Python Mastery — From Zero to AI Engineering

Lesson 4

String Manipulation, Regular Expressions & Text Processing

30 min

Part 1: String Internals

How Python Stores Strings

A Python string is not just a sequence of bytes. It is a PyUnicodeObject — a sophisticated structure that chooses its internal representation based on the highest Unicode code point in the string. CPython uses a flexible array representation with four possible layouts:

Latin-1 (1 byte per character): for strings containing only code points 0–255
UCS-2 (2 bytes per character): for strings up to code point U+FFFF
UCS-4 (4 bytes per character): for strings with any code point above U+FFFF
Compact ASCII: a special fast path for pure ASCII strings

This means "hello" consumes far less memory than "héllo", which consumes far less than a string containing emoji. Python automatically picks the most compact representation.

Python

Click Run to execute — Python runs in your browser via WebAssembly

String Interning

CPython interns (caches) certain strings so that multiple variables pointing to the same string share a single object in memory. This is an optimization, not a language guarantee. Strings that look like Python identifiers (letters, digits, underscores, starting with a letter or underscore) and are compile-time constants are automatically interned.

Python

Click Run to execute — Python runs in your browser via WebAssembly

Why Immutability Matters for Performance

Every string method that "modifies" a string returns a new string object. This has a critical consequence: concatenating strings in a loop with += creates O(n) intermediate objects and runs in O(n²) time. The fix is str.join().

Python

Click Run to execute — Python runs in your browser via WebAssembly

Formatting Performance

Python offers four ways to format strings. Their performance differences matter in hot loops.

Python

Click Run to execute — Python runs in your browser via WebAssembly

Part 2: All String Methods

Case Methods

Python

Click Run to execute — Python runs in your browser via WebAssembly

Whitespace and Split Methods

Python

Click Run to execute — Python runs in your browser via WebAssembly

Search Methods

Python

Click Run to execute — Python runs in your browser via WebAssembly

Transform Methods

Python

Click Run to execute — Python runs in your browser via WebAssembly

Alignment and Formatting

Python

Click Run to execute — Python runs in your browser via WebAssembly

Predicate Methods

Python

Click Run to execute — Python runs in your browser via WebAssembly

Join, Partition, and Advanced Split

Python

Click Run to execute — Python runs in your browser via WebAssembly

Part 3: f-strings Mastery

Basic f-strings and Expressions

Python

Click Run to execute — Python runs in your browser via WebAssembly

Format Specification Mini-Language

Python

Click Run to execute — Python runs in your browser via WebAssembly

The `=` Debug Specifier

Python

Click Run to execute — Python runs in your browser via WebAssembly

Multi-line f-strings and Template Strings

Python

Click Run to execute — Python runs in your browser via WebAssembly

Part 4: Regular Expressions — Complete Guide

The re Module: Core Functions

Regular expressions are a mini-language for describing text patterns. Python's re module compiles a pattern string into a finite automaton, then runs that automaton against your text.

Python

Click Run to execute — Python runs in your browser via WebAssembly

Character Classes and Quantifiers

Python

Click Run to execute — Python runs in your browser via WebAssembly

Greedy vs Non-Greedy

Python

Click Run to execute — Python runs in your browser via WebAssembly

Anchors and Word Boundaries

Python

Click Run to execute — Python runs in your browser via WebAssembly

Groups: Capturing, Non-Capturing, Named

Python

Click Run to execute — Python runs in your browser via WebAssembly

Lookahead and Lookbehind

Python

Click Run to execute — Python runs in your browser via WebAssembly

re.sub with Function Replacement and Flags

Python

Click Run to execute — Python runs in your browser via WebAssembly

Common Regex Patterns

Python

Click Run to execute — Python runs in your browser via WebAssembly

Compiled Patterns for Performance

Python

Click Run to execute — Python runs in your browser via WebAssembly

Part 5: Text Processing

textwrap — Formatting Long Text

Python

Click Run to execute — Python runs in your browser via WebAssembly

difflib — Fuzzy String Matching

Python

Click Run to execute — Python runs in your browser via WebAssembly

unicodedata — Working with Unicode

Python

Click Run to execute — Python runs in your browser via WebAssembly

ast.literal_eval — Safe String Evaluation

Python

Click Run to execute — Python runs in your browser via WebAssembly

Project: Text Analysis Toolkit

Build a complete text analysis toolkit — word frequencies, sentence tokenizer, email/URL extractor, sensitive data redactor, and word cloud data generator.

Python

Click Run to execute — Python runs in your browser via WebAssembly

Exercises

Easy

1. Palindrome Check — Write a function is_palindrome(s) that returns True if s is a palindrome, ignoring case and non-alphanumeric characters.

Python

Click Run to execute — Python runs in your browser via WebAssembly

2. Title Case Fixer — Implement smart_title(s) that applies title case but keeps articles/prepositions lowercase unless they are the first word.

Python

Click Run to execute — Python runs in your browser via WebAssembly

Medium

3. Regex Log Parser — Parse Apache/Nginx access log lines and return structured dicts.

Python

Click Run to execute — Python runs in your browser via WebAssembly

4. Template Engine — Write a mini template engine that replaces {{variable}} placeholders.

Python

Click Run to execute — Python runs in your browser via WebAssembly

5. Caesar Cipher — Using str.maketrans(), implement encrypt and decrypt.

Python

Click Run to execute — Python runs in your browser via WebAssembly

Hard

6. String Compression — Implement run-length encoding and decoding.

Python

Click Run to execute — Python runs in your browser via WebAssembly

7. Word Wrap Algorithm — Implement a word wrap that minimizes ragged right margins (greedy approach).

Python

Click Run to execute — Python runs in your browser via WebAssembly

8. Regex-based CSV Parser — Parse CSV (handling quoted fields with commas and escaped quotes) without using the csv module.

Python

Click Run to execute — Python runs in your browser via WebAssembly

Data Structures — Lists, Dicts, Sets & Comprehensions Object-Oriented Programming — Classes to Protocols