String Manipulation, Regular Expressions & Text Processing
Part 1: String Internals
How Python Stores Strings
A Python string is not just a sequence of bytes. It is a PyUnicodeObject — a sophisticated structure that chooses its internal representation based on the highest Unicode code point in the string. CPython uses a flexible array representation with four possible layouts:
- Latin-1 (1 byte per character): for strings containing only code points 0–255
- UCS-2 (2 bytes per character): for strings up to code point U+FFFF
- UCS-4 (4 bytes per character): for strings with any code point above U+FFFF
- Compact ASCII: a special fast path for pure ASCII strings
This means "hello" consumes far less memory than "héllo", which consumes far less than a string containing emoji. Python automatically picks the most compact representation.
String Interning
CPython interns (caches) certain strings so that multiple variables pointing to the same string share a single object in memory. This is an optimization, not a language guarantee. Strings that look like Python identifiers (letters, digits, underscores, starting with a letter or underscore) and are compile-time constants are automatically interned.
Why Immutability Matters for Performance
Every string method that "modifies" a string returns a new string object. This has a critical consequence: concatenating strings in a loop with += creates O(n) intermediate objects and runs in O(n²) time. The fix is str.join().
Formatting Performance
Python offers four ways to format strings. Their performance differences matter in hot loops.
Part 2: All String Methods
Case Methods
Whitespace and Split Methods
Search Methods
Transform Methods
Alignment and Formatting
Predicate Methods
Join, Partition, and Advanced Split
Part 3: f-strings Mastery
Basic f-strings and Expressions
Format Specification Mini-Language
The = Debug Specifier
Multi-line f-strings and Template Strings
Part 4: Regular Expressions — Complete Guide
The re Module: Core Functions
Regular expressions are a mini-language for describing text patterns. Python's re module compiles a pattern string into a finite automaton, then runs that automaton against your text.
Character Classes and Quantifiers
Greedy vs Non-Greedy
Anchors and Word Boundaries
Groups: Capturing, Non-Capturing, Named
Lookahead and Lookbehind
re.sub with Function Replacement and Flags
Common Regex Patterns
Compiled Patterns for Performance
Part 5: Text Processing
textwrap — Formatting Long Text
difflib — Fuzzy String Matching
unicodedata — Working with Unicode
ast.literal_eval — Safe String Evaluation
Project: Text Analysis Toolkit
Build a complete text analysis toolkit — word frequencies, sentence tokenizer, email/URL extractor, sensitive data redactor, and word cloud data generator.
Exercises
Easy
1. Palindrome Check — Write a function is_palindrome(s) that returns True if s is a palindrome, ignoring case and non-alphanumeric characters.
2. Title Case Fixer — Implement smart_title(s) that applies title case but keeps articles/prepositions lowercase unless they are the first word.
Medium
3. Regex Log Parser — Parse Apache/Nginx access log lines and return structured dicts.
4. Template Engine — Write a mini template engine that replaces {{variable}} placeholders.
5. Caesar Cipher — Using str.maketrans(), implement encrypt and decrypt.
Hard
6. String Compression — Implement run-length encoding and decoding.
7. Word Wrap Algorithm — Implement a word wrap that minimizes ragged right margins (greedy approach).
8. Regex-based CSV Parser — Parse CSV (handling quoted fields with commas and escaped quotes) without using the csv module.