String Manipulation, Regex & Text Processing
Strings as Immutable Sequences
A Python string (str) is an immutable sequence of Unicode code points. Every string operation that appears to "modify" a string actually creates a new string object. This matters for performance: concatenating strings in a loop with += is O(n²). Use str.join() instead.
Strings support all sequence operations — indexing, slicing, len, in — with the same semantics as tuples (immutable).
Essential String Methods
Python's str has over 40 methods. These are the ones you will use every day:
Format Specifications in f-strings
Python's f-string mini-language lets you control exactly how values are displayed. The format spec goes after the colon: f"{value:spec}".
Special String Literals
Regular Expressions
A regular expression (regex) is a pattern that describes a set of strings. Python's re module implements a full regex engine. The key insight: write patterns that match the structure of your data, not just the specific example in front of you.
Groups and Named Groups
Common Patterns
PROJECT: Log File Parser
A full Apache/Nginx log parser that generates realistic log data and extracts actionable statistics.
PROJECT: Data Extractor
Extract structured data from unstructured text using a portfolio of regex patterns.
Challenge
Sharpen your skills with these exercises:
-
Password validator — write a regex that checks a password meets all of: at least 8 chars, at least one uppercase, at least one lowercase, at least one digit, at least one special character (
!@#$%^&*). Return a list of unmet requirements. -
Markdown link extractor — write a function that extracts all links from Markdown text, returning a list of
(display_text, url)tuples. Handle[text](url)and[text][ref]reference-style links. -
CSV-to-dict parser — without using the
csvmodule, write a robust CSV parser using regex that handles quoted fields containing commas and newlines. -
Format a table — given a list of dicts (like database rows), write a
format_table(rows, headers=None)function that outputs a properly aligned ASCII table using f-string format specifications. Column widths should auto-fit to content. -
Log anonymizer — write a function that takes a log string and replaces: all IPv4 addresses with
[IP_REDACTED], all email addresses with[EMAIL_REDACTED], and all credit card numbers (16 digits, possibly space/dash-separated) with[CC_REDACTED].
Key Takeaways
- Strings are immutable Unicode sequences — concatenation in a loop is O(n²); use
str.join()for O(n) - The format mini-language (
f"{val:spec}") handles width, alignment, precision, and numeric bases — master it to eliminate manual string formatting code - Raw strings (
r"...") are essential for regex patterns and Windows paths where backslash is literal - Compile regex patterns you reuse with
re.compile()— the pattern is cached as a compiled finite automaton - Use named groups (
(?P<name>...)) in complex patterns; they make your code self-documenting and the match object easy to work with re.findall()returns strings;re.finditer()returns Match objects with positional information — preferfinditerwhen you need positionre.sub()with a callable replacement function is one of the most powerful text transformation tools in Python- Encoding is not optional — always specify
encoding="utf-8"when opening text files; never assume the platform default