GadaaLabs
Python Mastery — From Zero to AI Engineering
Lesson 4

String Manipulation, Regex & Text Processing

24 min

Strings as Immutable Sequences

A Python string (str) is an immutable sequence of Unicode code points. Every string operation that appears to "modify" a string actually creates a new string object. This matters for performance: concatenating strings in a loop with += is O(n²). Use str.join() instead.

python
# O(n²) — don't do this in a loop
result = ""
for word in words:
    result += word + " "

# O(n) — correct
result = " ".join(words)

Strings support all sequence operations — indexing, slicing, len, in — with the same semantics as tuples (immutable).

String indexing, slicing, immutability
Click Run to execute — Python runs in your browser via WebAssembly

Essential String Methods

Python's str has over 40 methods. These are the ones you will use every day:

Key string methods
Click Run to execute — Python runs in your browser via WebAssembly

Format Specifications in f-strings

Python's f-string mini-language lets you control exactly how values are displayed. The format spec goes after the colon: f"{value:spec}".

f-string format specifications
Click Run to execute — Python runs in your browser via WebAssembly

Special String Literals

Multi-line, raw, and byte strings
Click Run to execute — Python runs in your browser via WebAssembly

Regular Expressions

A regular expression (regex) is a pattern that describes a set of strings. Python's re module implements a full regex engine. The key insight: write patterns that match the structure of your data, not just the specific example in front of you.

re module basics
Click Run to execute — Python runs in your browser via WebAssembly

Groups and Named Groups

Groups and named groups
Click Run to execute — Python runs in your browser via WebAssembly

Common Patterns

Common regex patterns
Click Run to execute — Python runs in your browser via WebAssembly

PROJECT: Log File Parser

A full Apache/Nginx log parser that generates realistic log data and extracts actionable statistics.

PROJECT: Log File Parser
Click Run to execute — Python runs in your browser via WebAssembly

PROJECT: Data Extractor

Extract structured data from unstructured text using a portfolio of regex patterns.

PROJECT: Data Extractor
Click Run to execute — Python runs in your browser via WebAssembly

Challenge

Sharpen your skills with these exercises:

  1. Password validator — write a regex that checks a password meets all of: at least 8 chars, at least one uppercase, at least one lowercase, at least one digit, at least one special character (!@#$%^&*). Return a list of unmet requirements.

  2. Markdown link extractor — write a function that extracts all links from Markdown text, returning a list of (display_text, url) tuples. Handle [text](url) and [text][ref] reference-style links.

  3. CSV-to-dict parser — without using the csv module, write a robust CSV parser using regex that handles quoted fields containing commas and newlines.

  4. Format a table — given a list of dicts (like database rows), write a format_table(rows, headers=None) function that outputs a properly aligned ASCII table using f-string format specifications. Column widths should auto-fit to content.

  5. Log anonymizer — write a function that takes a log string and replaces: all IPv4 addresses with [IP_REDACTED], all email addresses with [EMAIL_REDACTED], and all credit card numbers (16 digits, possibly space/dash-separated) with [CC_REDACTED].

Key Takeaways

  • Strings are immutable Unicode sequences — concatenation in a loop is O(n²); use str.join() for O(n)
  • The format mini-language (f"{val:spec}") handles width, alignment, precision, and numeric bases — master it to eliminate manual string formatting code
  • Raw strings (r"...") are essential for regex patterns and Windows paths where backslash is literal
  • Compile regex patterns you reuse with re.compile() — the pattern is cached as a compiled finite automaton
  • Use named groups ((?P<name>...)) in complex patterns; they make your code self-documenting and the match object easy to work with
  • re.findall() returns strings; re.finditer() returns Match objects with positional information — prefer finditer when you need position
  • re.sub() with a callable replacement function is one of the most powerful text transformation tools in Python
  • Encoding is not optional — always specify encoding="utf-8" when opening text files; never assume the platform default