Python Mastery — From Zero to AI Engineering

Lesson 13

Concurrency — Threading, Multiprocessing & asyncio

35 min

Part 1: Concurrency vs Parallelism

These terms are often used interchangeably but mean different things precisely:

Concurrency — the ability to manage multiple tasks at once. Tasks may be interleaved on a single CPU core; when one pauses (waiting for I/O), another runs. No two tasks necessarily execute at the same physical instant.

Parallelism — the ability to execute multiple tasks simultaneously on multiple CPU cores. True parallelism requires multiple processors.

A juggler managing three balls is concurrent (one ball in hand at a time). Three jugglers each with one ball are parallel. Python can do both — but through different mechanisms, and with an important constraint.

CPU-Bound vs I/O-Bound

This distinction determines which concurrency model to use:

I/O-bound tasks spend most of their time waiting: for a network response, a disk read, a database query. The CPU is idle during the wait. Examples: HTTP requests, file reads, database queries, subprocess calls.

CPU-bound tasks spend most of their time computing: number crunching, image processing, ML inference, encryption. The CPU is always busy.

Task timeline: I/O-bound
  Thread A: [compute][WAITING FOR NETWORK ........][compute]
  Thread B:           [compute][WAITING FOR DB ....][compute]
  Overlap: threads can share a core productively

Task timeline: CPU-bound
  Thread A: [compute compute compute compute compute]
  Thread B: [compute compute compute compute compute]
  Overlap: they compete for CPU — no benefit on one core

Python's Global Interpreter Lock (GIL)

The GIL is a mutex inside CPython (the reference Python implementation) that ensures only one thread executes Python bytecode at a time, even on a multi-core machine. It exists because CPython's memory management (reference counting) is not thread-safe — without the GIL, two threads decrementing the same object's reference count could corrupt memory.

Consequences:

Multi-threaded Python programs cannot parallelize pure-Python CPU work
The GIL is released during I/O operations (network, disk), so threads can overlap I/O waits
C extensions like NumPy often release the GIL during computation, allowing thread-level parallelism for those operations
multiprocessing bypasses the GIL entirely — separate processes have separate interpreters

python

# Illustration: GIL impact on CPU-bound work
import threading
import time

def count(n):
    total = 0
    for i in range(n):
        total += i
    return total

N = 50_000_000

# Sequential: runs in ~2s
start = time.time()
count(N)
count(N)
print(f"Sequential: {time.time() - start:.2f}s")

# Two threads: also ~2s (GIL prevents true parallelism)
start = time.time()
t1 = threading.Thread(target=count, args=(N,))
t2 = threading.Thread(target=count, args=(N,))
t1.start(); t2.start()
t1.join(); t2.join()
print(f"Two threads (GIL): {time.time() - start:.2f}s")
# For I/O-bound work, threads DO help because GIL is released during I/O

The Three Models at a Glance

| Model | Best for | True parallelism? | Overhead | Complexity | |---|---|---|---|---| | threading | I/O-bound | No (GIL) | Low | Medium | | multiprocessing | CPU-bound | Yes | High | High | | asyncio | High-concurrency I/O | No | Very low | Medium |

Part 2: threading — Deep Dive

Thread Lifecycle and Core API

A thread transitions through states: New (created, not started) → Runnable (started, waiting for GIL) → Running (executing bytecode) → Blocked (waiting for I/O or a lock) → Terminated (done or exception).

Thread Lifecycle

Click Run to execute — Python runs in your browser via WebAssembly

Race Conditions — Why Shared State is Dangerous

A race condition occurs when the result depends on the timing of thread execution. The classic example: incrementing a shared counter.

The += operator looks atomic but is actually three bytecode operations: LOAD, ADD, STORE. The OS scheduler can interrupt a thread between any two of these operations.

Race Condition vs Lock

Click Run to execute — Python runs in your browser via WebAssembly

All Synchronization Primitives

Synchronization Primitives

Click Run to execute — Python runs in your browser via WebAssembly

Thread-Local Storage and Timer

Click Run to execute — Python runs in your browser via WebAssembly

threading.Condition — Wait/Notify Pattern

Condition Variable — Bounded Buffer

Click Run to execute — Python runs in your browser via WebAssembly

Producer-Consumer with queue.Queue

queue.Queue is the idiomatic Python pattern for inter-thread communication. It is fully thread-safe, blocking, and supports priority queues.

Producer-Consumer with queue.Queue

Click Run to execute — Python runs in your browser via WebAssembly

ThreadPoolExecutor — The Right Way to Manage Threads

ThreadPoolExecutor Patterns

Click Run to execute — Python runs in your browser via WebAssembly

Part 3: multiprocessing — CPU Parallelism

Each Process has its own Python interpreter and memory space. No GIL contention. True parallelism on multi-core machines.

The tradeoff: inter-process communication requires serialization (pickling), and process startup takes ~50–100ms.

python

from multiprocessing import Process, Pool, Queue, Pipe, Value, Array, Manager
import os

def cpu_intensive(n):
    """Pure CPU work — each process uses a full core."""
    return sum(i * i for i in range(n))

if __name__ == "__main__":    # REQUIRED guard on Windows/macOS with spawn
    # Direct Process usage
    p = Process(target=cpu_intensive, args=(1_000_000,))
    p.start()
    p.join()   # Wait for completion
    print(f"Process exit code: {p.exitcode}")  # 0 = success

Spawn vs Fork — Start Methods

python

import multiprocessing as mp

# Three start methods:
# 'fork'       — Copy parent process (fast, Unix only, can cause issues with threads)
# 'spawn'      — Start fresh Python interpreter (safe, Windows default, macOS default since 3.8)
# 'forkserver' — Dedicated server process handles forking (Unix, safer than fork)

# Set globally (call before creating any processes):
mp.set_start_method('spawn')   # Recommended for portability

# Or per-context:
ctx = mp.get_context('fork')
p = ctx.Process(target=cpu_intensive, args=(100_000,))

Sharing State Between Processes

Because processes have separate memory, sharing data requires special objects:

python

from multiprocessing import Process, Value, Array, Lock
import ctypes

def increment_shared(counter, lock, n):
    for _ in range(n):
        with lock:
            counter.value += 1

def write_array(arr, idx, val):
    arr[idx] = val * val

if __name__ == "__main__":
    # Value: single shared primitive
    counter = Value('i', 0)    # 'i' = C int, 'd' = double, 'b' = bool
    lock = Lock()

    procs = [Process(target=increment_shared, args=(counter, lock, 10_000))
             for _ in range(4)]
    for p in procs: p.start()
    for p in procs: p.join()
    print(f"Counter: {counter.value}")  # 40000

    # Array: shared C array
    arr = Array('d', [0.0] * 8)   # 'd' = C double
    procs = [Process(target=write_array, args=(arr, i, i))
             for i in range(8)]
    for p in procs: p.start()
    for p in procs: p.join()
    print(f"Array: {list(arr)}")

multiprocessing.Queue and Pipe

python

from multiprocessing import Process, Queue, Pipe

# Queue: multi-producer multi-consumer, process-safe
def producer_proc(q):
    for i in range(5):
        q.put(f"item-{i}")
    q.put(None)  # Sentinel

def consumer_proc(q):
    while True:
        item = q.get()
        if item is None: break
        print(f"Got: {item}")

if __name__ == "__main__":
    q = Queue(maxsize=10)
    p = Process(target=producer_proc, args=(q,))
    c = Process(target=consumer_proc, args=(q,))
    p.start(); c.start()
    p.join(); c.join()

# Pipe: bidirectional (or unidirectional) channel between exactly 2 processes
def pipe_worker(conn):
    msg = conn.recv()           # Receive from parent
    conn.send(f"Echo: {msg}")   # Send back
    conn.close()

if __name__ == "__main__":
    parent_conn, child_conn = Pipe(duplex=True)
    p = Process(target=pipe_worker, args=(child_conn,))
    p.start()
    parent_conn.send("hello from parent")
    response = parent_conn.recv()
    print(response)
    p.join()

ProcessPoolExecutor — Recommended for Most CPU Work

python

from concurrent.futures import ProcessPoolExecutor, as_completed
import math

def factorize(n):
    """CPU-bound: prime factorization."""
    factors = []
    d = 2
    while d * d <= n:
        while n % d == 0:
            factors.append(d)
            n //= d
        d += 1
    if n > 1:
        factors.append(n)
    return factors

def expensive_computation(n):
    """Simulate heavy CPU work."""
    result = sum(math.log(i + 1) * math.sqrt(i) for i in range(n))
    return n, round(result, 4)

if __name__ == "__main__":
    numbers = list(range(50_000, 50_020))

    # map(): simple, ordered, blocks until all done
    with ProcessPoolExecutor(max_workers=4) as pool:
        results = list(pool.map(factorize, numbers, chunksize=5))
    print(f"Factorized {len(results)} numbers")

    # submit() + as_completed(): stream results as they finish
    workloads = [100_000 + i * 10_000 for i in range(8)]
    with ProcessPoolExecutor(max_workers=4) as pool:
        futures = {pool.submit(expensive_computation, n): n for n in workloads}
        for future in as_completed(futures):
            n, result = future.result()
            print(f"  n={n}: {result}")

Pickling Constraint

Multiprocessing requires arguments and return values to be picklable. This means:

Built-in types: int, float, str, list, dict, tuple — always picklable
Functions defined at module level — picklable
Lambda functions — NOT picklable
Nested functions — NOT picklable (use functools.partial instead)
File objects, sockets, database connections — NOT picklable

python

import pickle

# Test if something is picklable:
def is_picklable(obj):
    try:
        pickle.dumps(obj)
        return True
    except Exception as e:
        return False

print(is_picklable(42))          # True
print(is_picklable([1, 2, 3]))   # True
print(is_picklable(lambda x: x)) # False — lambdas not picklable

Part 4: asyncio — Event-Driven Async I/O

asyncio uses a single thread with a cooperative event loop. Coroutines voluntarily yield control with await. No OS context switching, no GIL issues — scales to thousands of concurrent connections.

Coroutines, Tasks, and the Event Loop

Coroutines, gather, create_task

Click Run to execute — Python runs in your browser via WebAssembly

asyncio.wait — Fine-Grained Control

asyncio.wait and Error Handling

Click Run to execute — Python runs in your browser via WebAssembly

asyncio Timeouts and Cancellation

Timeouts and Cancellation

Click Run to execute — Python runs in your browser via WebAssembly

asyncio Synchronization Primitives

Click Run to execute — Python runs in your browser via WebAssembly

async for, async with, and run_in_executor

async for, async with, run_in_executor

Click Run to execute — Python runs in your browser via WebAssembly

Part 5: Choosing the Right Model

Is your task I/O-bound? (waiting for network, disk, database)
  YES:
    Will you have 100+ concurrent operations?
      YES  → asyncio (lowest overhead, highest throughput)
      NO   → threading or ThreadPoolExecutor (simpler code)
    Is your I/O library async-aware (aiohttp, asyncpg, etc.)?
      YES  → asyncio
      NO   → threading (works with any sync library)
  NO (CPU-bound):
    Is it NumPy/Pandas/C extension code?
      YES  → threading may work (GIL released by C code)
      NO   → multiprocessing / ProcessPoolExecutor

Threading over asyncio when:

Working with legacy sync libraries (requests, psycopg2, etc.)
Simpler code is more important than maximum throughput
You have moderate concurrency (< 100 threads)

asyncio over threading when:

Building high-concurrency servers (thousands of connections)
Your I/O libraries support async (aiohttp, asyncpg, motor)
You want predictable, cooperative scheduling

multiprocessing when:

Pure Python CPU work (no NumPy)
Work per task takes > 100ms (startup overhead worthwhile)
You need true parallelism and can tolerate pickling constraints

Part 6: Advanced Patterns

Rate Limiting Async Calls

Rate Limiter with Token Bucket

Click Run to execute — Python runs in your browser via WebAssembly

Graceful Shutdown

Graceful Shutdown Pattern

Click Run to execute — Python runs in your browser via WebAssembly

PROJECT: Concurrent File Downloader

A simulated file downloader demonstrating threading with progress tracking, timeout handling, and retry logic. Uses asyncio.sleep to simulate network I/O in the browser environment.

Concurrent File Downloader with Retry

Click Run to execute — Python runs in your browser via WebAssembly

Exercises

Exercise 1 — Thread-safe counter class

Exercise 1 — Thread-Safe Counter

Click Run to execute — Python runs in your browser via WebAssembly

Exercise 2 — Async rate limiter: fetch N URLs with max M/sec

Exercise 2 — Rate Limited Fetcher

Click Run to execute — Python runs in your browser via WebAssembly

Exercise 3 — asyncio retry with exponential backoff

Exercise 3 — Retry with Exponential Backoff

Click Run to execute — Python runs in your browser via WebAssembly

Exercise 4 — Thread pool for parallel file processing (simulated)

Exercise 4 — Thread Pool File Processing

Click Run to execute — Python runs in your browser via WebAssembly

Exercise 5 — asyncio Pipeline: producer → transformer → writer

Exercise 5 — Three-Stage Async Pipeline

Click Run to execute — Python runs in your browser via WebAssembly

Exercise 6 — Semaphore for connection pool

Exercise 6 — Async Connection Pool

Click Run to execute — Python runs in your browser via WebAssembly

Exercise 7 — Compute fibonacci in executor (don't block event loop)

Exercise 7 — Executor for CPU Work

Click Run to execute — Python runs in your browser via WebAssembly

Exercise 8 — Observer pattern with threading.Event

Exercise 8 — Event Bus with Threading

Click Run to execute — Python runs in your browser via WebAssembly

Key Takeaways

The GIL prevents parallel bytecode execution in threads — it makes single-threaded CPython safe but limits multi-threaded CPU work
Threading is ideal for I/O-bound work: the GIL is released during I/O, enabling true concurrency on a single core
Multiprocessing bypasses the GIL by running separate Python interpreters — true CPU parallelism, but higher overhead and pickling constraints
asyncio uses cooperative multitasking on a single thread — lowest overhead, highest concurrency, but requires async-aware code
threading.Lock prevents race conditions; always use with lock: over manual acquire/release
ThreadPoolExecutor and ProcessPoolExecutor share the same concurrent.futures API — swapping them is one word
asyncio.gather() runs coroutines concurrently; asyncio.create_task() schedules them immediately without awaiting
asyncio.Semaphore caps concurrent connections; asyncio.wait_for enforces timeouts; always cancel pending tasks
queue.Queue (threads) and asyncio.Queue are the canonical producer-consumer patterns — prefer them over raw shared state
Profile before choosing: I/O-bound or CPU-bound? How many concurrent tasks? Answers determine the right model

APIs, Web Scraping & Async HTTP Machine Learning with scikit-learn