APIs, Web Scraping & Async HTTP

26 min

HTTP Fundamentals

Every web API interaction is an HTTP message exchange. Understanding the protocol makes debugging trivial and error handling natural.

HTTP Methods — each has a semantic contract:

| Method | Semantic | Body? | Idempotent? | |--------|----------|-------|-------------| | GET | Read a resource | No | Yes | | POST | Create a resource | Yes | No | | PUT | Replace a resource entirely | Yes | Yes | | PATCH | Partially update a resource | Yes | No | | DELETE | Remove a resource | No | Yes |

Idempotent means calling it N times has the same effect as calling it once. A GET on /users/42 always returns the same user (assuming no concurrent writes). A POST to /orders creates a new order each time.

Status Code Groups:

1xx — Informational (rarely seen in REST APIs)
2xx — Success: 200 OK, 201 Created, 204 No Content
3xx — Redirection: 301 Permanent, 302 Temporary
4xx — Client error: 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found, 422 Unprocessable Entity, 429 Too Many Requests
5xx — Server error: 500 Internal Server Error, 503 Service Unavailable

Key Headers:

Request Headers:
  Authorization: Bearer <token>       ← authentication
  Content-Type: application/json      ← body format
  Accept: application/json            ← preferred response format
  User-Agent: MyApp/1.0               ← identifies your client

Response Headers:
  Content-Type: application/json      ← body format of response
  X-RateLimit-Remaining: 42           ← rate limit info
  Retry-After: 60                     ← seconds to wait before retrying

urllib: The Standard Library HTTP Client

Pyodide (which runs Python in your browser) doesn't include the requests library, but urllib is part of the standard library and always available. In practice, you'd use requests or httpx in real projects — they have better ergonomics — but urllib teaches the underlying mechanics clearly.

Python

Click Run to execute — Python runs in your browser via WebAssembly

JSON APIs: Parsing and Authentication

Python

Click Run to execute — Python runs in your browser via WebAssembly

Error Handling for HTTP: Retries, Timeouts, and Circuit Breakers

Python

Click Run to execute — Python runs in your browser via WebAssembly

Web Scraping Concepts & HTML Parsing

Web scraping extracts structured data from HTML pages. In a browser-based environment we can't make network requests, but we can demonstrate full parsing logic with mock HTML:

Python

Click Run to execute — Python runs in your browser via WebAssembly

Async HTTP with asyncio

Synchronous HTTP is simple but wasteful. When making 10 API calls to 10 different servers, synchronous code waits for each one to complete before starting the next. With I/O-bound work, asyncio lets you start all 10 requests, then collect results as they arrive — the total time becomes roughly equal to the slowest single request, not the sum of all.

python

# Real async HTTP with aiohttp (install with pip install aiohttp)
import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.json()

async def fetch_all(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        return await asyncio.gather(*tasks)

# All requests run concurrently
results = asyncio.run(fetch_all([
    "https://api.example.com/data/1",
    "https://api.example.com/data/2",
    "https://api.example.com/data/3",
]))

The key concept: await suspends this coroutine and gives control back to the event loop, which can run other coroutines while we wait for I/O. No threads, no locks — just cooperative multitasking.

Python

Click Run to execute — Python runs in your browser via WebAssembly

httpx: Modern HTTP for Sync and Async

httpx is the modern successor to requests. It has an almost identical API for synchronous use, but also supports async:

python

import httpx

# Synchronous (drop-in replacement for requests)
with httpx.Client(base_url="https://api.example.com",
                  headers={"Authorization": "Bearer token"},
                  timeout=10.0) as client:
    response = client.get("/users/1")
    response.raise_for_status()  # raises HTTPStatusError for 4xx/5xx
    user = response.json()

# Async
import asyncio

async def fetch_async():
    async with httpx.AsyncClient(base_url="https://api.example.com") as client:
        response = await client.get("/users/1")
        return response.json()

result = asyncio.run(fetch_async())

Key advantages over requests:

Built-in async support with httpx.AsyncClient
HTTP/2 support
raise_for_status() is clean and standard
Connection pooling via context managers
Timeouts configurable per-operation or per-client

PROJECT: Weather Data Aggregator

Python

Click Run to execute — Python runs in your browser via WebAssembly

PROJECT: Async URL Processor

Python

Click Run to execute — Python runs in your browser via WebAssembly

Rate Limiting and Being a Good API Citizen

Python

Click Run to execute — Python runs in your browser via WebAssembly

Key Takeaways

HTTP methods carry semantic contracts: GET is safe and idempotent, POST creates resources, PUT replaces, PATCH partially updates — using the wrong verb breaks caching and causes unexpected client behavior
4xx = client's fault, 5xx = server's fault: only 5xx and 429 should trigger retries; retrying on 400/404 is wasteful and masks bugs in your code
Exponential backoff with jitter prevents thundering herds: when many clients retry simultaneously, random jitter spreads the load; full jitter (random(0, cap)) is better than equal-interval retries
urllib.request is stdlib; requests/httpx are ergonomic wrappers: in constrained environments urllib works everywhere; in real projects httpx adds HTTP/2, async support, and cleaner error handling
Async HTTP is O(1) in memory and O(max_latency) in time: asyncio.gather(task1, task2, ..., task10) sends all requests at once and waits for the slowest — not the sum of all latencies
asyncio.gather with return_exceptions=True prevents one failure from aborting the rest: without it, a single failed request in a batch cancels all others; use it when partial results are useful
Rate limiting is your responsibility as an API consumer: implement token-bucket or leaky-bucket limiting client-side, respect Retry-After headers, and cache aggressively
Parse and validate API responses defensively: use .get() with defaults, check for missing fields, and handle API version changes — remote data is untrusted data

Data Visualization — Matplotlib, Seaborn & Plotly Concurrency — Threading, Multiprocessing & asyncio