Python Mastery — From Zero to AI Engineering

Lesson 15

Deep Learning with PyTorch

40 min

Why Neural Networks Work

The Universal Approximation Theorem (Cybenko, 1989; Hornik, 1991) guarantees that a neural network with a single hidden layer containing enough neurons can approximate any continuous function on a compact subset of Rⁿ to arbitrary precision. This is the mathematical foundation.

In practice, networks succeed because of three ideas working together:

Differentiable composition — every operation in the network (matmul, ReLU, softmax) has a computable gradient. This means we can propagate error signals backward through the entire network.
Stochastic gradient descent — we don't need the optimal parameter update at each step; a noisy direction that's mostly correct is enough. Given enough steps, it converges.
Efficient computation — GPUs can perform thousands of matrix multiplications in parallel. A modern GPU can compute 20+ teraFLOPs per second.

The actual mechanics: a network is a parameterized function y = f(x; θ). We define a loss L(y, y_true) measuring how wrong the prediction is. We compute ∂L/∂θ for every parameter θ (via backpropagation), then subtract a small fraction of that gradient from θ. Repeat millions of times.

The Mathematics of Backpropagation

Consider a 3-layer network:

x → [W₁, b₁] → ReLU → [W₂, b₂] → ReLU → [W₃, b₃] → sigmoid → ŷ → L

Forward pass:

z₁ = W₁x + b₁
a₁ = ReLU(z₁)
z₂ = W₂a₁ + b₂
a₂ = ReLU(z₂)
z₃ = W₃a₂ + b₃
ŷ  = sigmoid(z₃)
L  = BCE(ŷ, y)

Backward pass (chain rule applied right to left):

∂L/∂z₃  = ŷ - y                    (BCE + sigmoid derivative cancel cleanly)
∂L/∂W₃  = (∂L/∂z₃) · a₂ᵀ / m       (m = batch size)
∂L/∂b₃  = mean(∂L/∂z₃)
∂L/∂a₂  = W₃ᵀ · (∂L/∂z₃)
∂L/∂z₂  = (∂L/∂a₂) · ReLU'(z₂)     (ReLU' = 1 if z>0 else 0)
∂L/∂W₂  = (∂L/∂z₂) · a₁ᵀ / m
...and so on back to W₁

Key insight: each layer only needs the gradient from the layer above it. No layer needs to know about any other layer. This is why deep learning frameworks (PyTorch, JAX, TensorFlow) can handle architectures of arbitrary depth — backprop is just a recursive application of the chain rule.

Activation Functions — Deep Dive

Activation Functions — Complete Reference

Click Run to execute — Python runs in your browser via WebAssembly

Loss Functions — Theory and Implementation

Click Run to execute — Python runs in your browser via WebAssembly

Neural Network from Scratch — Full NumPy Implementation

Building a network from scratch makes every component concrete. After this, PyTorch will feel like it's doing the tedious parts for you.

Production-Quality Neural Network in NumPy

Click Run to execute — Python runs in your browser via WebAssembly

Optimizers: SGD, Momentum, RMSProp, Adam

Optimizer Deep Dive — SGD to Adam

Click Run to execute — Python runs in your browser via WebAssembly

Regularization: Preventing Overfitting

Regularization Techniques — L2, Dropout, Early Stopping

Click Run to execute — Python runs in your browser via WebAssembly

PyTorch: Tensors and Autograd

PyTorch is too large to run in the browser, but here is the complete reference for your local environment:

python

import torch
import torch.nn as nn
import torch.nn.functional as F

# ── Tensor creation ───────────────────────────────────────────────────────────
x = torch.tensor([1.0, 2.0, 3.0])            # From Python list
X = torch.randn(32, 784)                       # Random normal, shape (32, 784)
W = torch.zeros(784, 256, requires_grad=True)  # Requires gradient tracking
I = torch.eye(5)                               # Identity matrix
Z = torch.zeros_like(W)                        # Same shape, all zeros

# ── GPU support ───────────────────────────────────────────────────────────────
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
X = X.to(device)
W = W.to(device)

# ── Autograd: automatic differentiation ───────────────────────────────────────
# Every tensor with requires_grad=True builds a computation graph
W = torch.randn(784, 256, requires_grad=True)
b = torch.zeros(256, requires_grad=True)

# Forward pass — builds the computation graph
out = X @ W + b              # Linear layer
out = F.relu(out)            # ReLU activation
loss = out.sum()

# Backward pass — computes all gradients via backprop
loss.backward()

print(W.grad.shape)          # (784, 256) — same shape as W
print(b.grad.shape)          # (256,)

# Zero gradients before next step! Gradients accumulate by default
W.grad.zero_()
b.grad.zero_()

# ── Context managers ──────────────────────────────────────────────────────────
with torch.no_grad():        # Disable gradient tracking (inference, val)
    predictions = model(X_val)

with torch.inference_mode(): # Faster than no_grad — stronger guarantees
    predictions = model(X_test)

Building Models with nn.Module

python

class MLP(nn.Module):
    """Multi-Layer Perceptron with BatchNorm and Dropout."""

    def __init__(self, input_size: int, hidden: list[int], output_size: int,
                 dropout: float = 0.3):
        super().__init__()

        layers: list[nn.Module] = []
        sizes = [input_size] + hidden

        for i in range(len(sizes) - 1):
            layers += [
                nn.Linear(sizes[i], sizes[i+1]),
                nn.BatchNorm1d(sizes[i+1]),   # Normalize layer inputs
                nn.ReLU(),
                nn.Dropout(dropout),          # Random zeroing
            ]

        layers.append(nn.Linear(hidden[-1], output_size))
        self.net = nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

model = MLP(input_size=784, hidden=[512, 256, 128], output_size=10)
print(model)
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

The Complete Training Loop

python

from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim

def train(model, train_loader, val_loader, epochs=30):
    optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)
    criterion = nn.CrossEntropyLoss()

    best_val_acc = 0.0
    patience_counter = 0

    for epoch in range(epochs):
        # ── Training phase ────────────────────────────────────────────────────
        model.train()  # Enables Dropout and BatchNorm training mode
        train_loss = 0.0

        for X_batch, y_batch in train_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)

            optimizer.zero_grad()              # Step 1: clear old gradients
            logits = model(X_batch)            # Step 2: forward pass
            loss = criterion(logits, y_batch)  # Step 3: compute loss
            loss.backward()                    # Step 4: backprop
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # Clip
            optimizer.step()                   # Step 5: update weights

            train_loss += loss.item() * len(X_batch)

        # ── Validation phase ──────────────────────────────────────────────────
        model.eval()  # Disables Dropout; BatchNorm uses running stats
        correct = total = 0

        with torch.no_grad():
            for X_batch, y_batch in val_loader:
                X_batch, y_batch = X_batch.to(device), y_batch.to(device)
                logits = model(X_batch)
                preds  = logits.argmax(dim=1)
                correct += (preds == y_batch).sum().item()
                total   += len(y_batch)

        val_acc = correct / total
        scheduler.step()   # Adjust learning rate

        # Early stopping
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), "best_model.pt")
            patience_counter = 0
        else:
            patience_counter += 1
            if patience_counter >= 7:
                print(f"Early stopping at epoch {epoch}")
                break

        print(f"Epoch {epoch+1:2d}: loss={train_loss/len(train_loader.dataset):.4f}"
              f"  val_acc={val_acc:.4f}  lr={scheduler.get_last_lr()[0]:.6f}")

    # Load best weights
    model.load_state_dict(torch.load("best_model.pt"))
    return model

CNN Architecture — How Convolutions Work

Convolutional Neural Networks — How Convolutions Work

Click Run to execute — Python runs in your browser via WebAssembly

Project: Full Neural Network Training Pipeline

Complete Training Pipeline with Early Stopping

Click Run to execute — Python runs in your browser via WebAssembly

PyTorch Reference: CNN for Image Classification

python

import torch
import torch.nn as nn
import torchvision

class ConvBlock(nn.Module):
    """Conv → BatchNorm → ReLU → optional MaxPool"""
    def __init__(self, in_channels, out_channels, pool=False):
        super().__init__()
        layers = [
            nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
        ]
        if pool:
            layers.append(nn.MaxPool2d(2))
        self.block = nn.Sequential(*layers)

    def forward(self, x):
        return self.block(x)

class CNN(nn.Module):
    """CNN for CIFAR-10 (32x32 RGB images, 10 classes)."""

    def __init__(self, n_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            ConvBlock(3,  64,  pool=False),  # (3,32,32) → (64,32,32)
            ConvBlock(64, 64,  pool=True),   # → (64,16,16)
            ConvBlock(64, 128, pool=False),  # → (128,16,16)
            ConvBlock(128,128, pool=True),   # → (128,8,8)
            ConvBlock(128,256, pool=False),  # → (256,8,8)
            ConvBlock(256,256, pool=True),   # → (256,4,4)
        )
        self.classifier = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),         # → (256,1,1) — handles any input size
            nn.Flatten(),                    # → (256,)
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(128, n_classes),
        )

    def forward(self, x):
        return self.classifier(self.features(x))

# Transfer learning: fine-tune ResNet-18
def make_finetune_model(n_classes=10, freeze_backbone=True):
    model = torchvision.models.resnet18(weights="DEFAULT")

    if freeze_backbone:
        for param in model.parameters():
            param.requires_grad = False

    # Replace the final fully connected layer
    in_features = model.fc.in_features    # 512 for ResNet-18
    model.fc = nn.Sequential(
        nn.Linear(in_features, 256),
        nn.ReLU(),
        nn.Dropout(0.4),
        nn.Linear(256, n_classes),
    )
    return model

# Only train the new head initially, then unfreeze backbone for fine-tuning
model = make_finetune_model(n_classes=10, freeze_backbone=True)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total     = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({trainable/total:.1%})")

Exercises

Exercise 1 — Backpropagation by Hand Implement forward and backward pass for a single linear layer with MSE loss. Given x = [1.0, 2.0], W = [[0.5, 0.3], [0.4, 0.6]], b = [0, 0], y_true = [1.0, 0.5], compute the loss and gradients dW and db. Verify by numerically computing the gradient as (L(W+eps) - L(W-eps)) / (2*eps) for each element.

Exercise 1 — Gradient Check

Click Run to execute — Python runs in your browser via WebAssembly

Exercise 2 — XOR with Minimal Network The XOR problem requires at least one hidden layer — it is not linearly separable. Implement a minimal [2, 4, 1] network, train on all 4 XOR inputs, and verify it reaches 100% accuracy. Then visualize the decision boundary by evaluating a 50×50 grid of points.

Exercise 3 — Implement Adam from Scratch Using the NumPy NeuralNetwork class above as a base, add Adam optimizer support. Replace the simple gradient descent update with the Adam update rule. Compare convergence speed on the crescent dataset against plain SGD.

Exercise 3 — Adam Optimizer

Click Run to execute — Python runs in your browser via WebAssembly

Exercise 4 — Batch Normalization Implement batch_norm(x, gamma, beta, eps=1e-5) that normalizes a batch of activations to mean=0, std=1, then scales and shifts with learned parameters. Test on a batch where the activations have wildly different scales. Show that batch norm makes training stable.

Exercise 5 — Early Stopping Add proper early stopping to the NeuralNetwork.train() method: track the best validation loss, save the best weights, and stop training if validation loss hasn't improved for patience epochs. Test with a dataset where the network starts to overfit after epoch 100.

Exercise 6 — Learning Rate Scheduler Implement a cosine annealing learning rate scheduler: lr_t = lr_min + 0.5*(lr_max - lr_min)*(1 + cos(pi*t/T)) where t is the current epoch and T is the total epochs. Apply it to the crescent training and plot the learning rate curve.

Exercise 7 — Confusion Matrix and ROC Curve Using the model from the project, implement: (a) a confusion matrix function, (b) precision-recall curve by varying the decision threshold from 0 to 1, (c) the area under the ROC curve (AUC-ROC). For (c), use the trapezoidal rule.

Exercise 7 — ROC Curve and AUC

Click Run to execute — Python runs in your browser via WebAssembly

Exercise 8 — Neural Net for Tabular Data Using the NeuralNetwork class, build a classifier for the following tabular problem: predict if a customer will buy a subscription based on [age, income, browsing_time, previous_purchases]. Generate a synthetic dataset (500 samples) where the true rule is income > 50000 AND browsing_time > 10 AND age < 50. Train with 80/20 split, report accuracy, and analyze which features the network weights most heavily by examining the magnitude of first-layer weights.

Machine Learning with scikit-learn NLP & Transformers — From Tokenization to Fine-Tuning