The Universal Approximation Theorem (Cybenko, 1989; Hornik, 1991) guarantees that a neural network with a single hidden layer containing enough neurons can approximate any continuous function on a compact subset of Rⁿ to arbitrary precision. This is the mathematical foundation.
In practice, networks succeed because of three ideas working together:
Differentiable composition — every operation in the network (matmul, ReLU, softmax) has a computable gradient. This means we can propagate error signals backward through the entire network.
Stochastic gradient descent — we don't need the optimal parameter update at each step; a noisy direction that's mostly correct is enough. Given enough steps, it converges.
Efficient computation — GPUs can perform thousands of matrix multiplications in parallel. A modern GPU can compute 20+ teraFLOPs per second.
The actual mechanics: a network is a parameterized function y = f(x; θ). We define a lossL(y, y_true) measuring how wrong the prediction is. We compute ∂L/∂θ for every parameter θ (via backpropagation), then subtract a small fraction of that gradient from θ. Repeat millions of times.
The Mathematics of Backpropagation
Consider a 3-layer network:
x → [W₁, b₁] → ReLU → [W₂, b₂] → ReLU → [W₃, b₃] → sigmoid → ŷ → L
∂L/∂z₃ = ŷ - y (BCE + sigmoid derivative cancel cleanly)∂L/∂W₃ = (∂L/∂z₃) · a₂ᵀ / m (m = batch size)∂L/∂b₃ = mean(∂L/∂z₃)∂L/∂a₂ = W₃ᵀ · (∂L/∂z₃)∂L/∂z₂ = (∂L/∂a₂) · ReLU'(z₂) (ReLU' = 1 if z>0 else 0)∂L/∂W₂ = (∂L/∂z₂) · a₁ᵀ / m...and so on back to W₁
Key insight: each layer only needs the gradient from the layer above it. No layer needs to know about any other layer. This is why deep learning frameworks (PyTorch, JAX, TensorFlow) can handle architectures of arbitrary depth — backprop is just a recursive application of the chain rule.
Activation Functions — Deep Dive
Activation Functions — Complete Reference
Click Run to execute — Python runs in your browser via WebAssembly
Loss Functions — Theory and Implementation
Loss Functions — Theory and Implementation
Click Run to execute — Python runs in your browser via WebAssembly
Neural Network from Scratch — Full NumPy Implementation
Building a network from scratch makes every component concrete. After this, PyTorch will feel like it's doing the tedious parts for you.
Production-Quality Neural Network in NumPy
Click Run to execute — Python runs in your browser via WebAssembly
Optimizers: SGD, Momentum, RMSProp, Adam
Optimizer Deep Dive — SGD to Adam
Click Run to execute — Python runs in your browser via WebAssembly
Regularization: Preventing Overfitting
Regularization Techniques — L2, Dropout, Early Stopping
Click Run to execute — Python runs in your browser via WebAssembly
PyTorch: Tensors and Autograd
PyTorch is too large to run in the browser, but here is the complete reference for your local environment:
python
import torchimport torch.nn as nnimport torch.nn.functional as F# ── Tensor creation ───────────────────────────────────────────────────────────x = torch.tensor([1.0, 2.0, 3.0]) # From Python listX = torch.randn(32, 784) # Random normal, shape (32, 784)W = torch.zeros(784, 256, requires_grad=True) # Requires gradient trackingI = torch.eye(5) # Identity matrixZ = torch.zeros_like(W) # Same shape, all zeros# ── GPU support ───────────────────────────────────────────────────────────────device = torch.device("cuda" if torch.cuda.is_available() else "cpu")X = X.to(device)W = W.to(device)# ── Autograd: automatic differentiation ───────────────────────────────────────# Every tensor with requires_grad=True builds a computation graphW = torch.randn(784, 256, requires_grad=True)b = torch.zeros(256, requires_grad=True)# Forward pass — builds the computation graphout = X @ W + b # Linear layerout = F.relu(out) # ReLU activationloss = out.sum()# Backward pass — computes all gradients via backproploss.backward()print(W.grad.shape) # (784, 256) — same shape as Wprint(b.grad.shape) # (256,)# Zero gradients before next step! Gradients accumulate by defaultW.grad.zero_()b.grad.zero_()# ── Context managers ──────────────────────────────────────────────────────────with torch.no_grad(): # Disable gradient tracking (inference, val) predictions = model(X_val)with torch.inference_mode(): # Faster than no_grad — stronger guarantees predictions = model(X_test)
Building Models with nn.Module
python
class MLP(nn.Module): """Multi-Layer Perceptron with BatchNorm and Dropout.""" def __init__(self, input_size: int, hidden: list[int], output_size: int, dropout: float = 0.3): super().__init__() layers: list[nn.Module] = [] sizes = [input_size] + hidden for i in range(len(sizes) - 1): layers += [ nn.Linear(sizes[i], sizes[i+1]), nn.BatchNorm1d(sizes[i+1]), # Normalize layer inputs nn.ReLU(), nn.Dropout(dropout), # Random zeroing ] layers.append(nn.Linear(hidden[-1], output_size)) self.net = nn.Sequential(*layers) def forward(self, x: torch.Tensor) -> torch.Tensor: return self.net(x)model = MLP(input_size=784, hidden=[512, 256, 128], output_size=10)print(model)print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
The Complete Training Loop
python
from torch.utils.data import DataLoader, TensorDatasetimport torch.optim as optimdef train(model, train_loader, val_loader, epochs=30): optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4) scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs) criterion = nn.CrossEntropyLoss() best_val_acc = 0.0 patience_counter = 0 for epoch in range(epochs): # ── Training phase ──────────────────────────────────────────────────── model.train() # Enables Dropout and BatchNorm training mode train_loss = 0.0 for X_batch, y_batch in train_loader: X_batch, y_batch = X_batch.to(device), y_batch.to(device) optimizer.zero_grad() # Step 1: clear old gradients logits = model(X_batch) # Step 2: forward pass loss = criterion(logits, y_batch) # Step 3: compute loss loss.backward() # Step 4: backprop nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # Clip optimizer.step() # Step 5: update weights train_loss += loss.item() * len(X_batch) # ── Validation phase ────────────────────────────────────────────────── model.eval() # Disables Dropout; BatchNorm uses running stats correct = total = 0 with torch.no_grad(): for X_batch, y_batch in val_loader: X_batch, y_batch = X_batch.to(device), y_batch.to(device) logits = model(X_batch) preds = logits.argmax(dim=1) correct += (preds == y_batch).sum().item() total += len(y_batch) val_acc = correct / total scheduler.step() # Adjust learning rate # Early stopping if val_acc > best_val_acc: best_val_acc = val_acc torch.save(model.state_dict(), "best_model.pt") patience_counter = 0 else: patience_counter += 1 if patience_counter >= 7: print(f"Early stopping at epoch {epoch}") break print(f"Epoch {epoch+1:2d}: loss={train_loss/len(train_loader.dataset):.4f}" f" val_acc={val_acc:.4f} lr={scheduler.get_last_lr()[0]:.6f}") # Load best weights model.load_state_dict(torch.load("best_model.pt")) return model
CNN Architecture — How Convolutions Work
Convolutional Neural Networks — How Convolutions Work
Click Run to execute — Python runs in your browser via WebAssembly
Project: Full Neural Network Training Pipeline
Complete Training Pipeline with Early Stopping
Click Run to execute — Python runs in your browser via WebAssembly
PyTorch Reference: CNN for Image Classification
python
import torchimport torch.nn as nnimport torchvisionclass ConvBlock(nn.Module): """Conv → BatchNorm → ReLU → optional MaxPool""" def __init__(self, in_channels, out_channels, pool=False): super().__init__() layers = [ nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1), nn.BatchNorm2d(out_channels), nn.ReLU(inplace=True), ] if pool: layers.append(nn.MaxPool2d(2)) self.block = nn.Sequential(*layers) def forward(self, x): return self.block(x)class CNN(nn.Module): """CNN for CIFAR-10 (32x32 RGB images, 10 classes).""" def __init__(self, n_classes=10): super().__init__() self.features = nn.Sequential( ConvBlock(3, 64, pool=False), # (3,32,32) → (64,32,32) ConvBlock(64, 64, pool=True), # → (64,16,16) ConvBlock(64, 128, pool=False), # → (128,16,16) ConvBlock(128,128, pool=True), # → (128,8,8) ConvBlock(128,256, pool=False), # → (256,8,8) ConvBlock(256,256, pool=True), # → (256,4,4) ) self.classifier = nn.Sequential( nn.AdaptiveAvgPool2d(1), # → (256,1,1) — handles any input size nn.Flatten(), # → (256,) nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.5), nn.Linear(128, n_classes), ) def forward(self, x): return self.classifier(self.features(x))# Transfer learning: fine-tune ResNet-18def make_finetune_model(n_classes=10, freeze_backbone=True): model = torchvision.models.resnet18(weights="DEFAULT") if freeze_backbone: for param in model.parameters(): param.requires_grad = False # Replace the final fully connected layer in_features = model.fc.in_features # 512 for ResNet-18 model.fc = nn.Sequential( nn.Linear(in_features, 256), nn.ReLU(), nn.Dropout(0.4), nn.Linear(256, n_classes), ) return model# Only train the new head initially, then unfreeze backbone for fine-tuningmodel = make_finetune_model(n_classes=10, freeze_backbone=True)trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)total = sum(p.numel() for p in model.parameters())print(f"Trainable: {trainable:,} / {total:,} ({trainable/total:.1%})")
Exercises
Exercise 1 — Backpropagation by Hand
Implement forward and backward pass for a single linear layer with MSE loss. Given x = [1.0, 2.0], W = [[0.5, 0.3], [0.4, 0.6]], b = [0, 0], y_true = [1.0, 0.5], compute the loss and gradients dW and db. Verify by numerically computing the gradient as (L(W+eps) - L(W-eps)) / (2*eps) for each element.
Exercise 1 — Gradient Check
Click Run to execute — Python runs in your browser via WebAssembly
Exercise 2 — XOR with Minimal Network
The XOR problem requires at least one hidden layer — it is not linearly separable. Implement a minimal [2, 4, 1] network, train on all 4 XOR inputs, and verify it reaches 100% accuracy. Then visualize the decision boundary by evaluating a 50×50 grid of points.
Exercise 3 — Implement Adam from Scratch
Using the NumPy NeuralNetwork class above as a base, add Adam optimizer support. Replace the simple gradient descent update with the Adam update rule. Compare convergence speed on the crescent dataset against plain SGD.
Exercise 3 — Adam Optimizer
Click Run to execute — Python runs in your browser via WebAssembly
Exercise 4 — Batch Normalization
Implement batch_norm(x, gamma, beta, eps=1e-5) that normalizes a batch of activations to mean=0, std=1, then scales and shifts with learned parameters. Test on a batch where the activations have wildly different scales. Show that batch norm makes training stable.
Exercise 5 — Early Stopping
Add proper early stopping to the NeuralNetwork.train() method: track the best validation loss, save the best weights, and stop training if validation loss hasn't improved for patience epochs. Test with a dataset where the network starts to overfit after epoch 100.
Exercise 6 — Learning Rate Scheduler
Implement a cosine annealing learning rate scheduler: lr_t = lr_min + 0.5*(lr_max - lr_min)*(1 + cos(pi*t/T)) where t is the current epoch and T is the total epochs. Apply it to the crescent training and plot the learning rate curve.
Exercise 7 — Confusion Matrix and ROC Curve
Using the model from the project, implement: (a) a confusion matrix function, (b) precision-recall curve by varying the decision threshold from 0 to 1, (c) the area under the ROC curve (AUC-ROC). For (c), use the trapezoidal rule.
Exercise 7 — ROC Curve and AUC
Click Run to execute — Python runs in your browser via WebAssembly
Exercise 8 — Neural Net for Tabular Data
Using the NeuralNetwork class, build a classifier for the following tabular problem: predict if a customer will buy a subscription based on [age, income, browsing_time, previous_purchases]. Generate a synthetic dataset (500 samples) where the true rule is income > 50000 AND browsing_time > 10 AND age < 50. Train with 80/20 split, report accuracy, and analyze which features the network weights most heavily by examining the magnitude of first-layer weights.