Building a model that performs well on the data you used to train it is trivially easy — you can memorise the training set. The hard problem is building a model that generalises to unseen data. Model evaluation is the set of practices that let you estimate generalisation performance honestly and diagnose why a model falls short.
The Train/Validation/Test Split
The fundamental principle: never evaluate a model on data it was trained on.
python
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.model_selection import (train_test_split, KFold, StratifiedKFold, cross_val_score, learning_curve)from sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.preprocessing import StandardScalerfrom sklearn.metrics import roc_curve, auc, RocCurveDisplayfrom sklearn.datasets import make_classificationX, y = make_classification(n_samples=2000, n_features=20, n_informative=10, n_redundant=5, random_state=42)# Hold out test set once — never touch it until the very endX_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_state=42)# Use the remaining data for cross-validationX_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.15, random_state=42)
K-Fold Cross-Validation
Single train/validation splits are noisy — the performance estimate depends heavily on which rows land in which split. K-fold cross-validation averages over k different splits to get a more stable estimate:
python
model = LogisticRegression(max_iter=1000, random_state=42)# 5-fold CV on the non-test datacv_scores = cross_val_score(model, X_temp, y_temp, cv=5, scoring="accuracy")print(f"CV scores: {cv_scores}")print(f"Mean: {cv_scores.mean():.4f}")print(f"Std: {cv_scores.std():.4f}")print(f"95% CI: ({cv_scores.mean() - 2*cv_scores.std():.4f}, " f"{cv_scores.mean() + 2*cv_scores.std():.4f})")
Stratified K-Fold
With imbalanced datasets, random splits may produce folds where the minority class is absent or over-represented. Stratified K-fold preserves the class proportion in each fold:
Learning curves show how training and validation performance change as training set size grows. They are the primary diagnostic tool for bias-variance problems:
| Pattern | Diagnosis | Fix |
|---|---|---|
| Large gap between train and val (train high, val low) | High variance (overfitting) | More data, regularisation, simpler model |
| Both train and val low | High bias (underfitting) | More complex model, more features |
| Both converging and high | Good fit | Deploy with confidence |
| Both plateau and val oscillates | Noisy data | More data, denoising |
AUC-ROC
The ROC curve plots the True Positive Rate (recall) against the False Positive Rate at every possible classification threshold. AUC (Area Under the Curve) summarises the curve as a single number:
| AUC | Model quality |
|---|---|
| 1.00 | Perfect classifier |
| 0.90–0.99 | Excellent |
| 0.80–0.89 | Good |
| 0.70–0.79 | Fair |
| 0.50 | No better than random |
Bias-Variance Decomposition
The expected prediction error decomposes into three components:
Error = Bias² + Variance + Irreducible Noise
Bias: Error from wrong assumptions in the model. A linear model fit to non-linear data has high bias.
Variance: Error from sensitivity to small fluctuations in training data. A deep decision tree that memorises training examples has high variance.
Irreducible noise: Inherent randomness in the target variable — cannot be reduced by any model.
python
# Simulate bias-variance tradeoff across tree depthsdepths = range(1, 20)train_accs, val_accs = [], []for d in depths: m = RandomForestClassifier(max_depth=d, n_estimators=50, random_state=42) t_scores = cross_val_score(m, X_train, y_train, cv=3, scoring="accuracy") v_scores = cross_val_score(m, X_val, y_val, cv=3, scoring="accuracy") train_accs.append(t_scores.mean()) val_accs.append(v_scores.mean())best_depth = depths[np.argmax(val_accs)]print(f"Best max_depth: {best_depth} (val accuracy: {max(val_accs):.4f})")
Summary
Always hold out a test set before any modelling; it is touched exactly once — after all tuning decisions are made — to report final performance.
K-fold cross-validation provides a more stable performance estimate than a single split; stratified K-fold preserves class balance across folds for imbalanced datasets.
Learning curves diagnose the bias-variance tradeoff: wide train-val gap indicates overfitting; both curves low indicates underfitting.
AUC-ROC measures overall ranking ability across all thresholds; it is threshold-independent and appropriate for imbalanced problems where accuracy misleads.
The bias-variance decomposition explains why there is no universally best model complexity — the optimal point balances underfitting against overfitting for the specific dataset size.