GadaaLabs
Data Science Fundamentals
Lesson 7

Model Evaluation

14 min

Building a model that performs well on the data you used to train it is trivially easy — you can memorise the training set. The hard problem is building a model that generalises to unseen data. Model evaluation is the set of practices that let you estimate generalisation performance honestly and diagnose why a model falls short.

The Train/Validation/Test Split

The fundamental principle: never evaluate a model on data it was trained on.

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import (train_test_split, KFold, StratifiedKFold,
                                      cross_val_score, learning_curve)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, auc, RocCurveDisplay
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=2000, n_features=20, n_informative=10,
                             n_redundant=5, random_state=42)

# Hold out test set once — never touch it until the very end
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

# Use the remaining data for cross-validation
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.15, random_state=42)

K-Fold Cross-Validation

Single train/validation splits are noisy — the performance estimate depends heavily on which rows land in which split. K-fold cross-validation averages over k different splits to get a more stable estimate:

python
model = LogisticRegression(max_iter=1000, random_state=42)

# 5-fold CV on the non-test data
cv_scores = cross_val_score(model, X_temp, y_temp, cv=5, scoring="accuracy")

print(f"CV scores:  {cv_scores}")
print(f"Mean:       {cv_scores.mean():.4f}")
print(f"Std:        {cv_scores.std():.4f}")
print(f"95% CI:     ({cv_scores.mean() - 2*cv_scores.std():.4f}, "
      f"{cv_scores.mean() + 2*cv_scores.std():.4f})")

Stratified K-Fold

With imbalanced datasets, random splits may produce folds where the minority class is absent or over-represented. Stratified K-fold preserves the class proportion in each fold:

python
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Manual stratified CV to inspect each fold
fold_results = []
for fold, (train_idx, val_idx) in enumerate(skf.split(X_temp, y_temp), start=1):
    X_f_train, X_f_val = X_temp[train_idx], X_temp[val_idx]
    y_f_train, y_f_val = y_temp[train_idx], y_temp[val_idx]

    m = LogisticRegression(max_iter=1000, random_state=42)
    m.fit(X_f_train, y_f_train)
    score = m.score(X_f_val, y_f_val)
    fold_results.append({"fold": fold, "val_accuracy": score,
                          "positive_rate": y_f_val.mean()})

print(pd.DataFrame(fold_results))

Learning Curves

Learning curves show how training and validation performance change as training set size grows. They are the primary diagnostic tool for bias-variance problems:

python
train_sizes, train_scores, val_scores = learning_curve(
    LogisticRegression(max_iter=1000),
    X_temp, y_temp,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

train_mean = train_scores.mean(axis=1)
train_std  = train_scores.std(axis=1)
val_mean   = val_scores.mean(axis=1)
val_std    = val_scores.std(axis=1)

fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(train_sizes, train_mean, "o-", color="#2196F3", label="Training accuracy")
ax.plot(train_sizes, val_mean,   "o-", color="#E91E63", label="Validation accuracy")
ax.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.15, color="#2196F3")
ax.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.15, color="#E91E63")
ax.set_xlabel("Training set size")
ax.set_ylabel("Accuracy")
ax.set_title("Learning Curves")
ax.legend()
plt.show()

How to read learning curves:

| Pattern | Diagnosis | Fix | |---|---|---| | Large gap between train and val (train high, val low) | High variance (overfitting) | More data, regularisation, simpler model | | Both train and val low | High bias (underfitting) | More complex model, more features | | Both converging and high | Good fit | Deploy with confidence | | Both plateau and val oscillates | Noisy data | More data, denoising |

AUC-ROC

The ROC curve plots the True Positive Rate (recall) against the False Positive Rate at every possible classification threshold. AUC (Area Under the Curve) summarises the curve as a single number:

python
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Random Forest":        RandomForestClassifier(n_estimators=100, random_state=42)
}

fig, ax = plt.subplots(figsize=(7, 6))

for name, m in models.items():
    m.fit(X_train, y_train)
    y_prob = m.predict_proba(X_val)[:, 1]
    fpr, tpr, _ = roc_curve(y_val, y_prob)
    roc_auc = auc(fpr, tpr)
    ax.plot(fpr, tpr, linewidth=2, label=f"{name} (AUC = {roc_auc:.3f})")

ax.plot([0, 1], [0, 1], "k--", label="Random classifier (AUC = 0.500)")
ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate (Recall)")
ax.set_title("ROC Curves")
ax.legend(loc="lower right")
plt.show()

AUC interpretation:

| AUC | Model quality | |---|---| | 1.00 | Perfect classifier | | 0.90–0.99 | Excellent | | 0.80–0.89 | Good | | 0.70–0.79 | Fair | | 0.50 | No better than random |

Bias-Variance Decomposition

The expected prediction error decomposes into three components:

Error = Bias² + Variance + Irreducible Noise

  • Bias: Error from wrong assumptions in the model. A linear model fit to non-linear data has high bias.
  • Variance: Error from sensitivity to small fluctuations in training data. A deep decision tree that memorises training examples has high variance.
  • Irreducible noise: Inherent randomness in the target variable — cannot be reduced by any model.
python
# Simulate bias-variance tradeoff across tree depths
depths = range(1, 20)
train_accs, val_accs = [], []

for d in depths:
    m = RandomForestClassifier(max_depth=d, n_estimators=50, random_state=42)
    t_scores = cross_val_score(m, X_train, y_train, cv=3, scoring="accuracy")
    v_scores = cross_val_score(m, X_val, y_val, cv=3, scoring="accuracy")
    train_accs.append(t_scores.mean())
    val_accs.append(v_scores.mean())

best_depth = depths[np.argmax(val_accs)]
print(f"Best max_depth: {best_depth} (val accuracy: {max(val_accs):.4f})")

Summary

  • Always hold out a test set before any modelling; it is touched exactly once — after all tuning decisions are made — to report final performance.
  • K-fold cross-validation provides a more stable performance estimate than a single split; stratified K-fold preserves class balance across folds for imbalanced datasets.
  • Learning curves diagnose the bias-variance tradeoff: wide train-val gap indicates overfitting; both curves low indicates underfitting.
  • AUC-ROC measures overall ranking ability across all thresholds; it is threshold-independent and appropriate for imbalanced problems where accuracy misleads.
  • The bias-variance decomposition explains why there is no universally best model complexity — the optimal point balances underfitting against overfitting for the specific dataset size.