Lesson 4

Experiment Tracking with MLflow

14 min

Without systematic experiment tracking, ML development quickly devolves into "I think the model with 256 hidden units was better, but I'm not sure which checkpoint that was." MLflow gives every run a permanent, queryable record: hyperparameters, metrics, artifacts, and environment — all linked to a single run ID you can reproduce weeks later.

Core MLflow Concepts

| Concept | What it stores | Example | |---|---|---| | Experiment | A named group of runs | click-rate-prediction | | Run | One training execution | run_id: a1b2c3d4 | | Params | Hyperparameters (logged once) | lr=0.001, dropout=0.3 | | Metrics | Scalar values over steps | val_loss at each epoch | | Artifacts | Files (models, plots, configs) | best_model.pt, confusion_matrix.png | | Tags | Free-form key-value labels | team=rec-sys, dataset=v7 |

Instrumenting a Training Run

python

import mlflow
import mlflow.pytorch

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("click-rate-prediction")

with mlflow.start_run(run_name="resnet18-baseline") as run:
    # Log hyperparameters
    mlflow.log_params({
        "learning_rate": 3e-4,
        "batch_size": 64,
        "epochs": 30,
        "optimizer": "AdamW",
        "weight_decay": 1e-2,
    })

    for epoch in range(NUM_EPOCHS):
        train_loss, train_acc = train_epoch(...)
        val_loss, val_acc     = validate(...)

        mlflow.log_metrics({
            "train_loss": train_loss,
            "train_acc":  train_acc,
            "val_loss":   val_loss,
            "val_acc":    val_acc,
        }, step=epoch)

    # Log the best model as an artifact
    mlflow.pytorch.log_model(model, artifact_path="model")
    print(f"Run ID: {run.info.run_id}")

The with block automatically records run duration, git commit hash (if in a git repo), and status (FINISHED/FAILED).

Comparing Runs Programmatically

python

from mlflow.tracking import MlflowClient

client = MlflowClient()

runs = client.search_runs(
    experiment_ids=["1"],
    filter_string="metrics.val_acc > 0.85",
    order_by=["metrics.val_loss ASC"],
    max_results=10,
)

for r in runs:
    print(
        f"{r.info.run_id[:8]}  "
        f"val_loss={r.data.metrics['val_loss']:.4f}  "
        f"lr={r.data.params['learning_rate']}"
    )

Model Registry and Promotion

python

# Register the best run's model
model_uri = f"runs:/{best_run_id}/model"
mv = mlflow.register_model(model_uri, name="ClickRatePredictor")

# Promote to staging after evaluation
client.transition_model_version_stage(
    name="ClickRatePredictor",
    version=mv.version,
    stage="Staging",
    archive_existing_versions=False,
)

# Load staging model for integration tests
staging_model = mlflow.pytorch.load_model(
    model_uri=f"models:/ClickRatePredictor/Staging"
)

The registry enforces an explicit promotion gate: a model must be manually (or programmatically) moved from None → Staging → Production. This prevents accidental deployment of untested checkpoints.

Reproducing an Experiment

bash

# Install exact dependencies from logged conda env
mlflow models serve -m "runs:/a1b2c3d4/model" --port 5001

# Or pull the run's params and re-run training
mlflow run . -P learning_rate=3e-4 -P batch_size=64 --experiment-name repro-check

Summary

MLflow organises work into experiments → runs, with params, metrics, artifacts, and tags per run.
Wrap training in mlflow.start_run() and call log_params once and log_metrics per step.
Use client.search_runs() with filter strings to find the best run without manually scanning the UI.
Register production candidates in the Model Registry and require explicit stage transitions before deployment.
Reproduce any experiment by loading its logged params and re-running the training script against the same DVC-pinned dataset.

Training Loops in PyTorch Model Packaging and Export