GadaaLabs
Machine Learning Engineering
Lesson 7

Monitoring and Drift Detection

15 min

Models degrade silently. The distribution of incoming requests shifts away from training data, labels change meaning, upstream systems change schema — and the model keeps returning confident wrong predictions. Production monitoring catches this before business metrics crater.

Types of Drift

| Drift type | What changes | Detection method | |---|---|---| | Data drift (covariate shift) | Input feature distribution P(X) | Statistical tests on feature values | | Label drift (prior shift) | Target distribution P(Y) | Monitor prediction distribution | | Concept drift | P(Y|X) relationship | Monitor model accuracy on labeled windows | | Schema drift | Feature names, types, nullable | Schema validation on every batch |

You need detection mechanisms for all four categories.

Setting Up Evidently

bash
pip install evidently
python
import pandas as pd
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, ClassificationPreset

# Reference: data from training period
reference_df = pd.read_parquet("data/reference_window.parquet")
# Current: last 24h of production requests (logged by the serving layer)
current_df   = pd.read_parquet("data/production_2024_03_27.parquet")

column_mapping = ColumnMapping(
    target="label",
    prediction="prediction",
    numerical_features=["age", "click_rate", "session_duration"],
    categorical_features=["country", "device_type"],
)

report = Report(metrics=[DataDriftPreset(), ClassificationPreset()])
report.run(reference_data=reference_df, current_data=current_df,
           column_mapping=column_mapping)
report.save_html("drift_report.html")

Evidently runs the Population Stability Index (PSI) and Kolmogorov-Smirnov test per feature, and Jensen-Shannon divergence on the full feature space.

Extracting Drift Signals Programmatically

python
from evidently.test_suite import TestSuite
from evidently.tests import TestShareOfDriftedColumns

suite = TestSuite(tests=[
    TestShareOfDriftedColumns(lt=0.3),  # fail if >30% of columns drift
])
suite.run(reference_data=reference_df, current_data=current_df,
          column_mapping=column_mapping)

result = suite.as_dict()
if not result["summary"]["all_passed"]:
    print("DRIFT ALERT: retraining trigger fired")
    trigger_retraining_pipeline()

Logging Predictions for Monitoring

The monitoring pipeline only works if the serving layer logs inputs and outputs:

python
import json, datetime
from pathlib import Path

LOG_DIR = Path("logs/predictions")
LOG_DIR.mkdir(parents=True, exist_ok=True)

def log_prediction(request_id: str, features: dict, prediction: str, confidence: float):
    record = {
        "request_id":  request_id,
        "timestamp":   datetime.datetime.utcnow().isoformat(),
        "features":    features,
        "prediction":  prediction,
        "confidence":  confidence,
    }
    log_path = LOG_DIR / f"{datetime.date.today()}.jsonl"
    with open(log_path, "a") as f:
        f.write(json.dumps(record) + "\n")

Write logs to an append-only JSONL file or a streaming sink (Kafka, Kinesis). Batch them daily into Parquet for the monitoring job.

Retraining Triggers

python
DRIFT_THRESHOLD_PSI    = 0.2   # PSI > 0.2 = significant drift
ACCURACY_DROP_THRESHOLD = 0.05  # more than 5pp drop from baseline

def should_retrain(drift_psi: float, current_accuracy: float, baseline_accuracy: float) -> bool:
    drift_triggered    = drift_psi > DRIFT_THRESHOLD_PSI
    accuracy_triggered = (baseline_accuracy - current_accuracy) > ACCURACY_DROP_THRESHOLD
    return drift_triggered or accuracy_triggered

Set a scheduled job (cron or Airflow DAG) to run this check daily. When it returns True, dispatch a training run in your CI/CD system.

Summary

  • Monitor four types of drift: data, label, concept, and schema — each requires a different detection approach.
  • Use Evidently to generate data drift reports comparing a reference window (training data) to a current production window.
  • Log every prediction with its input features at serving time; without logs, monitoring is impossible.
  • Automate retraining triggers based on PSI thresholds for data drift and accuracy drops relative to a baseline.
  • Schedule monitoring as a daily batch job and alert on-call engineers before accuracy degradation reaches users.