Production Monitoring, Data Drift & Model Decay
A deployed model is not a finished product — it is a system that continuously interacts with a changing world. The data distribution that produced your training set shifts as user behaviour evolves, upstream pipelines change their schemas, and the real-world phenomenon you modelled changes. Without monitoring, your model silently decays: accuracy falls, business decisions worsen, and nobody notices until a major incident.
What Can Fail in Production
Understanding the failure taxonomy determines what to monitor.
Data drift (covariate shift): P(X) changes. The input feature distribution shifts away from what the model was trained on. Example: an economic shock causes monthly_spend to drop sharply for all users — the model was never trained on this regime.
Concept drift: P(y|X) changes. The relationship between features and the target changes. Example: a competitor launches a competing product, so high-usage customers who previously stayed now churn — the same feature values now imply different outcomes.
Label drift: P(y) changes. The base rate of the target shifts. Example: a marketing campaign successfully retains customers, dropping churn from 25% to 10% — a calibrated model now over-predicts churn.
Upstream pipeline breakage: a schema change in an upstream data source introduces nulls, wrong units, or encoded values the preprocessing has never seen. Often the most common failure in practice.
Monitoring Layers
| Layer | What to track | When to alert | |-------|--------------|---------------| | Data quality | null rate, schema violations, out-of-range values | null rate increases by > 2x; new unknown categories | | Feature distribution | mean, std, histogram per feature | PSI > 0.2; KS p-value < 0.05 | | Prediction distribution | score histogram, mean prediction | PSI > 0.1 on score distribution | | Model performance | accuracy, AUC, F1 | when labels arrive — often days/weeks delayed | | System health | latency p99, error rate, throughput | error rate > 1%; p99 > SLA |
Statistical Drift Tests
KS Test (Kolmogorov-Smirnov)
Tests whether two samples come from the same continuous distribution. The KS statistic is the maximum absolute difference between the two CDFs.
Chi-Square Test for Categorical Features
Population Stability Index (PSI)
PSI is the standard drift metric in credit risk and finance. It quantifies how much a distribution has shifted relative to a reference.
Interpretation: PSI < 0.1 = no significant shift; 0.1 – 0.2 = moderate shift, investigate; > 0.2 = significant shift, retrain likely needed.
Jensen-Shannon Divergence
JSD is symmetric and bounded [0, 1], making it easier to interpret than KL divergence.
DriftMonitor Class
Prediction Distribution Monitoring
Evidently AI Integration
Alerting Logic
Key Takeaways
- Monitor all four failure modes separately: data drift (P(X) shifts), concept drift (P(y|X) shifts), label drift (P(y) shifts), and upstream pipeline breakage — they have different root causes and different remediation paths.
- PSI is the industry-standard drift metric: < 0.1 is stable, 0.1-0.2 warrants investigation, > 0.2 likely requires retraining. Use reference-quantile bins to make it distribution-aware.
- KS test detects differences in the full distribution shape; PSI quantifies how much the population has shifted. Use both — KS for statistical significance, PSI for magnitude.
- Monitor prediction score distributions in addition to input features: a stable input distribution with a drifted score distribution signals concept drift or a silent model bug.
- Performance monitoring with delayed labels requires storing prediction IDs and joining them to outcome events when labels arrive — design this pipeline at deployment time, not after an incident.
- Store all monitoring statistics in a time-series database (SQLite is adequate at small scale) to enable trend analysis and early detection of gradual drift.
- Evidently AI and similar libraries accelerate dashboard generation but add a dependency — always implement the core KS/PSI logic yourself so you understand what the tools are computing.
- Set your retraining trigger at PSI > 0.2 on two or more features, or PSI > 0.1 on the prediction score distribution — this gives you an objective, automatable retraining criterion.