ML Model Testing — Beyond Accuracy
Overall accuracy is a lie. A model with 95% accuracy on an imbalanced dataset may be correct 100% of the time on the majority class and completely wrong on the minority class that actually matters. A model that passes every benchmark in the lab may fail catastrophically on the specific demographic that uses your product most. Model evaluation in production requires going far beyond aggregate metrics — it requires systematic investigation of where the model fails and proof that those failures are acceptable.
Slice-Based Evaluation
Slice-based evaluation (also called disaggregated evaluation) measures model performance on meaningful subsets of the data. A slice is any filter that creates a subpopulation: a demographic group, a geographic region, a time window, a product category, a device type. Every ML system should evaluate performance on all slices that the business cares about before deployment.
Behavioral Testing
Behavioral testing defines expected model behaviors as properties and verifies them explicitly. This approach — adapted from NLP CheckList — defines three types of tests for any ML model.
MFT (Minimum Functionality Tests): basic cases the model must always get right. If a customer has purchased 10 times in the last month, their churn score should be low. These are not statistical — they are logical constraints.
INV (Invariance Tests): changing an irrelevant feature should not change the prediction. A customer's churn probability should not change if you randomly vary their name or primary key. If it does, the model has learned a spurious correlation.
DIR (Directional Expectation Tests): increasing a feature in one direction should change the prediction in a known direction. Higher spend should lower churn probability. More support tickets should raise it. Violations indicate a sign error, multicollinearity issue, or overfitting to noise.
Shadow Mode Evaluation
Shadow mode runs a new model candidate in parallel with the current production model. The new model's predictions are logged but never served to users. This lets you compare distributions and catch obvious failures before any users are affected.
Model Cards
A model card is a standardised document that accompanies every production model. It defines intended use, limitations, training data, evaluation data, and ethical considerations.
Cost-Sensitive Evaluation
For business decisions, raw AUC is insufficient. A missed churn prediction (false negative) might cost $5,000 in lost ARR. A false positive (incorrectly flagging a healthy customer for outreach) might cost $50 of customer success time. The optimal decision threshold is not 0.5.
pytest Test Suite for ML Models
Key Takeaways
- Aggregate metrics hide slice-level failures; always evaluate model performance on every slice the business cares about and set minimum performance thresholds per slice before deployment.
- Behavioral testing gives you three specific guarantees: MFT (the model gets obvious cases right), INV (irrelevant features don't change predictions), DIR (the model responds correctly to feature changes).
- Shadow mode is the safest way to evaluate a new model in production conditions: run it in parallel, log its predictions, and compare distributions before any users see it.
- Model cards are not optional documentation — they force you to define out-of-scope uses, evaluate on disaggregated slices, and document ethical considerations before deployment.
- The champion vs challenger framework provides a structured protocol for model promotion: the challenger must beat the champion by a defined margin on a held-out evaluation set.
- Cost-sensitive evaluation finds the decision threshold that maximises business value, not statistical accuracy — the optimal threshold is almost never 0.5.
- Your pytest model test suite should test prediction contract (shape, range, no NaN), known inputs, invariances, directional expectations, and performance thresholds per slice — all as blocking CI checks.