Claude Code Superpowers: AI That Gets Smarter With Every Task

Lesson 6

GRADIENT â ML Pipelines to Production

22 min

ML systems fail differently from web applications. A web server either responds or it does not. An ML model always responds — but its responses may be silently wrong for weeks before anyone notices. Drift creeps in. Labels corrupt. Training-serving skew accumulates. The model degrades gracefully, then suddenly falls off a cliff.

The GRADIENT skill provides structured protocols for every stage of the ML lifecycle. It does not try to be a textbook. It provides the specific checklists, tests, and configurations that prevent the failure modes that kill ML systems in production.

The Stage Assessment

When you invoke the GRADIENT skill, the first thing it does is assess where you are in the lifecycle. This prevents loading irrelevant patterns — a data pipeline problem and a serving latency problem need completely different protocols.

STAGE ASSESSMENT:

"What stage are you at?"
A) Data collection / ingestion
B) Feature engineering / preprocessing  
C) Model training / experimentation
D) Model evaluation / validation
E) Model serving / inference
F) Production monitoring / MLOps
G) Debugging an ML failure

Each stage maps to a specific set of patterns. Stages A/B load the data pipeline patterns. Stage C loads the training checklist. Stage E loads the serving patterns. Stage F loads the MLOps configuration. Stage G invokes hunter first, then returns with evidence.

After identifying the stage, the skill asks one more question: "What's the primary constraint — accuracy, latency, cost, or reliability?" This single answer shapes every trade-off throughout the implementation.

Stage A/B: Data Pipeline Testing

The most common cause of ML failure is not the model. It is the data.

Data pipeline tests should run before any model training. They validate the contract between your data source and your training code.

Schema validation catches structural problems immediately:

python

def test_input_schema_matches_expectations():
    expected_columns = {'user_id', 'timestamp', 'features', 'label'}
    assert expected_columns == set(df.columns), \
        f"Missing columns: {expected_columns - set(df.columns)}"

def test_label_distribution_reasonable():
    label_counts = df['label'].value_counts()
    assert len(label_counts) == NUM_CLASSES  # All classes present
    assert (label_counts / len(df)).max() < 0.9  # No class dominates

If the label distribution test fails — one class dominates — your model will predict the majority class for everything and achieve misleadingly high accuracy. This test catches it before you waste GPU hours.

Distribution shift detection catches the slow-moving problem: production data that looks different from training data. This is the leading cause of production model degradation.

The Kolmogorov-Smirnov test detects if two samples come from different distributions:

python

from scipy import stats

def test_feature_distribution_shift():
    for feature_idx in range(FEATURE_DIM):
        ks_stat, p_value = stats.ks_2samp(
            train_features[:, feature_idx],
            prod_features[:, feature_idx]
        )
        assert p_value > 0.01, \
            f"Feature {feature_idx} drift (p={p_value:.4f})"

Load the full pattern file for all distribution shift tests: the covariate shift AUC test, null handling tests, extreme value tests, and data leakage check are all there. The leakage check is particularly important — it prevents the silent error where features contain future information, making validation metrics artificially high.

Rule: Write pipeline tests before writing any model code. A broken pipeline wastes training time and produces worthless models.

Stage C: Model Training Checklist

Before training any complex model, establish a baseline.

markdown

Simple baseline implemented (logistic regression, majority class, or rule-based)?
Baseline metrics documented (accuracy, F1, AUC-ROC, latency)?
Complex model beats baseline?

If your deep learning model cannot beat logistic regression, you have a data problem, not a model architecture problem. The baseline forces this question before you spend days tuning hyperparameters.

Hyperparameter search: Bayesian optimization converges in far fewer trials than grid search for high-dimensional spaces. The training patterns file contains a Bayesian search configuration using Optuna-style settings: 50 trials, 24-hour timeout, log-uniform learning rate, categorical choices for layers and batch size, median pruning to kill underperforming trials early.

Early stopping: Never train without it. The configuration covers the metric to monitor (validation_loss), the mode (min), patience (10 epochs without improvement), minimum improvement threshold (1e-4), and weight restoration (always restore best weights, not final weights).

Checkpointing: Save every epoch, keep last 5, save best only, monitor validation AUC. Include in metadata: git commit hash, config snapshot, data version, training metrics. Without the git commit, you cannot reproduce a specific model version 6 months later.

Ablation study: After training, verify that each component of your model actually contributes. If removing a feature group does not hurt accuracy, you do not need it. Simpler models generalize better and are cheaper to serve.

Stage E: Model Serving TDD

Model serving has unique failure modes that standard web service testing does not cover.

Input validation must happen at three levels. Schema validation rejects malformed requests (missing fields, wrong types). Range validation rejects impossible values (age: 500, income: -1000). Distribution checking logs a warning when input features are more than 4 standard deviations from the training distribution — not a rejection, but a signal that the model is being asked to make a prediction outside its training domain.

Latency testing is not optional. P99 latency is the number that matters for user experience. Load the serving patterns file for the complete latency test: 1000 requests, measure each, assert P99 < LATENCY_BUDGET_MS. The throughput test runs EXPECTED_QPS concurrent requests and verifies actual throughput is ≥ 90% of target.

Fallback behavior is the test most teams skip. What happens when the model service is unavailable? What happens when it times out? These tests must pass before production deployment:

python

def test_model_unavailable_fallback():
    with patch('model_server.model', side_effect=ConnectionError()):
        response = model_server.predict(valid_input)
    assert response.status_code == 200
    assert response.json()['fallback_used'] == True
    assert response.json()['prediction'] == DEFAULT_PREDICTION

If you do not write this test, you will find out in production at 2am that your model service going down takes down your product. Write it now.

Output calibration is the most sophisticated test. A model's confidence score of 0.8 should mean it is correct 80% of the time, not 60% or 95%. Uncalibrated models give false confidence to downstream decision-making. The calibration test bins predictions by confidence level and checks actual accuracy in each bin is within 5% of the stated confidence.

Stage F: MLOps Deployment

Production ML requires four operational components. Without all four, the model degrades silently.

Model versioning uses semantic versioning (MAJOR.MINOR.PATCH). Each version must record: git commit, data version, training config, train/val/test accuracy, and artifact locations (weights, preprocessor, inference config). Without this, you cannot audit why a model changed behavior or roll back to a specific version.

A/B testing is how you safely deploy new models. The configuration covers traffic split (50/50 to start), primary metric (conversion rate), guardrail metrics (latency P99, error rate), stopping criteria (minimum 10k samples, 7 days duration, p=0.05 significance), and rollback triggers (5% metric degradation, 1% error rate increase, 50% latency increase).

Never deploy a new model to 100% of traffic. A/B testing catches problems before they affect all users.

Drift monitoring catches the leading cause of production model degradation. Two types matter.

Data drift is when the statistical distribution of input features changes. The Population Stability Index (PSI) is the standard metric. PSI > 0.1 indicates significant drift and triggers an alert. PSI > 0.25 triggers automatic rollback.

Concept drift is when the relationship between inputs and correct outputs changes (the world changed). CUSUM (Cumulative Sum) detection on prediction accuracy is the standard method.

Rollback configuration is your safety net. Automatic rollback should trigger on: validation accuracy drop > 5% from baseline, P99 latency > 2× budget, error rate > 5%, PSI score > 0.25. Rollback target: previous stable version. Notification: ML team Slack + on-call pager.

The automatic rollback configuration is the most important operational safety net in ML. Without it, a degraded model continues serving bad predictions until a human notices.

The Pattern Files

The GRADIENT skill references four pattern files that contain the actual code:

patterns/data-pipeline.md — schema validation, KS test, covariate shift AUC, null handling, leakage check
patterns/model-training.md — Bayesian hyperparameter config, early stopping, checkpoint config, ablation template
patterns/model-serving.md — input validation tests, P99 latency test, throughput test, fallback tests, calibration test
patterns/mlops.md — model versioning config, A/B test config, PSI drift monitoring, CUSUM concept drift, rollback config

Load only the pattern file relevant to your current stage. Loading all four costs ~3,000 tokens unnecessarily. Loading one pattern file costs ~700 tokens.

Integrating With the Broader System

The GRADIENT skill integrates with three other skills in your workflow.

forge governs how you write the pipeline tests and serving tests: failing tests first, then implementation. Never the other way around.

SENTINEL gates ML work before calling it done. The confidence scoring questions include: Do all pipeline tests pass? Does the model beat the baseline? Is the fallback tested? Is drift monitoring configured?

tribunal with DOMAIN: ml applies the ML-specific code review checklist: data leakage check, training-serving skew, label contamination, drift monitoring configured, fallback implemented.

Key Takeaway

ML systems fail from data problems, not model problems. Write pipeline tests before training. Establish a baseline before building complex models. Test fallback behavior before production deployment. Configure drift monitoring and automatic rollback as soon as you deploy. Use the stage assessment to load only the relevant patterns — you are at one stage at a time, and loading all patterns wastes context budget. The four pattern files contain everything you need to implement each stage correctly.

Systematic Debugging — Root Cause, Not Guessing AI Engineering Skill — RAG, Agents, and Prompts

GRADIENT â ML Pipelines to Production

The Stage Assessment

Stage A/B: Data Pipeline Testing

Stage C: Model Training Checklist

Stage E: Model Serving TDD

Stage F: MLOps Deployment

The Pattern Files

Integrating With the Broader System

Key Takeaway

GRADIENT â ML Pipelines to Production