A complete playbook for taking a trained model from your laptop to a production API — serialisation, FastAPI, Docker, monitoring, and CI/CD.
Prerequisites
pythonbasic-ml-knowledgedocker-basics
Deploying ML Models to Production
Training a model is 20% of the work. Getting it to serve reliable, low-latency predictions in production — and keeping it working six months later — is the other 80%. This guide is a complete playbook for that second 80%.
Model Serialisation: pickle vs joblib vs ONNX
After training, your first decision is how to save the model. Three options are common:
Typically runs 2-5x faster than native PyTorch/scikit-learn at inference time
pickle and joblib are fine for local experimentation. In production, they are security liabilities — a malicious pickle file executes arbitrary code on deserialization.
Note: Always call model.eval() before exporting PyTorch models. Dropout and BatchNorm behave differently in train vs eval mode — forgetting this is a common source of prediction inconsistencies between training and serving.
FastAPI Inference Endpoint
FastAPI is the standard for Python ML inference APIs. It is fast, async-native, and generates OpenAPI docs automatically.
Never bake secrets into Docker images or docker-compose files. Use environment variables injected at runtime:
bash
# Local dev: .env file (gitignored)API_KEY=secret123MODEL_PATH=/app/model.onnxMAX_BATCH_SIZE=32
In production (Kubernetes):
Use Kubernetes Secrets mounted as environment variables
Or use AWS Secrets Manager / GCP Secret Manager with a secrets sidecar
In your FastAPI app, load configuration with Pydantic Settings:
python
from pydantic_settings import BaseSettingsclass Settings(BaseSettings): model_path: str = "model.onnx" max_batch_size: int = 32 log_level: str = "INFO" api_key: str # Required — will raise if missing class Config: env_file = ".env"settings = Settings()
Basic Drift Detection
Model drift occurs when the statistical distribution of incoming data shifts away from the training distribution. Undetected drift causes silent accuracy degradation.
A simple approach: log every prediction and periodically compare distributions.
Run a weekly drift check with a script that computes the Population Stability Index (PSI) between the training feature distribution and the last 7 days of logged features. If PSI > 0.2 for any feature, trigger a retraining alert.
Before sending traffic to a new model version, verify:
[ ] ONNX model validated: predictions match sklearn/PyTorch outputs within 1e-5[ ] /health endpoint returns 200 within 1s under load[ ] /ready endpoint returns 503 before model loads and 200 after[ ] Input validation tested with edge cases (missing fields, NaN, Inf, wrong shape)[ ] Structured JSON logs emitting to stdout with correct fields[ ] Docker image runs as non-root user[ ] Secrets loaded from environment, not hardcoded[ ] CI pipeline passes: all tests green, image builds and pushes successfully[ ] Drift logging active: every prediction written to append-only log[ ] Rollback plan documented: previous image tag recorded, rollback command ready[ ] Load test: p99 latency under 100ms at expected peak QPS
Summary
Use ONNX for production model serialisation — it is cross-language, runtime-agnostic, and cannot execute arbitrary code like pickle can.
Build your inference API with FastAPI using Pydantic request/response models for automatic validation, and run ONNX Runtime in a thread pool executor to keep the async event loop unblocked.
Write a multi-stage Dockerfile that separates build and runtime stages; run the final image as a non-root user; the result is a lean, secure image under 200 MB.
Use structlog with JSONRenderer in production so every prediction and error is machine-readable and can be ingested by any observability platform.
Log every prediction to a JSONL file and run periodic drift detection (PSI score) to catch distribution shift before it silently degrades accuracy in production.