Designing ML Systems
Good ML engineering starts long before the first line of model code. A system design session forces you to surface hidden assumptions — about data freshness, labeling cost, acceptable latency, and failure modes — before those assumptions become expensive bugs. This lesson walks through a repeatable process for converting a product requirement into a deployable architecture.
Requirements Gathering
Start with four axes:
| Axis | Question to answer | Example answer | |---|---|---| | Prediction type | Classification, regression, generation, ranking? | Binary classification | | Latency budget | Real-time (<100 ms), near-real-time (<2 s), batch? | Near-real-time | | Throughput | Requests per second at peak | 500 RPS | | Data freshness | How stale can features be? | 15 minutes |
Write down answers before opening a code editor. They drive every downstream decision.
Anatomy of an ML System
A production system has five distinct planes:
- Data plane — raw sources, ingestion, feature store
- Training plane — pipeline, compute, experiment registry
- Serving plane — model server, API gateway, caching
- Monitoring plane — metrics, drift detectors, alerting
- Control plane — CI/CD, retraining triggers, rollback logic
Sketch these as boxes and arrows before choosing frameworks. The sketch also becomes documentation.
Data Flow Planning
Trace every feature from its raw source to the model input tensor:
Documenting latency SLAs per feature reveals mismatches early: if your model needs a 15-minute-fresh feature but the pipeline runs hourly, you have a design gap to resolve before training.
Choosing the Right Architecture Pattern
Three patterns cover most use cases:
Pattern A is simplest but fails above ~1000 RPS without caching. Pattern B trades freshness for cost. Pattern C is most complex but enables sub-second freshness at scale.
Capacity Estimation
Back-of-envelope math catches sizing mistakes before deployment:
Add 30% headroom for traffic spikes and rolling restarts.
Summary
- Gather requirements on four axes before writing code: prediction type, latency, throughput, and data freshness.
- Split every ML system into five planes: data, training, serving, monitoring, and control.
- Document feature lineage and per-feature latency SLAs to catch pipeline gaps early.
- Choose among request-time, pre-computed, or streaming inference based on freshness vs. cost tradeoffs.
- Do capacity math upfront — replicas needed = (RPS × inference time) / target utilisation.