ML System Design — Requirements, Constraints & Architecture
Why Design Before You Model
Every senior ML engineer has a war story that begins the same way: a team spent three months training a model, then discovered at deploy time that the required feature was not available at inference, or that the latency requirement was 50 ms and the model took 800 ms, or that retraining daily was assumed but the data pipeline only ran weekly. These failures have nothing to do with model quality. They are system design failures.
ML system design is a discipline distinct from algorithm selection. It asks: what are we actually trying to build, what are the constraints we must operate within, and what architecture satisfies both? This lesson gives you a repeatable framework for answering those questions.
The ML Canvas — A Requirements Framework
Before choosing a model architecture, you need to fill out what practitioners call the ML canvas. It is a structured set of questions that force you to articulate decisions that are usually left implicit.
Functional Requirements
The functional requirements define what the system must do:
- Prediction task: What is the model predicting? (binary classification, multi-class, regression, ranking, generation)
- Output contract: What does the consumer of predictions expect? A score, a label, a ranked list, a probability with calibration guarantees?
- Input contract: What data is available at inference time? Be specific — list every field and its source.
- Freshness requirement: How stale can predictions be before they cause harm? Real-time, near-real-time (seconds), batch (hours)?
Non-Functional Requirements
Non-functional requirements are the constraints that determine which architectures are even feasible:
- Latency SLA: p99 latency budget in milliseconds. This single number rules out entire classes of models.
- Throughput: Requests per second at peak. Determines horizontal scaling strategy.
- Availability: 99.9% vs 99.99% has enormous operational implications.
- Cost envelope: GPU inference costs money. Sometimes a smaller, faster model is the right answer even if it is less accurate.
- Retraining frequency: How often must the model be retrained to stay accurate? This determines your pipeline complexity.
Framing — ML vs Non-ML
Not every prediction problem needs an ML model. Before building a system, ask honestly:
- Can a rule-based system hit 80% of the accuracy at 5% of the cost?
- Is there enough labelled data to train a model that beats heuristics?
- Is the pattern stable enough that a model trained today will still be useful in six months?
If the answer to any of these is unclear, start with heuristics, instrument them, collect labels from production, then replace with an ML model once you have evidence it helps.
Online vs Batch Inference
The single most consequential architecture decision is whether predictions are generated online (at request time) or offline (pre-computed and cached).
Batch Inference
In batch inference, you run the model over all inputs on a schedule and store predictions in a lookup table. At serving time you just read from the table.
Batch inference is appropriate when: predictions can be stale (hours), the input space is bounded (you can enumerate all users), and latency at serving time must be sub-millisecond (cache hit).
Online Inference
Online inference generates predictions synchronously in response to a request. The model runs on each request in real time.
Online inference is required when: the prediction depends on context only available at request time (current cart, live session), decisions must be made in real time, or the input space is too large to pre-compute.
Training Frequency and Data Availability
How often you retrain is determined by two factors: how fast the data distribution drifts, and how expensive retraining is. These interact to create a retraining cadence.
For a fraud detection model, transaction patterns can shift within hours when a new attack vector emerges. For a content recommendation model, preferences drift over weeks. For a credit risk model, regulatory constraints may prohibit retraining more than quarterly.
Data availability is the other half of this equation. You cannot retrain on data that does not exist yet. If labels are delayed (e.g., chargeback data takes 30 days to arrive), your feedback loop is 30 days regardless of how fast you can run training.
The Three Types of ML System Failures
Production ML systems fail in three distinct ways, and each requires a different mitigation strategy.
Data Failures
The model receives malformed, missing, or out-of-distribution input. This is the most common failure mode and the hardest to debug because the model still produces a prediction — it just silently produces the wrong one.
Mitigations: input validation with strict schemas, feature monitoring with distribution alerts, upstream data quality SLAs.
Model Failures
The model's learned patterns no longer match reality. This happens due to distribution shift (the world changed), label shift (the definition of the target changed), or feedback loops (the model's own decisions altered the distribution it was trained on).
Mitigations: continuous monitoring of prediction distributions, held-out evaluation sets refreshed regularly, automated retraining pipelines.
System Failures
Infrastructure components fail: the model server crashes, the feature store has elevated latency, the database connection pool exhausts. These are standard software reliability problems, but ML systems are often less battle-hardened than traditional services.
Mitigations: health checks, circuit breakers, fallback strategies (rule-based backup, cached predictions), runbooks.
Architecture Patterns
Two-Tower Architecture
Used widely in recommendation and retrieval systems. Two separate neural networks — one for the query (user), one for the document (item) — produce embeddings that can be compared with dot product or cosine similarity.
The key advantage is that item embeddings can be pre-computed offline and indexed with ANN (approximate nearest-neighbour) search (Faiss, ScaNN). At serving time you only run the user tower (fast) and do an ANN lookup.
Cascaded (Funnel) Architecture
Used when the full candidate space is too large to score with an expensive model. A cheap retrieval stage narrows millions of candidates to hundreds, which a more expensive ranking model re-scores.
Multi-Task Architecture
A single model trained on multiple related objectives simultaneously. Useful when tasks share representation (e.g., click prediction and purchase prediction for the same items).
Scalability From Day One
Scalability is not something you add later. Certain decisions made early are expensive to reverse:
- Stateless servers: If your model server accumulates state (user session, running averages), horizontal scaling becomes hard. Design for statelessness from the start.
- Feature computation location: Computing features inside the model server couples computation to serving. Extract feature computation to a feature store.
- Model size: A 10 GB model cannot be loaded on every server replica. Plan for model size relative to fleet size.
- Serialisation format: Pickle is not reproducible across Python versions. Commit to ONNX or TorchScript early.
Worked Design — Real-Time Fraud Detection
Requirements
Functional: Score each payment transaction as fraud risk within 200 ms of authorisation request. The decision to block or allow is made by the rule engine; the model provides a score between 0 and 1.
Non-functional:
- p99 latency: 150 ms (leaving 50 ms for the rule engine)
- Throughput: 2,000 TPS peak
- Availability: 99.95%
- Retraining: daily on prior 90 days of transactions
Framing
Binary classification. Label = chargeback filed within 60 days. Class imbalance is severe (~0.3% fraud rate). Will require careful threshold calibration and cost-sensitive evaluation.
Data Availability at Inference
Available within 200 ms of transaction:
- Transaction amount, currency, merchant category code
- Card BIN (first 6 digits), card country, issuing bank
- User velocity features (transactions in last 1h, 24h, 7d) — must come from online feature store
- Device fingerprint, IP geolocation
Not available at inference:
- Chargeback label (arrives 30–60 days later)
- Merchant settlement data
Architecture Decision — Online Inference
The 200 ms budget and context-dependence (velocity features are live) mandate online inference. Batch inference is not viable.
High-Level Architecture
Training Pipeline (Daily)
Failure Modes and Mitigations
| Failure | Detection | Mitigation | |---------|-----------|------------| | Feature store latency spike | p99 latency alert | Fallback to cached features or rule-based score | | Model server crash | Health check failure | Auto-restart + circuit breaker to rule engine | | Data drift (new merchant category) | PSI alert on feature distribution | Alert + trigger off-schedule retraining | | Label delay (chargeback data late) | Pipeline SLA alert | Hold previous model, flag for manual review |
Key Takeaways
- Fill out the ML canvas before writing model code: task framing, input/output contracts, latency SLA, throughput, retraining frequency.
- The online vs batch inference decision is the most consequential architecture choice. It is driven by freshness requirements and input space size.
- ML systems fail in three distinct categories: data failures, model failures, and system failures. Design mitigations for each.
- Two-tower, cascaded, and multi-task architectures solve different scale and accuracy tradeoffs — choose based on your latency budget and candidate space size.
- Scalability constraints (stateless servers, feature store separation, model serialisation format) must be decided at design time, not retrofitted.
- A fraud detection system at 2,000 TPS with a 150 ms p99 budget requires online inference, a low-latency feature store, and a model serialised to ONNX or TorchScript — not a scikit-learn pickle loaded from disk.