Data Pipelines for ML
A model is only as good as its training data. Silent data quality failures — a nullable column that suddenly contains NaN, a categorical field with a new unseen label, a timestamp in the wrong timezone — can poison a model without raising a single exception. This lesson builds a pipeline that fails loudly and keeps every dataset version reproducible.
Pipeline Stages
Every ML data pipeline passes through the same logical stages:
| Stage | Responsibility | Failure mode | |---|---|---| | Ingestion | Pull raw data from sources | Source unavailable, partial read | | Validation | Assert schema and value ranges | Silent corruption passes through | | Transformation | Compute features, encode categoricals | Leakage from future data | | Versioning | Snapshot the processed dataset | Irreproducible experiments | | Loading | Write to feature store or disk | Partial write, wrong partition |
Ingestion with Error Boundaries
Use context managers and explicit error handling so partial reads never reach the transformation step:
Schema Validation with Great Expectations
Great Expectations lets you encode your assumptions as executable tests:
If validation fails, the pipeline raises an exception — no silently corrupted datasets downstream.
Dataset Versioning with DVC
DVC (Data Version Control) tracks large binary files in Git-compatible fashion:
Each DVC-tracked file generates a .dvc pointer file that you commit to Git. Your experiment code and its exact dataset are coupled through the Git commit SHA.
Feature Transformation Without Leakage
Fit scalers only on training data, then apply the same fitted object to validation and test:
Save the fitted scaler alongside the model artifact; inference time must apply the identical transformation.
Summary
- Structure every pipeline into five stages: ingestion, validation, transformation, versioning, and loading.
- Use Great Expectations to turn data assumptions into automated, executable checks that fail loudly.
- Version datasets with DVC so every experiment can be traced back to the exact data it was trained on.
- Fit preprocessing transformers only on training data and persist the fitted object with the model to prevent train-serve skew.
- Treat partial writes as failures: use atomic rename patterns or transactional writes to avoid corrupt partitions reaching downstream consumers.