Raw data is almost always messy. Fields are missing, values are stored as the wrong type, strings are inconsistently capitalised, and dates come in six different formats. Cleaning is not glamorous, but it is where most of the analytical risk lives: a miscounted null or a silently-failed type cast will corrupt every downstream result.
Identifying and Handling Null Values
Start by quantifying the null landscape:
python
import pandas as pdimport numpy as npdf = pd.read_csv("customers.csv")# Count nulls per columnnull_counts = df.isnull().sum()null_pct = (df.isnull().mean() * 100).round(2)null_report = pd.DataFrame({ "missing_count": null_counts, "missing_pct": null_pct}).sort_values("missing_pct", ascending=False)print(null_report[null_report["missing_count"] > 0])
Once you understand where nulls live, choose a strategy per column:
python
# Drop rows missing a critical fielddf = df.dropna(subset=["customer_id", "purchase_date"])# Drop columns where more than 60% of values are missingthreshold = 0.6df = df.dropna(axis=1, thresh=int(len(df) * (1 - threshold)))# Fill with a constantdf["country"] = df["country"].fillna("Unknown")# Fill numeric columns with the column median (robust to outliers)df["age"] = df["age"].fillna(df["age"].median())# Forward-fill ordered time series datadf["price"] = df["price"].ffill()# Fill with group-specific means (more contextually accurate)df["salary"] = df.groupby("department")["salary"].transform( lambda x: x.fillna(x.mean()))
| Strategy | When to use |
|---|---|
| Drop row | Missing value is mandatory; row is unusable without it |
| Drop column | >50–70% of rows are missing; column adds no signal |
| Fill with median/mode | Numeric column; distribution is skewed |
| Fill with mean | Numeric column; distribution is approximately normal |
| Forward/back fill | Ordered time series; values persist until changed |
| Group-conditional fill | Meaningful subgroups exist (e.g., fill salary by job title) |
Removing Duplicates
Duplicate rows silently inflate counts and skew aggregations:
python
# Check total duplicatesprint(f"Duplicate rows: {df.duplicated().sum()}")# Check duplicates on specific columns (logical duplicates)print(f"Duplicate orders: {df.duplicated(subset=['order_id']).sum()}")# Inspect duplicate records before removingdups = df[df.duplicated(subset=["order_id"], keep=False)]print(dups.sort_values("order_id").head(10))# Drop duplicates, keeping the first occurrencedf = df.drop_duplicates(subset=["order_id"], keep="first")
Dtype Casting
Pandas infers dtypes on read, but it often gets them wrong. Storing a zip code as float64 or a Boolean flag as object wastes memory and breaks downstream logic:
python
print(df.dtypes)# Cast to the correct typesdf["age"] = df["age"].astype(int)df["is_premium"] = df["is_premium"].astype(bool)df["zip_code"] = df["zip_code"].astype(str).str.zfill(5) # keep leading zerosdf["product_id"] = df["product_id"].astype("category") # low-cardinality strings# Safe numeric casting — coerce errors to NaN instead of raisingdf["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")# Downcast to save memorydf["quantity"] = pd.to_numeric(df["quantity"], downcast="integer")
String Normalisation
Free-text fields are a cleaning minefield — inconsistent capitalisation, whitespace, and special characters create false duplicates:
Dates stored as strings are a constant source of errors. Parse them explicitly:
python
# Simple ISO format — Pandas handles automaticallydf["created_at"] = pd.to_datetime(df["created_at"])# Explicit format — much faster for large datasets (no format sniffing)df["purchase_date"] = pd.to_datetime(df["purchase_date"], format="%d/%m/%Y")# Mixed or ambiguous formats — use errors='coerce' to surface bad valuesdf["event_time"] = pd.to_datetime(df["event_time"], errors="coerce")bad_dates = df[df["event_time"].isna() & df["event_time_raw"].notna()]# Extract date components for feature engineeringdf["year"] = df["created_at"].dt.yeardf["month"] = df["created_at"].dt.monthdf["dow"] = df["created_at"].dt.dayofweek # 0=Mondaydf["is_weekend"] = df["dow"].isin([5, 6])
A Cleaning Checklist
Run through this mental checklist on every new dataset:
Null audit — count, locate, decide strategy per column.
Duplicate audit — exact and logical duplicates on key columns.
Dtype audit — every column should have its correct Pandas dtype.