Lesson 5

Aggregation & GroupBy

12 min

GroupBy operations follow a pattern called split-apply-combine: split the DataFrame into groups based on one or more key columns, apply an aggregation function to each group independently, then combine the results into a new DataFrame. This pattern covers the vast majority of analytical questions like "what is the average order value per customer segment?" or "which product categories grew the most last quarter?"

Basic `.groupby()`

python

import pandas as pd
import numpy as np

df = pd.read_csv("orders.csv", parse_dates=["order_date"])

# Single key, single aggregation
revenue_by_region = df.groupby("region")["revenue"].sum()
print(revenue_by_region)

# Multiple keys
monthly_region = df.groupby(["year", "region"])["revenue"].sum()
print(monthly_region)

# Retain the grouping columns as regular columns (not index)
result = df.groupby("region", as_index=False)["revenue"].sum()

By default, groupby puts the key column(s) in the index. Pass as_index=False to get a flat DataFrame back — often more convenient for subsequent merges or plotting.

Multiple Aggregations with `.agg()`

.agg() lets you compute multiple statistics simultaneously — including custom functions — in a single pass:

python

# Apply different functions to different columns
summary = df.groupby("category").agg(
    total_revenue   = ("revenue", "sum"),
    avg_revenue     = ("revenue", "mean"),
    order_count     = ("order_id", "count"),
    unique_customers= ("customer_id", "nunique"),
    median_basket   = ("items", "median"),
    revenue_std     = ("revenue", "std")
).round(2)

print(summary)

# Apply a custom function
def revenue_range(s):
    return s.max() - s.min()

df.groupby("category")["revenue"].agg(["mean", "std", revenue_range])

Common aggregation functions available by name string:

| Function | Result | |---|---| | "sum" | Total of all values | | "mean" | Arithmetic mean | | "median" | Middle value (robust to outliers) | | "std" | Sample standard deviation | | "min" / "max" | Minimum / maximum | | "count" | Count of non-null values | | "nunique" | Count of distinct values | | "first" / "last" | First or last value in group |

Transform vs Aggregate

.agg() reduces each group to one row. .transform() returns a result with the same shape as the input, which is useful for computing group-level statistics as new columns:

python

# Add a column showing each order's revenue as % of its category total
df["category_total"] = df.groupby("category")["revenue"].transform("sum")
df["pct_of_category"] = df["revenue"] / df["category_total"] * 100

# Compute z-scores within each group
df["revenue_zscore"] = df.groupby("category")["revenue"].transform(
    lambda x: (x - x.mean()) / x.std()
)

`.pivot_table()`

pivot_table is a convenience layer over groupby that produces a 2-D cross-tabulation — rows represent one category, columns represent another:

python

pivot = df.pivot_table(
    values  = "revenue",
    index   = "region",
    columns = "product_line",
    aggfunc = "sum",
    fill_value = 0,      # replace NaN with 0 in sparse cells
    margins    = True    # add row/column totals
)

print(pivot)

For simple frequency counts, use pd.crosstab:

python

ct = pd.crosstab(df["region"], df["customer_tier"], margins=True)
print(ct)

# Normalize to row percentages
ct_pct = pd.crosstab(df["region"], df["customer_tier"], normalize="index").round(3)

Multi-Level (Hierarchical) Indices

When you group by multiple keys, the result has a MultiIndex. Navigating it requires .xs(), tuple indexing, or flattening:

python

multi = df.groupby(["year", "quarter", "region"])["revenue"].sum()
print(multi)

# Access a specific group using tuple indexing
print(multi[2024, "Q2", "North"])

# Cross-section: all Q2 data across years and regions
print(multi.xs("Q2", level="quarter"))

# Flatten MultiIndex columns after pivot_table
pivot.columns = ["_".join(str(c) for c in col) for col in pivot.columns]

# Reset index to turn MultiIndex into regular columns
flat = multi.reset_index()

Applying GroupBy Results

A common workflow pattern: compute group-level aggregations, then merge them back onto the original DataFrame as enrichment columns:

python

region_avg = df.groupby("region")["revenue"].mean().rename("region_avg_revenue")
df = df.merge(region_avg, on="region", how="left")
df["above_regional_avg"] = df["revenue"] > df["region_avg_revenue"]

print(df[["order_id", "region", "revenue", "region_avg_revenue", "above_regional_avg"]].head())

Summary

groupby follows split-apply-combine: groups are split by key(s), aggregation functions are applied independently per group, and results are combined into a new structure.
.agg() with named aggregations (the result_col=(source_col, func) syntax) produces clean, readable summary DataFrames in one pass.
.transform() returns group statistics aligned to the original DataFrame's row index — the right tool for adding contextual columns without changing row count.
pivot_table organises aggregation results into a cross-tabular format; pd.crosstab handles frequency counts with normalization built in.
Multi-level indices arise naturally from multi-key groupby; use .xs(), .reset_index(), or column flattening to work with them conveniently.

Cleaning Messy Data Merging, Joining & Reshaping

Aggregation & GroupBy

Basic .groupby()

Multiple Aggregations with .agg()

Transform vs Aggregate

.pivot_table()

Multi-Level (Hierarchical) Indices

Applying GroupBy Results

Summary

Basic `.groupby()`

Multiple Aggregations with `.agg()`

`.pivot_table()`