Aggregation & GroupBy
GroupBy operations follow a pattern called split-apply-combine: split the DataFrame into groups based on one or more key columns, apply an aggregation function to each group independently, then combine the results into a new DataFrame. This pattern covers the vast majority of analytical questions like "what is the average order value per customer segment?" or "which product categories grew the most last quarter?"
Basic .groupby()
By default, groupby puts the key column(s) in the index. Pass as_index=False to get a flat DataFrame back — often more convenient for subsequent merges or plotting.
Multiple Aggregations with .agg()
.agg() lets you compute multiple statistics simultaneously — including custom functions — in a single pass:
Common aggregation functions available by name string:
| Function | Result |
|---|---|
| "sum" | Total of all values |
| "mean" | Arithmetic mean |
| "median" | Middle value (robust to outliers) |
| "std" | Sample standard deviation |
| "min" / "max" | Minimum / maximum |
| "count" | Count of non-null values |
| "nunique" | Count of distinct values |
| "first" / "last" | First or last value in group |
Transform vs Aggregate
.agg() reduces each group to one row. .transform() returns a result with the same shape as the input, which is useful for computing group-level statistics as new columns:
.pivot_table()
pivot_table is a convenience layer over groupby that produces a 2-D cross-tabulation — rows represent one category, columns represent another:
For simple frequency counts, use pd.crosstab:
Multi-Level (Hierarchical) Indices
When you group by multiple keys, the result has a MultiIndex. Navigating it requires .xs(), tuple indexing, or flattening:
Applying GroupBy Results
A common workflow pattern: compute group-level aggregations, then merge them back onto the original DataFrame as enrichment columns:
Summary
groupbyfollows split-apply-combine: groups are split by key(s), aggregation functions are applied independently per group, and results are combined into a new structure..agg()with named aggregations (theresult_col=(source_col, func)syntax) produces clean, readable summary DataFrames in one pass..transform()returns group statistics aligned to the original DataFrame's row index — the right tool for adding contextual columns without changing row count.pivot_tableorganises aggregation results into a cross-tabular format;pd.crosstabhandles frequency counts with normalization built in.- Multi-level indices arise naturally from multi-key groupby; use
.xs(),.reset_index(), or column flattening to work with them conveniently.