Pandas — DataFrames, Data Cleaning, Analysis & Transformation
Part 1: Series — The 1D Building Block
A Series is a one-dimensional labeled array. Think of it as a column in a spreadsheet where every row has a name (the index). Unlike a plain Python list, operations on two Series automatically align on the index — a property that makes pandas both powerful and occasionally surprising.
Every Series has:
- values — a NumPy array holding the actual data
- index — a label array (default: 0, 1, 2, … RangeIndex)
- dtype — the data type (int64, float64, object, datetime64, category, …)
- name — optional string used as the column header when the Series is inside a DataFrame
Part 2: DataFrame Fundamentals
A DataFrame is a 2D table: a dict of Series that all share the same index. Every column is a Series; every row is a cross-section with a label from the index.
Part 3: Data Cleaning — Production Techniques
Real data is messy. Missing values, wrong types, duplicate rows, inconsistent casing, hidden whitespace — all of these exist in production datasets. Systematic cleaning is the difference between analysis you can trust and analysis that quietly lies.
Part 4: Indexing Mastery
Part 5: Apply, Map, and Transform
Pandas provides several function-application methods. Choosing correctly between them matters for correctness and performance.
| Method | Applied to | Input | Output shape | Use when |
|--------|-----------|-------|-------------|----------|
| Series.map() | Series | one element | same length | element-wise with dict or function |
| Series.apply() | Series | one element | any | element-wise when no vectorized op exists |
| DataFrame.apply(axis=0) | DataFrame | one column (Series) | flexible | column-wise aggregation |
| DataFrame.apply(axis=1) | DataFrame | one row (Series) | same rows | row-wise custom logic |
| DataFrame.map() | DataFrame | one cell | same shape | element-wise on entire DataFrame |
| groupby.transform() | GroupBy | group (Series) | same rows as input | add group statistics as a column |
| .pipe() | any | object itself | any | method chaining |
Part 6: GroupBy — Split-Apply-Combine
GroupBy is one of pandas' most important abstractions. It lets you split a DataFrame into groups, apply a function to each, and combine the results — all in one expressive call.
Part 7: Merging, Joining, and Concatenating
Part 8: Reshaping Data
Part 9: Time Series
Part 10: Performance
PROJECT: Full Real-Estate Data Analysis Pipeline
This project loads a messy real-estate dataset, cleans it systematically, analyzes it with groupby and merging, and generates a polished summary report.
Exercises
Exercise 1 (Easy): Create a DataFrame with columns name, score, grade for 6 students. Use .loc to select students with score > 80. Add a column passed that is True when score >= 60. Print the result.
Exercise 2 (Easy): Given a Series of transaction amounts as strings (e.g., "$1,200.50"), clean them and compute the total, mean, and max.
Exercise 3 (Medium): Load the CSV below into a DataFrame. Clean it: fix types, handle missing values, remove duplicates. Then group by category and compute total and average amount.
Exercise 4 (Medium): Implement a z-score outlier removal function. Given a DataFrame and column name, return a copy with rows removed where the z-score of that column exceeds a threshold.
Exercise 5 (Medium): Create two DataFrames — orders and products — and merge them. Compute revenue (quantity * price) per product category.
Exercise 6 (Medium): Build a pivot table from sales data showing total sales by region (rows) and product (columns). Add a totals row and column.
Exercise 7 (Hard): Implement a complete time-series feature engineering pipeline: given daily price data, compute 7-day and 30-day moving averages, daily returns, 7-day rolling volatility, and a boolean flag for "golden cross" (MA7 crosses above MA30).
Exercise 8 (Hard): Build a data quality checker class that takes a DataFrame and produces a structured report: column types, missing percentages, unique counts, outlier counts (IQR), and a quality score.
Key Takeaways
- Index alignment is pandas' superpower and footgun: operations between Series/DataFrames automatically align on the index — invaluable for correctness, but produces all-NaN results when your indexes don't match unexpectedly
.locis label-based,.ilocis position-based: always use.loc[mask, col] = valuefor assignment — chained indexing (df["col"][mask] = val) may silently operate on a copypd.to_numeric(errors='coerce')andpd.to_datetime()are your first cleaning moves: they convert bad data toNaNrather than crashing, giving you explicit control over the repair- Named aggregations are more readable:
agg(total=("sales","sum"), count=("sales","count"))produces clean, named columns in one call instead of chaining.rename() transformkeeps the original shape: use it when you want to add a group statistic back to the original DataFrame without reducing rows;.agg()collapses rows- Categorical dtype is a free 5–20x memory saving: any string column with low cardinality (status, region, type) should be
astype("category") mergevsconcat:mergejoins on column values (SQL JOIN semantics);concatstacks DataFrames along an axis (UNION ALL semantics) — they solve completely different problems- Prefer vectorized ops over
apply():apply(axis=1)is a Python loop over rows — it's 50–200x slower than equivalent NumPy/pandas vectorized operations