Real-world data is distributed across multiple tables — a fact table of transactions, a dimension table of customers, a reference table of product metadata. Combining these tables correctly, and reshaping the result into the form your analysis needs, is one of the most frequently executed operations in data work.
.merge() — SQL-Style Joins
pd.merge() (or df.merge()) joins two DataFrames on one or more key columns. The how parameter controls which rows are retained:
python
import pandas as pdorders = pd.DataFrame({ "order_id": [1, 2, 3, 4], "customer_id": [101, 102, 103, 999], "amount": [50, 80, 30, 20]})customers = pd.DataFrame({ "customer_id": [101, 102, 103, 104], "name": ["Alice", "Bob", "Carol", "Dave"], "tier": ["Gold", "Silver", "Gold", "Bronze"]})# Inner join — only matched rows (default)inner = orders.merge(customers, on="customer_id", how="inner")# Left join — all orders; unmatched customers get NaNleft = orders.merge(customers, on="customer_id", how="left")# Outer join — all rows from both tablesouter = orders.merge(customers, on="customer_id", how="outer")# Right join — all customers; unmatched orders get NaNright = orders.merge(customers, on="customer_id", how="right")
Join type summary:
| how | Rows retained |
|---|---|
| "inner" | Only rows where the key exists in both DataFrames |
| "left" | All rows from the left DataFrame; NaN for unmatched right |
| "right" | All rows from the right DataFrame; NaN for unmatched left |
| "outer" | All rows from both DataFrames; NaN where unmatched |
Joining on Different Column Names
python
# left key = "cust_id", right key = "customer_id"merged = orders.merge( customers, left_on = "cust_id", right_on = "customer_id", how = "left", suffixes = ("_order", "_customer") # disambiguate duplicate column names)
Detecting Join Explosions
When a key is not unique in both tables, a merge can produce more rows than either input — a "many-to-many explosion". Always validate:
pd.concat stacks DataFrames either vertically (appending rows) or horizontally (appending columns):
python
q1 = pd.read_csv("sales_q1.csv")q2 = pd.read_csv("sales_q2.csv")q3 = pd.read_csv("sales_q3.csv")# Stack vertically — requires same column schemafull_year = pd.concat([q1, q2, q3], axis=0, ignore_index=True)# Add a label to track sourcefull_year = pd.concat( [q1, q2, q3], keys=["Q1", "Q2", "Q3"], names=["quarter", "original_idx"])# Stack horizontally — requires same row count / index alignmentfeatures = pd.concat([df_numeric, df_categorical], axis=1)
.melt() — Wide to Long
Many datasets arrive in "wide" format where each measurement has its own column. Long format has one row per observation and is required by Seaborn's plotting functions and many ML pipelines:
python
wide = pd.DataFrame({ "country": ["USA", "Germany", "Japan"], "2021": [500, 300, 420], "2022": [530, 310, 440], "2023": [560, 325, 460]})long = wide.melt( id_vars = ["country"], # columns to keep as identifiers value_vars = ["2021", "2022", "2023"], # columns to unpivot var_name = "year", # name for the new key column value_name = "gdp_index" # name for the new value column)print(long.head(6))# country year gdp_index# USA 2021 500# Germany 2021 300# ...
.pivot() — Long to Wide
.pivot() is the inverse of .melt(). It takes a long-format DataFrame and spreads one column's values across new columns: