Lesson 3

Pandas DataFrames

15 min

Pandas introduces two central data structures: the Series (a labelled 1-D array) and the DataFrame (a labelled 2-D table of Series). Together they provide the ergonomic, label-aware interface that makes Python data analysis feel as natural as working in a spreadsheet — while remaining far more powerful and programmable.

Series vs DataFrame

A Series is essentially a NumPy array with an index. Each element has both a positional location (integer 0, 1, 2 …) and a label (the index value, which can be strings, dates, or anything hashable).

python

import pandas as pd
import numpy as np

# Create a Series
revenue = pd.Series(
    [120_000, 95_000, 140_000, 87_000],
    index=["Q1", "Q2", "Q3", "Q4"],
    name="revenue_usd"
)

print(revenue["Q3"])    # 140000 — label access
print(revenue[2])       # 140000 — positional access
print(revenue.mean())   # 110500.0

A DataFrame is a dictionary of Series that share the same index. Every column is a Series; the DataFrame simply aligns them on a common row index.

python

df = pd.DataFrame({
    "product": ["Widget", "Gadget", "Doohickey"],
    "price":   [9.99, 24.99, 4.49],
    "units":   [340, 120, 890]
})
df["revenue"] = df["price"] * df["units"]
print(df)

Reading Data from Multiple Formats

Pandas ships with first-class readers for the most common data formats:

python

# CSV
df_csv = pd.read_csv("sales.csv", parse_dates=["date"], index_col="id")

# Excel (requires openpyxl)
df_xl  = pd.read_excel("report.xlsx", sheet_name="Q4")

# JSON (records orientation)
df_json = pd.read_json("events.json", orient="records")

# Parquet (columnar, efficient for large datasets)
df_parq = pd.read_parquet("dataset.parquet", engine="pyarrow")

# From a URL
df_web = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv")

Common read_csv parameters to know:

| Parameter | Purpose | Example | |---|---|---| | sep | Delimiter character | sep="\t" for TSV | | parse_dates | Auto-parse these columns as dates | ["created_at"] | | index_col | Use this column as the row index | "id" | | usecols | Load only these columns | ["name", "age", "salary"] | | dtype | Force column dtypes | {"zip": str} | | na_values | Treat these strings as NaN | ["N/A", "none", "–"] | | nrows | Read only first N rows (useful for inspection) | 1000 |

First-Pass Inspection

Before doing anything analytical, always run a quick inspection suite:

python

df = pd.read_csv("titanic.csv")

print(df.shape)          # (891, 12) — rows × columns
print(df.dtypes)         # dtype of each column
df.info()                # dtypes + non-null counts + memory usage
df.describe()            # count/mean/std/min/quartiles for numeric cols
df.describe(include="O") # same for object (string) columns
df.head(5)               # first 5 rows
df.tail(5)               # last 5 rows
df.sample(10, random_state=42)  # 10 random rows

df.info() is particularly valuable because it shows the count of non-null values per column, immediately revealing which columns have missing data and whether dtypes were inferred correctly (a column of integers stored as float64 is a signal that nulls snuck in).

Boolean Indexing

Boolean indexing is the primary filtering mechanism. You build a boolean Series (a mask) and pass it to the bracket operator:

python

df = pd.read_csv("titanic.csv")

# Single condition
survivors = df[df["survived"] == 1]

# Multiple conditions — use & (and), | (or), ~ (not)
# Always wrap each condition in parentheses!
first_class_survivors = df[(df["pclass"] == 1) & (df["survived"] == 1)]

# String conditions
women = df[df["sex"] == "female"]

# Membership test
cabins_bc = df[df["pclass"].isin([2, 3])]

# Null checks
missing_age = df[df["age"].isna()]
has_age     = df[df["age"].notna()]

`.loc` and `.iloc`

Always use .loc (label-based) or .iloc (position-based) when you want to select both rows and columns simultaneously:

python

# .loc — label-based: df.loc[row_labels, column_labels]
subset = df.loc[df["survived"] == 1, ["name", "age", "fare"]]

# .iloc — position-based: df.iloc[row_ints, col_ints]
top_left = df.iloc[:5, :3]   # first 5 rows, first 3 columns

Mixing label access and positional access (e.g., using plain [] with a row integer when the index is a string) is a common source of bugs. Using .loc/.iloc explicitly eliminates ambiguity.

Summary

A Series is a labelled 1-D array; a DataFrame is a dictionary of aligned Series sharing a common index.
Pandas can read CSV, Excel, JSON, Parquet, SQL, and more — each reader has parameters for parsing dates, specifying delimiters, and controlling dtypes.
Always run .shape, .info(), .describe(), and .head() immediately after loading a dataset to understand its structure before touching the data.
Boolean indexing with &, |, and ~ is the standard row-filtering approach; use .loc and .iloc for combined row-and-column selection to avoid ambiguous indexing.

NumPy Fundamentals Cleaning Messy Data

Pandas DataFrames

Series vs DataFrame

Reading Data from Multiple Formats

First-Pass Inspection

Boolean Indexing

.loc and .iloc

Summary

`.loc` and `.iloc`