Exploratory Data Analysis (EDA) is the practice of systematically interrogating a dataset to understand its structure, quality, and the stories it contains. Good EDA is not random — it follows a disciplined cycle: frame the business question, load and inspect the data, clean it, explore distributions and relationships, and synthesise findings into actionable insights.
This lesson walks through that full cycle using the publicly available NYC Yellow Taxi trip dataset.
Step 1 — Frame the Question
Before writing a single line of code, articulate what you are trying to learn. Vague EDA produces vague conclusions. For this analysis:
Question: What factors determine trip fare amount, and which time periods and borough pairs generate the most revenue for drivers?
Framing the question tells you which columns matter (fare_amount, pickup_datetime, dropoff_datetime, pickup_locationid, passenger_count, trip_distance) and what constitutes a useful finding.
Step 2 — Load and First Inspection
python
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snssns.set_theme(style="whitegrid", font_scale=1.1)plt.rcParams["figure.dpi"] = 120# Load a sample month of NYC taxi datadf = pd.read_parquet( "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet")print(f"Shape: {df.shape}")print(f"\nColumn dtypes:\n{df.dtypes}")print(f"\nNull counts:\n{df.isnull().sum()}")print(f"\nSample:\n{df.sample(5, random_state=42)}")df.describe().T.style.background_gradient(cmap="Blues", subset=["mean", "std"])
Inspection reveals:
3.0 million rows and 19 columns for January 2023.
store_and_fwd_flag has ~98% nulls — a candidate for dropping.
fare_amount, tip_amount, and total_amount include negative values — physically impossible.
All three variables are right-skewed — a small number of long, expensive trips pull the mean well above the median.
Step 5 — Bivariate and Temporal Analysis
python
# Fare vs distance — the primary relationshipfig, ax = plt.subplots(figsize=(8, 5))sample = df.sample(5000, random_state=42)sns.scatterplot(data=sample, x="trip_distance", y="fare_amount", hue="is_weekend", alpha=0.4, ax=ax, palette=["#2196F3","#E91E63"])ax.set_title("Fare Amount vs Trip Distance")plt.show()# Average fare by hour of dayhourly = df.groupby("pickup_hour")["fare_amount"].mean().reset_index()fig, ax = plt.subplots(figsize=(10, 4))sns.lineplot(data=hourly, x="pickup_hour", y="fare_amount", marker="o", color="#2196F3", ax=ax)ax.axhspan(0, 24, alpha=0.05, color="grey")ax.set_xticks(range(0, 24))ax.set_title("Average Fare by Hour of Day")ax.set_xlabel("Hour (0 = midnight)")ax.set_ylabel("Avg Fare ($)")plt.show()# Correlation heatmapnumeric_cols = ["fare_amount", "trip_distance", "trip_duration_min", "passenger_count", "tip_amount"]corr = df[numeric_cols].corr()sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", vmin=-1, vmax=1)plt.title("Correlation Matrix — Taxi Trip Variables")plt.show()
Step 6 — Synthesise Findings
After running the analysis, structure your findings as answers to the original question:
Finding 1 — Distance is the dominant fare predictor (r = 0.87). Trip distance explains the vast majority of fare variation. Duration adds modest predictive power beyond distance (r = 0.72 with fare), likely because traffic delays are metered.
Finding 2 — Surge-equivalent pricing appears during late-night hours. Fares between midnight and 4 AM average 22% higher per mile than the daily mean, consistent with taxi surcharges for overnight trips.
Finding 3 — Weekend vs weekday patterns differ mainly in volume, not price. Weekend fares are only $0.40 higher on average, but trip volume drops 18% on Sundays vs Saturdays.
Finding 4 — Passenger count has near-zero correlation with fare (r = 0.03). NYC metered fares do not scale with passenger count, so this variable will not be predictive in a regression model.
Step 7 — Communicate Insights
Structure your notebook conclusion as a short narrative:
## Key Takeaways- Trip distance is the primary driver of fare; model accordingly.- Late-night and JFK/LGA airport trips command higher per-mile fares and should be segmented separately in revenue analyses.- Passenger count is analytically irrelevant for fare prediction.- Approximately 4.2% of raw records required cleaning (negative fares, zero-distance trips, extreme durations) — document these exclusions in any published report.
Summary
EDA begins with a crisp question; without one, exploration is directionless.
The cleaning step typically removes 2–10% of raw records; always document what was removed and why.
Univariate analysis (distributions) precedes bivariate analysis (relationships) — understand each variable alone before asking how they relate.
Temporal slicing (by hour, day of week, month) almost always reveals operationally significant patterns in transactional data.
Synthesis means translating statistical observations back into the business vocabulary of the original question — that translation is what makes EDA useful.