Lesson 1

Descriptive Statistics Every Data Scientist Must Know

12 min

Descriptive statistics are the language you use to summarise a dataset in a handful of numbers. Choosing the wrong summary statistic — the mean of a skewed salary distribution, the mode of a continuous variable — leads to misleading analyses and flawed models. This lesson gives you the vocabulary and the judgment to choose correctly.

Measures of Central Tendency

The three classical measures of centre each ask a different question:

Mean — the balance point of the distribution (sum divided by count).
Median — the middle value when sorted; the 50th percentile.
Mode — the most frequently occurring value.

python

import numpy as np
import pandas as pd
from scipy import stats

salaries = np.array([42000, 45000, 47000, 48000, 52000, 55000, 58000, 60000, 65000, 250000])

mean   = np.mean(salaries)      # 72_200 — pulled up by the outlier
median = np.median(salaries)    # 53_500 — unaffected by the outlier
mode   = stats.mode(salaries).mode  # 42000 — not very useful for continuous data

print(f"Mean:   ${mean:,.0f}")
print(f"Median: ${median:,.0f}")
print(f"Mode:   ${mode:,.0f}")

When to use each:

| Statistic | Best situation | Avoid when | |---|---|---| | Mean | Symmetric distributions, no extreme outliers | Heavy skew or outliers are present | | Median | Skewed distributions (income, house prices) | Data is symmetric and you need sensitivity | | Mode | Categorical data; finding the most common category | Data is continuous (rarely unique values) |

The salary example is instructive: a CEO earning $250,000 in a team of 10 pulls the mean to $72,200 — almost $20,000 above every other employee's salary. The median ($53,500) is a far more representative measure of "typical" compensation.

Measures of Spread

Central tendency alone is insufficient. Two datasets can share the same mean but have radically different shapes:

python

data_a = np.array([48, 50, 50, 51, 52])   # tightly clustered
data_b = np.array([10, 30, 50, 70, 90])   # widely spread

for label, data in [("A", data_a), ("B", data_b)]:
    print(f"\nDataset {label}:")
    print(f"  Mean:     {data.mean():.1f}")
    print(f"  Variance: {data.var(ddof=1):.1f}")   # sample variance (ddof=1)
    print(f"  Std Dev:  {data.std(ddof=1):.1f}")
    print(f"  Range:    {data.max() - data.min()}")
    q1, q3 = np.percentile(data, [25, 75])
    print(f"  IQR:      {q3 - q1:.1f}")

Variance is the average of squared deviations from the mean. Squaring makes it unit² (dollars²) which is hard to interpret. Standard deviation is the square root of variance, restoring the original units.

IQR (interquartile range = Q3 − Q1) is robust to outliers because it only looks at the middle 50% of the data. This makes it the basis of the box plot's whisker calculation.

Skewness and When It Matters

Skewness measures the asymmetry of a distribution:

python

from scipy.stats import skew

data_normal    = np.random.default_rng(0).normal(50, 10, 1000)
data_right_skew = np.random.default_rng(0).exponential(scale=50, size=1000)

print(f"Normal skewness:   {skew(data_normal):.2f}")       # ~0
print(f"Right skew:        {skew(data_right_skew):.2f}")   # >1

Right (positive) skew: A long right tail. Mean > Median. Common in income, transaction amounts, and response times. Use median and log-transform before modelling.
Left (negative) skew: A long left tail. Mean < Median. Less common; seen in test scores with a ceiling effect.
Symmetric (near-zero skew): Mean ≈ Median ≈ Mode. Suitable for methods that assume normality.

The Five-Number Summary and Box Plots

The five-number summary (minimum, Q1, median, Q3, maximum) gives a complete picture of a distribution's shape with just five numbers:

python

import matplotlib.pyplot as plt
import seaborn as sns

df = pd.DataFrame({
    "Engineering": np.random.default_rng(1).normal(95000, 15000, 200),
    "Marketing":   np.random.default_rng(2).normal(75000, 20000, 200),
    "Operations":  np.random.default_rng(3).normal(60000, 10000, 200)
})

long_df = df.melt(var_name="department", value_name="salary")

fig, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(data=long_df, x="department", y="salary", palette="Set2", ax=ax)
ax.set_title("Salary Distribution by Department")
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f"${x/1000:.0f}K"))
plt.show()

# Programmatic five-number summary
print(df.describe(percentiles=[0.25, 0.5, 0.75]).T[["min","25%","50%","75%","max"]])

Covariance and Correlation

When two variables are related, covariance measures the direction of the relationship; correlation normalises it to the range [−1, 1]:

python

study_hours = np.array([2, 3, 4, 5, 6, 7, 8, 9])
test_scores = np.array([55, 60, 65, 70, 78, 82, 88, 91])

cov  = np.cov(study_hours, test_scores)[0, 1]
corr = np.corrcoef(study_hours, test_scores)[0, 1]

print(f"Covariance:  {cov:.2f}")   # depends on units
print(f"Correlation: {corr:.4f}")  # unit-free; ~0.998

A correlation of 0.998 indicates a very strong positive linear relationship. Correlation does not imply causation — both variables could be driven by a third confounding factor.

Summary

Use the median for skewed or outlier-prone data (income, prices, durations); use the mean for symmetric distributions.
Standard deviation quantifies typical distance from the mean; IQR quantifies the spread of the middle 50% and is resistant to outliers.
Skewness above ±1 signals that log-transforming the variable may be necessary before applying models that assume normality.
The five-number summary (via .describe()) and box plots together give an efficient view of a variable's shape, centre, spread, and outliers.
Correlation is the unit-free, bounded measure of linear association; always inspect scatter plots alongside correlation coefficients to avoid being misled.

Probability Fundamentals