Descriptive Statistics Every Data Scientist Must Know
Descriptive statistics are the language you use to summarise a dataset in a handful of numbers. Choosing the wrong summary statistic — the mean of a skewed salary distribution, the mode of a continuous variable — leads to misleading analyses and flawed models. This lesson gives you the vocabulary and the judgment to choose correctly.
Measures of Central Tendency
The three classical measures of centre each ask a different question:
- Mean — the balance point of the distribution (sum divided by count).
- Median — the middle value when sorted; the 50th percentile.
- Mode — the most frequently occurring value.
When to use each:
| Statistic | Best situation | Avoid when | |---|---|---| | Mean | Symmetric distributions, no extreme outliers | Heavy skew or outliers are present | | Median | Skewed distributions (income, house prices) | Data is symmetric and you need sensitivity | | Mode | Categorical data; finding the most common category | Data is continuous (rarely unique values) |
The salary example is instructive: a CEO earning $250,000 in a team of 10 pulls the mean to $72,200 — almost $20,000 above every other employee's salary. The median ($53,500) is a far more representative measure of "typical" compensation.
Measures of Spread
Central tendency alone is insufficient. Two datasets can share the same mean but have radically different shapes:
Variance is the average of squared deviations from the mean. Squaring makes it unit² (dollars²) which is hard to interpret. Standard deviation is the square root of variance, restoring the original units.
IQR (interquartile range = Q3 − Q1) is robust to outliers because it only looks at the middle 50% of the data. This makes it the basis of the box plot's whisker calculation.
Skewness and When It Matters
Skewness measures the asymmetry of a distribution:
- Right (positive) skew: A long right tail. Mean > Median. Common in income, transaction amounts, and response times. Use median and log-transform before modelling.
- Left (negative) skew: A long left tail. Mean < Median. Less common; seen in test scores with a ceiling effect.
- Symmetric (near-zero skew): Mean ≈ Median ≈ Mode. Suitable for methods that assume normality.
The Five-Number Summary and Box Plots
The five-number summary (minimum, Q1, median, Q3, maximum) gives a complete picture of a distribution's shape with just five numbers:
Covariance and Correlation
When two variables are related, covariance measures the direction of the relationship; correlation normalises it to the range [−1, 1]:
A correlation of 0.998 indicates a very strong positive linear relationship. Correlation does not imply causation — both variables could be driven by a third confounding factor.
Summary
- Use the median for skewed or outlier-prone data (income, prices, durations); use the mean for symmetric distributions.
- Standard deviation quantifies typical distance from the mean; IQR quantifies the spread of the middle 50% and is resistant to outliers.
- Skewness above ±1 signals that log-transforming the variable may be necessary before applying models that assume normality.
- The five-number summary (via
.describe()) and box plots together give an efficient view of a variable's shape, centre, spread, and outliers. - Correlation is the unit-free, bounded measure of linear association; always inspect scatter plots alongside correlation coefficients to avoid being misled.