Statistical Foundations Every Data Scientist Must Own
24 min
Why Basic Statistics Is Not Enough
Most introductions to data science teach mean, median, mode, and standard deviation, then move quickly to machine learning. The result is practitioners who build models on data they don't actually understand. You cannot diagnose why your model fails on a particular segment if you can't characterise that segment's distribution rigorously. You cannot evaluate whether an A/B test result is meaningful without understanding effect size and power. You cannot trust a confidence interval you cannot derive.
This lesson fills those gaps. The tools here are the ones you reach for when something goes wrong — when a metric moves unexpectedly, when a model underperforms on a subgroup, when a stakeholder asks "but is that difference real?" They are not glamorous, but they are load-bearing.
python
import numpy as npimport pandas as pdfrom scipy import statsimport matplotlib.pyplot as pltrng = np.random.default_rng(42)# Synthetic SaaS customer dataset: monthly revenue per customern = 2000# Mix of SMB and enterprise customers — heavy right skewrevenue_smb = rng.lognormal(mean=5.5, sigma=0.8, size=int(n * 0.85))revenue_enterprise = rng.lognormal(mean=8.0, sigma=0.5, size=int(n * 0.15))revenue = np.concatenate([revenue_smb, revenue_enterprise])df = pd.DataFrame({ "monthly_revenue": revenue, "segment": (["smb"] * int(n * 0.85)) + (["enterprise"] * int(n * 0.15)),})print(f"Shape: {df.shape}")print(df["monthly_revenue"].describe())
Robust Descriptive Statistics
The Problem with Mean and Standard Deviation
Mean and standard deviation are optimal summaries for Gaussian data. Real business data is almost never Gaussian. It is skewed, heavy-tailed, and contaminated with outliers. A single enterprise contract can drag your mean revenue up by an order of magnitude while the median stays put. Standard deviation then misrepresents the spread experienced by 85% of your customers.
The trimmed mean discards the top and bottom alpha fraction of values before computing the average. It gives you a mean that is robust to outliers without throwing away all information about the tails.
python
# 10% trimmed mean: discard bottom 10% and top 10%trimmed_mean = stats.trim_mean(revenue, proportiontocut=0.10)print(f"10% Trimmed mean: {trimmed_mean:.2f}")# Compare across trim levelsfor trim in [0.0, 0.05, 0.10, 0.20]: t = stats.trim_mean(revenue, proportiontocut=trim) print(f" Trim {trim:.0%}: {t:.2f}")
Median Absolute Deviation (MAD)
Standard deviation squares deviations, making it sensitive to large outliers. The Median Absolute Deviation is the median of absolute deviations from the median — it has a breakdown point of 50%, meaning up to half your data can be replaced by arbitrarily large values before it breaks.
The ratio robust_std / std_rev tells you how much your standard deviation is inflated by outliers. A ratio well below 1 signals heavy tails or outliers dominating the spread.
Quartile Coefficient of Dispersion
The coefficient of variation (std / mean) is useless when the mean is near zero or the data spans multiple orders of magnitude. The Quartile Coefficient of Dispersion (QCD) uses only quartiles and is therefore both robust and scale-invariant.
The Central Limit Theorem — What It Really Guarantees
Statement and Conditions
The CLT states that the sampling distribution of the sample mean converges to a Normal distribution as sample size increases, regardless of the population distribution — provided the population has finite variance.
The practical implication: even when your raw data is wildly non-normal, the mean of a sufficiently large sample behaves normally. This is why nearly all classical inference procedures work.
python
# Demonstrate CLT on our heavy-tailed revenue datasample_sizes = [5, 30, 100, 500]n_simulations = 5000fig, axes = plt.subplots(1, 4, figsize=(16, 4))for ax, n_samples in zip(axes, sample_sizes): sample_means = [ rng.choice(revenue, size=n_samples).mean() for _ in range(n_simulations) ] ax.hist(sample_means, bins=50, density=True, alpha=0.7, color="steelblue") # Overlay theoretical normal mu = np.mean(sample_means) sigma = np.std(sample_means) x = np.linspace(mu - 4*sigma, mu + 4*sigma, 200) ax.plot(x, stats.norm.pdf(x, mu, sigma), "r-", lw=2) ax.set_title(f"n = {n_samples}") ax.set_xlabel("Sample mean")plt.suptitle("CLT: Sampling Distribution of the Mean (Revenue Data)")plt.tight_layout()plt.savefig("clt_demo.png", dpi=150)plt.show()
Standard Error
The standard error (SE) of the mean is the standard deviation of the sampling distribution. It tells you how precisely your sample mean estimates the population mean.
python
population_std = revenue.std()for n_s in [10, 50, 100, 500, 1000]: se = population_std / np.sqrt(n_s) print(f" n={n_s:>5}: SE = {se:.2f} (95% CI half-width ≈ {1.96*se:.2f})")
SE shrinks at rate 1/sqrt(n). To halve your uncertainty you need four times as many observations — a fact that has real budget implications.
Confidence Intervals
Z-Interval (Known Population Std, Large n)
python
sample = rng.choice(revenue, size=200)n_s = len(sample)x_bar = sample.mean()se = revenue.std() / np.sqrt(n_s) # using population std (known)z_crit = stats.norm.ppf(0.975) # 1.96ci_low = x_bar - z_crit * seci_high = x_bar + z_crit * seprint(f"95% Z-interval: ({ci_low:.2f}, {ci_high:.2f})")print(f"Population mean: {revenue.mean():.2f} (should be inside ~95% of the time)")
T-Interval (Unknown Population Std — the Common Case)
In practice you almost never know the population standard deviation. The t-distribution accounts for the additional uncertainty of estimating std from data.
python
# scipy makes this trivial, but build it manually firstsample = rng.choice(revenue, size=40) # small n where t vs z mattersn_s = len(sample)x_bar = sample.mean()s = sample.std(ddof=1) # unbiased estimator (Bessel's correction)se_t = s / np.sqrt(n_s)t_crit = stats.t.ppf(0.975, df=n_s - 1)ci_low = x_bar - t_crit * se_tci_high = x_bar + t_crit * se_tprint(f"n={n_s}, t_crit={t_crit:.4f} (vs z_crit=1.9600)")print(f"95% t-interval: ({ci_low:.2f}, {ci_high:.2f})")# scipy one-linerci_scipy = stats.t.interval(0.95, df=n_s-1, loc=x_bar, scale=se_t)print(f"scipy version: ({ci_scipy[0]:.2f}, {ci_scipy[1]:.2f})")
As degrees of freedom increase, t converges to z. With n=40 the difference is already small. With n=10 it matters considerably.
Bootstrap Confidence Intervals
Bootstrap CIs make no distributional assumptions. They are the right tool when the CLT hasn't fully kicked in or when your statistic is complex (a ratio, a quantile, a model coefficient).
python
def bootstrap_ci(data, statistic=np.mean, n_boot=5000, ci=0.95, rng=None): rng = rng or np.random.default_rng() boot_stats = np.array([ statistic(rng.choice(data, size=len(data), replace=True)) for _ in range(n_boot) ]) alpha = (1 - ci) / 2 return np.percentile(boot_stats, [alpha*100, (1-alpha)*100])sample = rng.choice(revenue, size=200)ci_boot = bootstrap_ci(sample, statistic=np.mean, rng=rng)ci_boot_med = bootstrap_ci(sample, statistic=np.median, rng=rng)print(f"Bootstrap 95% CI for mean: ({ci_boot[0]:.2f}, {ci_boot[1]:.2f})")print(f"Bootstrap 95% CI for median: ({ci_boot_med[0]:.2f}, {ci_boot_med[1]:.2f})")
Effect Size — Statistical vs Practical Significance
A p-value tells you the probability of observing a result as extreme as yours under the null hypothesis. It does not tell you whether the difference is meaningful. With large enough samples, trivially small differences become highly significant. Effect size bridges this gap.
Cohen's d
Cohen's d expresses the difference between two means in units of the pooled standard deviation. It is interpretable independent of sample size.
For correlation and linear relationships, Pearson r doubles as an effect size. Cohen's benchmarks: 0.1 small, 0.3 medium, 0.5 large.
python
# Simulate revenue ~ tenure relationshiptenure_months = rng.integers(1, 61, size=500)revenue_by_tenure = 200 + 15 * tenure_months + rng.normal(0, 80, size=500)r, p = stats.pearsonr(tenure_months, revenue_by_tenure)print(f"r = {r:.4f}, p = {p:.2e}")print(f"r² = {r**2:.4f} — {r**2*100:.1f}% of revenue variance explained by tenure")
Statistical Power and Sample Size Planning
Type I and Type II Errors
A Type I error (false positive, alpha) occurs when you reject a true null hypothesis. A Type II error (false negative, beta) occurs when you fail to reject a false null. Power = 1 - beta = the probability of detecting an effect that truly exists.
python
# Visualise the tradeofffrom scipy.stats import normalpha = 0.05effect_size_d = 0.3 # small effect (Cohen's d)n_per_group = 100pooled_se = np.sqrt(2 / n_per_group) # for equal groups, unit variance# Null distribution: mean = 0# Alternative distribution: mean = effect_size_dz_crit = norm.ppf(1 - alpha / 2) # two-tailed critical valuepower = 1 - norm.cdf(z_crit - effect_size_d / pooled_se) + \ norm.cdf(-z_crit - effect_size_d / pooled_se)print(f"Detected power: {power:.3f}")
Power Analysis with scipy
python
from scipy.stats import norm as _normdef power_analysis(effect_d, n_per_group, alpha=0.05, two_tailed=True): """Power for two-sample t-test.""" z_alpha = _norm.ppf(1 - alpha / (2 if two_tailed else 1)) se = np.sqrt(2 / n_per_group) ncp = effect_d / se # non-centrality parameter power = 1 - _norm.cdf(z_alpha - ncp) + _norm.cdf(-z_alpha - ncp) return power# What sample size do we need to detect a small effect (d=0.2) with 80% power?for n in [50, 100, 200, 400, 800]: pw = power_analysis(effect_d=0.2, n_per_group=n) print(f" n={n:>4}: power = {pw:.3f}")
python
# Using statsmodels for a one-liner power analysisfrom statsmodels.stats.power import TTestIndPoweranalysis = TTestIndPower()n_required = analysis.solve_power(effect_size=0.2, power=0.80, alpha=0.05)print(f"\nRequired n per group for d=0.2, 80% power, alpha=0.05: {n_required:.0f}")n_required_08 = analysis.solve_power(effect_size=0.5, power=0.80, alpha=0.05)print(f"Required n per group for d=0.5, 80% power, alpha=0.05: {n_required_08:.0f}")
The Multiple Testing Problem
Every additional hypothesis you test at alpha=0.05 inflates your false positive rate. If you test 20 hypotheses, you expect one false positive by chance even when all nulls are true.
Understanding sampling distributions lets you reason about whether your results could be due to chance. Here is a complete worked example: we have two customer segments and want to know whether their mean session times differ.
python
# Generate realistic session-time datasmb_sessions = rng.gamma(shape=3, scale=4, size=300) # mean ~12 minent_sessions = rng.gamma(shape=4, scale=5, size=100) # mean ~20 min# Step 1: Characterise distributions before testingfor label, data in [("SMB", smb_sessions), ("Enterprise", ent_sessions)]: print(f"\n{label}:") print(f" n={len(data)}, mean={data.mean():.2f}, median={np.median(data):.2f}") print(f" std={data.std():.2f}, skew={stats.skew(data):.2f}") print(f" 95% CI: {stats.t.interval(0.95, df=len(data)-1, loc=data.mean(), scale=stats.sem(data))}")# Step 2: Test normality (only to check CLT adequacy for small n)for label, data in [("SMB", smb_sessions), ("Enterprise", ent_sessions)]: stat, p = stats.shapiro(data[:50]) # Shapiro limited to n<5000 print(f"{label} Shapiro-Wilk p={p:.4f} (reject normality if p<0.05)")# Step 3: Two-sample t-test (robust to non-normality via CLT for these n)t_stat, p_val = stats.ttest_ind(ent_sessions, smb_sessions, equal_var=False) # Welch'sd = (ent_sessions.mean() - smb_sessions.mean()) / np.sqrt( (ent_sessions.std()**2 + smb_sessions.std()**2) / 2)print(f"\nWelch's t-test: t={t_stat:.4f}, p={p_val:.4e}")print(f"Cohen's d: {d:.3f}")
Key Takeaways
Mean and standard deviation are unreliable summaries for skewed or heavy-tailed data; trimmed mean, MAD, and QCD are robust alternatives worth reaching for first
The Central Limit Theorem guarantees that sample means are approximately normally distributed for large n — this is why classical inference procedures work even on non-normal raw data
Confidence intervals quantify estimation uncertainty, not the probability that the true parameter is inside them (frequentist CIs are about procedure, not posterior probability)
Effect size (Cohen's d, Pearson r) is what tells you whether a statistically significant difference is practically meaningful — large samples make tiny effects significant
Statistical power is the probability of detecting a real effect; running an underpowered study is a waste of resources that produces unreliable results
Type I error rate inflates multiplicatively under multiple testing; apply Bonferroni for strict family-wise control or Benjamini-Hochberg when false discovery rate is an acceptable trade-off
Bootstrap confidence intervals are your fallback when distributional assumptions are uncertain and n is not large enough for the CLT to have fully converged
Standard error scales as 1/sqrt(n) — halving uncertainty requires quadrupling sample size, a fact with direct implications for experiment design budgets