Data Science Fundamentals — From Theory to Production Models

Lesson 1

Statistical Foundations Every Data Scientist Must Own

24 min

Why Basic Statistics Is Not Enough

Most introductions to data science teach mean, median, mode, and standard deviation, then move quickly to machine learning. The result is practitioners who build models on data they don't actually understand. You cannot diagnose why your model fails on a particular segment if you can't characterise that segment's distribution rigorously. You cannot evaluate whether an A/B test result is meaningful without understanding effect size and power. You cannot trust a confidence interval you cannot derive.

This lesson fills those gaps. The tools here are the ones you reach for when something goes wrong — when a metric moves unexpectedly, when a model underperforms on a subgroup, when a stakeholder asks "but is that difference real?" They are not glamorous, but they are load-bearing.

python

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)

# Synthetic SaaS customer dataset: monthly revenue per customer
n = 2000
# Mix of SMB and enterprise customers — heavy right skew
revenue_smb = rng.lognormal(mean=5.5, sigma=0.8, size=int(n * 0.85))
revenue_enterprise = rng.lognormal(mean=8.0, sigma=0.5, size=int(n * 0.15))
revenue = np.concatenate([revenue_smb, revenue_enterprise])

df = pd.DataFrame({
    "monthly_revenue": revenue,
    "segment": (["smb"] * int(n * 0.85)) + (["enterprise"] * int(n * 0.15)),
})

print(f"Shape: {df.shape}")
print(df["monthly_revenue"].describe())

Robust Descriptive Statistics

The Problem with Mean and Standard Deviation

Mean and standard deviation are optimal summaries for Gaussian data. Real business data is almost never Gaussian. It is skewed, heavy-tailed, and contaminated with outliers. A single enterprise contract can drag your mean revenue up by an order of magnitude while the median stays put. Standard deviation then misrepresents the spread experienced by 85% of your customers.

python

mean_rev = revenue.mean()
median_rev = np.median(revenue)
std_rev = revenue.std()

print(f"Mean:   {mean_rev:>10.2f}")
print(f"Median: {median_rev:>10.2f}")
print(f"Std:    {std_rev:>10.2f}")
print(f"Skewness: {stats.skew(revenue):.3f}")
print(f"Kurtosis: {stats.kurtosis(revenue):.3f}")  # excess kurtosis

Trimmed Mean

The trimmed mean discards the top and bottom alpha fraction of values before computing the average. It gives you a mean that is robust to outliers without throwing away all information about the tails.

python

# 10% trimmed mean: discard bottom 10% and top 10%
trimmed_mean = stats.trim_mean(revenue, proportiontocut=0.10)
print(f"10% Trimmed mean: {trimmed_mean:.2f}")

# Compare across trim levels
for trim in [0.0, 0.05, 0.10, 0.20]:
    t = stats.trim_mean(revenue, proportiontocut=trim)
    print(f"  Trim {trim:.0%}: {t:.2f}")

Median Absolute Deviation (MAD)

Standard deviation squares deviations, making it sensitive to large outliers. The Median Absolute Deviation is the median of absolute deviations from the median — it has a breakdown point of 50%, meaning up to half your data can be replaced by arbitrarily large values before it breaks.

python

# MAD = median(|x_i - median(x)|)
mad = stats.median_abs_deviation(revenue)
# For Gaussian data, std ≈ 1.4826 * MAD
robust_std = stats.median_abs_deviation(revenue, scale="normal")

print(f"MAD:        {mad:.2f}")
print(f"Robust std: {robust_std:.2f}  (MAD * 1.4826)")
print(f"True std:   {std_rev:.2f}")

The ratio robust_std / std_rev tells you how much your standard deviation is inflated by outliers. A ratio well below 1 signals heavy tails or outliers dominating the spread.

Quartile Coefficient of Dispersion

The coefficient of variation (std / mean) is useless when the mean is near zero or the data spans multiple orders of magnitude. The Quartile Coefficient of Dispersion (QCD) uses only quartiles and is therefore both robust and scale-invariant.

python

q1, q3 = np.percentile(revenue, [25, 75])
iqr = q3 - q1
qcd = (q3 - q1) / (q3 + q1)

print(f"Q1:  {q1:.2f}")
print(f"Q3:  {q3:.2f}")
print(f"IQR: {iqr:.2f}")
print(f"QCD: {qcd:.4f}  (dimensionless, comparable across scales)")

# Compare SMB vs enterprise dispersion
for seg in ["smb", "enterprise"]:
    sub = df.loc[df["segment"] == seg, "monthly_revenue"].values
    q1s, q3s = np.percentile(sub, [25, 75])
    qcd_s = (q3s - q1s) / (q3s + q1s)
    print(f"  {seg:>10} QCD: {qcd_s:.4f}")

The Central Limit Theorem — What It Really Guarantees

Statement and Conditions

The CLT states that the sampling distribution of the sample mean converges to a Normal distribution as sample size increases, regardless of the population distribution — provided the population has finite variance.

The practical implication: even when your raw data is wildly non-normal, the mean of a sufficiently large sample behaves normally. This is why nearly all classical inference procedures work.

python

# Demonstrate CLT on our heavy-tailed revenue data
sample_sizes = [5, 30, 100, 500]
n_simulations = 5000

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for ax, n_samples in zip(axes, sample_sizes):
    sample_means = [
        rng.choice(revenue, size=n_samples).mean()
        for _ in range(n_simulations)
    ]
    ax.hist(sample_means, bins=50, density=True, alpha=0.7, color="steelblue")

    # Overlay theoretical normal
    mu = np.mean(sample_means)
    sigma = np.std(sample_means)
    x = np.linspace(mu - 4*sigma, mu + 4*sigma, 200)
    ax.plot(x, stats.norm.pdf(x, mu, sigma), "r-", lw=2)
    ax.set_title(f"n = {n_samples}")
    ax.set_xlabel("Sample mean")

plt.suptitle("CLT: Sampling Distribution of the Mean (Revenue Data)")
plt.tight_layout()
plt.savefig("clt_demo.png", dpi=150)
plt.show()

Standard Error

The standard error (SE) of the mean is the standard deviation of the sampling distribution. It tells you how precisely your sample mean estimates the population mean.

python

population_std = revenue.std()
for n_s in [10, 50, 100, 500, 1000]:
    se = population_std / np.sqrt(n_s)
    print(f"  n={n_s:>5}: SE = {se:.2f}  (95% CI half-width ≈ {1.96*se:.2f})")

SE shrinks at rate 1/sqrt(n). To halve your uncertainty you need four times as many observations — a fact that has real budget implications.

Confidence Intervals

Z-Interval (Known Population Std, Large n)

python

sample = rng.choice(revenue, size=200)
n_s = len(sample)
x_bar = sample.mean()
se = revenue.std() / np.sqrt(n_s)  # using population std (known)

z_crit = stats.norm.ppf(0.975)  # 1.96
ci_low = x_bar - z_crit * se
ci_high = x_bar + z_crit * se
print(f"95% Z-interval: ({ci_low:.2f}, {ci_high:.2f})")
print(f"Population mean: {revenue.mean():.2f}  (should be inside ~95% of the time)")

T-Interval (Unknown Population Std — the Common Case)

In practice you almost never know the population standard deviation. The t-distribution accounts for the additional uncertainty of estimating std from data.

python

# scipy makes this trivial, but build it manually first
sample = rng.choice(revenue, size=40)  # small n where t vs z matters
n_s = len(sample)
x_bar = sample.mean()
s = sample.std(ddof=1)  # unbiased estimator (Bessel's correction)
se_t = s / np.sqrt(n_s)

t_crit = stats.t.ppf(0.975, df=n_s - 1)
ci_low = x_bar - t_crit * se_t
ci_high = x_bar + t_crit * se_t
print(f"n={n_s}, t_crit={t_crit:.4f} (vs z_crit=1.9600)")
print(f"95% t-interval: ({ci_low:.2f}, {ci_high:.2f})")

# scipy one-liner
ci_scipy = stats.t.interval(0.95, df=n_s-1, loc=x_bar, scale=se_t)
print(f"scipy version:  ({ci_scipy[0]:.2f}, {ci_scipy[1]:.2f})")

As degrees of freedom increase, t converges to z. With n=40 the difference is already small. With n=10 it matters considerably.

Bootstrap Confidence Intervals

Bootstrap CIs make no distributional assumptions. They are the right tool when the CLT hasn't fully kicked in or when your statistic is complex (a ratio, a quantile, a model coefficient).

python

def bootstrap_ci(data, statistic=np.mean, n_boot=5000, ci=0.95, rng=None):
    rng = rng or np.random.default_rng()
    boot_stats = np.array([
        statistic(rng.choice(data, size=len(data), replace=True))
        for _ in range(n_boot)
    ])
    alpha = (1 - ci) / 2
    return np.percentile(boot_stats, [alpha*100, (1-alpha)*100])

sample = rng.choice(revenue, size=200)
ci_boot = bootstrap_ci(sample, statistic=np.mean, rng=rng)
ci_boot_med = bootstrap_ci(sample, statistic=np.median, rng=rng)

print(f"Bootstrap 95% CI for mean:   ({ci_boot[0]:.2f}, {ci_boot[1]:.2f})")
print(f"Bootstrap 95% CI for median: ({ci_boot_med[0]:.2f}, {ci_boot_med[1]:.2f})")

Effect Size — Statistical vs Practical Significance

A p-value tells you the probability of observing a result as extreme as yours under the null hypothesis. It does not tell you whether the difference is meaningful. With large enough samples, trivially small differences become highly significant. Effect size bridges this gap.

Cohen's d

Cohen's d expresses the difference between two means in units of the pooled standard deviation. It is interpretable independent of sample size.

python

# Generate two groups: control and treatment churn rates
np.random.seed(42)
control_session_time = rng.normal(loc=18.0, scale=5.0, size=500)
treatment_session_time = rng.normal(loc=19.5, scale=5.2, size=500)

diff = treatment_session_time.mean() - control_session_time.mean()
pooled_std = np.sqrt(
    (control_session_time.var(ddof=1) + treatment_session_time.var(ddof=1)) / 2
)
cohens_d = diff / pooled_std

print(f"Mean diff: {diff:.3f} minutes")
print(f"Pooled SD: {pooled_std:.3f}")
print(f"Cohen's d: {cohens_d:.3f}")
print("Interpretation: <0.2 trivial, 0.2-0.5 small, 0.5-0.8 medium, >0.8 large")

# t-test — significant but small effect
t_stat, p_value = stats.ttest_ind(treatment_session_time, control_session_time)
print(f"t-stat: {t_stat:.4f}, p-value: {p_value:.4f}")

Pearson r as Effect Size

For correlation and linear relationships, Pearson r doubles as an effect size. Cohen's benchmarks: 0.1 small, 0.3 medium, 0.5 large.

python

# Simulate revenue ~ tenure relationship
tenure_months = rng.integers(1, 61, size=500)
revenue_by_tenure = 200 + 15 * tenure_months + rng.normal(0, 80, size=500)

r, p = stats.pearsonr(tenure_months, revenue_by_tenure)
print(f"r = {r:.4f}, p = {p:.2e}")
print(f"r² = {r**2:.4f} — {r**2*100:.1f}% of revenue variance explained by tenure")

Statistical Power and Sample Size Planning

Type I and Type II Errors

A Type I error (false positive, alpha) occurs when you reject a true null hypothesis. A Type II error (false negative, beta) occurs when you fail to reject a false null. Power = 1 - beta = the probability of detecting an effect that truly exists.

python

# Visualise the tradeoff
from scipy.stats import norm

alpha = 0.05
effect_size_d = 0.3  # small effect (Cohen's d)
n_per_group = 100
pooled_se = np.sqrt(2 / n_per_group)  # for equal groups, unit variance

# Null distribution: mean = 0
# Alternative distribution: mean = effect_size_d
z_crit = norm.ppf(1 - alpha / 2)  # two-tailed critical value
power = 1 - norm.cdf(z_crit - effect_size_d / pooled_se) + \
            norm.cdf(-z_crit - effect_size_d / pooled_se)
print(f"Detected power: {power:.3f}")

Power Analysis with scipy

python

from scipy.stats import norm as _norm

def power_analysis(effect_d, n_per_group, alpha=0.05, two_tailed=True):
    """Power for two-sample t-test."""
    z_alpha = _norm.ppf(1 - alpha / (2 if two_tailed else 1))
    se = np.sqrt(2 / n_per_group)
    ncp = effect_d / se  # non-centrality parameter
    power = 1 - _norm.cdf(z_alpha - ncp) + _norm.cdf(-z_alpha - ncp)
    return power

# What sample size do we need to detect a small effect (d=0.2) with 80% power?
for n in [50, 100, 200, 400, 800]:
    pw = power_analysis(effect_d=0.2, n_per_group=n)
    print(f"  n={n:>4}: power = {pw:.3f}")

python

# Using statsmodels for a one-liner power analysis
from statsmodels.stats.power import TTestIndPower

analysis = TTestIndPower()
n_required = analysis.solve_power(effect_size=0.2, power=0.80, alpha=0.05)
print(f"\nRequired n per group for d=0.2, 80% power, alpha=0.05: {n_required:.0f}")

n_required_08 = analysis.solve_power(effect_size=0.5, power=0.80, alpha=0.05)
print(f"Required n per group for d=0.5, 80% power, alpha=0.05: {n_required_08:.0f}")

The Multiple Testing Problem

Every additional hypothesis you test at alpha=0.05 inflates your false positive rate. If you test 20 hypotheses, you expect one false positive by chance even when all nulls are true.

python

# Bonferroni correction
n_tests = 20
alpha_corrected = 0.05 / n_tests
print(f"Bonferroni-corrected alpha for {n_tests} tests: {alpha_corrected:.4f}")

# Benjamini-Hochberg (less conservative, controls FDR not FWER)
from statsmodels.stats.multitest import multipletests

# Simulate p-values: 18 true nulls, 2 real effects
true_null_pvals = rng.uniform(0, 1, size=18)
real_effect_pvals = rng.beta(1, 20, size=2)  # concentrated near 0
all_pvals = np.concatenate([true_null_pvals, real_effect_pvals])

reject_bonf, pvals_corrected_bonf, _, _ = multipletests(all_pvals, method="bonferroni")
reject_bh, pvals_corrected_bh, _, _ = multipletests(all_pvals, method="fdr_bh")

print(f"Bonferroni rejections: {reject_bonf.sum()}")
print(f"Benjamini-Hochberg rejections: {reject_bh.sum()}")

Sampling Distributions in Practice

Understanding sampling distributions lets you reason about whether your results could be due to chance. Here is a complete worked example: we have two customer segments and want to know whether their mean session times differ.

python

# Generate realistic session-time data
smb_sessions = rng.gamma(shape=3, scale=4, size=300)    # mean ~12 min
ent_sessions = rng.gamma(shape=4, scale=5, size=100)    # mean ~20 min

# Step 1: Characterise distributions before testing
for label, data in [("SMB", smb_sessions), ("Enterprise", ent_sessions)]:
    print(f"\n{label}:")
    print(f"  n={len(data)}, mean={data.mean():.2f}, median={np.median(data):.2f}")
    print(f"  std={data.std():.2f}, skew={stats.skew(data):.2f}")
    print(f"  95% CI: {stats.t.interval(0.95, df=len(data)-1, loc=data.mean(), scale=stats.sem(data))}")

# Step 2: Test normality (only to check CLT adequacy for small n)
for label, data in [("SMB", smb_sessions), ("Enterprise", ent_sessions)]:
    stat, p = stats.shapiro(data[:50])  # Shapiro limited to n<5000
    print(f"{label} Shapiro-Wilk p={p:.4f}  (reject normality if p<0.05)")

# Step 3: Two-sample t-test (robust to non-normality via CLT for these n)
t_stat, p_val = stats.ttest_ind(ent_sessions, smb_sessions, equal_var=False)  # Welch's
d = (ent_sessions.mean() - smb_sessions.mean()) / np.sqrt(
    (ent_sessions.std()**2 + smb_sessions.std()**2) / 2
)
print(f"\nWelch's t-test: t={t_stat:.4f}, p={p_val:.4e}")
print(f"Cohen's d: {d:.3f}")

Key Takeaways

Mean and standard deviation are unreliable summaries for skewed or heavy-tailed data; trimmed mean, MAD, and QCD are robust alternatives worth reaching for first
The Central Limit Theorem guarantees that sample means are approximately normally distributed for large n — this is why classical inference procedures work even on non-normal raw data
Confidence intervals quantify estimation uncertainty, not the probability that the true parameter is inside them (frequentist CIs are about procedure, not posterior probability)
Effect size (Cohen's d, Pearson r) is what tells you whether a statistically significant difference is practically meaningful — large samples make tiny effects significant
Statistical power is the probability of detecting a real effect; running an underpowered study is a waste of resources that produces unreliable results
Type I error rate inflates multiplicatively under multiple testing; apply Bonferroni for strict family-wise control or Benjamini-Hochberg when false discovery rate is an acceptable trade-off
Bootstrap confidence intervals are your fallback when distributional assumptions are uncertain and n is not large enough for the CLT to have fully converged
Standard error scales as 1/sqrt(n) — halving uncertainty requires quadrupling sample size, a fact with direct implications for experiment design budgets

Probability Theory & Distributions for ML