Regression is where theory meets measurement. Every time you estimate an effect, predict a continuous outcome, or test whether one variable explains variance in another, you are doing regression. This lesson goes beyond fitting a line — it covers the assumptions that must hold for OLS estimates to be meaningful, the diagnostics that detect when they are violated, and the regularisation techniques that keep models honest when those assumptions break down.
The OLS Objective
Ordinary Least Squares regression finds the coefficient vector beta that minimises the sum of squared residuals (SSR):
SSR = sum over i of (y_i - y_hat_i)^2 = (y - Xbeta)^T * (y - Xbeta)
The closed-form solution — the normal equations — is:
beta = (X^T X)^(-1) X^T y
This solution is exact, requiring no iterations. It exists when X^T X is invertible (full column rank — no perfect multicollinearity). For n observations and p features, this involves a p x p matrix inversion that costs O(p^3), which becomes expensive for wide datasets (p > 10,000).
OLS estimates are BLUE (Best Linear Unbiased Estimators) only when the Gauss-Markov assumptions hold:
Linearity: the relationship between X and y is linear
Independence: residuals are independent of each other (no autocorrelation)
Normality: residuals are normally distributed (for valid inference)
Equal variance (Homoscedasticity): residual variance is constant across fitted values
Violating these does not necessarily make the point estimates wrong, but it invalidates the standard errors, p-values, and confidence intervals — which is what you use to make decisions.
Residuals vs Fitted: should show a random horizontal band around zero. A curved pattern indicates non-linearity (add polynomial terms). A funnel shape (variance grows with fitted values) indicates heteroscedasticity (consider log-transforming y or using weighted least squares).
QQ Plot of Residuals: should show points on the diagonal line. Heavy tails (S-shape) suggest the normality assumption is violated — bootstrapped confidence intervals may be more appropriate than t-based ones.
Scale-Location: similar to Residuals vs Fitted but uses sqrt of absolute standardised residuals. A positive slope confirms heteroscedasticity.
Residuals vs Leverage: points in the upper or lower right (high leverage, large residual) are influential — they are pulling the regression line toward them. Cook's distance contours (the dotted lines) mark combinations that have outsized influence on the fit. Points outside the contour warrant investigation.
Multicollinearity and VIF
When predictor variables are strongly correlated, the normal equations become numerically unstable — small changes in data produce large swings in coefficients. The Variance Inflation Factor (VIF) quantifies how much the variance of a coefficient is inflated due to collinearity.
VIF_j = 1 / (1 - R²_j)
where R²_j is the R² from regressing feature j on all other features. VIF = 1 means no collinearity; VIF > 5 is concerning; VIF > 10 indicates severe collinearity requiring action.
python
def compute_vif(X_df): """Compute VIF for each feature in a DataFrame.""" from sklearn.linear_model import LinearRegression vif_data = {} cols = X_df.columns.tolist() for col in cols: y_col = X_df[col].values X_rest = X_df[[c for c in cols if c != col]].values r2 = LinearRegression().fit(X_rest, y_col).score(X_rest, y_col) vif_data[col] = 1 / (1 - r2) if r2 < 1 else np.inf return pd.Series(vif_data, name='VIF').round(2)X_df = df[['sqft', 'bedrooms', 'age', 'dist_center']]print("VIF values:")print(compute_vif(X_df))# Demonstrate severe multicollinearityX_collinear = X_df.copy()X_collinear['sqft_nearly_same'] = X_df['sqft'] + rng.normal(0, 10, n) # Almost identical to sqftprint("\nVIF with near-duplicate feature:")print(compute_vif(X_collinear))# Remedies: drop one collinear feature, use PCA, or apply Ridge (which handles collinearity)
Polynomial Features and the Overfitting Trap
Polynomial regression fits curves by adding polynomial transformations of X as new features. PolynomialFeatures(degree=2) with one input feature x produces [1, x, x²]; with two features [x1, x2] it produces [1, x1, x2, x1², x1*x2, x2²].
The degree-1 fit misses the curvature. The degree-3 fit captures the signal. The degree-15 fit memorises the training data, oscillating wildly in regions with no training points — it will fail badly on new data.
Ridge Regression (L2 Regularisation)
Ridge adds a penalty on the squared magnitude of coefficients to the OLS objective:
Loss = SSR + alpha * sum over j of beta_j^2
This shrinks all coefficients toward zero but never exactly to zero. The effect is largest on coefficients associated with redundant or weakly-predictive features. Crucially, Ridge makes X^T X + alpha * I always invertible, solving the multicollinearity problem.
Lasso adds a penalty on the absolute magnitude of coefficients:
Loss = SSR + alpha * sum over j of |beta_j|
The L1 penalty has a geometric property that drives some coefficients to exactly zero, performing automatic feature selection. This makes Lasso invaluable when you have many features and expect only a subset to matter.
where r controls the mixing ratio. l1_ratio=1 is pure Lasso; l1_ratio=0 is pure Ridge. ElasticNet is preferred over pure Lasso when features are correlated (Lasso arbitrarily picks one; ElasticNet tends to group them).
sklearn optimises for prediction; statsmodels optimises for statistical inference. When you need p-values, standard errors, confidence intervals, and F-statistics, use statsmodels.
The loss function is log-loss (cross-entropy), not SSR. There is no closed-form solution; logistic regression is solved iteratively (IRLS or gradient descent).
A coefficient of beta_j = 0.4 means that a 1-unit increase in x_j multiplies the odds of the positive class by exp(0.4) = 1.49 — a 49% increase in odds, holding all other features constant. Note that this is the odds ratio, not the probability ratio. With standardised features, coefficients are comparable in magnitude.
The regularisation parameter C = 1/alpha. Large C means less regularisation (closer to MLE); small C means more regularisation (stronger shrinkage). Use cross-validation (LogisticRegressionCV) to select C.
Key Takeaways
OLS minimises SSR; the normal equations beta = (X^T X)^(-1) X^T y give the closed-form solution, but require X^T X to be invertible and cost O(p^3) — gradient descent scales better for large p.
The LINE assumptions (Linearity, Independence, Normality of residuals, Equal variance) must hold for OLS standard errors and p-values to be valid; residuals-vs-fitted and QQ plots are the primary diagnostics.
VIF above 5 signals collinearity that inflates coefficient variance; above 10 it is severe. Remedies include dropping one correlated feature, PCA, or switching to Ridge regression.
Ridge (L2) shrinks all coefficients toward zero and always produces a unique solution, making it the right choice when multicollinearity is present or when you need all features.
Lasso (L1) drives some coefficients to exactly zero, performing automatic feature selection; it is preferred when you believe only a sparse subset of features matter.
ElasticNet combines L1 and L2 penalties and is preferred over pure Lasso when correlated features should be selected together rather than arbitrarily.
Logistic regression models log-odds as a linear function; coefficients represent log-odds changes and exponentiate to odds ratios; the loss is cross-entropy, not SSR.
Use statsmodels for p-values, confidence intervals, and model fit statistics; use sklearn for pipelines, cross-validation, and production prediction.