Lesson 5

Linear Regression

13 min

Linear regression is both the simplest and one of the most interpretable supervised learning models. Despite its simplicity, it remains the backbone of much quantitative analysis in finance, economics, and social science — and a conceptually essential stepping stone to understanding more complex models.

Ordinary Least Squares (OLS)

Linear regression models the relationship between a dependent variable y and one or more independent variables X as:

y = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ + ε

where β₀ is the intercept, β₁…βₙ are coefficients, and ε is the residual error.

OLS minimises the sum of squared residuals: it finds the β values that make the total squared vertical distance between each data point and the regression line as small as possible.

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler

np.random.seed(42)

# Simulate: house price = 50k + 150 * sqft + 30k * bedrooms + noise
n = 300
sqft     = np.random.uniform(500, 3000, n)
bedrooms = np.random.randint(1, 6, n)
price    = 50000 + 150 * sqft + 30000 * bedrooms + np.random.normal(0, 25000, n)

df = pd.DataFrame({"sqft": sqft, "bedrooms": bedrooms, "price": price})

X = df[["sqft", "bedrooms"]]
y = df["price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

print(f"Intercept:            ${model.intercept_:,.0f}")
print(f"Coefficient (sqft):   ${model.coef_[0]:,.2f}")
print(f"Coefficient (beds):   ${model.coef_[1]:,.0f}")

y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2   = r2_score(y_test, y_pred)
print(f"\nRMSE: ${rmse:,.0f}")
print(f"R²:   {r2:.4f}")

Interpreting Coefficients

Each coefficient represents the expected change in y for a one-unit increase in that feature, holding all other features constant.

From the example above:

Intercept ($50,000): The predicted price of a house with 0 sqft and 0 bedrooms — a boundary condition with no practical meaning.
sqft coefficient ($150): Each additional square foot of space is associated with a $150 increase in price.
bedrooms coefficient ($30,000): Each additional bedroom is associated with a $30,000 increase, holding sqft constant.

R² (coefficient of determination) measures the proportion of variance in y explained by the model. R² = 0.90 means the model explains 90% of the variation in house prices.

python

# Residual analysis — check model assumptions
residuals = y_test - y_pred

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Residuals vs fitted values — should be randomly scattered around 0
axes[0].scatter(y_pred, residuals, alpha=0.5, color="#2196F3")
axes[0].axhline(0, color="red", linestyle="--")
axes[0].set_xlabel("Predicted Price"); axes[0].set_ylabel("Residuals")
axes[0].set_title("Residuals vs Fitted")

# Q-Q plot — check normality of residuals
from scipy import stats as scipy_stats
scipy_stats.probplot(residuals, dist="norm", plot=axes[1])
axes[1].set_title("Q-Q Plot of Residuals")
plt.tight_layout(); plt.show()

Patterns in residual plots indicate problems:

| Residual Pattern | Implication | |---|---| | Random scatter around 0 | Assumptions satisfied | | Funnel shape (heteroscedasticity) | Variance increases with predicted value; try log(y) | | Curved pattern | A non-linear term is needed | | Systematic clusters | A missing categorical variable |

Ridge and Lasso Regularisation

When a model has many features (or features that are correlated), OLS can overfit — it fits noise in the training data and performs poorly on new data. Regularisation adds a penalty term to the loss function that discourages large coefficient values.

Ridge (L2 regularisation) adds the sum of squared coefficients: Loss = RSS + α × Σβᵢ²

python

# Scale features before regularisation — essential because penalty is scale-sensitive
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# Ridge — shrinks all coefficients toward zero but rarely to exactly zero
ridge = Ridge(alpha=10.0)   # alpha is the regularisation strength
ridge.fit(X_train_s, y_train)
print(f"Ridge R²: {ridge.score(X_test_s, y_test):.4f}")
print(f"Ridge coefficients: {ridge.coef_}")

Lasso (L1 regularisation) adds the sum of absolute coefficient values: Loss = RSS + α × Σ|βᵢ|

Lasso's key property is sparsity: it can drive coefficients exactly to zero, performing automatic feature selection.

python

# Lasso — drives some coefficients exactly to zero (feature selection)
lasso = Lasso(alpha=500, max_iter=10000)
lasso.fit(X_train_s, y_train)
print(f"Lasso R²: {lasso.score(X_test_s, y_test):.4f}")
print(f"Non-zero coefficients: {(lasso.coef_ != 0).sum()}")

| Method | Penalty | Effect on coefficients | Use when | |---|---|---|---| | OLS | None | Unpenalised | Few features, n >> p | | Ridge | L2 (sum of squares) | Shrinks toward zero, never exactly | Many correlated features | | Lasso | L1 (sum of absolutes) | Drives some to exactly zero | Feature selection needed | | ElasticNet | L1 + L2 | Combination of both | Many features with groups of correlation |

Tuning the Regularisation Parameter

python

from sklearn.linear_model import RidgeCV

# RidgeCV performs k-fold cross-validation to find the best alpha automatically
ridge_cv = RidgeCV(alphas=[0.1, 1.0, 10.0, 100.0, 1000.0], cv=5)
ridge_cv.fit(X_train_s, y_train)
print(f"Best alpha: {ridge_cv.alpha_}")
print(f"CV R²:      {ridge_cv.score(X_test_s, y_test):.4f}")

Summary

OLS finds the coefficient values that minimise the sum of squared residuals — the ordinary least squares criterion.
Each coefficient represents the marginal effect of one feature on the target, holding all others constant; always inspect residuals to verify assumptions.
R² measures the fraction of variance explained; RMSE measures prediction error in the original units.
Ridge (L2) regularisation shrinks coefficients to reduce overfitting; Lasso (L1) additionally performs feature selection by zeroing out weak predictors.
Always scale features before applying regularised models — the penalty is applied uniformly and is sensitive to feature magnitude.

Feature Engineering Classification