Data Science Fundamentals — From Theory to Production Models

Lesson 9

Feature Engineering & Selection for ML

24 min

Raw data almost never arrives in a form that maximises model performance. A tenure column measured in days has a long right tail — log-transforming it makes gradient descent converge faster and linear models fit better. A timestamp is opaque to a tree model until you decompose it into day-of-week and hour. A categorical "plan tier" column must be encoded before any algorithm sees it. Feature engineering is the craft of converting domain knowledge and statistical intuition into representations that models can exploit. Feature selection then prunes the signal from the noise.

Numeric Transformations

Right-skewed numeric distributions compress informative variation at the high end. Transforming them before modelling — especially for linear models and distance-based algorithms — can significantly improve performance.

python

import numpy as np
import pandas as pd
from scipy import stats
from sklearn.preprocessing import PowerTransformer
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)

# Simulate right-skewed revenue data
revenue = rng.exponential(scale=200, size=2000) + 1  # always positive

# log1p: safe for zero values (log(1+x)), the most common choice
log1p_rev = np.log1p(revenue)

# sqrt: milder than log, tolerates zero, good for count data
sqrt_rev = np.sqrt(revenue)

# Box-Cox: finds the optimal lambda; requires strictly positive values
bc_rev, lambda_bc = stats.boxcox(revenue)
print(f"Box-Cox lambda: {lambda_bc:.4f}")  # near 0 = log; near 0.5 = sqrt

# Yeo-Johnson: same idea but handles zero and negative values
pt = PowerTransformer(method="yeo-johnson", standardize=True)
yj_rev = pt.fit_transform(revenue.reshape(-1, 1)).ravel()
print(f"Yeo-Johnson lambda: {pt.lambdas_[0]:.4f}")

fig, axes = plt.subplots(1, 4, figsize=(16, 4))
for ax, data, label in zip(
    axes,
    [revenue, log1p_rev, sqrt_rev, yj_rev],
    ["Raw", "log1p", "sqrt", "Yeo-Johnson"],
):
    ax.hist(data, bins=50, color="#7c3aed", alpha=0.75)
    skew = stats.skew(data)
    ax.set_title(f"{label}\nskewness={skew:.2f}")
plt.tight_layout()
plt.show()

Rule of thumb: use log1p for monetary amounts and counts. Prefer PowerTransformer(method='yeo-johnson') inside a sklearn Pipeline because it handles zeros and negatives and is invertible.

Polynomial and Interaction Features

Linear models cannot learn interactions between features unless you encode them explicitly. PolynomialFeatures generates all degree-N combinations.

python

from sklearn.preprocessing import PolynomialFeatures
import numpy as np

X = np.array([[2, 3],
              [4, 5]])

# degree=2 produces: [1, x1, x2, x1^2, x1*x2, x2^2]
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)
X_poly = poly.fit_transform(X)
print("Feature names:", poly.get_feature_names_out(["age", "spend"]))
# ['age', 'spend', 'age^2', 'age spend', 'spend^2']

# ⚠ Degree explosion warning
# With 20 features and degree=2: C(20+2, 2) = 231 features
# With 20 features and degree=3: 1771 features
# With 100 features and degree=2: 5151 features → overfitting risk, slow training

# Use interaction_only=True to skip the squared terms (halves the explosion)
poly_interact = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interact = poly_interact.fit_transform(X)
print("Interaction-only names:", poly_interact.get_feature_names_out(["age", "spend"]))
# ['age', 'spend', 'age spend']

# Practical guideline: only add polynomial features when you have a strong
# domain reason to believe an interaction exists, or after selection confirms it.

Encoding Categorical Variables

One-Hot Encoding

The default choice when there is no natural ordering among categories.

python

from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "plan": ["free", "pro", "enterprise", "free", "pro"],
    "region": ["US", "EU", "US", "APAC", "EU"],
})

ohe = OneHotEncoder(
    handle_unknown="ignore",   # unknown categories at inference → all zeros, no error
    sparse_output=False,       # return dense numpy array (sklearn >= 1.2)
    drop="if_binary",          # drop one column for binary features (avoid perfect collinearity)
)
X_ohe = ohe.fit_transform(df[["plan", "region"]])
cols = ohe.get_feature_names_out(["plan", "region"])
print(pd.DataFrame(X_ohe, columns=cols))

Ordinal Encoding

Use when there is a meaningful order (e.g. education level, severity rating).

python

from sklearn.preprocessing import OrdinalEncoder

df_ord = pd.DataFrame({"severity": ["low", "medium", "high", "critical", "low"]})

oe = OrdinalEncoder(
    categories=[["low", "medium", "high", "critical"]],  # explicit order
    handle_unknown="use_encoded_value",
    unknown_value=-1,
)
print(oe.fit_transform(df_ord))
# [[0.], [1.], [2.], [3.], [0.]]

Target Encoding

Replace each category with the mean target value for that category. Extremely powerful but leaky if done naively.

python

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

# ---- LEAKY version (never do this) ----
def target_encode_leaky(df, col, target):
    """Computes means on the FULL training set → target leakage."""
    means = df.groupby(col)[target].mean()
    return df[col].map(means)

# ---- SAFE cross-fold version ----
def target_encode_kfold(df: pd.DataFrame, col: str, target: str, n_splits: int = 5) -> pd.Series:
    """
    Out-of-fold target encoding: each row's encoding uses statistics
    computed on other folds, eliminating leakage.
    """
    encoded = pd.Series(np.nan, index=df.index)
    global_mean = df[target].mean()
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

    for train_idx, val_idx in kf.split(df):
        train_fold = df.iloc[train_idx]
        val_fold   = df.iloc[val_idx]
        fold_means = train_fold.groupby(col)[target].mean()
        encoded.iloc[val_idx] = val_fold[col].map(fold_means).fillna(global_mean)

    return encoded

# sklearn 1.3+ provides TargetEncoder with built-in cross-fitting
from sklearn.preprocessing import TargetEncoder

df_te = pd.DataFrame({
    "plan": ["free", "pro", "enterprise", "free", "pro", "enterprise"] * 50,
    "churned": np.random.default_rng(0).integers(0, 2, 300),
})

te = TargetEncoder(target_type="binary", cv=5, smooth="auto")
X_te = te.fit_transform(df_te[["plan"]], df_te["churned"])
print("TargetEncoder output sample:", X_te[:5].ravel())

Frequency Encoding

Replace each category with its frequency (count or proportion) in the training set.

python

def frequency_encode(series: pd.Series, normalize: bool = True) -> pd.Series:
    """Replace each category with its relative frequency."""
    freq = series.value_counts(normalize=normalize)
    return series.map(freq)

df["plan_freq"] = frequency_encode(df["plan"])
print(df[["plan", "plan_freq"]])

Frequency encoding preserves no target information but is leakage-free and works well as a lightweight complement to OHE in tree models.

Datetime Feature Extraction

python

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "event_ts": pd.date_range("2023-01-01", periods=500, freq="6h"),
    "user_signup_ts": pd.to_datetime(["2022-03-15"] * 500),
})

df["year"]        = df["event_ts"].dt.year
df["month"]       = df["event_ts"].dt.month          # 1-12
df["day_of_week"] = df["event_ts"].dt.dayofweek      # 0=Monday, 6=Sunday
df["is_weekend"]  = df["day_of_week"].isin([5, 6]).astype(int)
df["hour"]        = df["event_ts"].dt.hour
df["quarter"]     = df["event_ts"].dt.quarter
df["days_since_signup"] = (df["event_ts"] - df["user_signup_ts"]).dt.days

# Cyclical encoding for periodic features
# hour 23 and hour 0 are adjacent; raw integer encoding doesn't capture this.
# Encode as (sin, cos) pair so periodicity is preserved in Euclidean space.
def cyclical_encode(series: pd.Series, period: int) -> pd.DataFrame:
    theta = 2 * np.pi * series / period
    return pd.DataFrame({
        f"{series.name}_sin": np.sin(theta),
        f"{series.name}_cos": np.cos(theta),
    }, index=series.index)

hour_encoded    = cyclical_encode(df["hour"],        period=24)
month_encoded   = cyclical_encode(df["month"],       period=12)
dow_encoded     = cyclical_encode(df["day_of_week"], period=7)

df = pd.concat([df, hour_encoded, month_encoded, dow_encoded], axis=1)
print(df[["hour", "hour_sin", "hour_cos"]].head())

Aggregation Features

Group-level statistics bring relational context into flat feature tables.

python

import pandas as pd
import numpy as np

rng = np.random.default_rng(42)
n = 2000

transactions = pd.DataFrame({
    "user_id":  rng.integers(1, 201, n),
    "amount":   rng.exponential(50, n),
    "n_items":  rng.integers(1, 10, n),
    "ts":       pd.date_range("2023-01-01", periods=n, freq="1h"),
})

# Compute per-user aggregations on the training set
user_agg = (
    transactions.groupby("user_id")["amount"]
    .agg(
        user_mean_amount="mean",
        user_std_amount="std",
        user_max_amount="max",
        user_txn_count="count",
    )
    .reset_index()
)

# Join aggregations back to the transaction-level table
transactions = transactions.merge(user_agg, on="user_id", how="left")
print(transactions[["user_id", "amount", "user_mean_amount", "user_txn_count"]].head())

Lag Features

Lag features capture the previous value of a time-varying quantity for each entity.

python

import pandas as pd
import numpy as np

df_ts = pd.DataFrame({
    "user_id": [1, 1, 1, 2, 2, 2],
    "week":    [1, 2, 3, 1, 2, 3],
    "revenue": [100, 120, 95, 200, 180, 210],
})
df_ts = df_ts.sort_values(["user_id", "week"])

df_ts["revenue_lag1"] = df_ts.groupby("user_id")["revenue"].shift(1)
df_ts["revenue_lag2"] = df_ts.groupby("user_id")["revenue"].shift(2)
df_ts["revenue_diff"] = df_ts["revenue"] - df_ts["revenue_lag1"]
df_ts["revenue_roll_mean2"] = (
    df_ts.groupby("user_id")["revenue"]
    .transform(lambda s: s.shift(1).rolling(2).mean())
)
print(df_ts)

Always shift before computing rolling statistics to avoid leaking the current observation's value into its own feature.

Feature Selection

Filter Methods

python

from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_samples=1000, n_features=20, n_informative=8,
                            n_redundant=4, random_state=42)

# Pearson correlation with target (linear relationships only)
correlations = np.array([np.corrcoef(X[:, i], y)[0, 1] for i in range(X.shape[1])])
top_pearson = np.argsort(np.abs(correlations))[::-1][:10]
print("Top features by |Pearson r|:", top_pearson)

# Mutual information: captures non-linear relationships, better for tree models
mi_scores = mutual_info_classif(X, y, random_state=42)
top_mi = np.argsort(mi_scores)[::-1][:10]
print("Top features by MI:", top_mi)

Wrapper Methods: RFECV

python

from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold

rfecv = RFECV(
    estimator=RandomForestClassifier(n_estimators=100, random_state=42),
    step=1,                             # remove one feature per round
    cv=StratifiedKFold(5),
    scoring="roc_auc",
    min_features_to_select=3,
    n_jobs=-1,
)
rfecv.fit(X, y)
print(f"Optimal number of features: {rfecv.n_features_}")
print(f"Selected feature indices:   {np.where(rfecv.support_)[0]}")

Embedded Methods

python

from sklearn.linear_model import LassoCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Lasso path: features whose coefficients shrink to zero at the chosen alpha
pipe_lasso = make_pipeline(StandardScaler(), LassoCV(cv=5, max_iter=10000))
pipe_lasso.fit(X, y.astype(float))
lasso_coef = pipe_lasso.named_steps["lassocv"].coef_
selected_lasso = np.where(lasso_coef != 0)[0]
print(f"Lasso selected {len(selected_lasso)} features:", selected_lasso)

# Tree feature importances (impurity-based: fast but biased toward high cardinality)
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X, y)
imp_impurity = rf.feature_importances_
print("Top 5 by impurity importance:", np.argsort(imp_impurity)[::-1][:5])

# Permutation importance: model-agnostic, slower, more reliable
perm = permutation_importance(rf, X, y, n_repeats=10, random_state=42, n_jobs=-1)
imp_perm = perm.importances_mean
print("Top 5 by permutation importance:", np.argsort(imp_perm)[::-1][:5])

Variance Threshold

Remove near-zero-variance features before any other selection step.

python

from sklearn.feature_selection import VarianceThreshold

vt = VarianceThreshold(threshold=0.01)  # remove features with variance < 0.01
X_high_var = vt.fit_transform(X)
print(f"Features before: {X.shape[1]}, after: {X_high_var.shape[1]}")

Complete Pipeline: ColumnTransformer on Synthetic Churn Data

python

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PowerTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification

rng = np.random.default_rng(42)
n = 1500

# Synthetic SaaS churn dataset
df = pd.DataFrame({
    "tenure_months":   rng.integers(1, 60, n).astype(float),
    "monthly_spend":   rng.exponential(80, n),
    "support_tickets": rng.integers(0, 20, n).astype(float),
    "feature_usage":   rng.uniform(0, 1, n),
    "contract_type":   rng.choice(["monthly", "annual", "multi-year"], n),
    "plan_tier":       rng.choice(["starter", "growth", "enterprise"], n),
})

# Inject ~5% missing values
for col in ["tenure_months", "monthly_spend"]:
    mask = rng.random(n) < 0.05
    df.loc[mask, col] = np.nan

y = (
    (df["tenure_months"].fillna(df["tenure_months"].median()) < 12)
    & (df["support_tickets"] > 5)
).astype(int)

X = df.drop(columns=[])  # keep all columns for the pipeline

numeric_features     = ["tenure_months", "monthly_spend", "support_tickets", "feature_usage"]
categorical_features = ["contract_type", "plan_tier"]

# Numeric pipeline: impute median → power-transform (removes skew) → scale
numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("power",   PowerTransformer(method="yeo-johnson", standardize=False)),
    ("scaler",  StandardScaler()),
])

# Categorical pipeline: impute most frequent → one-hot encode
categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("ohe",     OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
])

preprocessor = ColumnTransformer([
    ("num", numeric_pipeline,     numeric_features),
    ("cat", categorical_pipeline, categorical_features),
])

full_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier",   RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)),
])

scores = cross_val_score(full_pipeline, X, y, cv=5, scoring="roc_auc")
print(f"CV ROC-AUC: {scores.mean():.4f} ± {scores.std():.4f}")

# Fit final pipeline and inspect feature names
full_pipeline.fit(X, y)
ohe_cats = (
    full_pipeline.named_steps["preprocessor"]
    .named_transformers_["cat"]
    .named_steps["ohe"]
    .get_feature_names_out(categorical_features)
)
all_features = numeric_features + list(ohe_cats)
importances  = full_pipeline.named_steps["classifier"].feature_importances_
feat_imp = pd.Series(importances, index=all_features).sort_values(ascending=False)
print(feat_imp.head(8))

Key Takeaways

Apply log1p or PowerTransformer(method='yeo-johnson') to right-skewed numeric features before linear models or distance-based algorithms; tree models are invariant to monotonic transforms but transforming can still speed up gradient boosting convergence.
Target encoding is powerful but leaky by default — always use out-of-fold (cross-fit) encoding; sklearn 1.3+ TargetEncoder handles this correctly.
Cyclical features (hour, day-of-week, month) must be encoded as (sin, cos) pairs so the model understands that hour 23 and hour 0 are adjacent.
Polynomial features explode exponentially — use interaction_only=True and apply selection afterward, or add only features motivated by domain knowledge.
Prefer permutation importance over impurity-based importance when your dataset has high-cardinality or continuous features; impurity importance is biased toward those features.
RFECV gives you the optimal feature count via cross-validation but is expensive — run it on a fast estimator (e.g. a shallow random forest) rather than your final model.
Wrap all preprocessing in a ColumnTransformer inside a Pipeline — this ensures transforms are fit only on training data during cross-validation, preventing data leakage.
Variance threshold should be the first step in any selection process; near-zero-variance features contribute noise and slow down downstream methods.

Bias-Variance Tradeoff, Overfitting & Regularisation Ensemble Methods — Bagging, Boosting & Stacking