Raw data almost never arrives in a form that maximises model performance. A tenure column measured in days has a long right tail — log-transforming it makes gradient descent converge faster and linear models fit better. A timestamp is opaque to a tree model until you decompose it into day-of-week and hour. A categorical "plan tier" column must be encoded before any algorithm sees it. Feature engineering is the craft of converting domain knowledge and statistical intuition into representations that models can exploit. Feature selection then prunes the signal from the noise.
Numeric Transformations
Right-skewed numeric distributions compress informative variation at the high end. Transforming them before modelling — especially for linear models and distance-based algorithms — can significantly improve performance.
python
import numpy as npimport pandas as pdfrom scipy import statsfrom sklearn.preprocessing import PowerTransformerimport matplotlib.pyplot as pltrng = np.random.default_rng(42)# Simulate right-skewed revenue datarevenue = rng.exponential(scale=200, size=2000) + 1 # always positive# log1p: safe for zero values (log(1+x)), the most common choicelog1p_rev = np.log1p(revenue)# sqrt: milder than log, tolerates zero, good for count datasqrt_rev = np.sqrt(revenue)# Box-Cox: finds the optimal lambda; requires strictly positive valuesbc_rev, lambda_bc = stats.boxcox(revenue)print(f"Box-Cox lambda: {lambda_bc:.4f}") # near 0 = log; near 0.5 = sqrt# Yeo-Johnson: same idea but handles zero and negative valuespt = PowerTransformer(method="yeo-johnson", standardize=True)yj_rev = pt.fit_transform(revenue.reshape(-1, 1)).ravel()print(f"Yeo-Johnson lambda: {pt.lambdas_[0]:.4f}")fig, axes = plt.subplots(1, 4, figsize=(16, 4))for ax, data, label in zip( axes, [revenue, log1p_rev, sqrt_rev, yj_rev], ["Raw", "log1p", "sqrt", "Yeo-Johnson"],): ax.hist(data, bins=50, color="#7c3aed", alpha=0.75) skew = stats.skew(data) ax.set_title(f"{label}\nskewness={skew:.2f}")plt.tight_layout()plt.show()
Rule of thumb: use log1p for monetary amounts and counts. Prefer PowerTransformer(method='yeo-johnson') inside a sklearn Pipeline because it handles zeros and negatives and is invertible.
Polynomial and Interaction Features
Linear models cannot learn interactions between features unless you encode them explicitly. PolynomialFeatures generates all degree-N combinations.
python
from sklearn.preprocessing import PolynomialFeaturesimport numpy as npX = np.array([[2, 3], [4, 5]])# degree=2 produces: [1, x1, x2, x1^2, x1*x2, x2^2]poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)X_poly = poly.fit_transform(X)print("Feature names:", poly.get_feature_names_out(["age", "spend"]))# ['age', 'spend', 'age^2', 'age spend', 'spend^2']# ⚠ Degree explosion warning# With 20 features and degree=2: C(20+2, 2) = 231 features# With 20 features and degree=3: 1771 features# With 100 features and degree=2: 5151 features → overfitting risk, slow training# Use interaction_only=True to skip the squared terms (halves the explosion)poly_interact = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)X_interact = poly_interact.fit_transform(X)print("Interaction-only names:", poly_interact.get_feature_names_out(["age", "spend"]))# ['age', 'spend', 'age spend']# Practical guideline: only add polynomial features when you have a strong# domain reason to believe an interaction exists, or after selection confirms it.
Encoding Categorical Variables
One-Hot Encoding
The default choice when there is no natural ordering among categories.
python
from sklearn.preprocessing import OneHotEncoderimport pandas as pdimport numpy as npdf = pd.DataFrame({ "plan": ["free", "pro", "enterprise", "free", "pro"], "region": ["US", "EU", "US", "APAC", "EU"],})ohe = OneHotEncoder( handle_unknown="ignore", # unknown categories at inference → all zeros, no error sparse_output=False, # return dense numpy array (sklearn >= 1.2) drop="if_binary", # drop one column for binary features (avoid perfect collinearity))X_ohe = ohe.fit_transform(df[["plan", "region"]])cols = ohe.get_feature_names_out(["plan", "region"])print(pd.DataFrame(X_ohe, columns=cols))
Ordinal Encoding
Use when there is a meaningful order (e.g. education level, severity rating).
Always shift before computing rolling statistics to avoid leaking the current observation's value into its own feature.
Feature Selection
Filter Methods
python
from sklearn.feature_selection import mutual_info_classiffrom sklearn.datasets import make_classificationimport numpy as npX, y = make_classification(n_samples=1000, n_features=20, n_informative=8, n_redundant=4, random_state=42)# Pearson correlation with target (linear relationships only)correlations = np.array([np.corrcoef(X[:, i], y)[0, 1] for i in range(X.shape[1])])top_pearson = np.argsort(np.abs(correlations))[::-1][:10]print("Top features by |Pearson r|:", top_pearson)# Mutual information: captures non-linear relationships, better for tree modelsmi_scores = mutual_info_classif(X, y, random_state=42)top_mi = np.argsort(mi_scores)[::-1][:10]print("Top features by MI:", top_mi)
Wrapper Methods: RFECV
python
from sklearn.feature_selection import RFECVfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import StratifiedKFoldrfecv = RFECV( estimator=RandomForestClassifier(n_estimators=100, random_state=42), step=1, # remove one feature per round cv=StratifiedKFold(5), scoring="roc_auc", min_features_to_select=3, n_jobs=-1,)rfecv.fit(X, y)print(f"Optimal number of features: {rfecv.n_features_}")print(f"Selected feature indices: {np.where(rfecv.support_)[0]}")
Embedded Methods
python
from sklearn.linear_model import LassoCVfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.inspection import permutation_importancefrom sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import make_pipeline# Lasso path: features whose coefficients shrink to zero at the chosen alphapipe_lasso = make_pipeline(StandardScaler(), LassoCV(cv=5, max_iter=10000))pipe_lasso.fit(X, y.astype(float))lasso_coef = pipe_lasso.named_steps["lassocv"].coef_selected_lasso = np.where(lasso_coef != 0)[0]print(f"Lasso selected {len(selected_lasso)} features:", selected_lasso)# Tree feature importances (impurity-based: fast but biased toward high cardinality)rf = RandomForestClassifier(n_estimators=200, random_state=42)rf.fit(X, y)imp_impurity = rf.feature_importances_print("Top 5 by impurity importance:", np.argsort(imp_impurity)[::-1][:5])# Permutation importance: model-agnostic, slower, more reliableperm = permutation_importance(rf, X, y, n_repeats=10, random_state=42, n_jobs=-1)imp_perm = perm.importances_meanprint("Top 5 by permutation importance:", np.argsort(imp_perm)[::-1][:5])
Variance Threshold
Remove near-zero-variance features before any other selection step.
python
from sklearn.feature_selection import VarianceThresholdvt = VarianceThreshold(threshold=0.01) # remove features with variance < 0.01X_high_var = vt.fit_transform(X)print(f"Features before: {X.shape[1]}, after: {X_high_var.shape[1]}")
Complete Pipeline: ColumnTransformer on Synthetic Churn Data
Apply log1p or PowerTransformer(method='yeo-johnson') to right-skewed numeric features before linear models or distance-based algorithms; tree models are invariant to monotonic transforms but transforming can still speed up gradient boosting convergence.
Target encoding is powerful but leaky by default — always use out-of-fold (cross-fit) encoding; sklearn 1.3+ TargetEncoder handles this correctly.
Cyclical features (hour, day-of-week, month) must be encoded as (sin, cos) pairs so the model understands that hour 23 and hour 0 are adjacent.
Polynomial features explode exponentially — use interaction_only=True and apply selection afterward, or add only features motivated by domain knowledge.
Prefer permutation importance over impurity-based importance when your dataset has high-cardinality or continuous features; impurity importance is biased toward those features.
RFECV gives you the optimal feature count via cross-validation but is expensive — run it on a fast estimator (e.g. a shallow random forest) rather than your final model.
Wrap all preprocessing in a ColumnTransformer inside a Pipeline — this ensures transforms are fit only on training data during cross-validation, preventing data leakage.
Variance threshold should be the first step in any selection process; near-zero-variance features contribute noise and slow down downstream methods.