Classification is the task of predicting which discrete category a new observation belongs to. It underlies spam detection, medical diagnosis, credit scoring, and churn prediction. This lesson covers two foundational algorithms — logistic regression and decision trees — and the evaluation metrics that distinguish good classifiers from lucky ones.
Logistic Regression and the Sigmoid Function
Despite its name, logistic regression is a classification algorithm. It predicts the probability that an observation belongs to the positive class by passing a linear combination of features through the sigmoid function:
σ(z) = 1 / (1 + e^(−z))
The sigmoid squashes any real-valued input to the range (0, 1), making it interpretable as a probability.
python
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.linear_model import LogisticRegressionfrom sklearn.tree import DecisionTreeClassifier, plot_treefrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import (classification_report, confusion_matrix, precision_score, recall_score, f1_score)from sklearn.preprocessing import StandardScaler# Visualise the sigmoid functionz = np.linspace(-8, 8, 200)sigmoid = 1 / (1 + np.exp(-z))fig, ax = plt.subplots(figsize=(8, 4))ax.plot(z, sigmoid, color="#2196F3", linewidth=2.5)ax.axhline(0.5, color="grey", linestyle="--", alpha=0.6)ax.axvline(0, color="grey", linestyle="--", alpha=0.6)ax.fill_between(z, sigmoid, 0.5, where=(sigmoid > 0.5), alpha=0.15, color="#4CAF50")ax.fill_between(z, sigmoid, 0.5, where=(sigmoid < 0.5), alpha=0.15, color="#F44336")ax.set_xlabel("z (linear combination of features)")ax.set_ylabel("Predicted probability")ax.set_title("Sigmoid Function — Converts Linear Score to Probability")plt.show()
Decision trees partition the feature space into rectangular regions using a sequence of binary splits. The CART (Classification and Regression Trees) algorithm greedily selects the feature and threshold that best separates the classes at each node.
Gini impurity measures how often a randomly chosen element from the node would be incorrectly classified:
Gini = 1 − Σ(pᵢ²)
A pure node (all one class) has Gini = 0. CART chooses the split that minimises the weighted average Gini of the two child nodes.
python
tree = DecisionTreeClassifier(max_depth=4, min_samples_leaf=10, random_state=42)tree.fit(X_train, y_train)y_pred_tree = tree.predict(X_test)print(classification_report(y_test, y_pred_tree, target_names=["Died", "Survived"]))# Visualise the treefig, ax = plt.subplots(figsize=(20, 8))plot_tree(tree, feature_names=X.columns, class_names=["Died", "Survived"], filled=True, rounded=True, fontsize=9, ax=ax)plt.title("Decision Tree — Titanic Survival")plt.tight_layout()plt.show()# Feature importance from the treeimp = pd.Series(tree.feature_importances_, index=X.columns).sort_values(ascending=False)print(f"\nFeature importance:\n{imp}")
Precision, Recall, and F1 Score
Accuracy alone is misleading when classes are imbalanced. If 95% of credit card transactions are legitimate, a model that always predicts "not fraud" achieves 95% accuracy while being completely useless.
| Metric | Formula | Answers |
|---|---|---|
| Accuracy | (TP + TN) / All | What fraction of all predictions are correct? |
| Precision | TP / (TP + FP) | Of those predicted positive, how many actually are? |
| Recall (Sensitivity) | TP / (TP + FN) | Of all actual positives, how many did we catch? |
| F1 Score | 2 × P × R / (P + R) | Harmonic mean — balances precision and recall |
| Specificity | TN / (TN + FP) | Of all actual negatives, how many did we correctly reject? |
When to prioritise precision: When false positives are costly (spam filter flagging important emails).
When to prioritise recall: When false negatives are costly (cancer screening missing a tumour).
Adjusting the Decision Threshold
By default, predict() uses a threshold of 0.5. Shifting the threshold trades precision for recall:
Logistic regression applies the sigmoid function to a linear combination of features, producing a probability between 0 and 1; the decision boundary is linear in feature space.
Coefficients are on the log-odds scale; exponentiating them gives odds ratios directly interpretable as multiplicative effects on the probability of the positive class.
Decision trees (CART) partition feature space by greedily minimising Gini impurity at each split; max_depth and min_samples_leaf are the primary overfitting controls.
Accuracy is a misleading metric for imbalanced datasets; use precision (positive predictive value) and recall (sensitivity) and their harmonic mean, F1.
The classification threshold of 0.5 is not sacred — shift it to tune the precision-recall tradeoff based on the asymmetric cost of each error type.