Unsupervised learning operates without labels. Instead of predicting a target, it discovers structure — natural groupings, latent dimensions, anomalies, and compressed representations. These are not convenience tools for when labels are unavailable; they are often the most revealing analyses in a project. Customer segmentation, anomaly detection, feature compression, and data exploration all depend on unsupervised methods used correctly.
K-Means: Lloyd's Algorithm
K-Means partitions n data points into k clusters by minimising the within-cluster sum of squares (inertia):
inertia = sum over clusters c, sum over points x in c of ||x - mu_c||^2
where mu_c is the centroid of cluster c.
Lloyd's algorithm (the standard K-Means procedure):
Initialise k centroids (randomly or with K-Means++ heuristic)
Assign each point to its nearest centroid (E step)
Recompute each centroid as the mean of its assigned points (M step)
Repeat steps 2-3 until assignments no longer change
This converges to a local minimum — not necessarily the global minimum — so multiple restarts (n_init) are essential.
python
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.cluster import (KMeans, MiniBatchKMeans, DBSCAN, AgglomerativeClustering)from sklearn.mixture import GaussianMixturefrom sklearn.decomposition import PCAfrom sklearn.manifold import TSNEfrom sklearn.preprocessing import StandardScalerfrom sklearn.metrics import silhouette_score, silhouette_samplesfrom sklearn.datasets import make_blobs, make_moonsfrom scipy.cluster.hierarchy import dendrogram, linkageimport warningswarnings.filterwarnings('ignore')rng = np.random.default_rng(42)# Synthetic customer dataset for segmentationn_customers = 2000# True 4-cluster structure with different scalesX_cust, y_true = make_blobs( n_samples=n_customers, centers=4, cluster_std=[1.2, 0.8, 1.5, 1.0], random_state=42)# Add two extra features with noiseX_cust = np.hstack([ X_cust, rng.normal(0, 1, (n_customers, 3)) # Irrelevant features])scaler = StandardScaler()X_scaled = scaler.fit_transform(X_cust)print(f"Dataset: {X_cust.shape[0]} customers, {X_cust.shape[1]} features")
Elbow Method and Silhouette Score
The elbow method plots inertia against k and looks for the point of diminishing returns. The silhouette score measures cluster cohesion and separation for each point:
silhouette(i) = (b(i) - a(i)) / max(a(i), b(i))
where a(i) is the mean intra-cluster distance and b(i) is the mean distance to the nearest other cluster. The score ranges from -1 (wrong cluster) to 1 (perfect separation), with 0 indicating overlap.
python
k_range = range(2, 11)inertias = []silhouette_scores = []for k in k_range: km = KMeans(n_clusters=k, n_init=20, random_state=42) labels = km.fit_predict(X_scaled) inertias.append(km.inertia_) silhouette_scores.append(silhouette_score(X_scaled, labels))fig, axes = plt.subplots(1, 2, figsize=(13, 4))axes[0].plot(list(k_range), inertias, 'bo-', markersize=6)axes[0].set_xlabel('Number of clusters k')axes[0].set_ylabel('Inertia (within-cluster SSE)')axes[0].set_title('Elbow Method\n(look for the "elbow" — diminishing returns)')axes[0].axvline(4, color='red', linestyle='--', alpha=0.7, label='True k=4')axes[0].legend()axes[1].plot(list(k_range), silhouette_scores, 'ro-', markersize=6)axes[1].set_xlabel('Number of clusters k')axes[1].set_ylabel('Mean silhouette score')axes[1].set_title('Silhouette Score\n(higher = better defined clusters)')axes[1].axvline(4, color='red', linestyle='--', alpha=0.7, label='True k=4')axes[1].legend()plt.tight_layout()plt.savefig('kmeans_selection.png', dpi=150, bbox_inches='tight')plt.show()print("Best k by silhouette:", k_range.start + silhouette_scores.index(max(silhouette_scores)))
Silhouette Plot Per Sample
The per-sample silhouette plot reveals whether clusters are well-separated and equally sized. Clusters with many samples below the mean silhouette score, or with negative values, indicate poor assignment.
K-Means++ improves centroid initialisation: instead of random initialisation, it selects each new centroid with probability proportional to its squared distance from the nearest already-selected centroid. This typically reduces the number of restarts needed and avoids degenerate solutions. It is the default in sklearn (init='k-means++').
For large datasets (n > 100,000), MiniBatchKMeans uses mini-batches to update centroids, converging much faster with only a small accuracy trade-off.
K-Means assumes clusters are spherical (isotropic) and roughly equal in size. It is sensitive to outliers (outliers pull centroids toward them), requires scaling (features with larger ranges dominate distances), and always produces exactly k clusters — even if no natural clusters exist.
DBSCAN: Density-Based Spatial Clustering
DBSCAN groups points by density. A point is a core point if at least min_samples points (including itself) fall within distance epsilon. Core points within epsilon of each other form a cluster. Border points are within epsilon of a core point but below the min_samples threshold. Noise points belong to no cluster.
# DBSCAN strengths and weaknessesprint("DBSCAN advantages:")print(" - Finds arbitrary-shaped clusters")print(" - Identifies noise points (outlier detection)")print(" - Does not require specifying k upfront")print(" - Robust to outliers\n")print("DBSCAN disadvantages:")print(" - Struggles with varying density clusters")print(" - Sensitive to epsilon — small changes can merge/split clusters")print(" - Scales poorly (O(n log n) with spatial index; O(n^2) naive)")print(" - Feature scaling required (same as K-Means)")# Demonstrate sensitivity to varying densityX_vary_dense = np.vstack([ rng.normal([0, 0], 0.3, (200, 2)), # Dense cluster rng.normal([5, 5], 1.5, (200, 2)), # Sparse cluster])for eps in [0.3, 0.5, 0.8]: labels_v = DBSCAN(eps=eps, min_samples=5).fit_predict(X_vary_dense) n_clusters_v = len(np.unique(labels_v[labels_v >= 0])) n_noise_v = (labels_v == -1).sum() print(f"eps={eps}: {n_clusters_v} clusters, {n_noise_v} noise points")
Hierarchical Clustering and Dendrograms
Agglomerative hierarchical clustering (bottom-up) starts with each point as its own cluster and repeatedly merges the two most similar clusters. The result is a dendrogram — a tree showing the order and distance of merges. Cutting the dendrogram at any level yields a flat clustering.
Linkage methods define how distance between clusters is computed:
Single linkage: distance between closest points — prone to chaining
Complete linkage: distance between farthest points — prefers compact clusters
Average linkage: mean pairwise distance — balance between single and complete
Ward's linkage: minimise within-cluster variance at each merge — often best for balanced clusters
python
# Use a smaller subset for clear dendrogramX_hier, _ = make_blobs(n_samples=80, centers=4, cluster_std=0.8, random_state=42)X_hier_s = StandardScaler().fit_transform(X_hier)fig, axes = plt.subplots(2, 2, figsize=(14, 10))linkage_methods = ['single', 'complete', 'average', 'ward']for ax, method in zip(axes.flatten(), linkage_methods): Z = linkage(X_hier_s, method=method) dendrogram(Z, ax=ax, truncate_mode='lastp', p=20, leaf_rotation=90, leaf_font_size=8, color_threshold=Z[-3, 2]) ax.set_title(f'{method.capitalize()} linkage dendrogram') ax.set_xlabel('Cluster (size in brackets)') ax.set_ylabel('Distance') # Mark where to cut (above top 3 merges threshold) cut_height = Z[-4, 2] ax.axhline(cut_height, color='red', linestyle='--', alpha=0.7, label=f'Cut at {cut_height:.2f}') ax.legend(fontsize=8)plt.suptitle('Hierarchical Clustering Dendrograms — 4 Linkage Methods', fontsize=13)plt.tight_layout()plt.savefig('dendrograms.png', dpi=150, bbox_inches='tight')plt.show()# Apply agglomerative clusteringfor method in ['ward', 'complete', 'average']: agg = AgglomerativeClustering(n_clusters=4, linkage=method) labels_agg = agg.fit_predict(X_hier_s) print(f"{method:10s} linkage silhouette: " f"{silhouette_score(X_hier_s, labels_agg):.4f}")
Gaussian Mixture Models
GMM is a probabilistic model that assumes data comes from a mixture of K Gaussian distributions. Unlike K-Means (hard assignment), GMM assigns soft probabilities: each point has a probability of belonging to each component.
GMM is fit via the Expectation-Maximisation (EM) algorithm:
E step: compute the probability that each point belongs to each Gaussian (soft assignments)
M step: update each Gaussian's mean, covariance, and mixing weight using the weighted points
BIC (Bayesian Information Criterion) and AIC (Akaike Information Criterion) help select the number of components: lower is better, but BIC penalises complexity more heavily.
python
# GMM with BIC/AIC selectionX_gmm, _ = make_blobs(n_samples=500, centers=3, cluster_std=[0.8, 1.2, 0.6], random_state=42)X_gmm_s = StandardScaler().fit_transform(X_gmm)k_range_gmm = range(1, 9)bic_scores = []aic_scores = []cov_types = ['full', 'tied', 'diag', 'spherical']# Test 'full' covariancefor k in k_range_gmm: gmm = GaussianMixture(n_components=k, covariance_type='full', n_init=5, random_state=42) gmm.fit(X_gmm_s) bic_scores.append(gmm.bic(X_gmm_s)) aic_scores.append(gmm.aic(X_gmm_s))fig, ax = plt.subplots(figsize=(8, 4))ax.plot(list(k_range_gmm), bic_scores, 'bo-', label='BIC')ax.plot(list(k_range_gmm), aic_scores, 'ro-', label='AIC')ax.set_xlabel('Number of components k')ax.set_ylabel('Information Criterion (lower = better)')ax.set_title('GMM: BIC and AIC for Model Selection')ax.legend()ax.axvline(3, color='black', linestyle=':', alpha=0.5, label='True k=3')plt.tight_layout()plt.savefig('gmm_bic_aic.png', dpi=150, bbox_inches='tight')plt.show()best_k_gmm = k_range_gmm.start + np.argmin(bic_scores)print(f"Best k by BIC: {best_k_gmm}")# Fit best GMM and show soft assignmentsgmm_best = GaussianMixture(n_components=best_k_gmm, covariance_type='full', n_init=10, random_state=42)gmm_best.fit(X_gmm_s)proba_gmm = gmm_best.predict_proba(X_gmm_s)print(f"\nSoft assignment example (first point):")print(f" P(component 0) = {proba_gmm[0, 0]:.4f}")print(f" P(component 1) = {proba_gmm[0, 1]:.4f}")if best_k_gmm > 2: print(f" P(component 2) = {proba_gmm[0, 2]:.4f}")
PCA: Principal Component Analysis
PCA finds a new set of orthogonal axes (principal components) that capture maximum variance. The first component points in the direction of highest variance; the second is orthogonal to the first and captures the next highest variance; and so on.
Mathematically, PCA computes the eigenvectors of the covariance matrix. The eigenvalues indicate how much variance each component explains.
# PCA for compression and reconstructionn_components_95 = n_for_95pca_95 = PCA(n_components=n_components_95, random_state=42)X_compressed = pca_95.fit_transform(X_digits_s)X_reconstructed = pca_95.inverse_transform(X_compressed)# Reconstruction errorreconstruction_mse = np.mean((X_digits_s - X_reconstructed) ** 2)print(f"\nReconstruction MSE with {n_components_95} components: {reconstruction_mse:.4f}")print(f"Compression ratio: {X_digits_s.shape[1]} features -> {n_components_95} components " f"({n_components_95/X_digits_s.shape[1]:.1%} of original)")# Visualise original vs reconstructedfig, axes = plt.subplots(2, 8, figsize=(16, 4))for i in range(8): axes[0, i].imshow(X_digits[i].reshape(8, 8), cmap='gray_r') axes[0, i].axis('off') axes[0, i].set_title(f'Original\n({digits.target[i]})', fontsize=8) axes[1, i].imshow(scaler_d.inverse_transform(X_reconstructed[i:i+1])[0].reshape(8, 8), cmap='gray_r') axes[1, i].axis('off') axes[1, i].set_title(f'n_comp={n_components_95}', fontsize=8)plt.suptitle(f'PCA Reconstruction: {n_components_95} components ({cumulative_variance[n_components_95-1]:.1%} variance)', fontsize=12)plt.tight_layout()plt.savefig('pca_reconstruction.png', dpi=150, bbox_inches='tight')plt.show()# 2D PCA for visualisationpca_2d = PCA(n_components=2, random_state=42)X_2d = pca_2d.fit_transform(X_digits_s)plt.figure(figsize=(8, 6))scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y_digits, cmap='tab10', s=8, alpha=0.6)plt.colorbar(scatter, label='Digit')plt.title(f'PCA 2D Projection of Digits\n(explains {cumulative_variance[1]:.1%} of variance)')plt.xlabel('PC1')plt.ylabel('PC2')plt.tight_layout()plt.savefig('pca_2d_digits.png', dpi=150, bbox_inches='tight')plt.show()
t-SNE: Topology-Preserving Visualisation
t-SNE (t-Distributed Stochastic Neighbour Embedding) is designed for visualisation of high-dimensional data in 2D or 3D. It converts pairwise distances to probabilities and minimises the KL divergence between high-dimensional and low-dimensional probability distributions.
Critical caveat: t-SNE preserves local topology (nearby points remain nearby) but does NOT preserve global distances. Clusters that look far apart in a t-SNE plot may not actually be far apart in feature space. Never use t-SNE distances for downstream analysis.
python
# t-SNE on digits (use PCA pre-reduction for speed — standard practice)X_pca_50 = PCA(n_components=50, random_state=42).fit_transform(X_digits_s)tsne = TSNE(n_components=2, perplexity=30, n_iter=1000, random_state=42, learning_rate='auto', init='pca')X_tsne = tsne.fit_transform(X_pca_50)plt.figure(figsize=(8, 6))scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_digits, cmap='tab10', s=8, alpha=0.7)plt.colorbar(scatter, label='Digit')plt.title('t-SNE 2D Embedding of Digits (perplexity=30)')plt.xlabel('t-SNE 1')plt.ylabel('t-SNE 2')plt.tight_layout()plt.savefig('tsne_digits.png', dpi=150, bbox_inches='tight')plt.show()print("t-SNE caveats:")print(" - NOT suitable for preserving global structure or distances")print(" - Different runs produce different layouts (non-deterministic by default)")print(" - Perplexity should be between 5 and 50; try several values")print(" - Slow: O(n^2) naive; BH approximation reduces to O(n log n)")
UMAP: Better Global Structure with Speed
UMAP (Uniform Manifold Approximation and Projection) is a newer dimensionality reduction technique that often preserves both local and global structure better than t-SNE, while being significantly faster.
python
try: import umap reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42) X_umap = reducer.fit_transform(X_digits_s) plt.figure(figsize=(8, 6)) scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y_digits, cmap='tab10', s=8, alpha=0.7) plt.colorbar(scatter, label='Digit') plt.title('UMAP 2D Embedding of Digits (n_neighbors=15, min_dist=0.1)') plt.xlabel('UMAP 1') plt.ylabel('UMAP 2') plt.tight_layout() plt.savefig('umap_digits.png', dpi=150, bbox_inches='tight') plt.show() print("UMAP advantages over t-SNE:") print(" - Faster (especially on large datasets)") print(" - Better global structure preservation") print(" - Can embed new points (transform() works on unseen data)") print(" - Can be used for general dimensionality reduction, not just 2D visualisation")except ImportError: print("UMAP not installed. Run: pip install umap-learn")
Choosing a Clustering Algorithm
python
print("""Algorithm Selection Guide:---K-Means: Use when: clusters are roughly spherical and equal-sized, n is large Tune: k (elbow + silhouette), n_init=20, always scale features first Avoid: elongated/non-convex clusters, presence of many outliersDBSCAN: Use when: clusters have arbitrary shapes, outlier detection needed Tune: eps (k-NN distance plot), min_samples (rule of thumb: >=2*n_features) Avoid: widely varying cluster densities, very high-dimensional dataHierarchical (Agglomerative): Use when: you need a hierarchy of clusters, no fixed k, small-medium n Tune: linkage method (Ward usually best), cutting threshold from dendrogram Avoid: very large n (O(n^2) memory for full linkage matrix)GMM: Use when: you want soft cluster membership probabilities, clusters are elliptical Tune: n_components (BIC), covariance_type (full, tied, diag, spherical) Avoid: highly non-Gaussian cluster shapes""")
Key Takeaways
K-Means minimises within-cluster sum of squares via Lloyd's algorithm; K-Means++ initialisation avoids degenerate solutions; always scale features and use multiple restarts (n_init >= 10).
The elbow method (plot inertia vs k) and silhouette score (higher = better separation) are complementary tools for selecting k; per-sample silhouette plots reveal cluster quality at the individual point level.
DBSCAN discovers arbitrary-shaped clusters without specifying k; core/border/noise classification enables outlier detection; epsilon selection via k-NN distance plots; fails on datasets with varying cluster density.
Hierarchical clustering produces a dendrogram that can be cut at any level; Ward linkage minimises within-cluster variance at each merge and typically produces the most balanced clusters.
GMM provides soft (probabilistic) cluster assignments and handles elliptical clusters; BIC balances fit quality vs model complexity for selecting the number of components.
PCA finds orthogonal axes of maximum variance; scree plots and cumulative explained variance guide the number of components to retain; Kaiser rule (eigenvalue greater than 1) is a common heuristic.
t-SNE produces visually appealing 2D layouts but preserves only local topology — never interpret t-SNE distances as true feature-space distances; pre-reduce with PCA to 50 dimensions before t-SNE for speed.
UMAP is faster than t-SNE, preserves global structure better, and supports out-of-sample transformation — prefer it for most visualisation tasks on large or high-dimensional datasets.