Clustering & Dimensionality Reduction
Unsupervised learning finds patterns in data without labels. The two main tasks are:
These techniques are essential for exploratory data analysis, visualization, preprocessing, and discovering hidden patterns.
K-Means Clustering
K-Means is the most widely used clustering algorithm. It partitions data into K clusters by:
1. Initialize K random centroids 2. Assign each point to the nearest centroid 3. Recalculate centroids as the mean of assigned points 4. Repeat steps 2-3 until convergence
K-Means is fast and works well on spherical, evenly-sized clusters. But you need to choose K in advance.
1from sklearn.cluster import KMeans
2from sklearn.datasets import make_blobs
3import numpy as np
4
5# Generate synthetic clustered data
6X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)
7
8# Fit K-Means
9kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
10kmeans.fit(X)
11
12print(f"Cluster centers:\n{kmeans.cluster_centers_}")
13print(f"Labels (first 10): {kmeans.labels_[:10]}")
14print(f"Inertia (sum of squared distances): {kmeans.inertia_:.2f}")The Elbow Method: Choosing K
The elbow method runs K-Means for a range of K values and plots the inertia (sum of squared distances from each point to its centroid). The "elbow" where the curve bends is a good choice for K -- beyond that point, adding more clusters gives diminishing returns.
1from sklearn.cluster import KMeans
2import numpy as np
3
4# Try different values of K
5inertias = []
6K_range = range(1, 11)
7
8for k in K_range:
9 km = KMeans(n_clusters=k, random_state=42, n_init=10)
10 km.fit(X)
11 inertias.append(km.inertia_)
12
13# Print inertia values to find the elbow
14print("K | Inertia")
15print("-" * 25)
16for k, inertia in zip(K_range, inertias):
17 bar = "=" * int(inertia / max(inertias) * 40)
18 print(f"{k:2d} | {inertia:8.1f} {bar}")Silhouette Score: Validating Clusters
The silhouette score measures how similar a point is to its own cluster compared to other clusters. It ranges from -1 to 1:
1from sklearn.metrics import silhouette_score
2
3# Compute silhouette score for different K
4print("K | Silhouette Score")
5print("-" * 30)
6for k in range(2, 8):
7 km = KMeans(n_clusters=k, random_state=42, n_init=10)
8 labels = km.fit_predict(X)
9 score = silhouette_score(X, labels)
10 bar = "=" * int(score * 40)
11 print(f"{k:2d} | {score:.4f} {bar}")K-Means Assumptions and Limitations
DBSCAN: Density-Based Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters based on density rather than distance to centroids. It has two key parameters:
DBSCAN's advantages:
1from sklearn.cluster import DBSCAN
2from sklearn.datasets import make_moons
3from sklearn.preprocessing import StandardScaler
4
5# Generate non-spherical clusters that K-Means struggles with
6X_moons, y_moons = make_moons(n_samples=300, noise=0.1, random_state=42)
7X_moons_scaled = StandardScaler().fit_transform(X_moons)
8
9# DBSCAN clustering
10dbscan = DBSCAN(eps=0.3, min_samples=5)
11labels = dbscan.fit_predict(X_moons_scaled)
12
13n_clusters = len(set(labels) - {-1})
14n_noise = list(labels).count(-1)
15
16print(f"Number of clusters found: {n_clusters}")
17print(f"Noise points: {n_noise}")
18print(f"Labels (unique): {sorted(set(labels))}")
19
20# Compare with K-Means on the same data
21km = KMeans(n_clusters=2, random_state=42, n_init=10)
22km_labels = km.fit_predict(X_moons_scaled)
23print(f"\nK-Means silhouette: {silhouette_score(X_moons_scaled, km_labels):.4f}")
24print(f"DBSCAN silhouette: {silhouette_score(X_moons_scaled, labels):.4f}")When to Use DBSCAN over K-Means
PCA: Principal Component Analysis
PCA is the most popular dimensionality reduction technique. It finds new axes (principal components) that capture the maximum variance in the data.
Key ideas:
1from sklearn.decomposition import PCA
2from sklearn.datasets import load_digits
3from sklearn.preprocessing import StandardScaler
4
5# Load the digits dataset (64 features = 8x8 pixel images)
6X, y = load_digits(return_X_y=True)
7X_scaled = StandardScaler().fit_transform(X)
8
9# Reduce from 64 dimensions to 2 for visualization
10pca = PCA(n_components=2)
11X_2d = pca.fit_transform(X_scaled)
12
13print(f"Original shape: {X_scaled.shape}")
14print(f"Reduced shape: {X_2d.shape}")
15print(f"Variance explained by 2 components: {sum(pca.explained_variance_ratio_):.4f}")
16
17# How many components to retain 95% variance?
18pca_full = PCA(n_components=0.95) # Keep 95% of variance
19X_95 = pca_full.fit_transform(X_scaled)
20print(f"Components for 95% variance: {pca_full.n_components_}")
21
22# Show cumulative explained variance
23pca_all = PCA().fit(X_scaled)
24cumvar = pca_all.explained_variance_ratio_.cumsum()
25for n_comp in [2, 5, 10, 20, 30]:
26 print(f" {n_comp} components: {cumvar[n_comp-1]:.4f} variance explained")What PCA Maximizes
t-SNE and UMAP: Nonlinear Visualization
PCA is linear -- it can only find straight-line patterns. For complex datasets, nonlinear methods often produce better visualizations.
t-SNE (t-distributed Stochastic Neighbor Embedding)
UMAP (Uniform Manifold Approximation and Projection)
1from sklearn.manifold import TSNE
2from sklearn.datasets import load_digits
3from sklearn.preprocessing import StandardScaler
4
5# Load digits dataset
6X, y = load_digits(return_X_y=True)
7X_scaled = StandardScaler().fit_transform(X)
8
9# t-SNE reduction to 2D
10tsne = TSNE(
11 n_components=2,
12 perplexity=30,
13 random_state=42,
14 n_iter=1000
15)
16X_tsne = tsne.fit_transform(X_scaled)
17
18print(f"t-SNE output shape: {X_tsne.shape}")
19print(f"t-SNE KL divergence: {tsne.kl_divergence_:.4f}")
20
21# Show cluster separation for each digit
22for digit in range(10):
23 mask = y == digit
24 center = X_tsne[mask].mean(axis=0)
25 print(f" Digit {digit}: center at ({center[0]:.1f}, {center[1]:.1f})")