Collaborative Filtering
Collaborative filtering (CF) is the most widely used recommendation technique. It predicts a user's interests by collecting preferences from many users -- the core assumption is that if users agreed in the past, they will agree in the future.
Two Flavors of CF
1. User-Based CF: Find users similar to the target user, then recommend items those similar users liked 2. Item-Based CF: Find items similar to items the target user liked, then recommend those similar items
Explicit vs Implicit Feedback
| Explicit Feedback | Implicit Feedback |
|---|---|
| Star ratings, thumbs up/down, reviews | Clicks, views, purchases, time spent |
| Clear signal of preference | Must infer preference from behavior |
| Sparse (users rarely rate) | Dense (every interaction is data) |
| May contain bias (rating inflation) | No negative signal (absence != dislike) |
1import numpy as np
2
3# Example user-item rating matrix (0 = unrated)
4# Rows = users, Columns = items (movies)
5ratings = np.array([
6 [5, 3, 0, 1, 4], # User 0
7 [4, 0, 0, 1, 2], # User 1
8 [1, 1, 0, 5, 4], # User 2
9 [0, 0, 5, 4, 0], # User 3
10 [0, 3, 4, 0, 3], # User 4
11])
12
13movie_names = ["Action_1", "Comedy_1", "Drama_1", "SciFi_1", "Comedy_2"]
14n_users, n_items = ratings.shape
15
16print(f"Rating matrix: {n_users} users x {n_items} items")
17print(f"Total possible ratings: {n_users * n_items}")
18print(f"Observed ratings: {np.count_nonzero(ratings)}")
19print(f"Sparsity: {1 - np.count_nonzero(ratings) / (n_users * n_items):.1%}")Similarity Metrics
The heart of collaborative filtering is measuring how similar two users (or items) are. The two most common metrics:
Cosine Similarity
Measures the cosine of the angle between two vectors. Ranges from -1 to 1 (or 0 to 1 for non-negative ratings):sim(a, b) = (a . b) / (||a|| * ||b||)
Pearson Correlation
Centers the vectors by subtracting each user's mean rating before computing cosine similarity:sim(a, b) = sum((a_i - mean_a)(b_i - mean_b)) / (std_a * std_b)
Adjusted Cosine (for Item-Based)
Subtracts the user mean from each rating before computing item similarity. This accounts for the fact that different users have different rating baselines.1import numpy as np
2
3def cosine_similarity(a, b):
4 """Compute cosine similarity between two vectors, considering only co-rated items."""
5 # Only consider items both users have rated
6 mask = (a > 0) & (b > 0)
7 if mask.sum() == 0:
8 return 0.0
9 a_filtered = a[mask]
10 b_filtered = b[mask]
11 dot = np.dot(a_filtered, b_filtered)
12 norm = np.linalg.norm(a_filtered) * np.linalg.norm(b_filtered)
13 return dot / norm if norm > 0 else 0.0
14
15
16def pearson_correlation(a, b):
17 """Compute Pearson correlation between two users on co-rated items."""
18 mask = (a > 0) & (b > 0)
19 if mask.sum() < 2: # Need at least 2 co-rated items
20 return 0.0
21 a_filtered = a[mask]
22 b_filtered = b[mask]
23 a_centered = a_filtered - a_filtered.mean()
24 b_centered = b_filtered - b_filtered.mean()
25 num = np.dot(a_centered, b_centered)
26 den = np.linalg.norm(a_centered) * np.linalg.norm(b_centered)
27 return num / den if den > 0 else 0.0
28
29
30# Compute user-user similarity matrix
31ratings = np.array([
32 [5, 3, 0, 1, 4],
33 [4, 0, 0, 1, 2],
34 [1, 1, 0, 5, 4],
35 [0, 0, 5, 4, 0],
36 [0, 3, 4, 0, 3],
37])
38n_users = ratings.shape[0]
39
40print("=== Cosine Similarity ===")
41cos_sim = np.zeros((n_users, n_users))
42for i in range(n_users):
43 for j in range(n_users):
44 cos_sim[i, j] = cosine_similarity(ratings[i], ratings[j])
45print(np.round(cos_sim, 3))
46
47print("\n=== Pearson Correlation ===")
48pear_sim = np.zeros((n_users, n_users))
49for i in range(n_users):
50 for j in range(n_users):
51 pear_sim[i, j] = pearson_correlation(ratings[i], ratings[j])
52print(np.round(pear_sim, 3))User-Based CF: Predicting Ratings
Once we have a similarity matrix, we predict ratings using a weighted average of similar users' ratings:
pred(u, i) = mean_u + sum(sim(u, v) * (r_vi - mean_v)) / sum(|sim(u, v)|)
where the sum is over neighbors v who have rated item i.
Steps:
1. Compute similarity between target user and all other users 2. Select top-K most similar users (neighbors) who have rated the target item 3. Compute weighted average of their ratings, adjusting for each user's meanItem-Based CF
Instead of finding similar users, we find similar items. Item-based CF is often preferred in production because:
pred(u, i) = sum(sim(i, j) * r_uj) / sum(|sim(i, j)|)
where the sum is over items j that user u has rated and are similar to item i.
1import numpy as np
2
3def predict_user_based(ratings, user_sim, target_user, target_item, k=3):
4 """
5 Predict rating for target_user on target_item using user-based CF.
6
7 Uses top-k most similar users who have rated the target item.
8 """
9 # User means (only over rated items)
10 user_means = np.array([
11 ratings[u][ratings[u] > 0].mean() if (ratings[u] > 0).any() else 0
12 for u in range(ratings.shape[0])
13 ])
14
15 target_mean = user_means[target_user]
16
17 # Find users who rated this item
18 rated_mask = ratings[:, target_item] > 0
19 rated_mask[target_user] = False # Exclude the target user
20
21 candidates = np.where(rated_mask)[0]
22 if len(candidates) == 0:
23 return target_mean # Fallback to user mean
24
25 # Get similarities and select top-k
26 sims = user_sim[target_user, candidates]
27 top_k_idx = np.argsort(-np.abs(sims))[:k]
28 neighbors = candidates[top_k_idx]
29
30 # Weighted average
31 numerator = 0.0
32 denominator = 0.0
33 for v in neighbors:
34 sim = user_sim[target_user, v]
35 numerator += sim * (ratings[v, target_item] - user_means[v])
36 denominator += abs(sim)
37
38 if denominator == 0:
39 return target_mean
40
41 return target_mean + numerator / denominator
42
43
44# Using our previous ratings and similarity
45ratings = np.array([
46 [5, 3, 0, 1, 4],
47 [4, 0, 0, 1, 2],
48 [1, 1, 0, 5, 4],
49 [0, 0, 5, 4, 0],
50 [0, 3, 4, 0, 3],
51])
52
53# Compute Pearson similarity
54n_users = ratings.shape[0]
55user_sim = np.zeros((n_users, n_users))
56for i in range(n_users):
57 for j in range(n_users):
58 if i == j:
59 user_sim[i, j] = 1.0
60 else:
61 mask = (ratings[i] > 0) & (ratings[j] > 0)
62 if mask.sum() >= 2:
63 a = ratings[i][mask] - ratings[i][ratings[i] > 0].mean()
64 b = ratings[j][mask] - ratings[j][ratings[j] > 0].mean()
65 d = np.linalg.norm(a) * np.linalg.norm(b)
66 user_sim[i, j] = np.dot(a, b) / d if d > 0 else 0
67
68# Predict: What would User 1 rate Movie 2 (Drama_1)?
69pred = predict_user_based(ratings, user_sim, target_user=1, target_item=2, k=3)
70print(f"Predicted rating for User 1 on Drama_1: {pred:.2f}")
71
72# Predict all missing ratings for User 1
73print("\nUser 1 predicted ratings:")
74for item in range(ratings.shape[1]):
75 if ratings[1, item] == 0:
76 p = predict_user_based(ratings, user_sim, 1, item, k=3)
77 print(f" Item {item}: {p:.2f}")
78 else:
79 print(f" Item {item}: {ratings[1, item]:.1f} (actual)")Matrix Factorization (SVD & ALS)
Memory-based CF (user-based/item-based) does not scale well to millions of users and items. Matrix factorization offers a scalable alternative by learning latent factors that explain the observed ratings.
The idea: decompose the user-item rating matrix R into two lower-rank matrices:
R ≈ U * V^T
where:
SVD (Singular Value Decomposition)
Classic linear algebra approach. For a complete matrix:R = U * Σ * V^TBut rating matrices are sparse (mostly zeros), so we use truncated SVD or regularized SVD that only fits on observed entries.
ALS (Alternating Least Squares)
1. Initialize U and V randomly 2. Fix V, solve for U (least squares problem) 3. Fix U, solve for V (least squares problem) 4. Repeat until convergenceALS is popular because:
1import numpy as np
2
3def matrix_factorization_sgd(R, k=2, lr=0.01, reg=0.02, epochs=100):
4 """
5 Learn latent factors using Stochastic Gradient Descent.
6
7 Minimizes: sum over observed (u,i) of (R_ui - U_u . V_i)^2
8 + reg * (||U||^2 + ||V||^2)
9 """
10 n_users, n_items = R.shape
11 # Initialize latent factors
12 U = np.random.normal(0, 0.1, (n_users, k))
13 V = np.random.normal(0, 0.1, (n_items, k))
14 # Bias terms
15 bu = np.zeros(n_users)
16 bi = np.zeros(n_items)
17 mu = R[R > 0].mean() # Global mean
18
19 # Get observed entries
20 observed = list(zip(*np.where(R > 0)))
21
22 losses = []
23 for epoch in range(epochs):
24 np.random.shuffle(observed)
25 total_loss = 0
26
27 for u, i in observed:
28 # Prediction
29 pred = mu + bu[u] + bi[i] + np.dot(U[u], V[i])
30 error = R[u, i] - pred
31
32 # Update biases
33 bu[u] += lr * (error - reg * bu[u])
34 bi[i] += lr * (error - reg * bi[i])
35
36 # Update latent factors
37 U_old = U[u].copy()
38 U[u] += lr * (error * V[i] - reg * U[u])
39 V[i] += lr * (error * U_old - reg * V[i])
40
41 total_loss += error ** 2 + reg * (
42 np.sum(U[u] ** 2) + np.sum(V[i] ** 2) +
43 bu[u] ** 2 + bi[i] ** 2
44 )
45
46 losses.append(total_loss / len(observed))
47 if (epoch + 1) % 20 == 0:
48 print(f"Epoch {epoch+1}: loss = {losses[-1]:.4f}")
49
50 return U, V, bu, bi, mu
51
52
53# Example
54R = np.array([
55 [5, 3, 0, 1, 4],
56 [4, 0, 0, 1, 2],
57 [1, 1, 0, 5, 4],
58 [0, 0, 5, 4, 0],
59 [0, 3, 4, 0, 3],
60], dtype=float)
61
62U, V, bu, bi, mu = matrix_factorization_sgd(R, k=3, lr=0.01, reg=0.02, epochs=100)
63
64# Predict all ratings
65R_pred = mu + bu[:, None] + bi[None, :] + U @ V.T
66print("\nOriginal matrix:")
67print(R.astype(int))
68print("\nPredicted matrix (rounded):")
69print(np.round(R_pred, 1))
70print("\nPredicted missing entries:")
71for u in range(R.shape[0]):
72 for i in range(R.shape[1]):
73 if R[u, i] == 0:
74 print(f" User {u}, Item {i}: {R_pred[u, i]:.2f}")