Feature Stores: Bridging Training and Serving
One of the most insidious bugs in production ML is training-serving skew: when the features used during training differ from those used during inference. A model trained on "average order value over the last 30 days" might see a different computation of that feature at serving time due to different code paths, different data sources, or different timing semantics. Feature stores solve this by providing a single source of truth for feature definitions and values.
This lesson covers why feature stores exist, how to use Feast (the most popular open-source feature store), and best practices for feature engineering in production.
Training-Serving Skew
What a Feature Store Does
A feature store has three core responsibilities:
1. Feature Registry
A central catalog of all features with metadata:2. Offline Store (for training)
Historical feature values used for training and batch scoring:3. Online Store (for serving)
Low-latency access to the latest feature values for real-time inference:| Component | Purpose | Latency | Storage |
|---|---|---|---|
| Registry | Feature catalog & metadata | N/A | File/DB |
| Offline Store | Training data & historical lookups | Seconds-minutes | Warehouse |
| Online Store | Real-time serving | < 10ms | Key-value |
1# === Feast Feature Store Setup ===
2# This demonstrates the Feast workflow: define, materialize, serve
3
4# --- Step 1: Define feature repository structure ---
5# feast_repo/
6# feature_store.yaml # Configuration
7# features.py # Feature definitions
8# data/ # Offline data source
9
10# --- feature_store.yaml ---
11feast_config = """
12project: ml_platform
13provider: local
14registry: data/registry.db
15online_store:
16 type: sqlite
17 path: data/online_store.db
18offline_store:
19 type: file
20entity_key_serialization_version: 2
21"""
22print("=== feature_store.yaml ===")
23print(feast_config)
24
25# --- features.py ---
26feature_definitions = """
27from datetime import timedelta
28from feast import Entity, Feature, FeatureView, Field, FileSource
29from feast.types import Float32, Int64, String
30
31# Entity: the "who" or "what" features describe
32customer = Entity(
33 name="customer_id",
34 description="Unique customer identifier",
35)
36
37# Data source: where raw feature data lives
38customer_stats_source = FileSource(
39 path="data/customer_stats.parquet",
40 timestamp_field="event_timestamp",
41 created_timestamp_column="created_timestamp",
42)
43
44# Feature View: a group of related features from one source
45customer_stats_fv = FeatureView(
46 name="customer_stats",
47 entities=[customer],
48 ttl=timedelta(days=1), # How stale can features be?
49 schema=[
50 Field(name="total_orders", dtype=Int64),
51 Field(name="avg_order_value", dtype=Float32),
52 Field(name="days_since_last_order", dtype=Int64),
53 Field(name="lifetime_value", dtype=Float32),
54 ],
55 source=customer_stats_source,
56 online=True, # Materialize to online store
57)
58
59# Feature View with on-demand transformations
60from feast import on_demand_feature_view, RequestSource
61import pandas as pd
62
63input_request = RequestSource(
64 name="order_amount",
65 schema=[Field(name="current_order_amount", dtype=Float32)],
66)
67
68@on_demand_feature_view(
69 sources=[customer_stats_fv, input_request],
70 schema=[
71 Field(name="order_vs_average", dtype=Float32),
72 Field(name="is_high_value", dtype=Int64),
73 ],
74)
75def order_context(inputs: pd.DataFrame) -> pd.DataFrame:
76 df = pd.DataFrame()
77 df["order_vs_average"] = (
78 inputs["current_order_amount"] / inputs["avg_order_value"]
79 )
80 df["is_high_value"] = (inputs["lifetime_value"] > 500).astype(int)
81 return df
82"""
83print("=== features.py ===")
84print(feature_definitions)Feast Workflow
1. Apply: Register features
feast apply
This reads your feature definitions and updates the registry.2. Materialize: Populate online store
feast materialize 2024-01-01T00:00:00 2024-12-31T23:59:59
This copies the latest feature values from the offline store to the online store for low-latency serving.3. Get Training Data (offline)
training_df = store.get_historical_features(
entity_df=entity_with_timestamps,
features=["customer_stats:total_orders", "customer_stats:avg_order_value"],
).to_df()
Uses point-in-time joins to prevent data leakage.4. Get Serving Data (online)
features = store.get_online_features(
features=["customer_stats:total_orders", "customer_stats:avg_order_value"],
entity_rows=[{"customer_id": "C123"}],
).to_dict()
Returns the latest feature values with sub-10ms latency.1import pandas as pd
2import numpy as np
3from datetime import datetime, timedelta
4
5# === Simulating Feast Operations ===
6# (Full Feast requires infrastructure; this simulates the concepts)
7
8np.random.seed(42)
9
10# --- Simulate historical feature data ---
11n_customers = 100
12n_days = 90
13records = []
14
15for cid in range(n_customers):
16 for day_offset in range(n_days):
17 ts = datetime(2024, 1, 1) + timedelta(days=day_offset)
18 records.append({
19 "customer_id": f"C{cid:03d}",
20 "event_timestamp": ts,
21 "total_orders": int(np.random.poisson(5 + day_offset * 0.05)),
22 "avg_order_value": round(np.random.normal(50, 15), 2),
23 "days_since_last_order": int(np.random.exponential(7)),
24 "lifetime_value": round(np.random.normal(300 + day_offset, 100), 2),
25 })
26
27feature_df = pd.DataFrame(records)
28print(f"Feature store: {len(feature_df)} records, "
29 f"{n_customers} customers, {n_days} days")
30print(feature_df.head())
31
32# --- Point-in-Time Join ---
33def point_in_time_join(entity_df, feature_df,
34 entity_col="customer_id",
35 timestamp_col="event_timestamp"):
36 """
37 For each entity at a given timestamp, return the most recent
38 feature values BEFORE that timestamp (no future leakage).
39 """
40 results = []
41 for _, row in entity_df.iterrows():
42 cid = row[entity_col]
43 ts = row[timestamp_col]
44
45 # Filter: same customer, timestamp BEFORE query time
46 mask = (
47 (feature_df[entity_col] == cid) &
48 (feature_df[timestamp_col] <= ts)
49 )
50 matching = feature_df[mask]
51
52 if len(matching) > 0:
53 # Get the most recent record
54 latest = matching.sort_values(timestamp_col).iloc[-1]
55 result = row.to_dict()
56 for col in ["total_orders", "avg_order_value",
57 "days_since_last_order", "lifetime_value"]:
58 result[col] = latest[col]
59 results.append(result)
60
61 return pd.DataFrame(results)
62
63# --- Create training entity DataFrame ---
64# "What features did customer X have on date Y?"
65training_entities = pd.DataFrame({
66 "customer_id": ["C001", "C001", "C050", "C050", "C099"],
67 "event_timestamp": [
68 datetime(2024, 2, 1),
69 datetime(2024, 3, 1),
70 datetime(2024, 2, 15),
71 datetime(2024, 3, 15),
72 datetime(2024, 3, 30),
73 ],
74 "label": [1, 0, 1, 1, 0],
75})
76
77print("\n=== Training Entities ===")
78print(training_entities)
79
80# --- Point-in-time join ---
81training_data = point_in_time_join(training_entities, feature_df)
82print("\n=== Training Data (point-in-time joined) ===")
83print(training_data)
84
85# --- Online serving (latest features) ---
86def get_online_features(feature_df, entity_ids,
87 entity_col="customer_id"):
88 """Simulate online store: return latest features per entity."""
89 results = []
90 for cid in entity_ids:
91 mask = feature_df[entity_col] == cid
92 latest = feature_df[mask].sort_values(
93 "event_timestamp").iloc[-1]
94 results.append(latest.to_dict())
95 return pd.DataFrame(results)
96
97print("\n=== Online Features (latest) ===")
98online = get_online_features(feature_df, ["C001", "C050"])
99print(online[["customer_id", "total_orders",
100 "avg_order_value", "lifetime_value"]])Point-in-Time Correctness
Feature Engineering Best Practices
Naming Conventions
Use consistent, descriptive names:{entity}_{feature}_{aggregation}_{window}
customer_order_count_30dproduct_avg_rating_7duser_session_duration_median_24hFeature Freshness
Match freshness to business needs:Feature Stores Comparison
| Feature Store | Type | Strengths |
|---|---|---|
| Feast | Open source | Simple, self-hosted, great for starting |
| Tecton | Managed (built on Feast) | Production-grade, real-time transforms |
| Hopsworks | Open source / managed | Integrated with ML pipelines |
| Databricks Feature Store | Managed | Tight integration with Spark/Unity Catalog |
| AWS SageMaker Feature Store | Managed | Tight integration with AWS ecosystem |