Feature Stores: Bridging Training and Serving

One of the most insidious bugs in production ML is training-serving skew: when the features used during training differ from those used during inference. A model trained on "average order value over the last 30 days" might see a different computation of that feature at serving time due to different code paths, different data sources, or different timing semantics. Feature stores solve this by providing a single source of truth for feature definitions and values.

This lesson covers why feature stores exist, how to use Feast (the most popular open-source feature store), and best practices for feature engineering in production.

Training-Serving Skew

Training-serving skew occurs when the feature values seen during training differ from those seen during serving. Common causes: (1) different code computes features for training vs serving, (2) data leakage from the future during training, (3) different data freshness between batch training and real-time serving. Feature stores eliminate skew by using the same feature definitions for both training and serving.

What a Feature Store Does

A feature store has three core responsibilities:

1. Feature Registry

A central catalog of all features with metadata:

Feature name, type, description, owner

How it is computed (transformation logic)

Data source and freshness requirements

Which models consume it

2. Offline Store (for training)

Historical feature values used for training and batch scoring:

Stores time-stamped feature values

Supports point-in-time joins: for a given entity at a given time, return the feature values that were available at that exact moment (no future leakage)

Backed by data warehouses: BigQuery, Snowflake, Redshift, Parquet files

3. Online Store (for serving)

Low-latency access to the latest feature values for real-time inference:

Stores only the most recent value per entity

Backed by key-value stores: Redis, DynamoDB, Bigtable

Sub-millisecond latency for production serving

Component	Purpose	Latency	Storage
Registry	Feature catalog & metadata	N/A	File/DB
Offline Store	Training data & historical lookups	Seconds-minutes	Warehouse
Online Store	Real-time serving	< 10ms	Key-value

python

1# === Feast Feature Store Setup ===
2# This demonstrates the Feast workflow: define, materialize, serve
3
4# --- Step 1: Define feature repository structure ---
5# feast_repo/
6#   feature_store.yaml      # Configuration
7#   features.py             # Feature definitions
8#   data/                   # Offline data source
9
10# --- feature_store.yaml ---
11feast_config = """
12project: ml_platform
13provider: local
14registry: data/registry.db
15online_store:
16  type: sqlite
17  path: data/online_store.db
18offline_store:
19  type: file
20entity_key_serialization_version: 2
21"""
22print("=== feature_store.yaml ===")
23print(feast_config)
24
25# --- features.py ---
26feature_definitions = """
27from datetime import timedelta
28from feast import Entity, Feature, FeatureView, Field, FileSource
29from feast.types import Float32, Int64, String
30
31# Entity: the "who" or "what" features describe
32customer = Entity(
33    name="customer_id",
34    description="Unique customer identifier",
35)
36
37# Data source: where raw feature data lives
38customer_stats_source = FileSource(
39    path="data/customer_stats.parquet",
40    timestamp_field="event_timestamp",
41    created_timestamp_column="created_timestamp",
42)
43
44# Feature View: a group of related features from one source
45customer_stats_fv = FeatureView(
46    name="customer_stats",
47    entities=[customer],
48    ttl=timedelta(days=1),  # How stale can features be?
49    schema=[
50        Field(name="total_orders", dtype=Int64),
51        Field(name="avg_order_value", dtype=Float32),
52        Field(name="days_since_last_order", dtype=Int64),
53        Field(name="lifetime_value", dtype=Float32),
54    ],
55    source=customer_stats_source,
56    online=True,  # Materialize to online store
57)
58
59# Feature View with on-demand transformations
60from feast import on_demand_feature_view, RequestSource
61import pandas as pd
62
63input_request = RequestSource(
64    name="order_amount",
65    schema=[Field(name="current_order_amount", dtype=Float32)],
66)
67
68@on_demand_feature_view(
69    sources=[customer_stats_fv, input_request],
70    schema=[
71        Field(name="order_vs_average", dtype=Float32),
72        Field(name="is_high_value", dtype=Int64),
73    ],
74)
75def order_context(inputs: pd.DataFrame) -> pd.DataFrame:
76    df = pd.DataFrame()
77    df["order_vs_average"] = (
78        inputs["current_order_amount"] / inputs["avg_order_value"]
79    )
80    df["is_high_value"] = (inputs["lifetime_value"] > 500).astype(int)
81    return df
82"""
83print("=== features.py ===")
84print(feature_definitions)

Feast Workflow

1. Apply: Register features

feast apply

This reads your feature definitions and updates the registry.

2. Materialize: Populate online store

feast materialize 2024-01-01T00:00:00 2024-12-31T23:59:59

This copies the latest feature values from the offline store to the online store for low-latency serving.

3. Get Training Data (offline)

training_df = store.get_historical_features(
    entity_df=entity_with_timestamps,
    features=["customer_stats:total_orders", "customer_stats:avg_order_value"],
).to_df()

Uses point-in-time joins to prevent data leakage.

4. Get Serving Data (online)

features = store.get_online_features(
    features=["customer_stats:total_orders", "customer_stats:avg_order_value"],
    entity_rows=[{"customer_id": "C123"}],
).to_dict()

Returns the latest feature values with sub-10ms latency.

python

1import pandas as pd
2import numpy as np
3from datetime import datetime, timedelta
4
5# === Simulating Feast Operations ===
6# (Full Feast requires infrastructure; this simulates the concepts)
7
8np.random.seed(42)
9
10# --- Simulate historical feature data ---
11n_customers = 100
12n_days = 90
13records = []
14
15for cid in range(n_customers):
16    for day_offset in range(n_days):
17        ts = datetime(2024, 1, 1) + timedelta(days=day_offset)
18        records.append({
19            "customer_id": f"C{cid:03d}",
20            "event_timestamp": ts,
21            "total_orders": int(np.random.poisson(5 + day_offset * 0.05)),
22            "avg_order_value": round(np.random.normal(50, 15), 2),
23            "days_since_last_order": int(np.random.exponential(7)),
24            "lifetime_value": round(np.random.normal(300 + day_offset, 100), 2),
25        })
26
27feature_df = pd.DataFrame(records)
28print(f"Feature store: {len(feature_df)} records, "
29      f"{n_customers} customers, {n_days} days")
30print(feature_df.head())
31
32# --- Point-in-Time Join ---
33def point_in_time_join(entity_df, feature_df,
34                        entity_col="customer_id",
35                        timestamp_col="event_timestamp"):
36    """
37    For each entity at a given timestamp, return the most recent
38    feature values BEFORE that timestamp (no future leakage).
39    """
40    results = []
41    for _, row in entity_df.iterrows():
42        cid = row[entity_col]
43        ts = row[timestamp_col]
44
45        # Filter: same customer, timestamp BEFORE query time
46        mask = (
47            (feature_df[entity_col] == cid) &
48            (feature_df[timestamp_col] <= ts)
49        )
50        matching = feature_df[mask]
51
52        if len(matching) > 0:
53            # Get the most recent record
54            latest = matching.sort_values(timestamp_col).iloc[-1]
55            result = row.to_dict()
56            for col in ["total_orders", "avg_order_value",
57                        "days_since_last_order", "lifetime_value"]:
58                result[col] = latest[col]
59            results.append(result)
60
61    return pd.DataFrame(results)
62
63# --- Create training entity DataFrame ---
64# "What features did customer X have on date Y?"
65training_entities = pd.DataFrame({
66    "customer_id": ["C001", "C001", "C050", "C050", "C099"],
67    "event_timestamp": [
68        datetime(2024, 2, 1),
69        datetime(2024, 3, 1),
70        datetime(2024, 2, 15),
71        datetime(2024, 3, 15),
72        datetime(2024, 3, 30),
73    ],
74    "label": [1, 0, 1, 1, 0],
75})
76
77print("\n=== Training Entities ===")
78print(training_entities)
79
80# --- Point-in-time join ---
81training_data = point_in_time_join(training_entities, feature_df)
82print("\n=== Training Data (point-in-time joined) ===")
83print(training_data)
84
85# --- Online serving (latest features) ---
86def get_online_features(feature_df, entity_ids,
87                         entity_col="customer_id"):
88    """Simulate online store: return latest features per entity."""
89    results = []
90    for cid in entity_ids:
91        mask = feature_df[entity_col] == cid
92        latest = feature_df[mask].sort_values(
93            "event_timestamp").iloc[-1]
94        results.append(latest.to_dict())
95    return pd.DataFrame(results)
96
97print("\n=== Online Features (latest) ===")
98online = get_online_features(feature_df, ["C001", "C050"])
99print(online[["customer_id", "total_orders",
100              "avg_order_value", "lifetime_value"]])

Point-in-Time Correctness

Point-in-time correctness is the most critical concept in feature stores. When building training data, you must use only feature values that were available at the time the label was observed. Using future feature values creates data leakage and makes your model appear more accurate during training than it will be in production. Feature stores enforce this automatically.

Feature Engineering Best Practices

Naming Conventions

Use consistent, descriptive names: {entity}_{feature}_{aggregation}_{window}

customer_order_count_30d

product_avg_rating_7d

user_session_duration_median_24h

Feature Freshness

Match freshness to business needs:

Real-time (seconds): Fraud detection, pricing

Near real-time (minutes): Recommendations, search ranking

Batch (hours/daily): Reporting, risk scoring

Feature Stores Comparison

Feature Store	Type	Strengths
Feast	Open source	Simple, self-hosted, great for starting
Tecton	Managed (built on Feast)	Production-grade, real-time transforms
Hopsworks	Open source / managed	Integrated with ML pipelines
Databricks Feature Store	Managed	Tight integration with Spark/Unity Catalog
AWS SageMaker Feature Store	Managed	Tight integration with AWS ecosystem

Feature Stores: Bridging Training and Serving

This lesson covers why feature stores exist, how to use Feast (the most popular open-source feature store), and best practices for feature engineering in production.

Training-Serving Skew

What a Feature Store Does

A feature store has three core responsibilities:

1. Feature Registry

A central catalog of all features with metadata:

Feature name, type, description, owner

How it is computed (transformation logic)

Data source and freshness requirements

Which models consume it

2. Offline Store (for training)

Historical feature values used for training and batch scoring:

Stores time-stamped feature values

Supports point-in-time joins: for a given entity at a given time, return the feature values that were available at that exact moment (no future leakage)

Backed by data warehouses: BigQuery, Snowflake, Redshift, Parquet files

3. Online Store (for serving)

Low-latency access to the latest feature values for real-time inference:

Stores only the most recent value per entity

Backed by key-value stores: Redis, DynamoDB, Bigtable

Sub-millisecond latency for production serving

Component	Purpose	Latency	Storage
Registry	Feature catalog & metadata	N/A	File/DB
Offline Store	Training data & historical lookups	Seconds-minutes	Warehouse
Online Store	Real-time serving	< 10ms	Key-value

python

1# === Feast Feature Store Setup ===
2# This demonstrates the Feast workflow: define, materialize, serve
3
4# --- Step 1: Define feature repository structure ---
5# feast_repo/
6#   feature_store.yaml      # Configuration
7#   features.py             # Feature definitions
8#   data/                   # Offline data source
9
10# --- feature_store.yaml ---
11feast_config = """
12project: ml_platform
13provider: local
14registry: data/registry.db
15online_store:
16  type: sqlite
17  path: data/online_store.db
18offline_store:
19  type: file
20entity_key_serialization_version: 2
21"""
22print("=== feature_store.yaml ===")
23print(feast_config)
24
25# --- features.py ---
26feature_definitions = """
27from datetime import timedelta
28from feast import Entity, Feature, FeatureView, Field, FileSource
29from feast.types import Float32, Int64, String
30
31# Entity: the "who" or "what" features describe
32customer = Entity(
33    name="customer_id",
34    description="Unique customer identifier",
35)
36
37# Data source: where raw feature data lives
38customer_stats_source = FileSource(
39    path="data/customer_stats.parquet",
40    timestamp_field="event_timestamp",
41    created_timestamp_column="created_timestamp",
42)
43
44# Feature View: a group of related features from one source
45customer_stats_fv = FeatureView(
46    name="customer_stats",
47    entities=[customer],
48    ttl=timedelta(days=1),  # How stale can features be?
49    schema=[
50        Field(name="total_orders", dtype=Int64),
51        Field(name="avg_order_value", dtype=Float32),
52        Field(name="days_since_last_order", dtype=Int64),
53        Field(name="lifetime_value", dtype=Float32),
54    ],
55    source=customer_stats_source,
56    online=True,  # Materialize to online store
57)
58
59# Feature View with on-demand transformations
60from feast import on_demand_feature_view, RequestSource
61import pandas as pd
62
63input_request = RequestSource(
64    name="order_amount",
65    schema=[Field(name="current_order_amount", dtype=Float32)],
66)
67
68@on_demand_feature_view(
69    sources=[customer_stats_fv, input_request],
70    schema=[
71        Field(name="order_vs_average", dtype=Float32),
72        Field(name="is_high_value", dtype=Int64),
73    ],
74)
75def order_context(inputs: pd.DataFrame) -> pd.DataFrame:
76    df = pd.DataFrame()
77    df["order_vs_average"] = (
78        inputs["current_order_amount"] / inputs["avg_order_value"]
79    )
80    df["is_high_value"] = (inputs["lifetime_value"] > 500).astype(int)
81    return df
82"""
83print("=== features.py ===")
84print(feature_definitions)

Feast Workflow

1. Apply: Register features

feast apply

This reads your feature definitions and updates the registry.

2. Materialize: Populate online store

feast materialize 2024-01-01T00:00:00 2024-12-31T23:59:59

This copies the latest feature values from the offline store to the online store for low-latency serving.

3. Get Training Data (offline)

training_df = store.get_historical_features(
    entity_df=entity_with_timestamps,
    features=["customer_stats:total_orders", "customer_stats:avg_order_value"],
).to_df()

Uses point-in-time joins to prevent data leakage.

4. Get Serving Data (online)

features = store.get_online_features(
    features=["customer_stats:total_orders", "customer_stats:avg_order_value"],
    entity_rows=[{"customer_id": "C123"}],
).to_dict()

Returns the latest feature values with sub-10ms latency.

python

1import pandas as pd
2import numpy as np
3from datetime import datetime, timedelta
4
5# === Simulating Feast Operations ===
6# (Full Feast requires infrastructure; this simulates the concepts)
7
8np.random.seed(42)
9
10# --- Simulate historical feature data ---
11n_customers = 100
12n_days = 90
13records = []
14
15for cid in range(n_customers):
16    for day_offset in range(n_days):
17        ts = datetime(2024, 1, 1) + timedelta(days=day_offset)
18        records.append({
19            "customer_id": f"C{cid:03d}",
20            "event_timestamp": ts,
21            "total_orders": int(np.random.poisson(5 + day_offset * 0.05)),
22            "avg_order_value": round(np.random.normal(50, 15), 2),
23            "days_since_last_order": int(np.random.exponential(7)),
24            "lifetime_value": round(np.random.normal(300 + day_offset, 100), 2),
25        })
26
27feature_df = pd.DataFrame(records)
28print(f"Feature store: {len(feature_df)} records, "
29      f"{n_customers} customers, {n_days} days")
30print(feature_df.head())
31
32# --- Point-in-Time Join ---
33def point_in_time_join(entity_df, feature_df,
34                        entity_col="customer_id",
35                        timestamp_col="event_timestamp"):
36    """
37    For each entity at a given timestamp, return the most recent
38    feature values BEFORE that timestamp (no future leakage).
39    """
40    results = []
41    for _, row in entity_df.iterrows():
42        cid = row[entity_col]
43        ts = row[timestamp_col]
44
45        # Filter: same customer, timestamp BEFORE query time
46        mask = (
47            (feature_df[entity_col] == cid) &
48            (feature_df[timestamp_col] <= ts)
49        )
50        matching = feature_df[mask]
51
52        if len(matching) > 0:
53            # Get the most recent record
54            latest = matching.sort_values(timestamp_col).iloc[-1]
55            result = row.to_dict()
56            for col in ["total_orders", "avg_order_value",
57                        "days_since_last_order", "lifetime_value"]:
58                result[col] = latest[col]
59            results.append(result)
60
61    return pd.DataFrame(results)
62
63# --- Create training entity DataFrame ---
64# "What features did customer X have on date Y?"
65training_entities = pd.DataFrame({
66    "customer_id": ["C001", "C001", "C050", "C050", "C099"],
67    "event_timestamp": [
68        datetime(2024, 2, 1),
69        datetime(2024, 3, 1),
70        datetime(2024, 2, 15),
71        datetime(2024, 3, 15),
72        datetime(2024, 3, 30),
73    ],
74    "label": [1, 0, 1, 1, 0],
75})
76
77print("\n=== Training Entities ===")
78print(training_entities)
79
80# --- Point-in-time join ---
81training_data = point_in_time_join(training_entities, feature_df)
82print("\n=== Training Data (point-in-time joined) ===")
83print(training_data)
84
85# --- Online serving (latest features) ---
86def get_online_features(feature_df, entity_ids,
87                         entity_col="customer_id"):
88    """Simulate online store: return latest features per entity."""
89    results = []
90    for cid in entity_ids:
91        mask = feature_df[entity_col] == cid
92        latest = feature_df[mask].sort_values(
93            "event_timestamp").iloc[-1]
94        results.append(latest.to_dict())
95    return pd.DataFrame(results)
96
97print("\n=== Online Features (latest) ===")
98online = get_online_features(feature_df, ["C001", "C050"])
99print(online[["customer_id", "total_orders",
100              "avg_order_value", "lifetime_value"]])

Point-in-Time Correctness

Feature Engineering Best Practices

Naming Conventions

Use consistent, descriptive names: {entity}_{feature}_{aggregation}_{window}

customer_order_count_30d

product_avg_rating_7d

user_session_duration_median_24h

Feature Freshness

Match freshness to business needs:

Real-time (seconds): Fraud detection, pricing

Near real-time (minutes): Recommendations, search ranking

Batch (hours/daily): Reporting, risk scoring

Feature Stores Comparison

Feature Store	Type	Strengths
Feast	Open source	Simple, self-hosted, great for starting
Tecton	Managed (built on Feast)	Production-grade, real-time transforms
Hopsworks	Open source / managed	Integrated with ML pipelines
Databricks Feature Store	Managed	Tight integration with Spark/Unity Catalog
AWS SageMaker Feature Store	Managed	Tight integration with AWS ecosystem