Skip to main content

Feature Stores

What feature stores solve (training-serving skew), Feast (offline/online stores, feature services, materialization), Tecton, feature engineering best practices, and point-in-time correctness

~45 min
Listen to this lesson

Feature Stores: Bridging Training and Serving

One of the most insidious bugs in production ML is training-serving skew: when the features used during training differ from those used during inference. A model trained on "average order value over the last 30 days" might see a different computation of that feature at serving time due to different code paths, different data sources, or different timing semantics. Feature stores solve this by providing a single source of truth for feature definitions and values.

This lesson covers why feature stores exist, how to use Feast (the most popular open-source feature store), and best practices for feature engineering in production.

Training-Serving Skew

Training-serving skew occurs when the feature values seen during training differ from those seen during serving. Common causes: (1) different code computes features for training vs serving, (2) data leakage from the future during training, (3) different data freshness between batch training and real-time serving. Feature stores eliminate skew by using the same feature definitions for both training and serving.

What a Feature Store Does

A feature store has three core responsibilities:

1. Feature Registry

A central catalog of all features with metadata:
  • Feature name, type, description, owner
  • How it is computed (transformation logic)
  • Data source and freshness requirements
  • Which models consume it
  • 2. Offline Store (for training)

    Historical feature values used for training and batch scoring:
  • Stores time-stamped feature values
  • Supports point-in-time joins: for a given entity at a given time, return the feature values that were available at that exact moment (no future leakage)
  • Backed by data warehouses: BigQuery, Snowflake, Redshift, Parquet files
  • 3. Online Store (for serving)

    Low-latency access to the latest feature values for real-time inference:
  • Stores only the most recent value per entity
  • Backed by key-value stores: Redis, DynamoDB, Bigtable
  • Sub-millisecond latency for production serving
  • ComponentPurposeLatencyStorage
    RegistryFeature catalog & metadataN/AFile/DB
    Offline StoreTraining data & historical lookupsSeconds-minutesWarehouse
    Online StoreReal-time serving< 10msKey-value

    python
    1# === Feast Feature Store Setup ===
    2# This demonstrates the Feast workflow: define, materialize, serve
    3
    4# --- Step 1: Define feature repository structure ---
    5# feast_repo/
    6#   feature_store.yaml      # Configuration
    7#   features.py             # Feature definitions
    8#   data/                   # Offline data source
    9
    10# --- feature_store.yaml ---
    11feast_config = """
    12project: ml_platform
    13provider: local
    14registry: data/registry.db
    15online_store:
    16  type: sqlite
    17  path: data/online_store.db
    18offline_store:
    19  type: file
    20entity_key_serialization_version: 2
    21"""
    22print("=== feature_store.yaml ===")
    23print(feast_config)
    24
    25# --- features.py ---
    26feature_definitions = """
    27from datetime import timedelta
    28from feast import Entity, Feature, FeatureView, Field, FileSource
    29from feast.types import Float32, Int64, String
    30
    31# Entity: the "who" or "what" features describe
    32customer = Entity(
    33    name="customer_id",
    34    description="Unique customer identifier",
    35)
    36
    37# Data source: where raw feature data lives
    38customer_stats_source = FileSource(
    39    path="data/customer_stats.parquet",
    40    timestamp_field="event_timestamp",
    41    created_timestamp_column="created_timestamp",
    42)
    43
    44# Feature View: a group of related features from one source
    45customer_stats_fv = FeatureView(
    46    name="customer_stats",
    47    entities=[customer],
    48    ttl=timedelta(days=1),  # How stale can features be?
    49    schema=[
    50        Field(name="total_orders", dtype=Int64),
    51        Field(name="avg_order_value", dtype=Float32),
    52        Field(name="days_since_last_order", dtype=Int64),
    53        Field(name="lifetime_value", dtype=Float32),
    54    ],
    55    source=customer_stats_source,
    56    online=True,  # Materialize to online store
    57)
    58
    59# Feature View with on-demand transformations
    60from feast import on_demand_feature_view, RequestSource
    61import pandas as pd
    62
    63input_request = RequestSource(
    64    name="order_amount",
    65    schema=[Field(name="current_order_amount", dtype=Float32)],
    66)
    67
    68@on_demand_feature_view(
    69    sources=[customer_stats_fv, input_request],
    70    schema=[
    71        Field(name="order_vs_average", dtype=Float32),
    72        Field(name="is_high_value", dtype=Int64),
    73    ],
    74)
    75def order_context(inputs: pd.DataFrame) -> pd.DataFrame:
    76    df = pd.DataFrame()
    77    df["order_vs_average"] = (
    78        inputs["current_order_amount"] / inputs["avg_order_value"]
    79    )
    80    df["is_high_value"] = (inputs["lifetime_value"] > 500).astype(int)
    81    return df
    82"""
    83print("=== features.py ===")
    84print(feature_definitions)

    Feast Workflow

    1. Apply: Register features

    feast apply
    
    This reads your feature definitions and updates the registry.

    2. Materialize: Populate online store

    feast materialize 2024-01-01T00:00:00 2024-12-31T23:59:59
    
    This copies the latest feature values from the offline store to the online store for low-latency serving.

    3. Get Training Data (offline)

    training_df = store.get_historical_features(
        entity_df=entity_with_timestamps,
        features=["customer_stats:total_orders", "customer_stats:avg_order_value"],
    ).to_df()
    
    Uses point-in-time joins to prevent data leakage.

    4. Get Serving Data (online)

    features = store.get_online_features(
        features=["customer_stats:total_orders", "customer_stats:avg_order_value"],
        entity_rows=[{"customer_id": "C123"}],
    ).to_dict()
    
    Returns the latest feature values with sub-10ms latency.

    python
    1import pandas as pd
    2import numpy as np
    3from datetime import datetime, timedelta
    4
    5# === Simulating Feast Operations ===
    6# (Full Feast requires infrastructure; this simulates the concepts)
    7
    8np.random.seed(42)
    9
    10# --- Simulate historical feature data ---
    11n_customers = 100
    12n_days = 90
    13records = []
    14
    15for cid in range(n_customers):
    16    for day_offset in range(n_days):
    17        ts = datetime(2024, 1, 1) + timedelta(days=day_offset)
    18        records.append({
    19            "customer_id": f"C{cid:03d}",
    20            "event_timestamp": ts,
    21            "total_orders": int(np.random.poisson(5 + day_offset * 0.05)),
    22            "avg_order_value": round(np.random.normal(50, 15), 2),
    23            "days_since_last_order": int(np.random.exponential(7)),
    24            "lifetime_value": round(np.random.normal(300 + day_offset, 100), 2),
    25        })
    26
    27feature_df = pd.DataFrame(records)
    28print(f"Feature store: {len(feature_df)} records, "
    29      f"{n_customers} customers, {n_days} days")
    30print(feature_df.head())
    31
    32# --- Point-in-Time Join ---
    33def point_in_time_join(entity_df, feature_df,
    34                        entity_col="customer_id",
    35                        timestamp_col="event_timestamp"):
    36    """
    37    For each entity at a given timestamp, return the most recent
    38    feature values BEFORE that timestamp (no future leakage).
    39    """
    40    results = []
    41    for _, row in entity_df.iterrows():
    42        cid = row[entity_col]
    43        ts = row[timestamp_col]
    44
    45        # Filter: same customer, timestamp BEFORE query time
    46        mask = (
    47            (feature_df[entity_col] == cid) &
    48            (feature_df[timestamp_col] <= ts)
    49        )
    50        matching = feature_df[mask]
    51
    52        if len(matching) > 0:
    53            # Get the most recent record
    54            latest = matching.sort_values(timestamp_col).iloc[-1]
    55            result = row.to_dict()
    56            for col in ["total_orders", "avg_order_value",
    57                        "days_since_last_order", "lifetime_value"]:
    58                result[col] = latest[col]
    59            results.append(result)
    60
    61    return pd.DataFrame(results)
    62
    63# --- Create training entity DataFrame ---
    64# "What features did customer X have on date Y?"
    65training_entities = pd.DataFrame({
    66    "customer_id": ["C001", "C001", "C050", "C050", "C099"],
    67    "event_timestamp": [
    68        datetime(2024, 2, 1),
    69        datetime(2024, 3, 1),
    70        datetime(2024, 2, 15),
    71        datetime(2024, 3, 15),
    72        datetime(2024, 3, 30),
    73    ],
    74    "label": [1, 0, 1, 1, 0],
    75})
    76
    77print("\n=== Training Entities ===")
    78print(training_entities)
    79
    80# --- Point-in-time join ---
    81training_data = point_in_time_join(training_entities, feature_df)
    82print("\n=== Training Data (point-in-time joined) ===")
    83print(training_data)
    84
    85# --- Online serving (latest features) ---
    86def get_online_features(feature_df, entity_ids,
    87                         entity_col="customer_id"):
    88    """Simulate online store: return latest features per entity."""
    89    results = []
    90    for cid in entity_ids:
    91        mask = feature_df[entity_col] == cid
    92        latest = feature_df[mask].sort_values(
    93            "event_timestamp").iloc[-1]
    94        results.append(latest.to_dict())
    95    return pd.DataFrame(results)
    96
    97print("\n=== Online Features (latest) ===")
    98online = get_online_features(feature_df, ["C001", "C050"])
    99print(online[["customer_id", "total_orders",
    100              "avg_order_value", "lifetime_value"]])

    Point-in-Time Correctness

    Point-in-time correctness is the most critical concept in feature stores. When building training data, you must use only feature values that were available at the time the label was observed. Using future feature values creates data leakage and makes your model appear more accurate during training than it will be in production. Feature stores enforce this automatically.

    Feature Engineering Best Practices

    Naming Conventions

    Use consistent, descriptive names: {entity}_{feature}_{aggregation}_{window}
  • customer_order_count_30d
  • product_avg_rating_7d
  • user_session_duration_median_24h
  • Feature Freshness

    Match freshness to business needs:
  • Real-time (seconds): Fraud detection, pricing
  • Near real-time (minutes): Recommendations, search ranking
  • Batch (hours/daily): Reporting, risk scoring
  • Feature Stores Comparison

    Feature StoreTypeStrengths
    FeastOpen sourceSimple, self-hosted, great for starting
    TectonManaged (built on Feast)Production-grade, real-time transforms
    HopsworksOpen source / managedIntegrated with ML pipelines
    Databricks Feature StoreManagedTight integration with Spark/Unity Catalog
    AWS SageMaker Feature StoreManagedTight integration with AWS ecosystem