Skip to main content

Time Series Fundamentals

Components, stationarity, autocorrelation, and proper train/test splitting

~40 min
Listen to this lesson

Time Series Fundamentals

A time series is a sequence of data points ordered by time. Unlike tabular data where rows are independent, time series data has temporal dependencies -- each observation is related to its neighbors in time.

Time Series Components

Every time series can be decomposed into four components:

1. Trend: Long-term increase or decrease in the data (e.g., rising global temperatures) 2. Seasonality: Regular periodic patterns (e.g., higher ice cream sales in summer) 3. Cyclical: Longer-term fluctuations without a fixed period (e.g., business cycles) 4. Noise/Residual: Random variation that cannot be explained by the other components

These can combine additively or multiplicatively:

  • Additive: y(t) = Trend + Seasonality + Noise
  • Multiplicative: y(t) = Trend * Seasonality * Noise
  • Use multiplicative when seasonal fluctuations grow with the level of the series.

    python
    1import numpy as np
    2
    3# Generate a synthetic time series with known components
    4np.random.seed(42)
    5n_points = 365  # One year of daily data
    6
    7# Time index
    8t = np.arange(n_points)
    9
    10# Trend: gradual upward
    11trend = 0.05 * t + 50
    12
    13# Seasonality: yearly cycle
    14seasonality = 10 * np.sin(2 * np.pi * t / 365)
    15
    16# Weekly pattern (smaller amplitude)
    17weekly = 3 * np.sin(2 * np.pi * t / 7)
    18
    19# Noise
    20noise = np.random.normal(0, 2, n_points)
    21
    22# Combine (additive)
    23y = trend + seasonality + weekly + noise
    24
    25print(f"Time series length: {len(y)}")
    26print(f"Mean: {y.mean():.2f}")
    27print(f"Std: {y.std():.2f}")
    28print(f"First 10 values: {np.round(y[:10], 2)}")

    Stationarity

    A time series is stationary if its statistical properties (mean, variance, autocorrelation) don't change over time. Most forecasting models assume or require stationarity.

    Why Stationarity Matters

  • Non-stationary series are harder to model because patterns keep changing
  • Many statistical tests and models (ARIMA, etc.) assume stationarity
  • A model trained on one period may fail on another if the series is non-stationary
  • Making a Series Stationary

    1. Differencing: Subtract the previous value: y'(t) = y(t) - y(t-1) 2. Seasonal differencing: Subtract the value from one season ago 3. Log transformation: Stabilizes variance when it grows with the level 4. Detrending: Remove the trend component

    Augmented Dickey-Fuller (ADF) Test

    The standard statistical test for stationarity. The null hypothesis is that the series has a unit root (is non-stationary):
  • p-value < 0.05: Reject null hypothesis -- series IS stationary
  • p-value >= 0.05: Fail to reject -- series may be non-stationary
  • python
    1import numpy as np
    2
    3def simple_adf_check(series, max_lag=1):
    4    """
    5    Simplified stationarity check using variance of differences.
    6    (Full ADF requires statsmodels)
    7    """
    8    # First difference
    9    diff = np.diff(series)
    10
    11    # Check if variance is roughly constant across segments
    12    n = len(diff)
    13    segment_size = n // 4
    14    variances = []
    15    for i in range(4):
    16        segment = diff[i * segment_size:(i + 1) * segment_size]
    17        variances.append(np.var(segment))
    18
    19    variance_ratio = max(variances) / (min(variances) + 1e-10)
    20
    21    # Simple heuristic: if variance ratio is small, likely stationary
    22    is_stationary = variance_ratio < 3.0
    23
    24    return {
    25        "likely_stationary": is_stationary,
    26        "variance_ratio": variance_ratio,
    27        "segment_variances": variances,
    28        "mean_of_diffs": np.mean(diff),
    29    }
    30
    31# Test with stationary vs non-stationary
    32np.random.seed(42)
    33stationary = np.random.randn(200)  # White noise (stationary)
    34non_stationary = np.cumsum(np.random.randn(200))  # Random walk (non-stationary)
    35trending = np.arange(200) * 0.5 + np.random.randn(200)  # Trend (non-stationary)
    36
    37for name, series in [("White noise", stationary), ("Random walk", non_stationary), ("Trending", trending)]:
    38    result = simple_adf_check(series)
    39    print(f"{name}: stationary={result['likely_stationary']}, var_ratio={result['variance_ratio']:.2f}")

    Autocorrelation (ACF & PACF)

    Autocorrelation measures how a time series correlates with lagged versions of itself. It reveals the memory and structure of the data.

    ACF (Autocorrelation Function): Correlation between y(t) and y(t-k) for lag k. Includes indirect correlations through intermediate lags.

    PACF (Partial Autocorrelation Function): Correlation between y(t) and y(t-k) after removing the effect of intermediate lags. Shows only the direct relationship.

    Reading ACF/PACF Plots

    PatternACFPACFModel Suggestion
    AR(p)Tails off (decays)Cuts off after lag pUse ARIMA(p, d, 0)
    MA(q)Cuts off after lag qTails offUse ARIMA(0, d, q)
    ARMA(p,q)Tails offTails offUse ARIMA(p, d, q)

    Decomposition

    Decomposition separates a time series into its components, helping you understand what's driving the data.

    python
    1import numpy as np
    2
    3def compute_acf(series, max_lag=20):
    4    """Compute autocorrelation function."""
    5    n = len(series)
    6    mean = np.mean(series)
    7    var = np.var(series)
    8    acf_values = []
    9
    10    for lag in range(max_lag + 1):
    11        if lag == 0:
    12            acf_values.append(1.0)
    13            continue
    14        cov = np.mean((series[lag:] - mean) * (series[:-lag] - mean))
    15        acf_values.append(cov / var)
    16
    17    return np.array(acf_values)
    18
    19
    20def compute_pacf(series, max_lag=20):
    21    """Compute partial autocorrelation using Durbin-Levinson recursion."""
    22    acf = compute_acf(series, max_lag)
    23    pacf_values = [1.0, acf[1]]
    24
    25    for k in range(2, max_lag + 1):
    26        # Durbin-Levinson algorithm
    27        phi = np.zeros((k + 1, k + 1))
    28        phi[1, 1] = acf[1]
    29
    30        for i in range(2, k + 1):
    31            num = acf[i] - sum(phi[i-1, j] * acf[i-j] for j in range(1, i))
    32            den = 1 - sum(phi[i-1, j] * acf[j] for j in range(1, i))
    33            phi[i, i] = num / den if abs(den) > 1e-10 else 0
    34
    35            for j in range(1, i):
    36                phi[i, j] = phi[i-1, j] - phi[i, i] * phi[i-1, i-j]
    37
    38        pacf_values.append(phi[k, k])
    39
    40    return np.array(pacf_values)
    41
    42
    43# Example: AR(2) process
    44np.random.seed(42)
    45n = 500
    46ar2 = np.zeros(n)
    47for t in range(2, n):
    48    ar2[t] = 0.6 * ar2[t-1] - 0.3 * ar2[t-2] + np.random.randn()
    49
    50acf = compute_acf(ar2, max_lag=10)
    51pacf = compute_pacf(ar2, max_lag=10)
    52
    53print("ACF values (should tail off):")
    54print(np.round(acf, 3))
    55print("\nPACF values (should cut off after lag 2):")
    56print(np.round(pacf, 3))

    Time Series Train/Test Split: No Shuffling!

    Never randomly shuffle time series data for train/test splitting! This would leak future information into the training set. Always use a temporal split: train on earlier data, test on later data. For cross-validation, use expanding or sliding window approaches where the test set is always after the training set.
    python
    1import numpy as np
    2
    3def time_series_split(data, test_ratio=0.2):
    4    """Temporal split: train on past, test on future."""
    5    split_idx = int(len(data) * (1 - test_ratio))
    6    return data[:split_idx], data[split_idx:]
    7
    8
    9def expanding_window_cv(data, n_splits=5, min_train_size=50):
    10    """
    11    Expanding window cross-validation for time series.
    12    Training window grows, test window stays the same size.
    13    """
    14    n = len(data)
    15    test_size = (n - min_train_size) // n_splits
    16    folds = []
    17
    18    for i in range(n_splits):
    19        train_end = min_train_size + i * test_size
    20        test_end = train_end + test_size
    21        if test_end > n:
    22            break
    23        folds.append({
    24            "train": (0, train_end),
    25            "test": (train_end, test_end),
    26        })
    27
    28    return folds
    29
    30
    31def sliding_window_features(series, window_size=5):
    32    """Create windowed features for ML models."""
    33    X, y = [], []
    34    for i in range(window_size, len(series)):
    35        X.append(series[i - window_size:i])
    36        y.append(series[i])
    37    return np.array(X), np.array(y)
    38
    39
    40# Demo
    41np.random.seed(42)
    42data = np.cumsum(np.random.randn(200)) + 100
    43
    44# Temporal split
    45train, test = time_series_split(data, test_ratio=0.2)
    46print(f"Train: {len(train)} points, Test: {len(test)} points")
    47print(f"Train period: indices 0-{len(train)-1}")
    48print(f"Test period: indices {len(train)}-{len(data)-1}")
    49
    50# Expanding window CV
    51folds = expanding_window_cv(data, n_splits=4, min_train_size=50)
    52for i, fold in enumerate(folds):
    53    print(f"Fold {i}: train={fold['train']}, test={fold['test']}")
    54
    55# Windowed features
    56X, y = sliding_window_features(data[:20], window_size=5)
    57print(f"\nWindowed features shape: X={X.shape}, y={y.shape}")
    58print(f"First window: {np.round(X[0], 2)} -> {y[0]:.2f}")