Methodological Details of Bayesian State-Space Models with Variable Selection

José Mauricio Gómez Julián

octubre 2025

State-Space Model Framework for Time Series

State-space models constitute a unified framework for time series analysis that explicitly separates the underlying signal from observational noise. This methodology implements full Bayesian inference on structural time series components (Bayesian Structural Time Series - BSTS), combining the flexibility of 01 with automatic variable selection through 02 priors. The approach allows uncertainty quantification at all model levels while maintaining economic interpretability of the components.

General State-Space Representation

Mathematical Formulation

A state-space model is defined by two fundamental equations:

Observation Equation: \[y_t = Z_t' \alpha_t + \beta' x_t + \epsilon_t, \quad \epsilon_t \sim N(0, \sigma^2_{\epsilon})\]

State (Transition) Equation: \[\alpha_{t+1} = T_t \alpha_t + R_t \eta_t, \quad \eta_t \sim N(0, Q_t)\]

where: - \(y_t\) is the observation at time \(t\) - \(\alpha_t\) is the latent state vector of dimension \(m\) - \(x_t\) is the vector of exogenous regressors of dimension \(k\) - \(\beta\) are the regression coefficients - \(Z_t\) connects the state to observations - \(T_t\) is the state transition matrix - \(R_t\) and \(Q_t\) define the variance structure of the state process - \(\epsilon_t\) is the observation error

Advantages of the State-Space Framework

  1. Structural decomposition: Separates trend, seasonality, and irregular components
  2. Natural handling of missing data: The Kalman filter automatically interpolates
  3. Complete uncertainty: Posterior distributions for states and forecasts
  4. Modular flexibility: Components can be added as needed

Implemented Structural Components

Local Level (LL)

The simplest model includes only a stochastic level:

\[\begin{aligned} y_t &= \mu_t + \beta' x_t + \epsilon_t \\ \mu_{t+1} &= \mu_t + \eta_{\mu,t} \end{aligned}\]

where \(\eta_{\mu,t} \sim N(0, \sigma^2_\mu)\) represents innovations in the level. This model is appropriate for series with varying level but no systematic trend.

Local Linear Trend (LLT)

Extends the LL model by adding a stochastic trend:

\[\begin{aligned} y_t &= \mu_t + \beta' x_t + \epsilon_t \\ \mu_{t+1} &= \mu_t + \delta_t + \eta_{\mu,t} \\ \delta_{t+1} &= \delta_t + \eta_{\delta,t} \end{aligned}\]

where: - \(\mu_t\) is the level at time \(t\) - \(\delta_t\) is the slope (rate of change) at time \(t\) - \(\eta_{\mu,t} \sim N(0, \sigma^2_\mu)\) are level innovations - \(\eta_{\delta,t} \sim N(0, \sigma^2_\delta)\) are slope innovations

Seasonal Component (Optional)

For series with seasonal patterns:

\[\gamma_{t+1} = -\sum_{s=1}^{S-1} \gamma_{t-s+1} + \eta_{\gamma,t}\]

where \(S\) is the seasonal period (e.g., 12 for monthly data) and \(\eta_{\gamma,t} \sim N(0, \sigma^2_\gamma)\).

Bayesian Variable Selection with Spike-and-Slab

Economic Motivation

In high-dimensional contexts with multiple potential lags, variable selection is crucial for: - Avoiding overfitting - Maintaining interpretability - Identifying genuine predictors

Spike-and-Slab Prior

For each coefficient \(\beta_j\) we define a hierarchical prior:

\[\beta_j | \gamma_j \sim \begin{cases} N(0, \tau^2 v_j) & \text{if } \gamma_j = 1 \text{ (slab)} \\ \delta_0 & \text{if } \gamma_j = 0 \text{ (spike)} \end{cases}\]

where: - \(\gamma_j \in \{0,1\}\) is the inclusion indicator - \(\tau^2 v_j\) is the slab variance (non-zero component) - \(\delta_0\) is a point mass at zero

The prior on the indicators is:

\[\gamma_j \sim \text{Bernoulli}(\pi_j)\]

with \(\pi_j\) 03 according to expected model size.

Hyperparameter Calibration

Hyperparameters are configured via:

  1. Expected model size: \(E[\sum \gamma_j] = \max(1, \min(5, k))\) where \(k\) is the number of predictors
  2. Prior information weight: \(w = 0.01\) (weakly informative prior)
  3. Diagonal shrinkage: \(\kappa = 0.5\) for moderate regularization

This configuration balances flexibility with parsimony, allowing data to dominate inference while maintaining smooth regularization.

Bayesian Inference via MCMC

Sampling Algorithm

Inference is performed through a hybrid Gibbs sampling scheme:

  1. Latent states (\(\alpha_{1:T}\)): 04 (FFBS)
  2. Variances (\(\sigma^2_\epsilon, \sigma^2_\mu, \sigma^2_\delta\)): Gibbs with conjugate inverse-gamma priors
  3. Coefficients and indicators (\(\beta, \gamma\)): 05 (SSVS)

MCMC Configuration

  • Total iterations: 2000
  • Burn-in: 500 (25%)
  • Thinning: Not applied (all post-burn samples)
  • Chains: 1 (parallelization at pair level)
  • Seed: Fixed for reproducibility

Convergence Diagnostics

While not explicitly reported for computational efficiency, standard diagnostics include: - Visual inspection of traces - Chain autocorrelation - Effective sample size for key parameters - Stable inclusion probabilities for \(\gamma_j\)

Temporal Validation with 06

Validation Scheme

We implement Leave-Future-Out (LFO) with rolling origin:

  1. Initial set: 80% of data or minimum 30 observations
  2. Forecast horizon: \(h = 6\) months
  3. Step between origins: 6 months
  4. Window type: Expanding (accumulates history)

This more conservative scheme (6-month horizon and step vs 12 in ECM-MARS) allows greater temporal resolution in stability evaluation.

Procedure per Fold

For each fold \(f\):

  1. Data preparation:
    • Split into train/test according to split
    • Scale regressors per fold using train statistics
  2. Model fitting:
    • Base: Only structural components without regressors
    • Full: Structural components + regressors with spike-and-slab
  3. Probabilistic prediction:
    • Generate complete posterior predictive distribution
    • Extract mean, standard deviation, and quantiles
  4. Evaluation:
    • Calculate ELPD, point metrics, and calibration

Predictive Evaluation Metrics

Expected Log Predictive Density (ELPD)

For each future observation \(y_t^*\):

\[\text{ELPD}_t = \log \int p(y_t^* | \theta) p(\theta | \mathcal{D}_{\text{train}}) d\theta\]

Approximated via MCMC samples as:

\[\widehat{\text{ELPD}}_t = \log \left( \frac{1}{S} \sum_{s=1}^S p(y_t^* | \theta^{(s)}) \right)\]

where we assume normality: \(p(y_t^* | \theta^{(s)}) = N(y_t^*; \mu_t^{(s)}, \sigma_t^{(s)})\).

Point Error Metrics

Using the posterior mean as point prediction:

  • RMSE: \(\sqrt{\frac{1}{h} \sum_{t=1}^h (y_t - \bar{y}_t)^2}\)
  • MAE: \(\frac{1}{h} \sum_{t=1}^h |y_t - \bar{y}_t|\)

where \(\bar{y}_t = \frac{1}{S} \sum_{s=1}^S y_t^{(s)}\) is the posterior mean.

Calibration Metrics

We evaluate the quality of quantified uncertainty:

  • 80% Coverage: Proportion of observations within 80% credible interval
  • 95% Coverage: Proportion of observations within 95% credible interval
  • PIT (Probability Integral Transform): \(u_t = F_{Y_t}(y_t)\) where \(F_{Y_t}\) is the predictive CDF

If the model is well-calibrated, PIT values follow a uniform distribution on [0,1].

Optimal Structure Selection

Structure Grid

We systematically evaluate:

Model Level Trend Seasonality Free Parameters
LL Stochastic No Optional \(\sigma^2_\mu, \sigma^2_\epsilon\)
LLT Stochastic Stochastic Optional \(\sigma^2_\mu, \sigma^2_\delta, \sigma^2_\epsilon\)

Selection Criterion

The optimal structure is selected through:

  1. Primary: Highest average \(\Delta\text{ELPD}\) across folds
  2. Secondary: Highest support (proportion of winning folds)
  3. Tertiary: Highest \(\Delta\text{RMSE}\) (error reduction)

This hierarchy prioritizes probabilistic predictive capacity over point fit.

Victory and Stability Criterion

Victory per Fold

A full model “wins” in fold \(f\) if:

\[\begin{cases} \Delta\text{ELPD}_f > 0 & \text{(better predictive density)} \\ \Delta\text{RMSE}_f > 0 & \text{(lower error, i.e., RMSE}_{\text{base}} - \text{RMSE}_{\text{full}} > 0\text{)} \end{cases}\]

Note: Unlike GLM-AR(1), here \(\Delta\text{RMSE} = \text{RMSE}_{\text{base}} - \text{RMSE}_{\text{full}}\), so positive values indicate improvement.

Stability Metrics

  • Support: \(\frac{\text{wins}}{\text{folds}}\) = proportion of victorious folds
  • Strict threshold: support \(\geq 0.70\) with minimum 5 folds
  • Moderate threshold: support \(\geq 0.60\) with minimum 5 folds

Computational Implementation and Optimizations

Processing Architecture

  1. Parallelization by pairs: The 84 pairs are processed sequentially
  2. MCMC efficiency: Single chain per model (speed vs diagnostics trade-off)
  3. Matrix caching: Reuse of decompositions in Kalman filter

Special Case Management

The code includes robust handling of:

  • Zero variance: Skip fold if \(\text{sd}(y) \approx 0\)
  • Convergence failures: Try-catch with informative messages
  • Singular matrices: Automatic regularization in Kalman filter
  • Degenerate predictions: Fallback to quantile-based intervals

Output Data Structure

Each pair generates:

list(
  best_summary = tibble(     # Best model summary
    spec, folds, wins, support, 
    dELPD_mean, dRMSE_mean, ...
  ),
  best_results = tibble(     # Details per fold
    fold, n_train, n_test,
    ELPD_base, ELPD_full, dELPD,
    RMSE_base, RMSE_full, dRMSE,
    cover80, cover95, pit, ...
  ),
  all_summaries = tibble()   # All tested structures
)

Comparative Analysis with Previous Methodologies

Comparative Approach Table

Aspect ECM-MARS GLM-AR(1) BSTS
Paradigm Hybrid frequentist Parametric Bayesian Structural Bayesian
Prerequisites I(1), cointegration None None
Non-linearity MARS (splines) No No (but flexible components)
Variable selection Manual (pre-tests) No Automatic (spike-slab)
Temporal components ECM only Global AR(1) Level, trend, seasonality
Uncertainty Implicit bootstrap Complete posterior Complete posterior + states
Interpretability High (economic) Medium High (decomposition)
Computational cost Medium High Very high

Distinctive Advantages of BSTS

  1. Interpretable decomposition: Separates signal from noise with economically meaningful components

  2. Automatic selection: Spike-and-slab identifies relevant predictors without pre-specification

  3. Structural uncertainty handling: Quantifies uncertainty in unobservable components

  4. Robustness to specification: Does not require assumptions about integration order or cointegration

  5. Interval forecasts: Natural and well-calibrated prediction intervals

Relative Limitations

  1. High computational cost: FFBS + SSVS is intensive, especially with many regressors

  2. Linearity in predictors: Does not capture non-linear interactions like MARS

  3. Complex diagnostics: Convergence harder to verify than OLS

  4. Sensitivity to priors in small samples: Especially for component variances

Extensions and Future Developments

Immediate Improvements

  1. Additional components:
    • Calendar effects: Business days, moving holidays
    • Level shifts: Automatic detection of structural breaks
    • Dynamic regressors: Time-varying coefficients
  2. Informative priors:
    • Incorporate economic information in inclusion priors
    • Hierarchical priors for groups of related variables
  3. Enhanced validation:
    • Stochastic cross-validation (Pareto smoothed importance sampling)
    • Backtesting with multiple horizons

Methodological Extensions

  1. Non-linearity:
    • Gaussian processes for non-linear components
    • Bayesian neural networks for interaction capture
  2. Hierarchical models:
    • Partial pooling between related pairs
    • Shared latent factors
  3. Causality:
    • Intervention models with causal inference
    • Dynamic causal graphs

Complete Pipeline Pseudocode

INPUT: DataFrame with time series, variable specification
OUTPUT: Ranking of pairs by predictive capacity and stability

# PREPARATION
1. LOAD data and clean names
2. IDENTIFY 6 circulation variables, 7 production
3. GENERATE 84 pairs (2 × 6 × 7)

# PROCESSING PER PAIR
FOR each pair (Y, X):
  
  4. BUILD matrix with Y, X, lags(X, 1:6)
  5. REMOVE observations with NA
  
  # STRUCTURE TUNING
  FOR each structure in {LL, LLT}:
    
    # LEAVE-FUTURE-OUT
    6. GENERATE LFO splits (init=80%, h=6, step=6)
    
    FOR each fold:
      7. SPLIT train/test
      8. SCALE regressors with train stats
      
      # MODELS
      9. FIT base_model:
         - state_spec = structure without regressors
         - MCMC with 2000 iter, 500 burn-in
      
      10. FIT full_model:
          - state_spec = structure
          - prior = spike_slab(X_lags, expected_size=5)
          - MCMC with 2000 iter, 500 burn-in
      
      # PREDICTION
      11. GENERATE predictive distributions (h=6)
      12. EXTRACT mean, sd, quantiles
      
      # EVALUATION
      13. CALCULATE:
          - ELPD_base, ELPD_full → ΔELPD
          - RMSE_base, RMSE_full → ΔRMSE
          - Coverage 80%, 95%
          - PIT values
      
      14. DETERMINE victory: (ΔELPD > 0) ∧ (ΔRMSE > 0)
    
    15. SUMMARIZE structure:
        - support = wins/folds
        - metric averages
  
  16. SELECT best structure by ΔELPD
  17. SAVE pair results

# RANKING AND EXPORT
18. ORDER pairs by (support DESC, ΔELPD DESC, ΔRMSE DESC)
19. FILTER winners by thresholds (0.70, 0.60)
20. EXPORT CSVs and visualizations
21. GENERATE plots for last fold (without re-fitting)

Methodological Conclusions

The BSTS methodology represents the most comprehensive and flexible approach of the three implemented, combining the rigor of Bayesian inference with the interpretability of structural decomposition. Unlike ECM-MARS which requires prior validation of strict econometric assumptions, and GLM-AR(1) which models only global serial dependence, BSTS decomposes the series into interpretable components while automatically selecting relevant predictors.

Empirical results suggest that BSTS identifies robust predictive relationships with high precision, albeit at substantially higher computational cost. The ability to quantify uncertainty at all levels—from structural components to predictions—makes it particularly valuable for applications where risk assessment is critical.

The implementation with a 6-month horizon (vs 12 in other methodologies) provides more granular evaluation of temporal stability, revealing that many relationships appearing stable at long horizons show variability at shorter scales. This suggests the importance of considering multiple horizons in predictive model evaluation.

The BSTS framework is especially appropriate when: - Component interpretation is a priority - There is uncertainty about which predictors to include - Well-calibrated prediction intervals are required - Data exhibit multiple sources of variation (trend, seasonality, regressors)

The combination of automatic variable selection with structural modeling positions BSTS as a bridge between classical econometric methods and modern machine learning techniques, maintaining interpretability while embracing the inherent complexity in economic data.



  1. Structural decomposition separates systematic variation (trend, seasonality) from random variation, facilitating economic interpretation and improving forecasts by modeling each component appropriately.↩︎

  2. The spike-and-slab prior combines a continuous distribution (slab) for active coefficients with a point mass at zero (spike) for inactive coefficients, enabling exact variable selection.↩︎

  3. Calibration of predictive probabilities is crucial for decision-making under uncertainty. A well-calibrated model assigns probabilities that reflect long-run empirical frequencies.↩︎

  4. Forward Filtering Backward Sampling (FFBS) is the standard algorithm for sampling states in linear Gaussian models, combining forward Kalman filtering with backward sampling conditioned on complete data.↩︎

  5. Stochastic Search Variable Selection (SSVS) is an MCMC method for exploring model space, alternating between updating coefficients given the model and updating the model given coefficients.↩︎

  6. The validation scheme with h=6 and step=6 avoids overlap between test sets, ensuring independence in evaluation while maximizing data usage.↩︎