This document details the methodology implemented for analyzing causal relationships between production and circulation variables using a Bayesian Generalized Linear Model (BGLM) with first-order autoregressive structure AR(1). This approach differs substantially from the ECM-MARS methodology by adopting a complete Bayesian inferential paradigm, allowing comprehensive quantification of predictive uncertainty and model comparison through predictive information criteria.
Let \(Y_t\) be the dependent variable and \(X_t\) the independent variable at time \(t\). The complete model is specified as:
\[Y_{t,s} = \alpha + \beta_0 \cdot t_s + \sum_{i=1}^{L} \beta_i X_{t-i,s} + \epsilon_t\]
where:
Standardization is critical for the efficiency of the NUTS1 (No-U-Turn Sampler) algorithm, ensuring that all parameters operate on comparable scales and avoiding convergence issues in the Hamiltonian sampler.
As a comparison benchmark, we estimate a baseline model that excludes information from the independent variable:
\[Y_{t,s} = \alpha + \beta_0 \cdot t_s + \epsilon_t\]
This model captures only the temporal trend and the intrinsic autoregressive structure of \(Y\), serving as an effective null hypothesis against which to evaluate the predictive contribution of the lags of \(X\).
Priors are specified on the standardized scale to maintain consistency:
The choice of weakly informative priors2 reflects a balance between mild regularization and flexibility to learn from the data. The Student-t distribution for the intercept provides robustness against outliers, while the exponential prior for \(\sigma\) ensures positivity with a heavy tail.
Unlike traditional cross-validation that violates the principle of temporal causality, we implement Leave-Future-Out (LFO) with a sliding window. This scheme strictly respects temporal ordering and simulates the real context of prospective forecasting.
The procedure is structured as follows:
Initialization: The initial training set comprises 70% of the data or a minimum of 90 observations, whichever is greater.
Test horizon: Each fold evaluates \(h = 12\) months ahead, simulating annual forecasts.
Step between folds: The origin shifts 12 months between successive evaluations.
Sliding window: The training set size remains constant3, discarding old observations when incorporating new ones. This contrasts with the expanding window where the training set grows monotonically.
The advantage of the sliding window lies in its sensitivity to regime changes and its ability to adapt to broadly non-stationary dynamics. While an expanding window assumes constant parameters throughout the entire history, the sliding window recognizes that economic relationships may evolve, maintaining only the most relevant “recent memory” for forecasting.
For each fold \(k\), we compare the predictive performance of the complete model against the baseline model using two complementary criteria:
The ELPD4 quantifies the expected log predictive density for new observations:
\[\text{ELPD}_k = \sum_{t \in \text{test}_k} \log \int p(y_t | \theta) p(\theta | \text{data}_{\text{train},k}) d\theta\]
In practice, we approximate this integral using the log-sum-exp estimator over posterior samples:
\[\widehat{\text{ELPD}}_k = \sum_{t \in \text{test}_k} \log \left( \frac{1}{S} \sum_{s=1}^S p(y_t | \theta^{(s)}) \right)\]
where \(\theta^{(s)}\) are the posterior samples and \(S\) is the number of post-warmup iterations.
The difference \(\Delta\text{ELPD}_k = \text{ELPD}_{k,\text{full}} - \text{ELPD}_{k,\text{base}}\) measures the gain in predictive capacity. A \(\Delta\text{ELPD} > 0\) indicates that the complete model assigns higher probability to the observed out-of-sample data.
We complement the ELPD with traditional metrics on the original scale of the variables:
where \(\hat{y}_t\) is the posterior
mean of predictions, obtained via posterior_epred()
and
transformed back to the original scale.
A complete model “wins” in fold \(k\) if and only if:
\[\begin{cases} \Delta\text{ELPD}_k > 0 & \text{(better predictive density)} \\ \Delta\text{RMSE}_k < 0 & \text{(lower squared error)} \end{cases}\]
This dual criterion requires superiority in both probabilistic (ELPD) and deterministic (RMSE) terms, avoiding spurious victories from improvements in a single dimension.
Inference is performed using the NUTS5 algorithm implemented
in Stan through brms
with cmdstanr
backend.
Sampling parameters are calibrated to ensure robust convergence:
Parallelization is implemented at the chain level
(parallel_chains = 4
), not within each chain, to maintain
sampler efficiency without CPU oversubscription.
Although not explicitly reported in production code for efficiency, standard diagnostics include:
We define support as:
\[\text{support} = \frac{\text{folds\_pass}}{\text{folds}}\]
where folds_pass
counts the folds where the complete
model satisfies the dual victory criterion. This metric quantifies the
temporal consistency of predictive power.
We establish two confidence levels:
The requirement of an absolute minimum of folds prevents false positives from small denominators (e.g., 1/1 = 100% support does not constitute robust evidence).
High support indicates that the predictive relationship persists across different temporal regimes. Support values:
Aspect | ECM-MARS | Bayesian GLM AR(1) |
---|---|---|
Paradigm | Frequentist with sequential tests | Integral Bayesian |
Cointegration | Explicit requirement (Engle-Granger/Johansen) | Implicit in AR(1) structure |
Non-linearity | MARS with adaptive splines | Linear in parameters |
Uncertainty | HAC standard errors | Complete posterior distribution |
Comparison | Significance tests | ELPD + predictive metrics |
Complexity | High (multiple pre-tests) | Moderate (direct model) |
Comprehensive uncertainty quantification: Posterior distributions capture all parametric uncertainty, naturally propagating it to predictions.
Direct model comparison: ELPD provides a unified metric for comparing non-nested models without multiplicity adjustments.
Robustness to assumption violations: Regularizing priors and AR(1) structure accommodate moderate deviations from classical assumptions.
Interpretability: Credible intervals have direct probabilistic interpretation, unlike frequentist confidence intervals.
Computational cost: MCMC sampling is orders of magnitude slower than OLS/MARS estimation.
Prior sensitivity: In small samples, prior choice can substantially influence results.
Linearity: The absence of non-linear terms (splines) may limit flexibility to capture complex relationships.
Pairwise parallelization: Each pair \((X \to Y)\) is processed sequentially, but MCMC chains within each model are parallelized.
Memory release: Stan objects are implicitly released after each fold through R’s garbage collector.
cmdstanr backend: More efficient than rstan for large models, with better memory management and diagnostics.
The code includes safeguards for:
NULL
and are excluded from analysis.FOR each pair (X, Y) in {production × circulation} × {2 directions}:
1. Build matrix with Y, X, and lags(X, 1:L)
2. Remove rows with missing values
FOR each fold in rolling_splits(sliding_window):
3. Split into train/test
4. Standardize variables using train statistics
5. Fit baseline_model: Y_s ~ 1 + t_s + AR(1)
6. Fit full_model: Y_s ~ 1 + t_s + X_lags + AR(1)
7. Calculate ELPD for both models on test
8. Generate point predictions (posterior_epred)
9. Transform predictions to original scale
10. Compute metrics (RMSE, MAE, sMAPE, R²)
11. Determine victory: (ΔELPD > 0) ∧ (ΔRMSE < 0)
12. Calculate support = wins / total_folds
13. Average metrics across folds
14. Ranking by (support DESC, ΔELPD DESC, ΔRMSE ASC)
15. Filter winners by support thresholds
16. Export results and visualizations
Per-fold scaling (not global) is crucial because:
It is possible to obtain \(R^2 < 0\) in out-of-sample evaluation when the model predicts worse than the sample mean. This does not indicate calculation error but genuine poor predictive performance. Similarly, RMSE ratios > 1 indicate that the complete model worsens predictions relative to the baseline.
The current framework admits several natural extensions:
s()
in brms
The Bayesian GLM methodology with AR(1) offers a robust and principled framework for evaluating predictive relationships between economic variables. Unlike the ECM-MARS approach that requires sequential validation of multiple assumptions (stationarity, cointegration, adjustment speed), the Bayesian model integrates all uncertainty into a unified framework.
Results suggest that the most robust relationships (support > 0.70) typically correspond to pairs where economic theory predicts strong causality, while relationships with moderate support (0.60-0.70) may reflect indirect transmission channels or dependence on the prevailing economic regime.
Dual evaluation via ELPD and RMSE ensures that selected models not only fit well in terms of squared error but also adequately capture the probabilistic structure of the data. This double requirement effectively filters spurious relationships that might appear promising under a single criterion.
NUTS (No-U-Turn Sampler) is an adaptive extension of the Hamiltonian Monte Carlo algorithm that automatically adjusts trajectory lengths to maximize sampling efficiency without requiring manual tuning of the number of leapfrog steps.↩︎
Weakly informative priors provide mild regularization without imposing strong a priori beliefs. In the limit of abundant data, their influence vanishes, recovering estimates similar to maximum likelihood.↩︎
The constant size of the training set in sliding window ensures that all folds have comparable statistical power, avoiding bias toward late folds that occurs with expanding windows.↩︎
ELPD (Expected Log Predictive Density) is the fundamental predictive evaluation criterion in Bayesian statistics, generalized by WAIC and LOO-CV. It measures the model’s ability to assign high probability to new observations.↩︎
The NUTS algorithm eliminates the need to manually tune the number of steps in HMC through a stopping criterion based on “U-turns” in parameter space, where the trajectory begins to double back toward its origin.↩︎
Data leakage occurs when information from the test set contaminates the training process, artificially inflating performance metrics. Scaling with global statistics would constitute a subtle but pernicious form of leakage.↩︎