Introduction to RobustFlow

Overview

RobustFlow is an R package for auditing temporal drift, subgroup disparities, and robustness in longitudinal decision systems.

It is built around three methodological contributions:

Metric Definition Key function
Drift Intensity Index (DII) Frobenius norm of consecutive transition matrix differences compute_drift()
Bias Amplification Index (BAI) Change in disparity gap from first to last time point compute_bai()
Temporal Fragility Index (TFI) Minimum hidden-bias perturbation to nullify a trend conclusion compute_tfi_simple()

An interactive Shiny application (run_app()) provides a seven-tab workflow with point-and-click access to all analyses plus downloadable HTML reports, CSV bundles, and reproducible R scripts.


Simulated dataset

We construct a panel that mimics a study of mathematics risk status across five elementary school waves, with three SES groups.

set.seed(2024)
n_children <- 200
n_waves    <- 5

# Simulate SES group with unequal sizes
ses <- sample(
  c("Low SES", "Mid SES", "High SES"),
  size    = n_children,
  replace = TRUE,
  prob    = c(0.35, 0.40, 0.25)
)

# Risk probability increases over waves for Low SES,
# decreases slightly for High SES
risk_prob <- function(ses_grp, wave) {
  base <- switch(ses_grp,
    "Low SES"  = 0.55,
    "Mid SES"  = 0.35,
    "High SES" = 0.20
  )
  trend <- switch(ses_grp,
    "Low SES"  =  0.04,
    "Mid SES"  =  0.01,
    "High SES" = -0.02
  )
  pmin(pmax(base + trend * (wave - 1), 0.02), 0.98)
}

rows <- lapply(seq_len(n_children), function(i) {
  data.frame(
    child_id  = i,
    wave      = seq_len(n_waves),
    ses_group = ses[i],
    school_id = sample(seq_len(20), 1),
    risk_math = rbinom(n_waves, 1,
                       vapply(seq_len(n_waves),
                              function(w) risk_prob(ses[i], w),
                              numeric(1)))
  )
})
ecls_demo <- do.call(rbind, rows)
head(ecls_demo, 8)
#>   child_id wave ses_group school_id risk_math
#> 1        1    1  High SES         5         0
#> 2        1    2  High SES         5         0
#> 3        1    3  High SES         5         0
#> 4        1    4  High SES         5         0
#> 5        1    5  High SES         5         0
#> 6        2    1   Mid SES        14         0
#> 7        2    2   Mid SES        14         0
#> 8        2    3   Mid SES        14         1

Step 1: Validate panel data

validate_panel_data() checks variable existence, sorts the data, and reports panel balance and missingness.

validated <- validate_panel_data(
  data     = ecls_demo,
  id       = "child_id",
  time     = "wave",
  decision = "risk_math",
  group    = "ses_group",
  cluster  = "school_id"
)

cat(sprintf(
  "Individuals: %d\nWaves:       %d\nBalanced:    %s\n",
  validated$n_ids,
  validated$n_times,
  validated$balanced
))
#> Individuals: 200
#> Waves:       5
#> Balanced:    TRUE

Step 2: Build decision paths

build_paths() constructs each child’s complete risk trajectory and returns aggregate path frequencies, the pooled transition matrix, and Shannon entropy.

paths <- build_paths(
  data     = validated$data,
  id       = "child_id",
  time     = "wave",
  decision = "risk_math"
)

# Top 10 paths
head(paths$path_counts, 10)
#>             path  n  pct
#> 1  0->0->0->0->0 28 14.0
#> 2  0->0->0->1->0 16  8.0
#> 3  0->0->1->0->0 11  5.5
#> 4  1->0->0->0->0 11  5.5
#> 5  0->0->0->0->1 10  5.0
#> 6  0->0->0->1->1  8  4.0
#> 7  0->0->1->0->1  8  4.0
#> 8  1->1->0->1->1  8  4.0
#> 9  0->1->0->0->0  7  3.5
#> 10 0->1->0->0->1  7  3.5

Shannon entropy measures trajectory diversity — higher values mean more heterogeneous individual histories.

cat(sprintf("Path entropy: %.3f bits\n", paths$path_entropy))
#> Path entropy: 4.652 bits

Aggregate transition matrix

paths$transition_matrix
#>    
#>       0   1
#>   0 319 177
#>   1 159 145
df_paths <- head(paths$path_counts, 10)
df_paths$path <- factor(df_paths$path, levels = rev(df_paths$path))

ggplot(df_paths, aes(x = path, y = n)) +
  geom_col(fill = "#2c7bb6") +
  coord_flip() +
  labs(x = "Path", y = "Count",
       title = "Top 10 Decision Paths") +
  theme_minimal()
Top 10 most common five-wave risk trajectories.

Top 10 most common five-wave risk trajectories.


Step 3: Drift Intensity Index (DII)

compute_drift() estimates a period-specific transition matrix for each consecutive wave pair and computes the Frobenius norm of their difference:

\[DII_t = \| P_t - P_{t-1} \|_F\]

A larger DII signals greater structural instability between two periods.

drift <- compute_drift(
  data      = validated$data,
  id        = "child_id",
  time      = "wave",
  decision  = "risk_math",
  normalize = TRUE
)

drift$summary
#>   time      DII
#> 1    1       NA
#> 2    2       NA
#> 3    3 0.089542
#> 4    4 0.140190
#> 5    5 0.095655
cat(sprintf("Mean DII:             %.4f\n", drift$mean_dii))
#> Mean DII:             0.1085
cat(sprintf("Period of max drift:  wave %s\n", drift$max_dii_period))
#> Period of max drift:  wave 4
ggplot(drift$summary, aes(x = time, y = DII)) +
  geom_line(color = "#1a9641", linewidth = 1) +
  geom_point(size = 2.5) +
  labs(x = "Wave", y = "DII",
       title = "Drift Intensity Index Over Time") +
  theme_minimal()
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_line()`).
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_point()`).
DII over time. Higher values indicate greater structural change in risk transitions.

DII over time. Higher values indicate greater structural change in risk transitions.


Step 4: Group trajectories and Bias Amplification Index (BAI)

compute_group_gaps() computes the at-risk rate for each SES group at each wave and the pairwise gap (alphabetically first group minus second).

compute_bai() then summarizes whether that gap widens or narrows:

\[BAI = Gap_T - Gap_1\]

gaps <- compute_group_gaps(
  data        = validated$data,
  time        = "wave",
  decision    = "risk_math",
  group       = "ses_group",
  focal_value = 1
)

gaps$gap_df
#>   time       gap
#> 1    1 -0.277778
#> 2    2 -0.534722
#> 3    3 -0.347222
#> 4    4 -0.479167
#> 5    5 -0.590277
ggplot(gaps$long_format,
       aes(x = time, y = rate, color = group, group = group)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2.5) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),
                     limits = c(0, 1)) +
  labs(x = "Wave", y = "At-Risk Rate", color = "SES Group",
       title = "Math Risk Trajectories by SES Group") +
  theme_minimal() +
  theme(legend.position = "bottom")
At-risk rates by SES group across five waves.

At-risk rates by SES group across five waves.

ggplot(gaps$gap_df, aes(x = time, y = gap)) +
  geom_line(color = "#762a83", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey40") +
  geom_point(size = 2.5) +
  labs(x = "Wave", y = "Gap",
       title = "Disparity Gap Over Time") +
  theme_minimal()
Disparity gap (first two alphabetic SES groups) over time.

Disparity gap (first two alphabetic SES groups) over time.

bai <- compute_bai(gaps$gap)
cat(sprintf("BAI:       %.4f\n", bai$bai))
#> BAI:       -0.3125
cat(sprintf("Direction: %s\n",   bai$direction))
#> Direction: convergence
cat(sprintf("Gap at t=1: %.4f  |  Gap at t=T: %.4f\n",
            bai$gap_start, bai$gap_end))
#> Gap at t=1: -0.2778  |  Gap at t=T: -0.5903

The standardized version divides by the SD of the gap series:

bai_std <- compute_bai(gaps$gap, standardize = TRUE)
cat(sprintf("Standardized BAI: %.4f\n", bai_std$bai))
#> Standardized BAI: -2.3995

Step 5: Temporal Fragility Index (TFI)

compute_tfi_simple() estimates how much hidden bias — modelled as a scalar attenuation \(u\) — would be required to nullify the observed longitudinal trend in DII. The adjusted slope under perturbation \(u\) is:

\[\hat{\beta}(u) = \hat{\beta}_{obs} \times (1 - u)\]

The TFI is the minimum \(u\) at which the trend reverses or reaches zero.

dii_vals <- drift$summary$DII[!is.na(drift$summary$DII)]

tfi <- compute_tfi_simple(dii_vals)
tfi$summary_table
#>           Metric             Value
#> 1 Observed slope          0.003056
#> 2            TFI                 1
#> 3 Interpretation Moderately robust
sc <- tfi$sensitivity_curve

ggplot(sc, aes(x = perturbation, y = adjusted_effect)) +
  geom_line(color = "#4575b4", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "#d73027",
             linewidth = 0.8) +
  labs(x = "Hidden Bias Parameter (u)",
       y = "Adjusted Slope",
       title = "TFI Sensitivity Curve") +
  theme_minimal()
TFI sensitivity curve. The red dashed line marks the null effect threshold.

TFI sensitivity curve. The red dashed line marks the null effect threshold.

Interpretation guide

TFI value Interpretation
Inf Conclusion cannot be nullified by scalar attenuation — highly robust
> 0.5 Moderately robust
0.1 – 0.5 Requires scrutiny — modest hidden bias could reverse conclusions
< 0.1 Fragile — small bias sufficient to nullify the trend

Step 6: Exporting a reproducible R script

generate_r_script(
  id_var       = "child_id",
  time_var     = "wave",
  decision_var = "risk_math",
  group_var    = "ses_group",
  cluster_var  = "school_id",
  focal_value  = 1,
  output_file  = "my_robustflow_analysis.R"
)

The generated script is a self-contained .R file that a collaborator can run to reproduce the full analysis without the Shiny app.


Step 7: Launch the interactive app

run_app()

The app loads in your default browser and provides:

Tab Content
Data Upload, variable mapping, balance diagnostics, missingness
Paths Path frequencies, transition matrix, entropy
Drift Prevalence over time, DII trend, period-specific matrices
Disparities Group trajectories, gap evolution, BAI
Robustness TFI, sensitivity curve
Intervention Highest-risk transitions, disparity-generating steps
Report HTML report, CSV bundle, reproducible R script

Citation

If you use RobustFlow in published research, please cite:

Hait, S. (2025). RobustFlow: Robustness and drift auditing for
longitudinal decision systems. R package version 0.1.0.
https://github.com/subirhait/RobustFlow

Session info

sessionInfo()
#> R version 4.5.1 (2025-06-13 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#> 
#> Matrix products: default
#>   LAPACK version 3.12.1
#> 
#> locale:
#> [1] LC_COLLATE=C                          
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/Chicago
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_4.0.2    RobustFlow_0.1.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] config_0.3.2       plotly_4.11.0      sass_0.4.10        generics_0.1.4    
#>  [5] tidyr_1.3.2        digest_0.6.37      magrittr_2.0.4     evaluate_1.0.5    
#>  [9] attempt_0.3.1      grid_4.5.1         RColorBrewer_1.1-3 golem_0.5.1       
#> [13] fastmap_1.2.0      jsonlite_2.0.0     promises_1.3.3     httr_1.4.7        
#> [17] purrr_1.2.1        viridisLite_0.4.2  scales_1.4.0       lazyeval_0.2.2    
#> [21] jquerylib_0.1.4    cli_3.6.5          shiny_1.11.1       rlang_1.2.0       
#> [25] withr_3.0.2        cachem_1.1.0       yaml_2.3.10        tools_4.5.1       
#> [29] dplyr_1.2.1        httpuv_1.6.16      DT_0.34.0          vctrs_0.7.1       
#> [33] R6_2.6.1           mime_0.13          lifecycle_1.0.5    htmlwidgets_1.6.4 
#> [37] pkgconfig_2.0.3    pillar_1.11.1      bslib_0.9.0        later_1.4.4       
#> [41] gtable_0.3.6       glue_1.8.0         data.table_1.17.8  Rcpp_1.1.0        
#> [45] xfun_0.53          tibble_3.3.1       tidyselect_1.2.1   rstudioapi_0.17.1 
#> [49] knitr_1.51         farver_2.1.2       xtable_1.8-4       htmltools_0.5.8.1 
#> [53] rmarkdown_2.31     labeling_0.4.3     compiler_4.5.1     S7_0.2.0