Introduction to RobustFlow

Overview

RobustFlow is an R package for auditing temporal drift, subgroup disparities, and robustness in longitudinal decision systems.

It is built around three methodological contributions:

Metric	Definition	Key function
Drift Intensity Index (DII)	Frobenius norm of consecutive transition matrix differences	`compute_drift()`
Bias Amplification Index (BAI)	Change in disparity gap from first to last time point	`compute_bai()`
Temporal Fragility Index (TFI)	Minimum hidden-bias perturbation to nullify a trend conclusion	`compute_tfi_simple()`

An interactive Shiny application (run_app()) provides a seven-tab workflow with point-and-click access to all analyses plus downloadable HTML reports, CSV bundles, and reproducible R scripts.

Simulated dataset

We construct a panel that mimics a study of mathematics risk status across five elementary school waves, with three SES groups.

set.seed(2024)
n_children <- 200
n_waves    <- 5

# Simulate SES group with unequal sizes
ses <- sample(
  c("Low SES", "Mid SES", "High SES"),
  size    = n_children,
  replace = TRUE,
  prob    = c(0.35, 0.40, 0.25)
)

# Risk probability increases over waves for Low SES,
# decreases slightly for High SES
risk_prob <- function(ses_grp, wave) {
  base <- switch(ses_grp,
    "Low SES"  = 0.55,
    "Mid SES"  = 0.35,
    "High SES" = 0.20
  )
  trend <- switch(ses_grp,
    "Low SES"  =  0.04,
    "Mid SES"  =  0.01,
    "High SES" = -0.02
  )
  pmin(pmax(base + trend * (wave - 1), 0.02), 0.98)
}

rows <- lapply(seq_len(n_children), function(i) {
  data.frame(
    child_id  = i,
    wave      = seq_len(n_waves),
    ses_group = ses[i],
    school_id = sample(seq_len(20), 1),
    risk_math = rbinom(n_waves, 1,
                       vapply(seq_len(n_waves),
                              function(w) risk_prob(ses[i], w),
                              numeric(1)))
  )
})
ecls_demo <- do.call(rbind, rows)
head(ecls_demo, 8)
#>   child_id wave ses_group school_id risk_math
#> 1        1    1  High SES         5         0
#> 2        1    2  High SES         5         0
#> 3        1    3  High SES         5         0
#> 4        1    4  High SES         5         0
#> 5        1    5  High SES         5         0
#> 6        2    1   Mid SES        14         0
#> 7        2    2   Mid SES        14         0
#> 8        2    3   Mid SES        14         1

Step 1: Validate panel data

validate_panel_data() checks variable existence, sorts the data, and reports panel balance and missingness.

validated <- validate_panel_data(
  data     = ecls_demo,
  id       = "child_id",
  time     = "wave",
  decision = "risk_math",
  group    = "ses_group",
  cluster  = "school_id"
)

cat(sprintf(
  "Individuals: %d\nWaves:       %d\nBalanced:    %s\n",
  validated$n_ids,
  validated$n_times,
  validated$balanced
))
#> Individuals: 200
#> Waves:       5
#> Balanced:    TRUE

Step 2: Build decision paths

build_paths() constructs each child’s complete risk trajectory and returns aggregate path frequencies, the pooled transition matrix, and Shannon entropy.

paths <- build_paths(
  data     = validated$data,
  id       = "child_id",
  time     = "wave",
  decision = "risk_math"
)

# Top 10 paths
head(paths$path_counts, 10)
#>             path  n  pct
#> 1  0->0->0->0->0 28 14.0
#> 2  0->0->0->1->0 16  8.0
#> 3  0->0->1->0->0 11  5.5
#> 4  1->0->0->0->0 11  5.5
#> 5  0->0->0->0->1 10  5.0
#> 6  0->0->0->1->1  8  4.0
#> 7  0->0->1->0->1  8  4.0
#> 8  1->1->0->1->1  8  4.0
#> 9  0->1->0->0->0  7  3.5
#> 10 0->1->0->0->1  7  3.5

Shannon entropy measures trajectory diversity — higher values mean more heterogeneous individual histories.

cat(sprintf("Path entropy: %.3f bits\n", paths$path_entropy))
#> Path entropy: 4.652 bits

Aggregate transition matrix

paths$transition_matrix
#>    
#>       0   1
#>   0 319 177
#>   1 159 145

df_paths <- head(paths$path_counts, 10)
df_paths$path <- factor(df_paths$path, levels = rev(df_paths$path))

ggplot(df_paths, aes(x = path, y = n)) +
  geom_col(fill = "#2c7bb6") +
  coord_flip() +
  labs(x = "Path", y = "Count",
       title = "Top 10 Decision Paths") +
  theme_minimal()

Top 10 most common five-wave risk trajectories.

Step 3: Drift Intensity Index (DII)

compute_drift() estimates a period-specific transition matrix for each consecutive wave pair and computes the Frobenius norm of their difference:

\[DII_t = \| P_t - P_{t-1} \|_F\]

A larger DII signals greater structural instability between two periods.

drift <- compute_drift(
  data      = validated$data,
  id        = "child_id",
  time      = "wave",
  decision  = "risk_math",
  normalize = TRUE
)

drift$summary
#>   time      DII
#> 1    1       NA
#> 2    2       NA
#> 3    3 0.089542
#> 4    4 0.140190
#> 5    5 0.095655
cat(sprintf("Mean DII:             %.4f\n", drift$mean_dii))
#> Mean DII:             0.1085
cat(sprintf("Period of max drift:  wave %s\n", drift$max_dii_period))
#> Period of max drift:  wave 4

ggplot(drift$summary, aes(x = time, y = DII)) +
  geom_line(color = "#1a9641", linewidth = 1) +
  geom_point(size = 2.5) +
  labs(x = "Wave", y = "DII",
       title = "Drift Intensity Index Over Time") +
  theme_minimal()
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_line()`).
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_point()`).

DII over time. Higher values indicate greater structural change in risk transitions.

Step 4: Group trajectories and Bias Amplification Index (BAI)

compute_group_gaps() computes the at-risk rate for each SES group at each wave and the pairwise gap (alphabetically first group minus second).

compute_bai() then summarizes whether that gap widens or narrows:

\[BAI = Gap_T - Gap_1\]

gaps <- compute_group_gaps(
  data        = validated$data,
  time        = "wave",
  decision    = "risk_math",
  group       = "ses_group",
  focal_value = 1
)

gaps$gap_df
#>   time       gap
#> 1    1 -0.277778
#> 2    2 -0.534722
#> 3    3 -0.347222
#> 4    4 -0.479167
#> 5    5 -0.590277

ggplot(gaps$long_format,
       aes(x = time, y = rate, color = group, group = group)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2.5) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),
                     limits = c(0, 1)) +
  labs(x = "Wave", y = "At-Risk Rate", color = "SES Group",
       title = "Math Risk Trajectories by SES Group") +
  theme_minimal() +
  theme(legend.position = "bottom")

At-risk rates by SES group across five waves.

ggplot(gaps$gap_df, aes(x = time, y = gap)) +
  geom_line(color = "#762a83", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey40") +
  geom_point(size = 2.5) +
  labs(x = "Wave", y = "Gap",
       title = "Disparity Gap Over Time") +
  theme_minimal()

Disparity gap (first two alphabetic SES groups) over time.

bai <- compute_bai(gaps$gap)
cat(sprintf("BAI:       %.4f\n", bai$bai))
#> BAI:       -0.3125
cat(sprintf("Direction: %s\n",   bai$direction))
#> Direction: convergence
cat(sprintf("Gap at t=1: %.4f  |  Gap at t=T: %.4f\n",
            bai$gap_start, bai$gap_end))
#> Gap at t=1: -0.2778  |  Gap at t=T: -0.5903

The standardized version divides by the SD of the gap series:

bai_std <- compute_bai(gaps$gap, standardize = TRUE)
cat(sprintf("Standardized BAI: %.4f\n", bai_std$bai))
#> Standardized BAI: -2.3995

Step 5: Temporal Fragility Index (TFI)

compute_tfi_simple() estimates how much hidden bias — modelled as a scalar attenuation \(u\) — would be required to nullify the observed longitudinal trend in DII. The adjusted slope under perturbation \(u\) is:

\[\hat{\beta}(u) = \hat{\beta}_{obs} \times (1 - u)\]

The TFI is the minimum \(u\) at which the trend reverses or reaches zero.

dii_vals <- drift$summary$DII[!is.na(drift$summary$DII)]

tfi <- compute_tfi_simple(dii_vals)
tfi$summary_table
#>           Metric             Value
#> 1 Observed slope          0.003056
#> 2            TFI                 1
#> 3 Interpretation Moderately robust

sc <- tfi$sensitivity_curve

ggplot(sc, aes(x = perturbation, y = adjusted_effect)) +
  geom_line(color = "#4575b4", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "#d73027",
             linewidth = 0.8) +
  labs(x = "Hidden Bias Parameter (u)",
       y = "Adjusted Slope",
       title = "TFI Sensitivity Curve") +
  theme_minimal()

TFI sensitivity curve. The red dashed line marks the null effect threshold.

Interpretation guide

TFI value	Interpretation
`Inf`	Conclusion cannot be nullified by scalar attenuation — highly robust
`> 0.5`	Moderately robust
`0.1 – 0.5`	Requires scrutiny — modest hidden bias could reverse conclusions
`< 0.1`	Fragile — small bias sufficient to nullify the trend

Step 6: Exporting a reproducible R script

generate_r_script(
  id_var       = "child_id",
  time_var     = "wave",
  decision_var = "risk_math",
  group_var    = "ses_group",
  cluster_var  = "school_id",
  focal_value  = 1,
  output_file  = "my_robustflow_analysis.R"
)

The generated script is a self-contained .R file that a collaborator can run to reproduce the full analysis without the Shiny app.

Step 7: Launch the interactive app

run_app()

The app loads in your default browser and provides:

Tab	Content
Data	Upload, variable mapping, balance diagnostics, missingness
Paths	Path frequencies, transition matrix, entropy
Drift	Prevalence over time, DII trend, period-specific matrices
Disparities	Group trajectories, gap evolution, BAI
Robustness	TFI, sensitivity curve
Intervention	Highest-risk transitions, disparity-generating steps
Report	HTML report, CSV bundle, reproducible R script

Citation

If you use RobustFlow in published research, please cite:

Hait, S. (2025). RobustFlow: Robustness and drift auditing for
longitudinal decision systems. R package version 0.1.0.
https://github.com/subirhait/RobustFlow

Session info

sessionInfo()
#> R version 4.5.1 (2025-06-13 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#> 
#> Matrix products: default
#>   LAPACK version 3.12.1
#> 
#> locale:
#> [1] LC_COLLATE=C                          
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/Chicago
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_4.0.2    RobustFlow_0.1.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] config_0.3.2       plotly_4.11.0      sass_0.4.10        generics_0.1.4    
#>  [5] tidyr_1.3.2        digest_0.6.37      magrittr_2.0.4     evaluate_1.0.5    
#>  [9] attempt_0.3.1      grid_4.5.1         RColorBrewer_1.1-3 golem_0.5.1       
#> [13] fastmap_1.2.0      jsonlite_2.0.0     promises_1.3.3     httr_1.4.7        
#> [17] purrr_1.2.1        viridisLite_0.4.2  scales_1.4.0       lazyeval_0.2.2    
#> [21] jquerylib_0.1.4    cli_3.6.5          shiny_1.11.1       rlang_1.2.0       
#> [25] withr_3.0.2        cachem_1.1.0       yaml_2.3.10        tools_4.5.1       
#> [29] dplyr_1.2.1        httpuv_1.6.16      DT_0.34.0          vctrs_0.7.1       
#> [33] R6_2.6.1           mime_0.13          lifecycle_1.0.5    htmlwidgets_1.6.4 
#> [37] pkgconfig_2.0.3    pillar_1.11.1      bslib_0.9.0        later_1.4.4       
#> [41] gtable_0.3.6       glue_1.8.0         data.table_1.17.8  Rcpp_1.1.0        
#> [45] xfun_0.53          tibble_3.3.1       tidyselect_1.2.1   rstudioapi_0.17.1 
#> [49] knitr_1.51         farver_2.1.2       xtable_1.8-4       htmltools_0.5.8.1 
#> [53] rmarkdown_2.31     labeling_0.4.3     compiler_4.5.1     S7_0.2.0