By default, the Poisson, binomial, and normal models in bage
assume that any measurement errors in the input data are small enough to be ignored. These models can, however, be extended to accommodate various types of measurement error. This is done by adding a “data model”—also referred to as a “measurement error model”—to the base model. The data models that have been implemented so far in bage
are fairly generic: they aim to perform reasonably well on a wide range of applications, rather than performing optimally on any particular application. Future versions of bage
will add some more specialised data models.
This vignette begins with an overview of the current menu of data models. It then shows how the simple models presented in the overview can be extended to deal with more complicated situations, and discusses forecasting and confidentialization.
The overview will use a model of births in a single Korean province in a single year. The input data is:
births
#> # A tibble: 9 × 3
#> age births popn
#> <chr> <int> <int>
#> 1 10-14 0 27084
#> 2 15-19 9 25322
#> 3 20-24 110 23935
#> 4 25-29 912 28936
#> 5 30-34 2547 30964
#> 6 35-39 1235 31611
#> 7 40-44 262 44567
#> 8 45-49 6 41774
#> 9 50-54 0 51312
Our base model treats the input data as error free. We specify and fit the base model as follows.
library(bage)
library(dplyr)
mod_base <- mod_pois(births ~ age,
data = births,
exposure = popn) |>
fit()
#> Building log-posterior function...
#> Finding maximum...
#> Drawing values for hyper-parameters...
mod_base
#>
#> ------ Fitted Poisson model ------
#>
#> births ~ age
#>
#> exposure: popn
#>
#> term prior along n_par n_par_free std_dev
#> (Intercept) NFix() - 1 1 -
#> age RW() age 9 9 1.27
#>
#> disp: mean = 1
#>
#> n_draw var_age optimizer
#> 1000 age nlminb
#>
#> time_total time_max time_draw iter converged
#> 0.02 0.01 0.01 15 TRUE
#> message
#> both X-convergence and relative convergence (5)
The base model yields the following estimates for birth rates:
In full-sized applications, we generally want inclusion probabilities, coverage rates, standard deviations, and coefficients of variation to vary across dimensions such as age, sex, and time. The data models implemented in bage
allow this sort of variation.
We illustrate the use of varying data model parameters using a slightly extended version of our births model. We use a new dataset that includes a time variable:
births_time
#> # A tibble: 27 × 4
#> age time births popn
#> <chr> <int> <int> <int>
#> 1 10-14 2021 0 27208
#> 2 10-14 2022 0 27185
#> 3 10-14 2023 0 27084
#> 4 15-19 2021 3 25039
#> 5 15-19 2022 16 25276
#> 6 15-19 2023 9 25322
#> 7 20-24 2021 178 29008
#> 8 20-24 2022 120 26201
#> 9 20-24 2023 110 23935
#> 10 25-29 2021 1212 30826
#> # ℹ 17 more rows
We create a new prior for inclusion probabilities where the mean and dispersion vary over time:
prob_under_time <- data.frame(time = c(2021, 2022, 2023),
mean = c(0.99, 0.8, 0.99),
disp = c(0.01, 0.02, 0.01))
We implement one model with the old time-constant prior, and one with the new time-varying prior.
mod_timeconst <- mod_pois(births ~ age + time,
data = births_time,
exposure = popn) |>
set_datamod_undercount(prob = prob_under)
mod_timevarying <- mod_timeconst |>
set_datamod_undercount(prob = prob_under_time)
#> → Replacing existing "undercount" data model with new "undercount" data model
As we would expect, these two models give different results.
Data model parameters can vary across more than one dimension. We set up a prior that varies across age and time. Rather than using different inclusion probabilities for every age group, however, we use one set for people aged 10-34, and one for people aged 35+. To implement this, we need to create a new, aggregated age group.
births_time <- births_time |>
mutate(agegp = if_else(age %in% c("10-14", "15-19",
"20-24", "25-29",
"30-34"),
"10-34",
"35+"))
prob_under_agetime <- data.frame(
time = c(2021, 2022, 2023, 2021, 2022, 2023),
agegp = c("10-34", "10-34", "10-34", "35+", "35+", "35+"),
mean = c(0.95, 0.95, 0.95, 0.95, 0.5, 0.95),
disp = c(0.01, 0.01, 0.01, 0.01, 0.1, 0.01))
mod_agetime <- mod_pois(births ~ age + time,
data = births_time,
exposure = popn) |>
set_datamod_undercount(prob = prob_under_agetime)
The model assumes that births for women aged 35+ in 2022 were subject to an unusual degree of under-reporting. The results reflect that assumption.
Data models can be used in forecasting applications. Implementation is easiest with the data model does not contain any time-varying parameters, like the mod_timeconst
constructed above:
mod_timeconst |>
fit() |>
forecast(label = 2024)
#> Building log-posterior function...
#> Finding maximum...
#> Drawing values for hyper-parameters...
#> `components()` for past values...
#> `components()` for future values...
#> `augment()` for future values...
#> # A tibble: 9 × 8
#> age time births .births popn .observed .fitted
#> <chr> <dbl> <dbl> <dbl> <int> <dbl> <rdbl<1000>>
#> 1 10-14 2024 NA NA NA NA 5.4e-05 (2e-05, 0.00015)
#> 2 15-19 2024 NA NA NA NA 0.00045 (0.00025, 0.00077)
#> 3 20-24 2024 NA NA NA NA 0.0061 (0.004, 0.0093)
#> 4 25-29 2024 NA NA NA NA 0.04 (0.027, 0.064)
#> 5 30-34 2024 NA NA NA NA 0.1 (0.07, 0.16)
#> 6 35-39 2024 NA NA NA NA 0.048 (0.032, 0.073)
#> 7 40-44 2024 NA NA NA NA 0.0075 (0.0048, 0.012)
#> 8 45-49 2024 NA NA NA NA 0.00013 (6.6e-05, 0.00024)
#> 9 50-54 2024 NA NA NA NA 1.1e-05 (2.6e-06, 5.2e-05)
#> # ℹ 1 more variable: .expected <rdbl<1000>>
If future values for population are supplied, then the forecast will include true and reported values for the outcome variable:
newdata_births <- data.frame(
age = c("10-14", "15-19", "20-24", "25-29", "30-34",
"35-39", "40-44", "45-49", "50-54"),
time = rep(2024, 9),
popn = c(27084, 25322, 23935, 28936, 30964,
31611, 44567, 41774, 51312))
mod_timeconst |>
fit() |>
forecast(newdata = newdata_births)
#> Building log-posterior function...
#> Finding maximum...
#> Drawing values for hyper-parameters...
#> `components()` for past values...
#> `components()` for future values...
#> `augment()` for future values...
#> # A tibble: 9 × 8
#> age time births .births popn .observed
#> <chr> <dbl> <rdbl<1000>> <rdbl<1000>> <dbl> <dbl>
#> 1 10-14 2024 1 (0, 4) 1 (0, 6) 27084 NA
#> 2 15-19 2024 9 (3, 18) 11 (4, 22) 25322 NA
#> 3 20-24 2024 115 (71, 186) 144 (92, 230) 23935 NA
#> 4 25-29 2024 937 (608, 1520) 1174 (805, 1897) 28936 NA
#> 5 30-34 2024 2598 (1649, 4076) 3237 (2126, 4945) 30964 NA
#> 6 35-39 2024 1214 (756, 1841) 1518 (1012, 2287) 31611 NA
#> 7 40-44 2024 264 (160, 416) 331 (211, 514) 44567 NA
#> 8 45-49 2024 4 (0, 11) 5 (1, 13) 41774 NA
#> 9 50-54 2024 0 (0, 3) 0 (0, 3) 51312 NA
#> # ℹ 2 more variables: .fitted <rdbl<1000>>, .expected <rdbl<1000>>
When the data model contains parameters that vary over time, future values for these parameters must be specified. This is done when the data model is first created. Here, for, instance, we create a version of the prob_under_time
prior that includes values for 2024.
prob_under_time_ext <- rbind(
prob_under_time,
data.frame(time = 2024,
mean = 0.95,
disp = 0.05))
prob_under_time_ext
#> time mean disp
#> 1 2021 0.99 0.01
#> 2 2022 0.80 0.02
#> 3 2023 0.99 0.01
#> 4 2024 0.95 0.05
Our dataset only contains values up to 2023. When we fit our model, bage
only uses prior values for the data model up to 2023. But when we forecast, bage
uses the prior values for 2024.
bage
allows for the possibility that the outcome variable has been subject to measurement errors and has been confidentialized. The following model assumes that births have been under-reported, and have been randomly rounded to multiples of 3.
The undercount, overcount, miscount, noise, and exposure data models allow analysts to account for general types of measurement error commonly encountered in applied demography. Like all models in bage
, the data models are designed to be fast, even with large datasets.
As bage
develops further, we would like to complement the existing suite of data models with additional, more specialized models. We would, for instance, like to add a special model for data, such as births data, where there are gaps between the date of occurrence and the date of registration. We welcome suggestions for specialised models.