| Title: | A Common API for Synthesizing Data |
| Version: | 0.1.2 |
| Maintainer: | Aaron R. Williams <awilliams@urban.org> |
| Description: | A system built on 'tidymodels' for generating synthetic tabular data. We provide tools for ordering a sequential synthesis, feature and target engineering, sampling, hyperparameter tuning, enforcing constraints, and adding extra noise during a synthesis. |
| URL: | https://ui-research.github.io/tidysynthesis-documentation/ |
| BugReports: | https://github.com/UrbanInstitute/tidysynthesis/issues |
| Depends: | R (≥ 4.1.0) |
| Imports: | dplyr, forcats, parsnip, pillar, purrr, progressr, recipes, rlang, rsample, stringr, tibble, tidyr (≥ 1.0.0), tune, vctrs, workflows, yardstick, ExtDist, dapper |
| Suggests: | hardhat, palmerpenguins, poissonreg, randomForest, ranger, rpart, rpart.LAD (≥ 0.1.2), testthat (≥ 2.1.0), usethis, VGAM |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.3.2 |
| NeedsCompilation: | no |
| Packaged: | 2025-11-06 15:30:08 UTC; aaronwilliams |
| Author: | Aaron R. Williams |
| Repository: | CRAN |
| Date/Publication: | 2025-11-11 21:20:07 UTC |
constant mapping between component names and inspections
Description
constant mapping between component names and inspections
Usage
.name_to_inspect
Format
An object of class list of length 7.
American Community Survey confidential microdata (with weights)
Description
An extract constructed from the 2019 American Community Survey containing a survey sample of n = 1500 Nebraska respondents, with survey weights included.
Usage
acs_conf
Format
acs_conf
A data frame with 1,500 rows and 12 columns:
- county
fct, county
- gq
fct, group quarter kind
- sex
fct, sex
- marst
fct, marital status
- hcovany
fct, health insurance status
- empstat
fct, employment status; contains empty levels.
- classwkr
fct, employment kind (ex: self-employed, etc.); contains "N/A" levels.
- age
dbl, age (in years)
- famsize
dbl, household/family size
- transit_time
dbl, transit time to work (in minutes)
- inctot
dbl, annual income; contains missing values
- wgt
dbl, survey weight
Details
Original data source: Steven Ruggles, Sarah Flood, Matthew Sobek, Daniel Backman, Annie Chen, Grace Cooper, Stephanie Richards, Renae Rogers, and Megan Schouweiler. IPUMS USA: Version 15.0 [dataset]. Minneapolis, MN: IPUMS, 2024. https://doi.org/10.18128/D010.V15.0
Source
American Community Survey confidential microdata (without weights)
Description
An extract constructed from the 2019 American Community Survey containing a survey sample of n = 1500 Nebraska respondents, with survey weights included.
Usage
acs_conf_nw
Format
acs_conf_nw
A data frame with 1,500 rows and 11 columns:
- county
fct, county
- gq
fct, group quarter kind
- sex
fct, sex
- marst
fct, marital status
- hcovany
fct, health insurance status
- empstat
fct, employment status; contains empty levels.
- classwkr
fct, employment kind (ex: self-employed, etc.); contains "N/A" levels.
- age
dbl, age (in years)
- famsize
dbl, household/family size
- transit_time
dbl, transit time to work (in minutes)
- inctot
dbl, annual income; contains missing values
Details
Original data source: Steven Ruggles, Sarah Flood, Matthew Sobek, Daniel Backman, Annie Chen, Grace Cooper, Stephanie Richards, Renae Rogers, and Megan Schouweiler. IPUMS USA: Version 15.0 [dataset]. Minneapolis, MN: IPUMS, 2024. https://doi.org/10.18128/D010.V15.0
Source
American Community Survey starting microdata (with weights)
Description
An extract constructed from the 2019 American Community Survey containing a survey sample of n = 500 Nebraska respondents, with survey weights included.
Usage
acs_start
Format
acs_start
A data frame with 500 rows and 5 columns:
- county
fct, county
- gq
fct, group quarter kind
- sex
fct, sex
- marst
fct, marital status
- wgt
dbl, survey weight
Details
Original data source: Steven Ruggles, Sarah Flood, Matthew Sobek, Daniel Backman, Annie Chen, Grace Cooper, Stephanie Richards, Renae Rogers, and Megan Schouweiler. IPUMS USA: Version 15.0 [dataset]. Minneapolis, MN: IPUMS, 2024. https://doi.org/10.18128/D010.V15.0
Source
American Community Survey starting microdata (without weights)
Description
An extract constructed from the 2019 American Community Survey containing a survey sample of n = 500 Nebraska respondents, with survey weights included.
Usage
acs_start_nw
Format
acs_start_nw
A data frame with 500 rows and 4 columns:
- county
fct, county
- gq
fct, group quarter kind
- sex
fct, sex
- marst
fct, marital status
Details
Original data source: Steven Ruggles, Sarah Flood, Matthew Sobek, Daniel Backman, Annie Chen, Grace Cooper, Stephanie Richards, Renae Rogers, and Megan Schouweiler. IPUMS USA: Version 15.0 [dataset]. Minneapolis, MN: IPUMS, 2024. https://doi.org/10.18128/D010.V15.0
Source
Inject noise into a categorical random variable by mixing a sample of uniform records into the predictions.
Description
Inject noise into a categorical random variable by mixing a sample of uniform records into the predictions.
Usage
add_noise_cat_unif(
model,
new_data,
conf_model_data,
outcome_var,
col_schema,
pred,
unif_prop,
resample_props = NULL,
observed_levels = FALSE
)
Arguments
model |
A |
new_data |
A data frame used to generate predictions |
conf_model_data |
A data frame for estimating the predictive model |
outcome_var |
A string name representing the outcome variable |
col_schema |
A list of column schema specifications for the new variable |
pred |
A vector of values predicted by the model |
unif_prop |
A proportion of records to resample with uniform noise |
resample_props |
An optional named vector of probabilities for resampling,
defaults to uniform over all levels supplied in |
observed_levels |
An optional Boolean to only resample from observed levels in the confidential data. |
Value
A numeric vector with noise added to each prediction
Examples
conf_model_data <- mtcars|>
dplyr::mutate(gear = factor(.data[["gear"]]))
col_schema <- list(
"dtype" = "fct",
"levels" = c("3", "4", "5"),
"na_prop" = 0
)
add_noise_cat_unif(
model = conf_model_data,
new_data = NULL,
conf_model_data = NULL,
outcome_var = "gear",
col_schema = col_schema,
pred = factor(c(rep("3", 10), rep("4", 10), rep("5", 10))),
unif_prop = 0.5
)
Add discrete normal noise with mean 0 to predicted values with constant variance
Description
Add discrete normal noise with mean 0 to predicted values with constant variance
Usage
add_noise_disc_gaussian(
model,
new_data,
conf_model_data,
outcome_var,
col_schema,
pred,
variance = NULL,
rho = NULL,
sensitivity = NULL,
increment = 1
)
Arguments
model |
A |
new_data |
A data frame used to generate predictions |
conf_model_data |
A data frame for estimating the predictive model |
outcome_var |
A string name representing the outcome variable |
col_schema |
A list of column schema specifications for the new variable |
pred |
A vector of values predicted by the model |
variance |
float, sampling variance for additive noise |
rho |
float, alternative privacy loss budget prescribed by the Gaussian mechanism under rho-zero-concentrated differential privacy. |
sensitivity |
float, alternative sample sensitivity prescribed by the Gaussian mechanism under rho-zero-concentrated differential privacy. |
increment |
Numeric indicating space between discrete noise samples, defaults to 1. Note that this does not impact the noise sampling variance, as the increment rescales noise distributions specified by sampling variance. |
Value
A numeric vector with noise added to each prediction
Examples
add_noise_disc_gaussian(
model = NULL,
new_data = NULL,
conf_model_data = NULL,
outcome_var = NULL,
col_schema = NULL,
pred = 1:100,
variance = 3
)
Add discrete Laplace noise with mean 0 to predicted values with constant variance
Description
Add discrete Laplace noise with mean 0 to predicted values with constant variance
Usage
add_noise_disc_laplace(
model,
new_data,
conf_model_data,
outcome_var,
col_schema,
pred,
variance = NULL,
epsilon = NULL,
sensitivity = NULL,
increment = 1
)
Arguments
model |
A |
new_data |
A data frame used to generate predictions |
conf_model_data |
A data frame for estimating the predictive model |
outcome_var |
A string name representing the outcome variable |
col_schema |
A list of column schema specifications for the new variable |
pred |
A vector of values predicted by the model |
variance |
float, sampling variance for additive noise |
epsilon |
float, alternative privacy loss budget prescribed by the Laplace mechanism under epsilon differential privacy. |
sensitivity |
float, alternative sample sensitivity prescribed by the Laplace mechanism under epsilon differential privacy. |
increment |
Numeric indicating space between discrete noise samples, defaults to 1. Note that this does not impact the noise sampling variance, as the increment rescales noise distributions specified by sampling variance. |
Value
A numeric vector with noise added to each prediction
Examples
add_noise_disc_laplace(
model = NULL,
new_data = NULL,
conf_model_data = NULL,
outcome_var = NULL,
col_schema = NULL,
pred = 1:100,
variance = 3
)
Add normal noise with mean 0 to predicted values with constant variance
Description
Add normal noise with mean 0 to predicted values with constant variance
Usage
add_noise_gaussian(
model,
new_data,
conf_model_data,
outcome_var,
col_schema,
pred,
variance = NULL,
rho = NULL,
sensitivity = NULL
)
Arguments
model |
A |
new_data |
A data frame used to generate predictions |
conf_model_data |
A data frame for estimating the predictive model |
outcome_var |
A string name representing the outcome variable |
col_schema |
A list of column schema specifications for the new variable |
pred |
A vector of values predicted by the model |
variance |
Sampling variance for additive noise |
rho |
Alternative privacy loss budget prescribed by the Gaussian mechanism under rho-zero-concentrated differential privacy. |
sensitivity |
Alternative sample sensitivity prescribed by the Gaussian mechanism under rho-zero-concentrated differential privacy. |
Value
A numeric vector with noise added to each prediction
Examples
add_noise_gaussian(
model = NULL,
new_data = NULL,
conf_model_data = NULL,
outcome_var = NULL,
col_schema = NULL,
pred = 1:100,
variance = 3
)
Add normal noise to predicted values with variances calculated for ntiles using Gaussian kernel density estimators
Description
Add normal noise to predicted values with variances calculated for ntiles using Gaussian kernel density estimators
Usage
add_noise_kde(
model,
new_data,
conf_model_data,
outcome_var,
col_schema,
pred,
exclusions = NULL,
n_ntiles = NULL,
obs_per_ntile = NULL,
ties_method = "collapse",
sd_scale = 1
)
Arguments
model |
A |
new_data |
A data frame used to generate predictions |
conf_model_data |
A data frame for estimating the predictive model |
outcome_var |
A string name representing the outcome variable |
col_schema |
A list of column schema specifications for the new variable |
pred |
A vector of values predicted by the model |
exclusions |
Numeric values that should not receive extra noise |
n_ntiles |
The number of ntiles |
obs_per_ntile |
A numeric for the minimum number of observations to be
in an ntile. Cannot be used in conjunction with the |
ties_method |
The ntiles approach to adding noise requires a one-to-one mapping from model-generated values to ntiles in the original data. The methods "collapse", "random", and "exclusions" deal with situations where the ntiles lack unique bounds. "collapse" collapses ntile breaks to preserve the one-to-one relationship; "random" adds a small random perturbation to the derived boundaries; finally, "exclusions" treats ntile tie values as derived exclusions. |
sd_scale |
float, a positive number to scale the estimated KDE variance. Defaults to 1.0 |
Value
A numeric vector with noise added to each prediction
Examples
add_noise_kde(
model = NULL,
new_data = tibble::tibble(x = 1:100),
conf_model_data = tibble::tibble(x = 1:100),
outcome_var = "x",
col_schema = NULL,
pred = 1:100,
n_ntiles = 4
)
Add Laplace noise with mean 0 to predicted values with constant variance
Description
Add Laplace noise with mean 0 to predicted values with constant variance
Usage
add_noise_laplace(
model,
new_data,
conf_model_data,
outcome_var,
col_schema,
pred,
variance = NULL,
epsilon = NULL,
sensitivity = NULL
)
Arguments
model |
A |
new_data |
A data frame used to generate predictions |
conf_model_data |
A data frame for estimating the predictive model |
outcome_var |
A string name representing the outcome variable |
col_schema |
A list of column schema specifications for the new variable |
pred |
A vector of values predicted by the model |
variance |
Sampling variance for additive noise |
epsilon |
Alternative privacy loss budget prescribed by the Laplace mechanism under epsilon differential privacy. |
sensitivity |
Alternative sample sensitivity prescribed by the Laplace mechanism under epsilon differential privacy. |
Value
A numeric vector with noise added to each prediction
Examples
add_noise_laplace(
model = NULL,
new_data = NULL,
conf_model_data = NULL,
outcome_var = NULL,
col_schema = NULL,
pred = 1:100,
variance = 3
)
Add to visit sequence for factor variables
Description
Add to visit sequence for factor variables
Usage
add_sequence_factor(roadmap, ..., method = c("entropy", "gini"))
Arguments
roadmap |
A |
... |
|
method |
A quoted name for the method used to sort the visit_sequence. Current methods include "entropy" and "gini". |
Value
An updated visit_sequence
Examples
roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
) |>
add_sequence_factor(dplyr::where(is.factor), method = "gini")
Add to visit sequence using a manual method
Description
Add to visit sequence using a manual method
Usage
add_sequence_manual(roadmap, ...)
Arguments
roadmap |
A |
... |
|
Value
An updated roadmap object.
Examples
roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
) |>
add_sequence_manual(
c("inctot", "hcovany", "empstat", "classwkr", "age", "famsize",
"transit_time")
)
Add to visit sequence for numeric variables
Description
Add to visit sequence for numeric variables
Usage
add_sequence_numeric(
roadmap,
...,
method = c("correlation", "proportion", "weighted total", "absolute weighted total",
"weighted absolute total"),
cor_var = NULL,
na.rm = FALSE,
cor_use = "everything"
)
Arguments
roadmap |
A |
... |
|
method |
A quoted name for the method used to sort the visit_sequence |
cor_var |
A numeric variable for the correlation method |
na.rm |
Boolean that if TRUE, removes |
cor_use |
A string correlation data method passed to |
Value
An updated visit_sequence
Examples
roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
) |>
add_sequence_numeric(
dplyr::where(is.numeric),
method = "correlation",
cor_var = "age",
na.rm = TRUE
)
Collapse data frames with _NA variables to coerce related variables to include NA
Description
Collapse data frames with _NA variables to coerce related variables to include NA
Usage
collapse_na(data)
Arguments
data |
A data frame with columns ending in _NA |
Value
A data frame with no _NA columns and NA values
Examples
example_na_expanded <- expand_na(data = example_na)
collapse_na(data = example_na_expanded)
Create a constraints object
Description
Create a constraints object
Usage
constraints(
schema,
constraints_df_num = NULL,
constraints_df_cat = NULL,
max_z_num = 0,
max_z_cat = 0
)
Arguments
schema |
A |
constraints_df_num |
A specially formatted data frame with constraints to be imposed during the synthesis process. See examples for formatting. |
constraints_df_cat |
A specifically formatted data frame with constraints to be imposed during the synthesis process. |
max_z_num |
Numeric vector(s) for the number of times a value should be resampled before hardbounding if it violates a constraint. |
max_z_cat |
Numeric vector(s) for the number of times a value should be resampled before hardbounding if it violates a constraint. |
Value
A constraints object.
Examples
constraints(
schema = schema(
conf_data = mtcars |> dplyr::mutate(vs = factor(vs)),
start_data = dplyr::select(mtcars, cyl)
),
constraints_df_num = tibble::tribble(
~var, ~min, ~max, ~conditions,
# ensure all mpg values are greater than 0
"mpg", 0, Inf, "TRUE",
# ensure when cyl == 6, mpg is less than 15
"mpg", -Inf, 15, "cyl == 6",
# ensure disp is always between 0 and 150
"disp", 0, 150, "TRUE"
),
constraints_df_cat = tibble::tribble(
~var, ~allowed, ~forbidden, ~conditions,
# ensure vs != 1 when gear >= 4
"vs", NA, 1, "gear >= 5",
# ensure vs == 1 when gear >= 4
"vs", 0, NA, "gear == 4"
)
)
Add, update, or reset a constraints object within an existing roadmap.
Description
Add, update, or reset a constraints object within an existing roadmap.
Usage
add_constraints(roadmap, constraints)
update_constraints(roadmap, ...)
reset_constraints(roadmap)
Arguments
roadmap |
A |
constraints |
A |
... |
Optional named parameters passed to |
Value
A new roadmap object.
A roadmap object with added constraints.
A roadmap object with updated constraints.
A roadmap object with reset constraints.
Examples
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
constraints_df_num <-
tibble::tribble(~var, ~min, ~max, ~conditions,
"transit_time", 0, 300, "TRUE")
constraints <- constraints(
schema = rm[["schema"]],
constraints_df_num = constraints_df_num,
max_z_num = 0
)
rm |>
add_constraints(constraints)
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
constraints_df_num <-
tibble::tribble(~var, ~min, ~max, ~conditions,
"transit_time", 0, 300, "TRUE")
constraints <- constraints(
schema = rm[["schema"]],
constraints_df_num = constraints_df_num,
max_z_num = 0
)
rm |>
update_constraints(constraints)
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
constraints_df_num <-
tibble::tribble(~var, ~min, ~max, ~conditions,
"transit_time", 0, 300, "TRUE")
constraints <- constraints(
schema = rm[["schema"]],
constraints_df_num = constraints_df_num,
max_z_num = 0
)
rm <- rm |>
add_constraints(constraints)
reset_constraints(rm)
Construct a list of extractors for parsnip models
Description
Construct a list of extractors for parsnip models
Usage
construct_extractors(
roadmap,
default_extractor = NULL,
custom_extractors = NULL
)
Arguments
roadmap |
A roadmap object |
default_extractor |
An extractor from library(parsnip) |
custom_extractors |
A formatted list of extractors |
Value
A named list of extractors
Examples
# construct_extractors() can create a sequence of extractors using a fully-default
# approach, a hybrid approach, or a fully-customized approach. All approaches
# require a roadmap and extractors.
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
# Fully-default approach
construct_extractors(
roadmap = rm,
default_extractor = parsnip::extract_fit_engine
)
# Hybrid approach
construct_extractors(
roadmap = rm,
default_extractor = parsnip::extract_fit_engine,
custom_extractors = list(
list(vars = "hcovany", extractor = parsnip::extract_parameter_dials)
)
)
# Fully-customized approach
construct_extractors(
roadmap = rm,
custom_extractors = list(
list(
vars = c("hcovany", "empstat", "classwkr"),
extractor = parsnip::extract_fit_engine
),
list(
vars = c("age", "famsize", "transit_time", "inctot"),
extractor = parsnip::extract_parameter_dials
)
)
)
Construct a list of models for synthesis
Description
Construct a list of models for synthesis
Usage
construct_models(
roadmap,
default_regression_model = NULL,
default_classification_model = NULL,
custom_models = NULL
)
Arguments
roadmap |
A roadmap object |
default_regression_model |
A |
default_classification_model |
A |
custom_models |
A formatted list with |
Value
A named list of models
Examples
# construct_models() can create a sequence of models using a fully-default
# approach, a hybrid approach, or a fully-customized approach. All approaches
# require a roadmap and model objects.
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
rpart_mod_reg <- parsnip::decision_tree() |>
parsnip::set_engine(engine = "rpart") |>
parsnip::set_mode(mode = "regression")
rpart_mod_class <- parsnip::decision_tree() |>
parsnip::set_engine(engine = "rpart") |>
parsnip::set_mode(mode = "classification")
lm_mod <- parsnip::linear_reg() |>
parsnip::set_engine("lm") |>
parsnip::set_mode(mode = "regression")
# Fully-default approach
construct_models(
roadmap = rm,
default_regression_model = lm_mod,
default_classification_model = rpart_mod_class
)
# Hybrid approach
construct_models(
roadmap = rm,
default_regression_model = lm_mod,
default_classification_model = rpart_mod_class,
custom_models = list(
list(vars = "age", model = lm_mod)
)
)
# Fully-customized approach
construct_models(
roadmap = rm,
custom_models = list(
list(vars = c("hcovany", "empstat", "classwkr"), model = rpart_mod_class),
list(vars = c("age", "famsize", "transit_time", "inctot"), model = rpart_mod_reg)
)
)
Construct a list of noise objects for synthesis
Description
Construct a list of noise objects for synthesis
Usage
construct_noise(
roadmap,
default_regression_noise = NULL,
default_classification_noise = NULL,
custom_noise = NULL
)
Arguments
roadmap |
A roadmap object |
default_regression_noise |
A noise function for regression models |
default_classification_noise |
A noise function for classification models |
custom_noise |
A formatted list of noise functions |
Value
A named list of noise
Examples
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
noise_defaults <- construct_noise(
roadmap = rm,
default_regression_noise = noise(),
default_classification_noise = noise()
)
# construct_noise() can create a sequence of noise objects using a
# fully-default approach, a hybrid approach, or a fully-customized approach.
# All approaches require a roadmap and noise objects.
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
noise_reg <- noise(
add_noise = TRUE,
mode = "regression",
noise_fun = add_noise_gaussian
)
noise_class <- noise(
add_noise = TRUE,
mode = "classification",
noise_fun = add_noise_cat_unif
)
# Fully-default approach
construct_noise(
roadmap = rm,
default_regression_noise = noise_reg,
default_classification_noise = noise_class
)
# Hybrid approach
noise_reg2 <- noise(
add_noise = TRUE,
mode = "regression",
noise_fun = add_noise_disc_gaussian
)
construct_noise(
roadmap = rm,
default_regression_noise = noise_reg,
default_classification_noise = noise_class,
custom_noise = list(
list(vars = "age", noise = noise_reg2)
)
)
# Fully-customized approach
construct_noise(
roadmap = rm,
custom_noise = list(
list(vars = c("hcovany", "empstat", "classwkr"), noise = noise_class),
list(vars = c("age", "famsize", "transit_time", "inctot"), noise = noise_reg)
)
)
Construct a sequence of model recipes for sequential synthesis
Description
Construct a sequence of model recipes for sequential synthesis
Usage
construct_recipes(
roadmap,
default_regression_steps = NULL,
default_classification_steps = NULL,
custom_steps = NULL
)
Arguments
roadmap |
A roadmap object |
default_regression_steps |
A list containing one or more recipes::step_*() |
default_classification_steps |
A list containing one or more recipes::step_*() |
custom_steps |
A list of lists containing one or more recipes::step_*() |
Value
A list of formulas
Examples
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
construct_recipes(rm)
# construct_recipes() can create a sequence of recipes using a fully-default
# approach, a hybrid approach, or a fully-customized approach. All approaches
# require a roadmap and steps.
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
step1 <- function(x) {
x |>
recipes::step_center(recipes::all_predictors(), id = "center")
}
# Fully-default approach
construct_recipes(
roadmap = rm,
default_regression_steps = step1,
default_classification_steps = step1
)
# Hybrid approach
step2 <- function(x) {
x |>
recipes::step_scale(recipes::all_predictors(), id = "scale")
}
construct_recipes(
roadmap = rm,
default_regression_steps = step1,
default_classification_steps = step1,
custom_steps = list(
list(vars = "age", step = step2)
)
)
# Fully-customized approach
construct_recipes(
roadmap = rm,
custom_steps = list(
list(vars = c("hcovany", "empstat", "classwkr"), step = step1),
list(vars = c("age", "famsize", "transit_time", "inctot"), step = step1)
)
)
Construct a list of samplers for synthesis
Description
Construct a list of samplers for synthesis
Usage
construct_samplers(
roadmap,
default_regression_sampler = NULL,
default_classification_sampler = NULL,
custom_samplers = NULL
)
Arguments
roadmap |
A roadmap object |
default_regression_sampler |
A sampler function for regression models |
default_classification_sampler |
A sampler function for classification models |
custom_samplers |
A formatted list of sampler functions |
Value
A named list of samplers
Examples
# construct_samplers() can create a sequence of samplers using a fully-default
# approach, a hybrid approach, or a fully-customized approach. All approaches
# require a roadmap and samplers.
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
# Fully-default approach
construct_samplers(
roadmap = rm,
default_regression_sampler = sample_lm,
default_classification_sampler = sample_rpart
)
# Hybrid approach
construct_samplers(
roadmap = rm,
default_regression_sampler = sample_lm,
default_classification_sampler = sample_rpart,
custom_samplers = list(
list(vars = "hcovany", sampler = sample_rpart)
)
)
# Fully-customized approach
construct_samplers(
roadmap = rm,
custom_samplers = list(
list(vars = c("hcovany", "empstat", "classwkr"), sampler = sample_rpart),
list(vars = c("age", "famsize", "transit_time", "inctot"), sampler = sample_lm)
)
)
Construct a list of tuning grids for hyperparameter tuning predictive models
Description
Construct a list of tuning grids for hyperparameter tuning predictive models
Usage
construct_tuners(
roadmap,
default_regression_tuner = NULL,
default_classification_tuner = NULL,
custom_tuners = NULL
)
Arguments
roadmap |
A roadmap object |
default_regression_tuner |
A tuner. |
default_classification_tuner |
A tuner. |
custom_tuners |
A formatted list of tuners. |
Value
A named list of tuners
Examples
# construct_tuners() can create a sequence of tuners using a fully-default
# approach, a hybrid approach, or a fully-customized approach. All approaches
# require a roadmap and tuners.
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
tuner_reg <- list(
v = 3,
grid = 3,
metrics = yardstick::metric_set(yardstick::rmse)
)
tuner_cat <- list(
v = 3,
grid = 3,
metrics = yardstick::metric_set(yardstick::roc_auc)
)
# Fully-default approach
construct_tuners(
roadmap = rm,
default_regression_tuner = tuner_reg,
default_classification_tuner = tuner_cat
)
# Hybrid approach
tuner_cat2 <- list(
v = 3,
grid = 3,
metrics = yardstick::metric_set(yardstick::precision)
)
construct_tuners(
roadmap = rm,
default_regression_tuner = tuner_reg,
default_classification_tuner = tuner_cat,
custom_tuners = list(
list(vars = "hcovany", tuner = tuner_cat2)
)
)
# Fully-customized approach
construct_tuners(
roadmap = rm,
custom_tuners = list(
list(vars = c("hcovany", "empstat", "classwkr"), tuner = tuner_reg),
list(vars = c("age", "famsize", "transit_time", "inctot"), tuner = tuner_cat)
)
)
Convert "NA" values to NA for categorical variables
Description
Convert "NA" values to NA for categorical variables
Usage
convert_level_to_na(data)
Arguments
data |
A data frame or tibble |
Value
A data frame or tibble with "NA" converted to NA
Examples
data <- data.frame(
x1 = c(1, 2, NA),
x2 = c("1", "2", "NA"),
x3 = factor(c("1", "2", "NA")),
x4 = factor(c("b", "NA", "a"), ordered = TRUE)
)
convert_level_to_na(data)
Convert NA values to "NA" for categorical variables
Description
Convert NA values to "NA" for categorical variables
Usage
convert_na_to_level(data)
Arguments
data |
A data frame or tibble |
Value
A data frame or tibble with NA converted to "NA"
Examples
data <- data.frame(
x1 = c(1, 2, NA),
x2 = c("1", "2", NA),
x3 = factor(c("1", "2", NA)),
x4 = factor(c("b", NA, "a"), levels = c("b", NA, "a"), ordered = TRUE)
)
convert_na_to_level(data)
Redefine NA value for a dataset.
Description
Redefine NA value for a dataset.
Usage
enforce_custom_na(data, col_schema)
Arguments
data |
A |
col_schema |
A col_schema from a |
Value
A data.frame
Examples
# create custom NA filter
example_na_custom <- example_na |>
tidyr::replace_na(
list("wages" = -999)
)
example_na_expanded_custom <- enforce_custom_na(
data = example_na_custom,
col_schema = list(
"wages" = list(
dtype = "dbl",
na_value = -999
)
)
)
Add missing values where values should be missing according to _NA variables
Description
Add missing values where values should be missing according to _NA variables
Usage
enforce_na(data)
Arguments
data |
A synthetic data frame with _NA columns |
Value
A synthetic data frame with _NA columns that converts values that are labelled missing in an _NA variable to missing in the corresponding variable
Examples
example_na_expanded <- expand_na(data = example_na)
enforce_na(data = example_na_expanded)
Enforce a roadmap's schema on its existing data
Description
Enforce a roadmap's schema on its existing data
Usage
enforce_schema(roadmap)
Arguments
roadmap |
A |
Value
A roadmap object with modified conf_data, start_data, and schema
information.
Examples
rm <- roadmap(conf_data = acs_conf, start_data = acs_start) |>
update_schema(na_numeric_to_ind = TRUE)
enforce_schema(rm)
A df with different types of missingness
Description
A df with different types of missingness
Usage
example_na
Format
A tibble with 200 observations and 6 variables:
- age
Age of respondent
- sex
Sex of respondent with missingness at random
- labor_force
Labor force status of respondent with structural missingness
- hours
Hours work of respondent with missingness at random
- wages
Wages earned with structural missingness
Add new variables that indicate if a value is "missing" or "not missing" for original variables that contain NA
Description
Add new variables that indicate if a value is "missing" or "not missing" for original variables that contain NA
Usage
expand_na(
data,
types = c("chr", "dbl", "fct", "lgl", "int", "ord"),
skip_vars = NULL
)
Arguments
data |
A data frame |
types |
A vector of variables types to expand |
skip_vars |
A character vector of variables that shouldn't be expanded |
Value
An augmented data frame with the original variables and new variables that contain the missingness patterns of variables with NA
Examples
expand_na(data = example_na, type = c("dbl", "int"))
An S3 method for inverting a step
Description
An S3 method for inverting a step
Usage
invert(object, predictions, ...)
Arguments
object |
A recipe after fitting a model |
predictions |
A data frame with .pred |
... |
Other arguments |
Value
A tibble with inverted model-generated values
Examples
data <- tibble::tibble(
y = rlnorm(n = 1000, meanlog = 0, sdlog = 1),
x = rnorm(n = 1000)
)
adj <- recipes::recipe(y ~ x, data = data) |>
recipes::step_BoxCox(recipes::all_outcomes()) |>
recipes::prep()
invert(
object = adj$steps[[1]],
predictions = tibble::tibble(.pred = adj[["template"]][["y"]])
)
Invert a Box-Cox transformation
Description
Invert a Box-Cox transformation
Usage
## S3 method for class 'step_BoxCox'
invert(object, predictions, ...)
Arguments
object |
A recipe after fitting a model |
predictions |
A data frame with .pred |
... |
Other arguments |
Value
A tibble with the Box-Cox transformation inverted for .pred
Examples
data <- tibble::tibble(
y = rlnorm(n = 1000, meanlog = 0, sdlog = 1),
x = rnorm(n = 1000)
)
adj <- recipes::recipe(y ~ x, data = data) |>
recipes::step_BoxCox(recipes::all_outcomes()) |>
recipes::prep()
invert(
object = adj$steps[[1]],
predictions = tibble::tibble(.pred = adj[["template"]][["y"]])
)
Invert a Yeo-Johnson transformation
Description
Invert a Yeo-Johnson transformation
Usage
## S3 method for class 'step_YeoJohnson'
invert(object, predictions, ...)
Arguments
object |
A recipe after fitting a model |
predictions |
A data frame with .pred |
... |
Other arguments |
Value
A tibble with the Yeo_johnson transformation inverted for .pred
Examples
data <- tibble::tibble(
y = rlnorm(n = 1000, meanlog = 0, sdlog = 1),
x = rnorm(n = 1000)
)
adj <- recipes::recipe(y ~ x, data = data) |>
recipes::step_YeoJohnson(recipes::all_outcomes()) |>
recipes::prep()
invert(
object = adj$steps[[1]],
predictions = tibble::tibble(.pred = adj[["template"]][["y"]])
)
Invert a log transformation
Description
Invert a log transformation
Usage
## S3 method for class 'step_log'
invert(object, predictions, ...)
Arguments
object |
A recipe after fitting a model |
predictions |
A data frame with .pred |
... |
Other arguments |
Value
A tibble with the log transformation inverted for .pred
Examples
data <- tibble::tibble(
y = rlnorm(n = 1000, meanlog = 0, sdlog = 1),
x = rnorm(n = 1000)
)
adj <- recipes::recipe(y ~ x, data = data) |>
recipes::step_log(recipes::all_outcomes()) |>
recipes::prep()
invert(
object = adj$steps[[1]],
predictions = tibble::tibble(.pred = adj[["template"]][["y"]])
)
Kolmogorov-Smirnov distance
Description
Kolmogorov-Smirnov distance
Usage
ks_distance(data, ...)
## S3 method for class 'data.frame'
ks_distance(data, truth, estimate, na_rm = TRUE, case_weights = NULL, ...)
ks_distance_vec(truth, estimate, na_rm = TRUE, case_weights = NULL, ...)
Arguments
data |
A |
... |
Not currently used. |
truth |
The column identifier for the true results (that is numeric).
This should be an unquoted column name although this argument is passed by
expression and supports quasiquotation (you can unquote column names). For
|
estimate |
The column identifier for the predicted results (that is
also numeric). As with truth this can be specified different ways but the
primary method is to use an unquoted variable name. For |
na_rm |
A |
case_weights |
This is a placeholder for now and will be added when case_weights are added to tidysynthesis. |
Value
For ks_distance_vec(), a single numeric value (or NA).
A single numeric value (or NA).
A single numeric value (or NA).
Examples
ks1 <- data.frame(x = 1:100, y = 101:200)
ks_distance(data = ks1, truth = x, estimate = y)
ks1 <- data.frame(x = 1:100, y = 101:200)
ks_distance(data = ks1, truth = x, estimate = y)
ks1 <- data.frame(x = 1:100, y = 101:200)
ks_distance_vec(truth = ks1$x, estimate = ks1$y)
Create a noise object
Description
Create a noise object
Usage
noise(add_noise = FALSE, mode = "regression", noise_func = NULL, ...)
Arguments
add_noise |
Boolean, TRUE if adding noise |
mode |
String, one of "regression" or "classification" |
noise_func |
A function that adds noise to |
... |
Optional named additional arguments to pass to |
Value
A noise object
Examples
# create default noise object
noise()
# create noise object for classification
noise(
add_noise = TRUE,
mode = "classification",
noise_func = add_noise_cat_unif
)
# create noise object for regression
noise(
add_noise = TRUE,
mode = "regression",
noise_func = add_noise_kde,
n_ntiles = 10
)
Create a presynth object
Description
Create a presynth object
Usage
presynth(roadmap, synth_spec)
Arguments
roadmap |
A |
synth_spec |
A |
Value
A presynth object.
Examples
# create roadmap
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
rpart_mod_reg <- parsnip::decision_tree() |>
parsnip::set_engine(engine = "rpart") |>
parsnip::set_mode(mode = "regression")
rpart_mod_class <- parsnip::decision_tree() |>
parsnip::set_engine(engine = "rpart") |>
parsnip::set_mode(mode = "classification")
synth_spec1 <- synth_spec(
default_regression_model = rpart_mod_reg,
default_regression_sampler = sample_rpart,
default_classification_model = rpart_mod_class,
default_classification_sampler = sample_rpart
)
# create a presynth object
# use defaults for noise, constraints, and replicates
presynth(
roadmap = rm,
synth_spec = synth_spec1
)
Print the constraints object to the console with formatting
Description
Print the constraints object to the console with formatting
Usage
## S3 method for class 'constraints'
print(x, ...)
Arguments
x |
A |
... |
further arguments passed to or from other methods (not currently used). |
Value
Invisibly returns the input constraints object.
Examples
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
constraints_df_num <-
tibble::tribble(~var, ~min, ~max, ~conditions,
"transit_time", 0, 300, "TRUE")
constraints <- constraints(
schema = rm[["schema"]],
constraints_df_num = constraints_df_num,
max_z_num = 0
)
print(constraints)
Print the noise object to the console with formatting
Description
Print the noise object to the console with formatting
Usage
## S3 method for class 'noise'
print(x, ...)
Arguments
x |
A |
... |
further arguments passed to or from other methods (not currently used). |
Value
Invisibly returns the input noise object.
Examples
print(noise())
Print the postsynth object to the console with formatting
Description
Print the postsynth object to the console with formatting
Usage
## S3 method for class 'postsynth'
print(x, ...)
Arguments
x |
A |
... |
further arguments passed to or from other methods (not currently used). |
Value
Invisibly returns the input postsynth object.
Examples
# create roadmap
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
rpart_mod_reg <- parsnip::decision_tree() |>
parsnip::set_engine(engine = "rpart") |>
parsnip::set_mode(mode = "regression")
rpart_mod_class <- parsnip::decision_tree() |>
parsnip::set_engine(engine = "rpart") |>
parsnip::set_mode(mode = "classification")
synth_spec1 <- synth_spec(
default_regression_model = rpart_mod_reg,
default_regression_sampler = sample_rpart,
default_classification_model = rpart_mod_class,
default_classification_sampler = sample_rpart
)
# create a presynth object
# use defaults for noise, constraints, and replicates
presynth1 <- presynth(
roadmap = rm,
synth_spec = synth_spec1
)
# synthesize!
set.seed(1)
postsynth1 <- synthesize(presynth = presynth1)
print(postsynth1)
print method for presynth objects
Description
print method for presynth objects
Usage
## S3 method for class 'presynth'
print(x, ...)
Arguments
x |
A |
... |
further arguments passed to or from other methods (not currently used). |
Value
A presynth object
Examples
# create roadmap
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
rpart_mod_reg <- parsnip::decision_tree() |>
parsnip::set_engine(engine = "rpart") |>
parsnip::set_mode(mode = "regression")
rpart_mod_class <- parsnip::decision_tree() |>
parsnip::set_engine(engine = "rpart") |>
parsnip::set_mode(mode = "classification")
synth_spec1 <- synth_spec(
default_regression_model = rpart_mod_reg,
default_regression_sampler = sample_rpart,
default_classification_model = rpart_mod_class,
default_classification_sampler = sample_rpart
)
# create a presynth object
# use defaults for noise, constraints, and replicates
presynth <- presynth(
roadmap = rm,
synth_spec = synth_spec1
)
print(presynth)
Print the replicates object to the console with formatting
Description
Print the replicates object to the console with formatting
Usage
## S3 method for class 'replicates'
print(x, ...)
Arguments
x |
A |
... |
further arguments passed to or from other methods (not currently used). |
Value
Invisibly returns the input replicates object.
Examples
rep <- replicates(
start_data_replicates = 2,
model_sample_replicates = 2,
end_to_end_replicates = 2
)
print(rep)
Print the schema object to the console with formatting
Description
Print the schema object to the console with formatting
Usage
## S3 method for class 'schema'
print(x, ...)
Arguments
x |
A |
... |
further arguments passed to or from other methods (not currently used). |
Value
Invisibly returns the input schema object.
Examples
# default inferred schema
schema1 <- schema(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
print(schema1)
Print the start_method object to the console with formatting
Description
Print the start_method object to the console with formatting
Usage
## S3 method for class 'start_method'
print(x, ...)
Arguments
x |
A |
... |
further arguments passed to or from other methods (not currently used). |
Value
A start_method object
Examples
print(start_method())
Print the replicates object to the console with formatting
Description
Print the replicates object to the console with formatting
Usage
## S3 method for class 'synth_spec'
print(x, ...)
Arguments
x |
A |
... |
further arguments passed to or from other methods (not currently used). |
Value
A synth_spec object
Examples
synth_spec <- synth_spec()
print(synth_spec)
Print method for visit_sequence objects
Description
Print method for visit_sequence objects
Usage
## S3 method for class 'visit_sequence'
print(x, ...)
Arguments
x |
A |
... |
further arguments passed to or from other methods (not currently used). |
Value
Invisibly returns the input visit_sequence object.
Examples
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
print(rm[["visit_sequence"]])
Create a replicates object
Description
Create a replicates object
Usage
replicates(
start_data_replicates = 1,
model_sample_replicates = 1,
end_to_end_replicates = 1
)
Arguments
start_data_replicates |
The number of starting data replicates to use.
Note that if no |
model_sample_replicates |
The number of replicates for the conditional modeling process, including modeling and sampling new synthetic values. |
end_to_end_replicates |
The number of replicates for the entire synthesis process, including all previously specified steps. |
Value
A new replicates object.
Examples
replicates(
start_data_replicates = 2,
model_sample_replicates = 2,
end_to_end_replicates = 2
)
Add, update, or reset a replicates object within an existing roadmap.
Description
Add, update, or reset a replicates object within an existing roadmap.
Usage
add_replicates(roadmap, replicates)
update_replicates(roadmap, ...)
reset_replicates(roadmap)
Arguments
roadmap |
A |
replicates |
A |
... |
Optional named parameters passed to |
Value
A new roadmap object.
A new roadmap object with the added replicates.
A new roadmap object with updated replicates.
A new roadmap object with reset replicates.
Examples
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
new_replicates <- replicates(end_to_end_replicates = 2)
rm |>
add_replicates(new_replicates)
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
rm |>
update_replicates(start_data_replicates = 3)
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
rm <- rm |>
add_replicates(replicates(start_data_replicates = 3))
reset_replicates(roadmap = rm)
Create a roadmap
Description
A roadmap is a container object that aggregates information required to
specify the order of operations for synthesis modeling and sampling steps.
Usage
roadmap(
conf_data,
start_data,
start_method = NULL,
schema = NULL,
visit_sequence = NULL,
replicates = NULL,
constraints = NULL
)
Arguments
conf_data |
A |
start_data |
A |
start_method |
An optional |
schema |
An optional |
visit_sequence |
An optional |
replicates |
An optional |
constraints |
An optional |
Details
Users initiate a roadmap object with conf_data and start_data. All other
objects will either be completed with defaults or specified interactively via
the provided API.
Value
A new roadmap object.
Examples
roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw,
start_method = start_method(
start_func = start_resample, n = 1000
)
)
Sample the conditional distribution created by a generalized linear model
Description
Currently, logistic and poisson regression are supported using parsnip and
the standard glm engine. Note that poisson regression requires the suggested
poissonreg library.
Usage
sample_glm(model, new_data, conf_data)
Arguments
model |
A "model_fit" object created by parsnip |
new_data |
A data frame with predictors |
conf_data |
A data frame with original confidential predictors |
Value
A numeric vector of predictions
Examples
acs_conf <- acs_conf |>
tidyr::drop_na()
logistic_mod <- parsnip::logistic_reg() |>
parsnip::set_engine("glm") |>
parsnip::set_mode(mode = "classification")
classification_rec <- recipes::recipe(hcovany ~ ., data = acs_conf)
model_class <- workflows::workflow() |>
workflows::add_model(spec = logistic_mod) |>
workflows::add_recipe(recipe = classification_rec) |>
parsnip::fit(data = acs_conf)
set.seed(1)
sample1 <- sample_glm(
model = model_class,
new_data = acs_conf[1:3, ],
conf_data = acs_conf
)
Sample the conditional distribution created by a linear model
Description
Sample the conditional distribution created by a linear model
Usage
sample_lm(model, new_data, conf_data)
Arguments
model |
A "model_fit" object created by parsnip::linear_reg() |
new_data |
A data frame with predictors |
conf_data |
A data frame with original confidential predictors |
Value
A numeric vector of predictions
Examples
lm_mod <- parsnip::linear_reg() |>
parsnip::set_engine("lm") |>
parsnip::set_mode(mode = "regression")
regression_rec <- recipes::recipe(inctot ~ ., data = acs_conf)
model_reg <- workflows::workflow() |>
workflows::add_model(spec = lm_mod) |>
workflows::add_recipe(recipe = regression_rec) |>
parsnip::fit(data = acs_conf)
set.seed(1)
sample1 <- sample_lm(
model = model_reg,
new_data = acs_conf[1:3, ],
conf_data = acs_conf
)
Sample the conditional distribution created by a ranger rf model
Description
Sample the conditional distribution created by a ranger rf model
Usage
sample_ranger(model, new_data, conf_data)
Arguments
model |
A "model_fit" object created by parsnip::ranger() |
new_data |
A data frame with predictors |
conf_data |
A data frame with original confidential predictors |
Value
A numeric vector of predictions
Examples
rf_mod_regression <- parsnip::rand_forest(trees = 500, min_n = 1) |>
parsnip::set_engine(engine = "ranger") |>
parsnip::set_mode(mode = "regression") |>
parsnip::set_args(quantreg = TRUE)
regression_rec <- recipes::recipe(age ~ ., data = acs_conf)
model_reg <- workflows::workflow() |>
workflows::add_model(spec = rf_mod_regression) |>
workflows::add_recipe(recipe = regression_rec) |>
parsnip::fit(data = acs_conf)
set.seed(1)
sample1 <- sample_ranger(
model = model_reg,
new_data = acs_conf[1:3, ],
conf_data = acs_conf
)
Sample the conditional distribution created by a CART model
Description
Sample the conditional distribution created by a CART model
Usage
sample_rpart(model, new_data, conf_data, ignore_zeros = TRUE)
Arguments
model |
A "model_fit" object created by rpart |
new_data |
A data frame with predictors |
conf_data |
A data frame with original confidential predictors |
ignore_zeros |
Should a vector of all 0 observations return NA for
the l-diversity calculation. Defaults to |
Value
A numeric vector of predictions
Examples
rpart_mod_reg <- parsnip::decision_tree() |>
parsnip::set_engine("rpart") |>
parsnip::set_mode(mode = "regression")
regression_rec <- recipes::recipe(inctot ~ ., data = acs_conf)
model_reg <- workflows::workflow() |>
workflows::add_model(spec = rpart_mod_reg) |>
workflows::add_recipe(recipe = regression_rec) |>
parsnip::fit(data = acs_conf)
set.seed(1)
sample1 <- sample_rpart(
model = model_reg,
new_data = acs_conf[1:3, ],
conf_data = acs_conf
)
rpart_mod_class <- parsnip::decision_tree() |>
parsnip::set_engine("rpart") |>
parsnip::set_mode(mode = "classification")
classification_rec <- recipes::recipe(hcovany ~ ., data = acs_conf)
model_reg <- workflows::workflow() |>
workflows::add_model(spec = rpart_mod_class) |>
workflows::add_recipe(recipe = classification_rec) |>
parsnip::fit(data = acs_conf)
set.seed(1)
sample1 <- sample_rpart(
model = model_reg,
new_data = acs_conf[1:10, ],
conf_data = acs_conf
)
Generate a schema object.
Description
Generate a schema object.
Usage
schema(
conf_data,
start_data,
col_schema = NULL,
enforce = TRUE,
coerce_to_factors = FALSE,
coerce_to_doubles = FALSE,
na_factor_to_level = TRUE,
na_numeric_to_ind = TRUE
)
Arguments
conf_data |
A data frame to be synthesized. |
start_data |
A data frame with starting variables. |
col_schema |
An optional named list of columns in the confidential data
with their properties, including data type and factor levels. If NULL or only
partially specified, |
enforce |
Boolean that if true, will preprocess both |
coerce_to_factors |
Boolean that if true, coerces categorical data types
( |
coerce_to_doubles |
Boolean that if true, coerces columns specified as |
na_factor_to_level |
Boolean that if true, applies |
na_numeric_to_ind |
Boolean that if true, applies |
Value
A schema object.
Examples
conf_data <- data.frame(
var1 = c("1", "1", "2"),
var2 = c(1L, 2L, 3L),
var3 = c(1.1, 2.2, 3.3)
)
start_data <- dplyr::select(conf_data, var1)
# default inferred schema
schema(
conf_data = conf_data,
start_data = start_data
)
# overwriting factor levels
schema(
conf_data = conf_data,
start_data = start_data,
col_schema = list(
"var1" = list(
"dtype" = "fct",
"levels" = c("1", "2", "3")
)
),
coerce_to_factors = TRUE
)
Add, update, or reset a schema object within an existing roadmap.
Description
Add, update, or reset a schema object within an existing roadmap.
Usage
add_schema(roadmap, schema)
update_schema(roadmap, ...)
reset_schema(roadmap)
Arguments
roadmap |
A |
schema |
A |
... |
Optional named parameters passed to |
Value
A new roadmap object.
A roadmap object with added schema.
A roadmap object with updated schema.
A roadmap object with reset schema.
Examples
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
acs_schema <- schema(
conf_data = acs_conf_nw,
start_data = acs_start_nw,
na_numeric_to_ind = TRUE
)
rm |>
add_schema(schema = acs_schema)
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
rm |>
update_schema(na_numeric_to_ind = TRUE)
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
rm <- rm |>
update_schema(na_numeric_to_ind = TRUE)
reset_schema(roadmap = rm)
Create a start_method object.
Description
A start_method gets executed prior to running a synthesis. This modifies
the start_data, typically randomly, to provide greater disclosure risk
protections.
Usage
start_method(start_func = NULL, ...)
Arguments
start_func |
A function that accepts and returns a |
... |
Optional keyword arguments passed to |
Value
A start_method object
Examples
# basic usage
start_method(start_func = start_resample)
# adjust the number of observations
start_method(
start_func = start_resample,
start_data = acs_start_nw,
n = 10
)
# adjust the number of observations and use all combinations as support
start_method(
start_func = start_resample,
start_data = acs_start_nw,
n = 10,
inv_noise_scale = 1,
support = "all"
)
Add, update, or reset a start method within an existing roadmap.
Description
Add, update, or reset a start method within an existing roadmap.
Usage
add_start_method(roadmap, start_method)
update_start_method(roadmap, ...)
remove_start_method(roadmap)
Arguments
roadmap |
A |
start_method |
A |
... |
Optional named parameters passed to |
Value
A new roadmap object.
A new roadmap object with added start_method.
A new roadmap object with updated start_method.
A new roadmap object with removed start_method.
Examples
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw,
)
add_start_method(
roadmap = rm,
start_method = start_method()
)
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
update_start_method(
roadmap = rm,
start_method = start_method()
)
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw,
start_method = start_method()
)
remove_start_method(
roadmap = rm
)
Specify a resampling scheme for start_data
Description
Specify a resampling scheme for start_data
Usage
start_resample(
start_data,
n = NULL,
inv_noise_scale = NULL,
support = c("observed", "all")
)
Arguments
start_data |
A |
n |
An optional integer sample size. If unspecified, |
inv_noise_scale |
An optional parameter to set randomized noise to the proportions
of records with different |
support |
A string that specifies the method of resampling from the |
Value
A start_method object for resampling starting data
Examples
start_method(
start_func = start_resample, n = 1000
)
Create a synth_spec object
Description
The synth_spec object holds specifications for modeling and sampling components
for sequential synthetic data generation. Each component has an associated
construct_* function called when creating a presynth object.
Usage
synth_spec(
default_regression_model = NULL,
default_classification_model = NULL,
custom_models = NULL,
default_regression_steps = NULL,
default_classification_steps = NULL,
custom_steps = NULL,
default_regression_sampler = NULL,
default_classification_sampler = NULL,
custom_samplers = NULL,
default_regression_noise = NULL,
default_classification_noise = NULL,
custom_noise = NULL,
default_regression_tuner = NULL,
default_classification_tuner = NULL,
custom_tuners = NULL,
default_extractor = NULL,
custom_extractors = NULL,
invert_transformations = TRUE,
enforce_na = TRUE
)
Arguments
default_regression_model |
A |
default_classification_model |
A |
custom_models |
A list of named lists each with two elements:
|
default_regression_steps |
A list of |
default_classification_steps |
A list of |
custom_steps |
A list of named lists each with two elements:
|
default_regression_sampler |
A sampling function for drawing new values from regression models. |
default_classification_sampler |
A sampling function for drawing new values from classification models. |
custom_samplers |
A list of named lists each with two elements:
|
default_regression_noise |
A noise function for adding noise to numeric values. |
default_classification_noise |
A noise function for adding noise to classification values. |
custom_noise |
A list of named lists each with two elements:
|
default_regression_tuner |
A |
default_classification_tuner |
A |
custom_tuners |
A list of named lists each with two elements:
|
default_extractor |
An optional method for extracting workflows or extracts from workflows. |
custom_extractors |
A list of named lists each with two elements:
|
invert_transformations |
A Boolean for if outcome variable transformations applied through recipes should be inverted during synthesis. recipes need ids that begin with "outcome". |
enforce_na |
A Boolean for if NA values should be added into the
synthetic data with |
Value
A synth_spec object
Examples
rpart_mod <- parsnip::decision_tree() |>
parsnip::set_engine(engine = "rpart") |>
parsnip::set_mode(mode = "regression")
lm_mod <- parsnip::linear_reg() |>
parsnip::set_engine("lm") |>
parsnip::set_mode(mode = "regression")
step1 <- function(x) {
x |>
recipes::step_center(recipes::all_predictors(), id = "center")
}
step2 <- function(x) {
x |>
recipes::step_scale(recipes::all_predictors(), id = "scale")
}
step3 <- function(x) { x |> step1() |> step2() }
synth_spec(
default_regression_model = rpart_mod,
custom_models = list(
list("vars" = c("var1", "var2"),
"model" = lm_mod)
),
default_regression_steps = step1,
custom_steps = list(
list("vars" = c("var2", "var3"),
"steps" = step2),
list("vars" = c("var4"),
"steps" = step3)
),
default_regression_sampler = sample_rpart,
custom_samplers = list(
list("vars" = c("var1", "var2"),
"sampler" = sample_lm)
)
)
Add, update, or remove extractors from a synth_spec object
Description
Add, update, or remove extractors from a synth_spec object
Usage
add_custom_extractors(synth_spec, ...)
update_custom_extractors(synth_spec, ...)
remove_custom_extractors(synth_spec)
Arguments
synth_spec |
A |
... |
Optional named lists with two elements, |
Value
A new synth_spec object.
A new synth_spec object with added custom extractors.
A new synth_spec object with updated custom extractors.
A new synth_spec object with removed custom extractors.
Examples
synth_spec <- synth_spec()
add_custom_extractors(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "extractor" = parsnip::extract_fit_engine)
)
synth_spec <- synth_spec()
update_custom_extractors(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "extractor" = parsnip::extract_fit_engine)
)
synth_spec <- synth_spec()
synth_spec <- add_custom_extractors(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "extractor" = parsnip::extract_fit_engine)
)
remove_custom_extractors(synth_spec = synth_spec)
Inspections for synth_spec components
Description
Inspections for synth_spec components
Arguments
z |
Object |
Value
Boolean if matches class type
Add, update, or remove custom models from a synth_spec object
Description
Add, update, or remove custom models from a synth_spec object
Usage
add_custom_models(synth_spec, ...)
update_custom_models(synth_spec, ...)
remove_custom_models(synth_spec)
Arguments
synth_spec |
A |
... |
Optional named lists with two elements, |
Value
A new synth_spec object.
A new synth_spec object with added custom models.
A new synth_spec object with updated custom models.
A new synth_spec object with removed custom models.
Examples
synth_spec <- synth_spec()
dt_reg_mod <- parsnip::decision_tree() |>
parsnip::set_engine("rpart") |>
parsnip::set_mode("regression")
add_custom_models(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "model" = dt_reg_mod)
)
synth_spec <- synth_spec()
dt_reg_mod <- parsnip::decision_tree() |>
parsnip::set_engine("rpart") |>
parsnip::set_mode("regression")
update_custom_models(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "model" = dt_reg_mod)
)
synth_spec <- synth_spec()
dt_reg_mod <- parsnip::decision_tree() |>
parsnip::set_engine("rpart") |>
parsnip::set_mode("regression")
synth_spec <- update_custom_models(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "model" = dt_reg_mod)
)
remove_custom_models(synth_spec = synth_spec)
Add, update, or remove noise from a synth_spec object
Description
Add, update, or remove noise from a synth_spec object
Usage
add_custom_noise(synth_spec, ...)
Arguments
synth_spec |
A |
... |
Optional named lists with two elements, |
Value
A new synth_spec object.
A new synth_spec object with added custom noise.
Examples
synth_spec <- synth_spec()
noise1 <- noise(
add_noise = TRUE,
noise_func = add_noise_kde,
noise_params = list(
n_ntiles = 2
)
)
add_custom_noise(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "noise" = noise1)
)
Add, update, or remove recipe recipes from a synth_spec object
Description
Add, update, or remove recipe recipes from a synth_spec object
Usage
add_custom_steps(synth_spec, ...)
update_custom_steps(synth_spec, ...)
remove_custom_steps(synth_spec)
Arguments
synth_spec |
A |
... |
Optional named arguments mapping variables to lists of
|
Value
A new synth_spec object.
A new synth_spec object with added custom steps.
A new synth_spec object with updated custom steps.
A new synth_spec object with removed custom steps.
Examples
synth_spec <- synth_spec()
step1 <- function(x) {
x |> recipes::step_center(recipes::all_predictors(), id = "center")
}
add_custom_steps(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "steps" = step1)
)
synth_spec <- synth_spec()
step1 <- function(x) {
x |> recipes::step_center(recipes::all_predictors(), id = "center")
}
update_custom_steps(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "steps" = step1)
)
synth_spec <- synth_spec()
step1 <- function(x) {
x |> recipes::step_center(recipes::all_predictors(), id = "center")
}
synth_spec <- add_custom_steps(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "steps" = step1)
)
remove_custom_steps(synth_spec = synth_spec)
Add, update, or remove samplers from a synth_spec object
Description
Add, update, or remove samplers from a synth_spec object
Usage
add_custom_samplers(synth_spec, ...)
update_custom_samplers(synth_spec, ...)
remove_custom_samplers(synth_spec)
update_custom_noise(synth_spec, ...)
remove_custom_noise(synth_spec)
Arguments
synth_spec |
A |
... |
Optional named lists with two elements, |
Value
A new synth_spec object.
A new synth_spec object with added custom samplers.
A new synth_spec object with updated custom samplers.
A new synth_spec object with removed custom samplers.
A new synth_spec object with updated custom noise.
A new synth_spec object with removed custom noise.
Examples
synth_spec <- synth_spec()
add_custom_samplers(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "sampler" = sample_rpart)
)
synth_spec <- synth_spec()
update_custom_samplers(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "sampler" = sample_rpart)
)
synth_spec <- synth_spec()
synth_spec <- add_custom_samplers(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "sampler" = sample_rpart)
)
remove_custom_samplers(synth_spec = synth_spec)
synth_spec <- synth_spec()
noise1 <- noise(
add_noise = TRUE,
noise_func = add_noise_kde,
noise_params = list(
n_ntiles = 2
)
)
update_custom_noise(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "noise" = noise1)
)
synth_spec <- synth_spec()
noise1 <- noise(
add_noise = TRUE,
noise_func = add_noise_kde,
noise_params = list(
n_ntiles = 2
)
)
synth_spec <- add_custom_noise(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "noise" = noise1)
)
remove_custom_noise(synth_spec = synth_spec)
Add, update, or remove tuners from a synth_spec object
Description
Add, update, or remove tuners from a synth_spec object
Usage
add_custom_tuners(synth_spec, ...)
update_custom_tuners(synth_spec, ...)
remove_custom_tuners(synth_spec)
Arguments
synth_spec |
A |
... |
Optional named lists with two elements, |
Value
A new synth_spec object.
A new synth_spec object with added custom tuners.
A new synth_spec object with updated custom tuners.
A new synth_spec object with removed custom tuners.
Examples
synth_spec <- synth_spec()
tuner1 <- list(
v = 3,
grid = 3,
metrics = yardstick::metric_set(yardstick::rmse)
)
add_custom_tuners(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "tuner" = tuner1)
)
synth_spec <- synth_spec()
tuner1 <- list(
v = 3,
grid = 3,
metrics = yardstick::metric_set(yardstick::rmse)
)
update_custom_tuners(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "tuner" = tuner1)
)
synth_spec <- synth_spec()
tuner1 <- list(
v = 3,
grid = 3,
metrics = yardstick::metric_set(yardstick::rmse)
)
synth_spec <- add_custom_tuners(
synth_spec = synth_spec,
list("vars" = c("a", "b", "c"), "tuner" = tuner1)
)
remove_custom_tuners(synth_spec = synth_spec)
Synthesize a data set
Description
Synthesize a data set
Usage
synthesize(presynth, progress = FALSE)
Arguments
presynth |
A |
progress |
A single logical. Should a progress be displayed? |
Value
A postsynth object.
Examples
# create roadmap
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
rpart_mod_reg <- parsnip::decision_tree() |>
parsnip::set_engine(engine = "rpart") |>
parsnip::set_mode(mode = "regression")
rpart_mod_class <- parsnip::decision_tree() |>
parsnip::set_engine(engine = "rpart") |>
parsnip::set_mode(mode = "classification")
synth_spec1 <- synth_spec(
default_regression_model = rpart_mod_reg,
default_regression_sampler = sample_rpart,
default_classification_model = rpart_mod_class,
default_classification_sampler = sample_rpart
)
# create a presynth object
# use defaults for noise, constraints, and replicates
presynth1 <- presynth(
roadmap = rm,
synth_spec = synth_spec1
)
# synthesize!
set.seed(1)
postsynth1 <- synthesize(presynth = presynth1)
Generate syntheses from multiple presynth objects.
Description
Generate syntheses from multiple presynth objects.
Usage
tune_synthesis(
presynths,
postprocessing_func,
metadata_func = NULL,
simplify_post = FALSE,
seed = NULL
)
Arguments
presynths |
A list of |
postprocessing_func |
A function with arguments "synth_id", "synth_name", and "postsynth" that performs any desired postprocessing operations, like writing |
metadata_func |
An optional function with argument "presynth" that extracts
specified information from each |
simplify_post |
Boolean that, if true, expects |
seed |
A RNG seed to pass to |
Value
A post_tunesynth object.
Examples
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
dt_mod_reg <- parsnip::decision_tree() |>
parsnip::set_engine(engine = "rpart") |>
parsnip::set_mode(mode = "regression")
dt_mod_reg_cp <- parsnip::decision_tree(cost_complexity = 0.01) |>
parsnip::set_engine(engine = "rpart") |>
parsnip::set_mode(mode = "regression")
dt_mod_class <- parsnip::decision_tree() |>
parsnip::set_engine(engine = "rpart") |>
parsnip::set_mode(mode = "classification")
# synth specs
synth_spec1 <- synth_spec(
default_regression_model = dt_mod_reg,
default_regression_sampler = sample_rpart,
default_classification_model = dt_mod_class,
default_classification_sampler = sample_rpart
)
synth_spec2 <- synth_spec(
default_regression_model = dt_mod_reg_cp,
default_regression_sampler = sample_rpart,
default_classification_model = dt_mod_class,
default_classification_sampler = sample_rpart
)
presynth1 <- presynth(
roadmap = rm,
synth_spec = synth_spec1
)
presynth2 <- presynth(
roadmap = rm,
synth_spec = synth_spec2
)
postproc_f_null <- function(synth_id, synth_name, postsynth) {
return(postsynth[["synthetic_data"]])
}
tune_synthesis(
presynths = list(presynth1, presynth2),
postprocessing_func = postproc_f_null,
seed = 12345
)
Update presynth object
Description
Update presynth object
Usage
update_presynth(presynth, roadmap = NULL, synth_spec = NULL)
Arguments
presynth |
A |
roadmap |
An optional |
synth_spec |
An optional |
Value
A presynth object.
Examples
# create roadmap
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
rpart_mod_reg <- parsnip::decision_tree() |>
parsnip::set_engine(engine = "rpart") |>
parsnip::set_mode(mode = "regression")
rpart_mod_class <- parsnip::decision_tree() |>
parsnip::set_engine(engine = "rpart") |>
parsnip::set_mode(mode = "classification")
synth_spec1 <- synth_spec(
default_regression_model = rpart_mod_reg,
default_regression_sampler = sample_rpart,
default_classification_model = rpart_mod_class,
default_classification_sampler = sample_rpart
)
# create a presynth object
# use defaults for noise, constraints, and replicates
presynth <- presynth(
roadmap = rm,
synth_spec = synth_spec1
)
lm_mod <- parsnip::linear_reg() |>
parsnip::set_engine(engine = "lm") |>
parsnip::set_mode(mode = "regression")
synth_spec2 <- synth_spec(
default_regression_model = lm_mod,
default_regression_sampler = sample_lm,
default_classification_model = rpart_mod_class,
default_classification_sampler = sample_rpart
)
Tidy API calls ———————————————————-
Description
Update non-custom synth_spec arguments
Usage
update_synth_spec(synth_spec, ...)
Arguments
synth_spec |
A |
... |
Optional named keywords in |
Value
A synth_spec
Examples
synth_spec <- synth_spec()
lm_mod <- parsnip::linear_reg() |>
parsnip::set_engine("lm") |>
parsnip::set_mode(mode = "regression")
update_synth_spec(
synth_spec,
default_regression_model = lm_mod
)
Generate a visit sequence.
Description
Generate a visit sequence.
Usage
visit_sequence(schema, weight_var = NULL, synthesize_weight = TRUE)
Arguments
schema |
A |
weight_var |
A numeric weight for the weighted total ordering. |
synthesize_weight |
Boolean for if weight_var should be included in the visit sequence. |
Value
A visit_sequence object.
Examples
df <- data.frame(
factor_var = c("1", "1", "2"),
vara = c(10000, 20000, 100000),
varb = c(300, 200, 100),
var_loss = c(1999999, 0, -1000000),
weight = c(1000, 1000, 2000)
)
start_df <- dplyr::select(df, factor_var)
schema1 <- schema(
conf_data = dplyr::select(df, -weight),
start_data = start_df
)
vs1 <- visit_sequence(
schema = schema1
)
schema2 <- schema(
conf_data = df,
start_data = start_df
)
vs2 <- visit_sequence(
schema = schema2,
weight_var = weight,
synthesize_weight = TRUE
)
Add or reset a visit_sequence object within an existing roadmap.
Description
Add or reset a visit_sequence object within an existing roadmap.
Usage
update_visit_sequence(roadmap, ...)
reset_visit_sequence(roadmap)
Arguments
roadmap |
A |
... |
Optional additional parameters. |
Value
A new roadmap object.
A roadmap with an updated visit_sequence.
A new roadmap object with reset visit_sequence.
Examples
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
rm |>
update_visit_sequence(
weight_var = wgt,
synthesize_weight = TRUE
)
rm <- roadmap(
conf_data = acs_conf_nw,
start_data = acs_start_nw
)
rm <- rm |>
update_visit_sequence(
weight_var = wgt,
synthesize_weight = TRUE
)
reset_visit_sequence(roadmap = rm)