| Type: | Package |
| Title: | Holistic Multimodel Domain Analysis for Exploratory Machine Learning |
| Version: | 0.2.0 |
| Depends: | R (≥ 3.5.0) |
| Description: | Holistic Multimodel Domain Analysis (HMDA) is a robust and transparent framework designed for exploratory machine learning research, aiming to enhance the process of feature assessment and selection. HMDA addresses key limitations of traditional machine learning methods by evaluating the consistency across multiple high-performing models within a fine-tuned modeling grid, thereby improving the interpretability and reliability of feature importance assessments. Specifically, it computes Weighted Mean SHapley Additive exPlanations (WMSHAP), which aggregate feature contributions from multiple models based on weighted performance metrics. HMDA also provides confidence intervals to demonstrate the stability of these feature importance estimates. This framework is particularly beneficial for analyzing complex, multidimensional datasets common in health research, supporting reliable exploration of mental health outcomes such as suicidal ideation, suicide attempts, and other psychological conditions. Additionally, HMDA includes automated procedures for feature selection based on WMSHAP ratios and performs dimension reduction analyses to identify underlying structures among features. For more details see Haghish (2025) <doi:10.13140/RG.2.2.32473.63846>. |
| Imports: | curl (≥ 4.3.0), h2o (≥ 3.34.0.0), shapley (≥ 0.5), autoEnsemble (≥ 0.3), h2otools (≥ 0.4), splitTools (≥ 1.0.1), psych (≥ 2.4.6), dplyr (≥ 1.1.4), ggplot2 (≥ 3.4.2), gridExtra (≥ 2.3), reshape2 (≥ 1.4.5) |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| URL: | http://dx.doi.org/10.13140/RG.2.2.32473.63846, https://github.com/haghish/HMDA |
| BugReports: | https://github.com/haghish/HMDA/issues |
| NeedsCompilation: | no |
| Packaged: | 2026-02-16 11:22:38 UTC; haghish |
| Author: | E. F. Haghish [aut, cre, cph] |
| Maintainer: | E. F. Haghish <haghish@hotmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-02-16 11:40:02 UTC |
Select Best Models by Performance Metrics
Description
Detects all performance metric columns in a data frame, and for each metric, identifies the best model based on whether a higher or lower value is preferred. The function returns a vector of unique model IDs corresponding to the best models across all detected metrics.
Usage
best_of_family(df)
Arguments
df |
A data frame containing model performance results.
It must include a column named |
Details
The function first detects numeric columns (other than
"model_id") as performance metrics. It then uses a
predefined mapping to determine the optimal direction for each
metric: for example, higher values of auc and
aucpr are better, while lower values of logloss,
mean_per_class_error, rmse, and mse are
preferred. For any metric not in the mapping, the function
assumes that lower values indicate better performance.
For each metric, the function identifies the row index that
produces the best value according to the corresponding direction
(using which.max() or which.min()). It then extracts
the model_id from that row. The final result is a unique
set of model IDs that represent the best models across all metrics.
Value
A character vector of unique model_id values corresponding to the best model(s)
for each detected performance metric.
Author(s)
E. F. Haghish
Check Exploratory Factor Analysis Suitability
Description
Checks whether the specified features in a data frame meet basic criteria for performing exploratory factor analysis (EFA). The function verifies that each feature exists, is numeric, has sufficient variability, and does not have an excessive proportion of missing values. For multiple features, it also evaluates whether the correlation matrix is full rank and whether each feature has at least one sufficiently strong intercorrelation with another feature.
Usage
check_efa(
df,
features,
min_unique = 5,
min_intercorrelation = 0.3,
max_missing_rate = 0.05,
verbose = FALSE
)
Arguments
df |
A dataframe containing the features. |
features |
A character vector of feature names to be evaluated. |
min_unique |
An integer specifying the minimum number of unique non-missing values required for a feature. Default is 5. |
min_intercorrelation |
A numeric threshold for the minimum acceptable intercorrelation among features. (Note: this parameter is not used explicitly in the current implementation.) Default is 0.3. |
max_missing_rate |
A numeric threshold for maximum missing values. Features that have a higher missing rate than this threshold will be flagged. |
verbose |
Logical; if |
Details
The function performs several checks:
- Existence
Verifies that each feature in
featuresis present indf.- Numeric Type
Checks that each feature is numeric.
- Variability
Ensures that each feature has at least
min_uniqueunique non-missing values.- Missing Values
Flags features with more than 20% missing values.
If more than one feature is provided, the function computes the correlation matrix (using pairwise complete observations) and checks:
- Full Rank
Whether the correlation matrix is full rank. A rank lower than the number of features indicates redundancy.
- Intercorrelations
Identifies features that do not have any correlation (>= 0.4) with the other features.
Value
TRUE if all features are deemed suitable for EFA, and FALSE
otherwise. In the latter case, messages detailing the issues are printed.
Author(s)
E. F. Haghish
Examples
# Example: assess feature suitability for EFA using the USJudgeRatings dataset.
# this dataset contains ratings on several aspects of U.S. federal judges' performance.
# Here, we check whether these rating variables are suitable for EFA.
data("USJudgeRatings")
features_to_check <- colnames(USJudgeRatings[,-1])
result <- check_efa(
df = USJudgeRatings,
features = features_to_check,
min_unique = 3,
verbose = TRUE
)
# TRUE indicates the features are suitable.
print(result)
Dictionary of Variable Attributes
Description
Extracts a specified attribute from each column of a data frame and returns a dictionary as a data frame mapping variable names to their corresponding attribute values.
Usage
dictionary(df, attribute = "label", na.rm = TRUE)
Arguments
df |
A data frame whose columns may have attached attributes. |
attribute |
A character string specifying the name of the attribute to extract from each column (e.g., "label"). |
na.rm |
Logical; if |
Details
The function iterates over each column in the input data frame df and retrieves the specified attribute using attr(). If the attribute is not found for a column, NA is returned as its description. The resulting data frame acts as a dictionary for the variables, which is particularly useful for documenting datasets during exploratory data analysis.
Value
A data frame with two columns:
- name
The names of the variables in
df.- description
The extracted attribute values from each variable.
Author(s)
E. F. Haghish
Examples
# Example: Generate a dictionary of variable labels using the USJudgeRatings dataset.
# This dataset contains ratings on various performance measures for U.S. federal judges.
data("USJudgeRatings")
# Assume that the dataset's variables have been annotated with "label" attributes.
# which is the default label read by dictionary
attr(USJudgeRatings$CONT, "label") <- "Content Quality"
attr(USJudgeRatings$INTG, "label") <- "Integrity"
attr(USJudgeRatings$DMNR, "label") <- "Demeanor"
attr(USJudgeRatings$DILG, "label") <- "Diligence"
# Generate the dictionary of labels
dict <- dictionary(USJudgeRatings, "label")
print(dict)
Adjust Hyperparameter Combinations
Description
Prunes or expands a list of
hyperparameters so that the total number of model combinations,
computed as the product of the lengths of each parameter vector,
is near the desired target (n_models). It first prunes the
parameter with the largest number of values until the product is
less than or equal to n_models. Then, if the product is much
lower than the target (less than half of n_models), it attempts
to expand the parameter with the smallest number of values by adding
a midpoint value (if numeric).
Usage
hmda.adjust.params(params, n_models)
Arguments
params |
A list of hyperparameter vectors. |
n_models |
Integer. The desired target number of model combinations. |
Details
The function calculates the current product of the
lengths of the hyperparameter vectors. In a loop, it removes the
last element from the parameter vector with the largest length
until the product is less than or equal to n_models. If the
resulting product is less than half of n_models, the function
attempts to expand the parameter with the smallest length by
computing a midpoint between the two closest numeric values. The
expansion stops if no new value can be added, to avoid an infinite
loop.
Value
A list of hyperparameter vectors that has been pruned or
expanded so that the product of their lengths is near
n_models.
Author(s)
E. F. Haghish
Examples
# Example 1: Adjust a hyperparameter grid for 100 models.
params <- list(
alpha = c(0.1, 0.2, 0.3, 0.4),
beta = c(1, 2, 3, 4, 5),
gamma = c(10, 20, 30)
)
new_params <- hmda.adjust.params(params, n_models = 100)
print(new_params)
# Example 2: The generated hyperparameters range between min and max of each
# vector in the list
params <- list(
alpha = c(0.1, 0.2),
beta = c(1, 2, 3),
gamma = c(10, 20)
)
new_params <- hmda.adjust.params(params, n_models = 1000)
print(new_params)
Build Stacked Ensemble Model Using autoEnsemble R package
Description
Wrapper function in the HMDA package that builds a stacked ensemble model by combining multiple models. It leverages the autoEnsemble package to stack a set of trained models (e.g., from HMDA grid) into a stronger meta-learner. For more details on autoEnsemble, please see the GitHub repository at https://github.com/haghish/autoEnsemble and the CRAN package of autoEnsemble R package.
Usage
hmda.autoEnsemble(
models,
training_frame,
newdata = NULL,
family = "binary",
strategy = c("search"),
model_selection_criteria = c("auc", "aucpr", "mcc", "f2"),
min_improvement = 1e-05,
max = NULL,
top_rank = seq(0.01, 0.99, 0.01),
stop_rounds = 3,
reset_stop_rounds = TRUE,
stop_metric = "auc",
seed = -1,
verbatim = FALSE
)
Arguments
models |
A grid object (e.g., an H2O grid / HMDA grid) or a character vector of model IDs. |
training_frame |
An H2OFrame (or data frame already uploaded to the H2O server) that contains the training data used to build the base models. |
newdata |
Optional |
family |
Character string specifying the modeling family (e.g., |
strategy |
A character vector specifying the ensemble strategy. The available
strategy is |
model_selection_criteria |
A character vector specifying the performance metrics
to consider for model selection. The default is |
min_improvement |
Numeric. The minimum improvement in the evaluation metric required to continue the ensemble search. |
max |
Integer. The maximum number of models for each selection criterion.
If |
top_rank |
Numeric vector. Specifies the percentage (or percentages) of the
top models that should be considered for ensemble selection. If the strategy is
|
stop_rounds |
Integer. The number of consecutive rounds with no improvement in the performance metric before stopping the search. |
reset_stop_rounds |
Logical. If |
stop_metric |
Character. The metric used for early stopping; the default is
|
seed |
Integer. A random seed for reproducibility. Default is |
verbatim |
Logical. If |
Details
This wrapper function integrates with the HMDA package workflow to build a
stacked ensemble model from a set of base models. It calls the
ensemble() function from the autoEnsemble package to construct the
ensemble. The function is designed to work within HMDA's framework, where base
models are generated via grid search or AutoML. For more details on the autoEnsemble
approach, see:
The ensemble strategy "search" (default) searches for the best combination
of top-performing and diverse models to improve overall performance. The wrapper
returns both the final ensemble model and the list of top-ranked models used in the
ensemble.
Value
A list containing:
- model
The ensemble model built by autoEnsemble.
- top_models
A data frame of the top-ranked base models that were used in building the ensemble.
Author(s)
E. F. Haghish
Examples
## Not run:
library(HMDA)
library(h2o)
hmda.init()
# Import a sample binary outcome dataset into H2O
train <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
test <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])
params <- list(learn_rate = c(0.01, 0.1),
max_depth = c(3, 5, 9),
sample_rate = c(0.8, 1.0)
)
# Train and validate a cartesian grid of GBMs
hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
grid_id = "hmda_grid1",
training_frame = train,
nfolds = 10,
ntrees = 100,
seed = 1,
hyper_params = params)
# Assess the performances of the models
grid_performance <- hmda.grid.analysis(hmda_grid1)
# Return the best 2 models according to each metric
hmda.best.models(grid_performance, n_models = 2)
# build an autoEnsemble model & test it with the testing dataset
meta <- hmda.autoEnsemble(models = hmda_grid1, training_frame = train)
print(h2o.performance(model = meta$model, newdata = test))
## End(Not run)
Select Best Models Across All Models in HMDA Grid
Description
Scans an HMDA grid analysis data frame for performance metric columns and, for each metric, selects the best-performing models according to the correct optimization direction (lower is better for some metrics; higher is better for others). The function returns a subset of the input data frame containing the union of selected model IDs.
Usage
hmda.best.models(
df,
n_models = NULL,
distance_percentage = NULL,
metrics = c("logloss", "mae", "mse", "rmse", "rmsle", "mean_per_class_error", "auc",
"aucpr", "r2", "accuracy", "f1", "mcc", "f2"),
hyperparam = FALSE
)
Arguments
df |
A data frame of class |
n_models |
Integer. The number of top models to select per metric.
If both |
distance_percentage |
Numeric in (0, 1). Alternative to |
metrics |
Character vector of performance metric column names to consider. Supported metrics are "logloss", "mae", "mse", "rmse", "rmsle", "mean_per_class_error", "auc", "aucpr", "r2", "accuracy", "f1", "mcc", "f2". |
hyperparam |
Logical. If |
Details
The function uses a predefined set of H2O performance metrics along with their desired optimization directions:
- logloss, mae, mse, rmse, rmsle, mean_per_class_error
Lower values are better.
- auc, aucpr, r2, accuracy, f1, mcc, f2
Higher values are better.
Value
A data frame containing the union of selected models across all considered metrics.
If hyperparam = FALSE, the output includes model_ids and the metric columns found in df.
If hyperparam = TRUE, the output includes all columns from df for the selected models.
Author(s)
E. F. Haghish
Examples
## Not run:
library(HMDA)
library(h2o)
hmda.init()
# Import a sample binary outcome dataset into H2O
train <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
test <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])
params <- list(learn_rate = c(0.01, 0.1),
max_depth = c(3, 5, 9),
sample_rate = c(0.8, 1.0)
)
# Train and validate a cartesian grid of GBMs
hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
grid_id = "hmda_grid1",
training_frame = train,
nfolds = 10,
ntrees = 100,
seed = 1,
hyper_params = params)
# Assess the performances of the models
grid_performance <- hmda.grid.analysis(hmda_grid1)
# Return the best 2 models according to each metric
hmda.best.models(grid_performance, n_models = 2)
# return all models with performance metric as high as 98% of the best model, for each metric
# i.e., the distance of the selected models should be up to 2% from the
# best model in each metric
hmda.best.models(grid_performance, distance_percentage = 0.02)
## End(Not run)
Compare SHAP plots across selected models
Description
Produces side-by-side comparison plots of SHAP contributions for multiple models.
Models can be provided explicitly via model_id, or selected automatically from
an hmda.grid.analysis data frame using hmda.best.models() for each metric in
metrics. Two plot styles are supported:
"shap"H2O SHAP summary plot (beeswarm-style) for each model.
"bar"Bar plot based on a single-model
shapley::shapley()run.
Usage
hmda.compare.shap.plot(
hmda.grid.analysis,
newdata = NULL,
model_id = NULL,
metrics = c("aucpr", "mcc", "f2"),
plot = "shap",
top_n_features = 4,
ylimits = c(-1, 1)
)
Arguments
hmda.grid.analysis |
A data frame of class |
newdata |
An |
model_id |
Optional character vector of H2O model IDs. If provided, the function compares
these models directly and ignores |
metrics |
Character vector of metric names used to select the best model per metric from
|
plot |
Character. Plot type: |
top_n_features |
Integer. Number of top features shown in each plot. |
ylimits |
Numeric vector of length 2 giving y-axis limits for |
Details
When model_id is NULL, the function selects one model per metric from metrics
using hmda.best.models(). When model_id is provided, models are labeled as
"Model 1", "Model 2", etc.
Value
A gtable/grob object returned by gridExtra::grid.arrange() combining the plots.
The combined plot is also printed.
Author(s)
E. F. Haghish
Examples
## Not run:
library(HMDA)
library(h2o)
hmda.init()
# Import a sample binary outcome dataset into H2O
train <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
test <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])
params <- list(learn_rate = c(0.01, 0.1),
max_depth = c(3, 5, 9),
sample_rate = c(0.8, 1.0)
)
# Train and validate a cartesian grid of GBMs
hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
grid_id = "hmda_grid1",
training_frame = train,
nfolds = 10,
ntrees = 100,
seed = 1,
hyper_params = params)
# Assess the performances of the models
grid_performance <- hmda.grid.analysis(hmda_grid1)
# compare the best models acording to each performance metric
hmda.compare.shap.plot(hmda.grid.analysis = grid_performance,
newdata = test,
metrics = c("aucpr", "mcc", "f2"),
plot = "bar",
top_n_features = 5)
# Return the best 2 models according to each metric
best_models <- hmda.best.models(grid_performance, n_models = 2)
# compare the specified models based on their model_ids
hmda.compare.shap.plot(hmda.grid.analysis = grid_performance,
model_id = best_models$model_ids[1:3],
newdata = test,
metrics = c("aucpr", "mcc", "f2"),
plot = "bar",
top_n_features = 5)
## End(Not run)
Domain-level WMSHAP summary and plot
Description
#' Wrapper around shapley.domain to compute and visualize
weighted mean SHAP ratios (WMSHAP) at the domain/group/factor level. Domains are user-defined
clusters of feature names (e.g., latent factors or conceptual groups). The function aggregates
feature-level contributions into domain-level contributions and returns a plot and
confidence intervals.
Usage
hmda.domain(
wmshap,
domains,
plot = "bar",
print = FALSE,
colorcode = NULL,
xlab = "Factors"
)
Arguments
wmshap |
object of class 'shapley', as returned by the 'shapley' function |
domains |
character list, specifying the domains for grouping the features' contributions. Domains are clusters of features' names, that can be used to compute WMSHAP at higher level, along with their 95 better understand how a cluster of features influence the outcome. Note that either of 'features' or 'domains' arguments can be specified at the time. |
plot |
character, specifying the type of the plot, which can be either 'bar' or 'wmshap'. The default is 'bar'. |
print |
logical. if TRUE, the WMSHAP summary table for the given row is printed |
colorcode |
Character vector for specifying the color names for each domain in the plot. |
xlab |
Character label for WMSHAP domains or factors |
Value
A ggplot object (invisibly returned) and, depending on print, prints the domain summary.
Author(s)
E. F. Haghish
Examples
## Not run:
library(HMDA)
library(h2o)
hmda.init()
# Import a sample binary outcome dataset into H2O
train <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
test <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])
params <- list(learn_rate = c(0.01, 0.1),
max_depth = c(3, 5, 9),
sample_rate = c(0.8, 1.0)
)
# Train and validate a cartesian grid of GBMs
hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
grid_id = "hmda_grid1",
training_frame = train,
nfolds = 10,
ntrees = 100,
seed = 1,
hyper_params = params)
# compute weighted mean shap values
wmshap <- hmda.wmshap(models = hmda_grid1,
newdata = test,
performance_metric = "aucpr",
standardize_performance_metric = FALSE,
performance_type = "xval",
minimum_performance = 0,
method = "mean",
cutoff = 0.01,
plot = TRUE)
# define domains to combine their WMSHAP values
# =============================================
#
# There are different ways to specify a cluster of features or even
# a group of factors that touch on a broader domain. HMDA includes
# exploratory factor analysis procedure to help with this process
# (see ?hmda.efa function). Here, "assuming" that we have good reasons
# to combine some of the features under some clusters:
domains = list(Group1 = c("x22", "x18", "x14", "x1", "x10", "x4"),
Group2 = c("x25", "x23", "x6", "x27"),
Group3 = c("x28", "x26"))
hmda.domain(wmshap = wmshap,
plot = "bar",
domains = domains,
print = TRUE)
## End(Not run)
Perform Exploratory Factor Analysis with HMDA
Description
Performs exploratory factor analysis (EFA) on a specified set of features from a data frame using the psych package. The function optionally runs parallel analysis to recommend the number of factors, applies a rotation method, reverses specified features, and cleans up factor loadings by zeroing out values below a threshold. It then computes factor scores and reliability estimates, and finally returns a list containing the EFA results, cleaned loadings, reliability metrics, and factor correlations.
Usage
hmda.efa(
df,
features,
algorithm = "minres",
rotation = "promax",
parallel.analysis = TRUE,
nfactors = NULL,
dict = dictionary(df, attribute = "label"),
minimum_loadings = 0.3,
exclude_features = NULL,
ignore_binary = TRUE,
intercorrelation = 0.3,
reverse_features = NULL,
plot = FALSE,
factor_names = NULL,
verbose = TRUE
)
Arguments
df |
A data frame containing the items for EFA. |
features |
A vector of feature names (or indices) in |
algorithm |
Character. The factor extraction method to use.
Default is |
rotation |
Character. The rotation method to apply to the factor
solution. Default is |
parallel.analysis |
Logical. If |
nfactors |
Integer. The number of factors to extract. If |
dict |
A data frame dictionary with at least two columns:
|
minimum_loadings |
Numeric. Any factor loading with an absolute value
lower than this threshold is set to zero. Default is
|
exclude_features |
Character vector. Features to exclude from the analysis.
Default is |
ignore_binary |
Logical. If |
intercorrelation |
Numeric. (Unused in current version) Intended to set
a minimum intercorrelation threshold between items.
Default is |
reverse_features |
A vector of feature names for which the scoring
should be reversed prior to analysis. Default is
|
plot |
Logical. If |
factor_names |
Character vector. Optional names to assign to the extracted factors (i.e., new column names for loadings). |
verbose |
Logical. If |
Details
This function first checks that the number of factors is either provided
or determined via parallel analysis (if parallel.analysis is TRUE).
A helper function trans() is defined to reverse and standardize item
scores for features specified in reverse_features. Unwanted features can be
excluded via exclude_features. The EFA is then performed using
psych::fa() with the chosen extraction algorithm and rotation method.
Loadings are cleaned by zeroing out values below the minimum_loadings
threshold, rounded, and sorted. Factor scores are computed with
psych::factor.scores() and reliability is estimated using the
omega() function. Finally, factor correlations are extracted from the
EFA object.
Value
A list with the following components:
- parallel.analysis
The output from the parallel analysis, if run.
- efa
The full exploratory factor analysis object returned by
psych::fa.- efa_loadings
A matrix of factor loadings after zeroing out values below the
minimum_loadingsthreshold, rounded and sorted.- efa_reliability
The reliability results (omega) computed from the factor scores.
- factor_correlations
A matrix of factor correlations, rounded to 2 decimal places.
Author(s)
E. F. Haghish
Examples
# Example: assess feature suitability for EFA using the USJudgeRatings dataset.
# this dataset contains ratings on several aspects of U.S. federal judges' performance.
# Here, we check whether these rating variables are suitable for EFA.
data("USJudgeRatings")
features_to_check <- colnames(USJudgeRatings[,-1])
result <- check_efa(
df = USJudgeRatings,
features = features_to_check,
min_unique = 3,
verbose = TRUE
)
# TRUE indicates the features are suitable.
print(result)
Feature Selection Based on Weighted SHAP Values
Description
This function selects "important", "inessential", and "irrelevant"
features based on a summary of weighted mean SHAP values obtained from a prior
analysis. It uses the SHAP summary table (found in the wmshap object)
to identify features that are deemed important according to a specified method
and cutoff. Features with a lower confidence interval (lowerCI) below zero
are labeled as "irrelevant", while the remaining features are classified as
"inessential" if they do not meet the importance criteria.
Usage
hmda.feature.selection(
wmshap,
method = "mean",
cutoff = 0.01,
top_n_features = NULL
)
Arguments
wmshap |
A list object (typically returned by a weighted SHAP analysis)
that must contain a data frame |
method |
Character. Specify the method for selecting important features
based on their weighted mean SHAP ratios. The default is
|
cutoff |
Numeric. The threshold cutoff for the selection method. Features
with a weighted SHAP value (or ratio) greater than or equal to this value
are considered important. Default is |
top_n_features |
Integer. If specified, the function selects the top
|
Details
The function performs the following steps:
Retrieves the SHAP summary table from the
wmshapobject.Sorts the summary table in descending order based on the
meanSHAP value.Identifies all features available in the summary.
Classifies features as irrelevant if their
lowerCIvalue is below zero.If
top_n_featuresis not specified, selects important features as those whose value for the specifiedmethodcolumn meets or exceeds thecutoff; the remaining features (excluding those marked as irrelevant) are classified as inessential.If
top_n_featuresis provided, the function selects the topnfeatures (based on the sorted order) as important, with the rest (excluding irrelevant ones) being inessential.
Value
A list with three elements:
- important
A character vector of features deemed important.
- inessential
A character vector of features considered inessential (present in the data but not meeting the importance criteria).
- irrelevant
A character vector of features deemed irrelevant, defined as those with a lower confidence interval (lowerCI) below zero.
Author(s)
E. F. Haghish
Examples
## Not run:
library(HMDA)
library(h2o)
hmda.init()
h2o.removeAll()
# Import a sample binary outcome dataset into H2O
train <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
test <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])
params <- list(learn_rate = c(0.01, 0.1),
max_depth = c(3, 5, 9),
sample_rate = c(0.8, 1.0)
)
# Train and validate a cartesian grid of GBMs
hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
grid_id = "hmda_grid1",
training_frame = train,
nfolds = 10,
ntrees = 100,
seed = 1,
hyper_params = params)
# Assess the performances of the models
grid_performance <- hmda.grid.analysis(hmda_grid1)
# Return the best 2 models according to each metric
hmda.best.models(grid_performance, n_models = 2)
# build an autoEnsemble model & test it with the testing dataset
meta <- hmda.autoEnsemble(models = hmda_grid1, training_frame = train)
print(h2o.performance(model = meta$model, newdata = test))
# compute weighted mean shap values
wmshap <- hmda.wmshap(models = hmda_grid1,
newdata = test,
performance_metric = "aucpr",
standardize_performance_metric = FALSE,
performance_type = "xval",
minimum_performance = 0,
method = "mean",
cutoff = 0.01,
plot = TRUE)
# identify the important features
selected <- hmda.feature.selection(wmshap,
method = "mean",
cutoff = 0.01)
print(selected)
## End(Not run)
Tune a Cartesian Hyperparameter Grid in HMDA
Description
Runs an grid search for a single tree-based algorithm supported by HMDA
(currently "drf" or "gbm"). The function validates inputs, optionally
generates a grid ID, runs a Cartesian grid search, and (optionally) saves the grid
for recovery and reproducibility.
Usage
hmda.grid(
algorithm = c("drf", "gbm"),
grid_id = NULL,
x,
y,
training_frame = h2o.getFrame("hmda.train.hex"),
validation_frame = NULL,
hyper_params = list(),
nfolds = 10,
seed = NULL,
keep_cross_validation_predictions = TRUE,
recovery_dir = NULL,
sort_by = "logloss",
...
)
Arguments
algorithm |
Character. The algorithm to tune. Supported values are "drf" (Distributed Random Forest) and "gbm" (Gradient Boosting Machine). Only one algorithm can be specified. (Case-insensitive) |
grid_id |
Character. Optional identifier for the grid search.
If |
x |
Vector. Predictor column names or indices. |
y |
Character. The response column name or index. |
training_frame |
An H2OFrame containing the training data.
Default is |
validation_frame |
An H2OFrame for early stopping. Default is |
hyper_params |
List. A list of hyperparameter vectors for tuning.
If you do not have a clue about how to specify the
hyperparameters, consider consulting |
nfolds |
Integer. Number of folds for cross-validation. Default is 10. |
seed |
Integer. A seed for reproducibility.
Default is |
keep_cross_validation_predictions |
Logical. Whether to keep
cross-validation predictions. Default is |
recovery_dir |
Character. Directory path to save the grid search
output. If provided, the grid is saved using
|
sort_by |
Character. Metric used to sort the grid. Default is "logloss". |
... |
Additional arguments passed to |
Details
The function executes the following steps:
-
Input Validation: Ensures only one algorithm is specified and verifies that the training frame is an H2OFrame.
-
Grid ID Generation: If no
grid_idis provided, it creates one using the algorithm name and the current time. -
Grid Search Execution: Calls
h2o.grid()with the provided hyperparameters and cross-validation settings. -
Grid Saving: If a recovery directory is specified, the grid is saved to disk using
h2o.saveGrid().
The output is an H2O grid object that contains all the trained models.
Value
An object of class H2OGrid containing the grid search
results.
Author(s)
E. F. Haghish
See Also
hmda.grid.analysis, hmda.best.models
Examples
## Not run:
library(HMDA)
library(h2o)
hmda.init()
# Import a sample binary outcome dataset into H2O
train <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
test <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])
params <- list(learn_rate = c(0.01, 0.1),
max_depth = c(3, 5, 9),
sample_rate = c(0.8, 1.0)
)
# Train and validate a cartesian grid of GBMs
hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
grid_id = "hmda_grid1",
training_frame = train,
nfolds = 10,
ntrees = 100,
seed = 1,
hyper_params = params)
# Assess the performances of the models
grid_performance <- hmda.grid.analysis(hmda_grid1)
# Return the best 2 models according to each metric
hmda.best.models(grid_performance, n_models = 2)
## End(Not run)
Analyze Hyperparameter Grid Performance
Description
Reorders an HMDA grid based on a specified performance metric and supplements the grid's summary table with additional performance metrics extracted via cross-validation. The function returns a data frame of performance metrics for each model in the grid. This enables a detailed analysis of model performance across various metrics such as logloss, AUC, RMSE, etc.
Usage
hmda.grid.analysis(
grid,
performance_metrics = c("logloss", "mse", "rmse", "rmsle", "auc", "aucpr",
"mean_per_class_error", "r2", "f2", "mcc", "kappa"),
sort_by = "logloss"
)
Arguments
grid |
A HMDA grid object from which the performance summary will be extracted. |
performance_metrics |
A character vector of additional performance metric
names to be included in the analysis. Default is
|
sort_by |
A character string indicating the performance metric to sort the grid
by. Default is |
Details
The function performs the following steps:
-
Grid Reordering: It calls
h2o.getGrid()to reorder the grid based on thesort_bymetric. For metrics like "logloss", "mse", "rmse", and "rmsle", sorting is in ascending order; for others, it is in descending order. -
Performance Table Extraction: The grid's summary table is converted into a data frame.
-
Additional Metric Calculation: For each metric specified in
performance_metrics(other than the one used for sorting), the function initializes a column with NA values and iterates over each model in the grid (via itsmodel_ids) to extract the corresponding cross-validated performance metric using functions such ash2o.auc(),h2o.rmse(), etc. For threshold-based metrics (e.g.,f2,mcc,kappa), it retrieves performance viah2o.performance(). -
Return: The function returns the merged data frame of performance metrics.
Value
A data frame of class "hmda.grid.analysis" that contains the merged
performance summary table. This table includes the default metrics from the grid
summary along with the additional metrics specified by performance_metrics
(if available). The data frame is sorted according to the sort_by metric.
Author(s)
E. F. Haghish
Examples
## Not run:
# NOTE: This example may take a long time to run on your machine
# Initialize H2O (if not already running)
library(HMDA)
library(h2o)
hmda.init()
# Import a sample binary outcome train/test set into H2O
train <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
test <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])
# Run the hyperparameter search using DRF and GBM algorithms.
result <- hmda.search.param(algorithm = c("gbm"),
x = x,
y = y,
training_frame = train,
max_models = 100,
nfolds = 10,
stopping_metric = "AUC",
stopping_rounds = 3)
# Assess the performances of the models
grid_performance <- hmda.grid.analysis(gbm_grid1)
# Return the best 2 models according to each metric
hmda.best.models(grid_performance, n_models = 2)
## End(Not run)
Initialize or Restart H2O Cluster for HMDA Analysis
Description
Initializes or restarts an H2O cluster configured for Holistic Multimodel Domain Analysis. It sets up the cluster with specified CPU threads, memory, and connection settings. It first checks for an existing cluster, shuts it down if found, and then repeatedly attempts to establish a new connection, retrying up to 10 times if necessary.
Usage
hmda.init(
cpu = -1,
ram = NULL,
java = NULL,
ip = "localhost",
port = 54321,
verbatim = FALSE,
restart = TRUE,
shutdown = FALSE,
ignore_config = TRUE,
bind_to_localhost = FALSE,
...
)
Arguments
cpu |
integer. The number of CPU threads to use. -1 indicates all available threads. Default is -1. |
ram |
numeric. Minimum memory (in GB) for the cluster. If NULL, all available memory is used. |
java |
character. Path to the Java JDK. If provided, sets JAVA_HOME accordingly. |
ip |
character. The IP address for the H2O server. Default is "localhost". |
port |
integer. The port for the H2O server. Default is 54321. |
verbatim |
logical. If TRUE, prints detailed cluster info. Default is FALSE. |
restart |
logical. if TRUE, the server is erased and restarted |
shutdown |
logical. if TRUE, the server is closed |
ignore_config |
logical. If TRUE, ignores any existing H2O configuration. Default is TRUE. |
bind_to_localhost |
logical. If TRUE, restricts access to the cluster to the local machine. Default is FALSE. |
... |
Additional arguments passed to h2o.init(). |
Details
The function sets JAVA_HOME if a Java path is provided. It checks for an existing cluster via h2o.clusterInfo(). If found, the cluster is shut down and the function waits 5 seconds. It then attempts to initialize a new cluster using h2o.init() with the specified settings. On failure, it retries every 3 seconds, up to 10 attempts. If all attempts fail, an error is thrown.
Value
An H2OConnection object (invisibly printed by H2O) if shutdown = FALSE;
otherwise returns NULL.
Author(s)
E. F. Haghish
Examples
## Not run:
# Example 1: Initialize the H2O cluster with default settings.
library(HMDA)
hmda.init()
# Example 2: Initialize with specific settings such as Java path.
conn <- hmda.init(
cpu = 4,
ram = 8,
java = "/path/to/java", #e.g., "C:/Program Files/Java/jdk1.8.0_241"
ip = "localhost",
port = 54321,
verbatim = TRUE
)
# check the status of the h2o connection
h2o::h2o.clusterInfo(conn) #you can use h2o functions to interact with the server
## End(Not run)
Partition Data for HMDA Analysis
Description
Partition a data frame into training, testing, and
optionally validation sets, and upload these sets to a local
H2O server. If an outcome column y is provided and is a
factor or character, stratified splitting is used; otherwise, a
random split is performed. The proportions must sum to 1.
Usage
hmda.partition(
df,
y = NULL,
train = 0.8,
test = 0.2,
validation = NULL,
seed = 2025
)
Arguments
df |
A data frame to partition. |
y |
Optional character string naming the outcome column in |
train |
A numeric value for the proportion of the training set. |
test |
A numeric value for the proportion of the testing set. |
validation |
Optional numeric. Proportion of observations assigned to the validation set.
Default is |
seed |
A numeric seed for reproducibility. Default is 2025. |
Details
This function uses the splitTools package to perform
the partition. When y is provided and is a factor or character,
a stratified split is performed to preserve class proportions. Otherwise,
a basic random split is used. The partitions are then converted to H2O
frames using h2o::as.h2o().
Value
A named list containing the partitioned data frames and their corresponding H2O frames:
- hmda.train
Training set (data frame).
- hmda.test
Testing set (data frame).
- hmda.validation
Validation set (data frame), if any.
- hmda.train.hex
Training set as an H2O frame.
- hmda.test.hex
Testing set as an H2O frame.
- hmda.validation.hex
Validation set as an H2O frame, if applicable.
Author(s)
E. F. Haghish
Examples
## Not run:
# Example: Random split (80% train, 20% test) using iris data
data(iris)
splits <- hmda.partition(
df = iris,
train = 0.8,
test = 0.2,
seed = 2025
)
train_data <- splits$hmda.train
test_data <- splits$hmda.test
# Example: Stratified split (70% train, 15% test, 15% validation)
# using iris data, stratified by Species
splits_strat <- hmda.partition(
df = iris,
y = "Species",
train = 0.7,
test = 0.15,
validation = 0.15,
seed = 2025
)
train_strat <- splits_strat$hmda.train
test_strat <- splits_strat$hmda.test
valid_strat <- splits_strat$hmda.validation
## End(Not run)
Plot WMSHAP contributions
Description
This function applies different criteria to visualize WMSHAP contributions
Usage
hmda.plot(
wmshap,
plot = "bar",
method = "mean",
cutoff = 0.01,
top_n_features = NULL,
features = NULL,
legendstyle = "continuous",
scale_colour_gradient = NULL,
labels = NULL
)
Arguments
wmshap |
object of class 'shapley', as returned by the 'shapley' function or hmda.wmshap function |
plot |
Character. Plot type passed to |
method |
Character. The column name in |
cutoff |
numeric, specifying the cutoff for the method used for selecting the top features. |
top_n_features |
Integer. If specified, the top n features with the highest weighted SHAP values will be selected, overrullung the 'cutoff' and 'method' arguments. |
features |
character vector, specifying the feature to be plotted. |
legendstyle |
character, specifying the style of the plot legend, which can be either 'continuous' (default) or 'discrete'. the continuous legend is only applicable to 'wmshap' plots and other plots only use 'discrete' legend. |
scale_colour_gradient |
character vector for specifying the color gradients for the plot. |
labels |
Optional named character vector mapping feature names to display labels, e.g.,
|
Value
ggplot object
Author(s)
E. F. Haghish
Examples
## Not run:
library(HMDA)
library(h2o)
hmda.init()
# Import a sample binary outcome dataset into H2O
train <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
test <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])
params <- list(learn_rate = c(0.01, 0.1),
max_depth = c(3, 5, 9),
sample_rate = c(0.8, 1.0)
)
# Train and validate a cartesian grid of GBMs
hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
grid_id = "hmda_grid1",
training_frame = train,
nfolds = 10,
ntrees = 100,
seed = 1,
hyper_params = params)
# compute weighted mean shap values
wmshap <- hmda.wmshap(models = hmda_grid1,
newdata = test,
performance_metric = "aucpr",
standardize_performance_metric = FALSE,
performance_type = "xval",
minimum_performance = 0,
method = "mean",
cutoff = 0.01,
plot = TRUE)
#######################################################
### PLOT THE WEIGHTED MEAN SHAP VALUES
#######################################################
hmda.plot(wmshap, plot = "bar")
## End(Not run)
Plot model performance metrics across a grid of models
Description
Creates a line plot comparing multiple (maximize-type) performance metrics across a set of models.
The input data frame is typically the output of hmda.grid.analysis() and must contain
a model_ids column and one or more numeric metric columns (e.g., aucpr, mcc, f2).
The function can either plot the first top_models rows (criteria = "top_models")
or include all models that achieve at least distance_percentage times the best value
for at least one metric (criteria = "distance_percentage").
Usage
hmda.plot.metrics(
df,
metrics = c("auc", "aucpr", "r2", "mcc", "f2"),
criteria = "distance_percentage",
top_models = 100,
distance_percentage = 0.95,
plot = TRUE,
title = NULL
)
Arguments
df |
A data frame of class |
metrics |
Character vector of column names in |
criteria |
Character. One of |
top_models |
Integer. Number of top rows to plot when |
distance_percentage |
Numeric in (0, 1]. When |
plot |
Logical. If |
title |
Character. Add title to the plot. |
Author(s)
E. F. Haghish
Examples
## Not run:
# Example: Create a hyperparameter grid for GBM models.
predictors <- c("var1", "var2", "var3")
response <- "target"
# Define hyperparameter ranges
hyper_params <- list(
ntrees = seq(50, 150, by = 25),
max_depth = c(5, 10, 15),
learn_rate = c(0.01, 0.05, 0.1),
sample_rate = c(0.8, 1.0),
col_sample_rate = c(0.8, 1.0)
)
# Run the grid search
grid <- hmda.grid(
algorithm = "gbm",
x = predictors,
y = response,
training_frame = h2o.getFrame("hmda.train.hex"),
hyper_params = hyper_params,
nfolds = 10,
stopping_metric = "AUTO"
)
# Assess the performances of the models
grid_performance <- hmda.grid.analysis(grid)
# plot the metrics of models that are within 95% of the best models
# for each of the specified metrics
hmda.plot.metrics(grid_performance,
criteria = "distance_percentage",
distance_percentage = 0.95,
metrics = c("auc", "aucpr", "r2", "mcc", "f2"))
## End(Not run)
Search for Hyperparameters via Random Search
Description
Runs an automated hyperparameter search and returns several summaries of the hyperparameter grids as well as detailed hyperparameters from each model, and then produces multiple summaries based on different strategies. These strategies include:
- Best of Family
Selects the best model for each performance metric (avoiding duplicate model IDs).
- Top 2
Extracts hyperparameter settings from the top 2 models (according to a specified ranking metric).
- Top 5
Extracts hyperparameter settings from the top 5 models.
- Top 10
Extracts hyperparameter settings from the top 10 models.
These summaries help in identifying candidate hyperparameter ranges for further manual tuning. Note that a good suggestion depends on the extent of random search you carry out.
Usage
hmda.search.param(
algorithm = c("drf", "gbm"),
sort_by = "logloss",
x,
y,
training_frame = h2o.getFrame("hmda.train.hex"),
validation_frame = NULL,
max_models = 100,
max_runtime_secs = 3600,
nfolds = 10,
seed = NULL,
fold_column = NULL,
weights_column = NULL,
keep_cross_validation_predictions = TRUE,
stopping_rounds = NULL,
stopping_metric = "AUTO",
stopping_tolerance = NULL,
...
)
Arguments
algorithm |
Character. The algorithm to include in the random search. Supported values include "drf" (Distributed Random Forest) and "gbm" (Gradient Boosting Machine). The input is case-insensitive. |
sort_by |
Character string specifying the metric used to rank models. |
x |
Vector of predictor column names or indices. |
y |
Character string specifying the response column. |
training_frame |
An H2OFrame containing the training data.
Default is |
validation_frame |
An H2OFrame for early stopping.
Default is |
max_models |
Integer. Maximum number of models to build. Default is 100. |
max_runtime_secs |
integer. Amount of time (in seconds) that the model should keep searching. Default is 3600. |
nfolds |
Integer. Number of folds for cross-validation. Default is 10. |
seed |
Integer. A seed for reproducibility.
Default is |
fold_column |
Character. Column name for cross-validation fold
assignment. Default is |
weights_column |
Character. Column name for observation weights.
Default is |
keep_cross_validation_predictions |
Logical. Whether to keep
cross-validation predictions. Default is |
stopping_rounds |
Integer. Number of rounds with no improvement
before early stopping. Default is |
stopping_metric |
Character. Metric to use for early stopping. Default is "AUTO". |
stopping_tolerance |
Numeric. Relative tolerance for early stopping.
Default is |
... |
Additional arguments passed to |
Details
The function executes an automated hyperparameter search for the specified
algorithm. It then extracts the leaderboard from the H2OAutoML object and
retrieves detailed hyperparameter information for each model using automlModelParam() from the
h2otools package. The leaderboard and hyperparameter data are merged by the
model_id column. Sorting of the merged results is performed based on
the sort_by metric. For metrics not in
"logloss", "mean_per_class_error", "rmse", "mse", lower values are
considered better; for these four metrics, higher values are preferred.
After sorting, the function applies three strategies to summarize the hyperparameter search:
-
Best of Family: Selects the best model for each performance metric, ensuring that no model ID appears more than once.
-
Top 2: Gathers hyperparameter settings from the top 2 models.
-
Top 5 and Top 10: Similarly, collects hyperparameter settings from the top 5 and top 10 models, respectively.
-
All: List all the hyperparameters that were tried
These strategies provide different levels of granularity for analyzing the hyperparameter space and can be used for prototyping and further manual tuning.
Value
A list with the following components:
- grid
The H2OAutoML object returned by random search
- leaderboard
A merged data frame that combines leaderboard performance metrics with hyperparameter settings for each model. The data frame is sorted based on the specified ranking metric.
- hyperparameters_best_of_family
A summary list of the best hyperparameter settings for each performance metric. This strategy selects the best model per metric while avoiding duplicate model IDs.
- hyperparameters_top2
A list of hyperparameter settings from the top 2 models as ranked by the chosen metric.
- hyperparameters_top5
A list of hyperparameter settings from the top 5 models.
- hyperparameters_top10
A list of hyperparameter settings from the top 10 models.
- hyperparameters_all
A list of all hyperparameter settings.
Examples
## Not run:
# NOTE: This example may take a long time to run on your machine
# Initialize H2O (if not already running)
library(HMDA)
library(h2o)
hmda.init()
# Import a sample binary outcome train/test set into H2O
train <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
test <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])
# Run the hyperparameter search using DRF and GBM algorithms.
result <- hmda.search.param(algorithm = c("gbm"),
x = x,
y = y,
training_frame = train,
max_models = 100,
nfolds = 10,
stopping_metric = "AUC",
stopping_rounds = 3)
# Access the hyperparameter list of the best_of_family strategy:
result$best_of_family
# Access the hyperparameter of the top5 models based on the specified ranking parameter
result$top5
## End(Not run)
Suggest Hyperparameters for tuning HMDA Grids
Description
Suggests candidate hyperparameter values for tree-based algorithms. It computes a hyperparameter grid whose total number of model combinations is near a specified target.
The function builds a default candidate
grid for "gbm" or "drf", and then prunes or expands the grid using
hmda.adjust.params() so the total number of model combinations is near n_models.
For GBM models, default candidates include max_depth, ntrees, learn_rate,
sample_rate, and col_sample_rate. For DRF models, if x (predictors) and
family ("regression" or "classification") are provided, a vector of mtries is also suggested via suggest_mtries().
Usage
hmda.suggest.param(algorithm, n_models, x = NULL, family = NULL)
Arguments
algorithm |
A character string specifying the algorithm, which can be either "gbm" (gradient boosting machines) or "drf" (distributed random forest). |
n_models |
An integer for the desired approximate number of model combinations in the grid. Must be at least 100. |
x |
(Optional) A vector of predictor names. If provided and its length is at least 20, it is used to compute mtries for DRF. |
family |
(Optional) A character string indicating the
modeling family. Must be either "classification"
or "regression". This is used with |
Details
The function first checks that n_models is at least 100,
then validates the family parameter if provided. The
algorithm name is normalized to lowercase and must be either
"gbm" or "drf". For "gbm", a default grid of hyperparameters is
defined. For "drf", if both x and family are provided,
the function computes mtries via suggest_mtries(). If not,
a default grid is set without mtries. Finally, the candidate grid is
pruned or expanded using hmda.adjust.params() so that the
total number of combinations is near n_models.
Value
A named list of hyperparameter value vectors. This list is suitable for use with HMDA and H2O grid search functions.
Examples
## Not run:
library(h2o)
h2o.init()
# Example 1: Suggest hyperparameters for GBM with about 120 models.
params_gbm <- hmda.suggest.param("gbm", n_models = 120)
print(params_gbm)
# Example 2: Suggest hyperparameters for DRF (classification) with
# 100 predictors.
params_drf <- hmda.suggest.param(
algorithm = "drf",
n_models = 150,
x = paste0("V", 1:100),
family = "classification"
)
print(params_drf)
## End(Not run)
Normalize a vector based on specified minimum and maximum values
Description
This function normalizes a vector based on specified minimum and maximum values. If the minimum and maximum values are not specified, the function will use the minimum and maximum values of the vector.
Usage
hmda.test(wmshap, features = NULL, domains = NULL, n = 5000)
Arguments
wmshap |
object of class 'shapley', as returned by the 'shapley' function |
features |
character, name of two features to be compared with permutation test |
domains |
list of two character vectors, each including a number of features. |
n |
integer, number of permutations |
Value
normalized numeric vector
Author(s)
E. F. Haghish
Examples
## Not run:
# load the required libraries for building the base-learners and the ensemble models
library(h2o) #shapley supports h2o models
library(autoEnsemble) #autoEnsemble models, particularly useful under severe class imbalance
library(shapley)
# initiate the h2o server
h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE)
# upload data to h2o cloud
prostate_path <- system.file("extdata", "prostate.csv", package = "h2o")
prostate <- h2o.importFile(path = prostate_path, header = TRUE)
### H2O provides 2 types of grid search for tuning the models, which are
### AutoML and Grid. Below, I demonstrate how weighted mean shapley values
### can be computed for both types.
set.seed(10)
#######################################################
### PREPARE AutoML Grid (takes a couple of minutes)
#######################################################
# run AutoML to tune various models (GBM) for 60 seconds
y <- "CAPSULE"
prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification
aml <- h2o.automl(y = y, training_frame = prostate, max_runtime_secs = 120,
include_algos=c("GBM"),
# this setting ensures the models are comparable for building a meta learner
seed = 2023, nfolds = 10,
keep_cross_validation_predictions = TRUE)
### call 'shapley' function to compute the weighted mean and weighted confidence intervals
### of SHAP values across all trained models.
### Note that the 'newdata' should be the testing dataset!
result <- shapley(models = aml, newdata = prostate, plot = TRUE)
#######################################################
### Significance testing of contributions of two features
#######################################################
# testing the WMSHAP contributions of two features
hmda.test(result, features = c("GLEASON", "PSA"), n = 5000)
# testing the WMSHAP contributions of two domains (groups of features, latent factors, etc.)
hmda.test(result,
n = 5000,
domains = list(Demographic = c("RACE", "AGE"),
Cancer = c("VOL", "PSA", "GLEASON")))
## End(Not run)
Compute Weighted Mean SHAP Values and Confidence Intervals via shapley algorithm
Description
Wrapper for shapley that computes weighted mean SHAP ratios (WMSHAP)
and (when available) confidence intervals across a set of H2O models. Model contributions
are aggregated across multiple models and weighted by a chosen performance metric (e.g.,
"r2" for regression; "auc", "aucpr", or "f2" for classification).
This approach considers the variability of feature
importance across multiple models rather than reporting SHAP values from a single model.
For more details about shapley algotithm, see https://github.com/haghish/shapley
Usage
hmda.wmshap(
models,
newdata,
plot = TRUE,
performance_metric = "r2",
standardize_performance_metric = FALSE,
performance_type = "xval",
minimum_performance = 0,
method = "lowerCI",
cutoff = 0.01,
top_n_features = NULL,
n_models = 10,
sample_size = NULL
)
Arguments
models |
A grid object, an AutoML grid, an autoEnsemble object, or a character vector of H2O model IDs from which the SHAP values will be computed. |
newdata |
An H2OFrame (or data frame already uploaded to the H2O server) on which the SHAP values will be evaluated. |
plot |
Logical. If |
performance_metric |
Character. Specifies the performance metric to be used as
weights for the SHAP values. The default is |
standardize_performance_metric |
Logical. If |
performance_type |
Character. Specifies whether the performance metric should be
retrieved from the training data ("train"), validation data ("valid"), or
cross-validation ("xval"). Default is |
minimum_performance |
Numeric. The minimum performance threshold; any model with
a performance equal to or lower than this threshold will have a weight of zero in
the weighted SHAP calculation. Default is |
method |
Character. Specify the method for selecting important features
based on their weighted mean SHAP ratios. The default is
|
cutoff |
Numeric. The cutoff value used in the feature selection method
(default is |
top_n_features |
Integer. If specified, only the top |
n_models |
Integer. The minimum number of models that must meet the
|
sample_size |
Integer. The number of rows in |
Details
This function is designed as a wrapper for the HMDA package and calls the
shapley() function from the shapley package. It computes the weighted
average of SHAP values across multiple models, using a specified performance
metric (e.g., R Squared, AUC, etc.) as the weight. The performance metric can be
standardized if required. Additionally, the function selects top features based on
different methods (e.g., "mean" or "lowerCI") and
can limit the number of features considered via top_n_features. The
n_models parameter controls how many models must meet a minimum performance
threshold to be included in the weighted SHAP calculation.
For more information on the shapley and WMSHAP approaches used in HMDA, please refer to the shapley package documentation and the following resources:
shapley GitHub: https://github.com/haghish/shapley
shapley CRAN: https://CRAN.R-project.org/package=shapley
Value
A list with the following components:
A shapley object that includes the following compnents
- plot
A ggplot2 object showing the weighted mean SHAP values and confidence intervals (if
plot = TRUE).- shap_values
A data frame of the weighted mean SHAP (WMSHAP) values and confidence intervals for each feature.
- performance
A data frame of performance metrics for all models used in the analysis.
- model_ids
A vector of model IDs corresponding to the models evaluated.
Author(s)
E. F. Haghish
Examples
## Not run:
library(HMDA)
library(h2o)
hmda.init()
# Import a sample binary outcome dataset into H2O
train <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
test <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])
params <- list(learn_rate = c(0.01, 0.1),
max_depth = c(3, 5, 9),
sample_rate = c(0.8, 1.0)
)
# Train and validate a cartesian grid of GBMs
hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
grid_id = "hmda_grid1",
training_frame = train,
nfolds = 10,
ntrees = 100,
seed = 1,
hyper_params = gbm_params1)
# Assess the performances of the models
grid_performance <- hmda.grid.analysis(hmda_grid1)
# Return the best 2 models according to each metric
hmda.best.models(grid_performance, n_models = 2)
# build an autoEnsemble model & test it with the testing dataset
meta <- hmda.autoEnsemble(models = hmda_grid1, training_frame = train)
print(h2o.performance(model = meta$model, newdata = test))
# compute weighted mean shap values
wmshap <- hmda.wmshap(models = hmda_grid1,
newdata = test,
performance_metric = "aucpr",
standardize_performance_metric = FALSE,
performance_type = "xval",
minimum_performance = 0,
method = "mean",
cutoff = 0.01,
plot = TRUE)
# identify the important features
selected <- hmda.feature.selection(wmshap,
method = c("mean"),
cutoff = 0.01)
print(selected)
# View the plot of weighted mean SHAP values and confidence intervals
print(wmshap$plot)
## End(Not run)
Create SHAP Summary Table Based on the Given Criterion
Description
Generates a summary table of weighted mean SHAP (WMSHAP) values
and confidence intervals for each feature based on a weighted SHAP analysis.
The function filters the SHAP summary table (from a wmshap object) by
selecting features that meet or exceed a specified cutoff using a selection
method (default "mean"). It then sorts the table by the mean SHAP value,
formats the SHAP values along with their 95% confidence intervals into a single
string, and optionally adds human-readable feature descriptions from a provided
dictionary. The output is returned as a markdown table using the pander
package, or as a data frame if requested.
Usage
hmda.wmshap.table(
wmshap,
method = "lowerCI",
cutoff = 0.01,
round = 3,
exclude_features = NULL,
dict = NULL,
markdown.table = TRUE,
split.tables = 120,
split.cells = 50
)
Arguments
wmshap |
A wmshap object, returned by the hmda.wmshap function
containing a data frame |
method |
#' @param method Character. Specify the method for selecting important features
based on their weighted mean SHAP ratios. The default is
|
cutoff |
Numeric. The threshold cutoff for the selection method;
only features with a value in the |
round |
Integer. The number of decimal places to round the
SHAP mean and confidence interval values. Default is
|
exclude_features |
Character vector. A vector of feature names to be
excluded from the summary table. Default is |
dict |
A data frame containing at least two columns named
|
markdown.table |
Logical. If |
split.tables |
Integer. Controls table splitting in |
split.cells |
Integer. Controls cell splitting in |
Value
If markdown.table = TRUE, returns a markdown table (invisibly)
showing two columns: "Description" and "WMSHAP". If
markdown.table = FALSE, returns a data frame with these columns.
Author(s)
E. F. Haghish
See Also
Examples
## Not run:
library(HMDA)
library(h2o)
hmda.init()
# Import a sample binary outcome dataset into H2O
train <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
test <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])
params <- list(learn_rate = c(0.01, 0.1),
max_depth = c(3, 5, 9),
sample_rate = c(0.8, 1.0)
)
# Train and validate a cartesian grid of GBMs
hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
grid_id = "hmda_grid1",
training_frame = train,
nfolds = 10,
ntrees = 100,
seed = 1,
hyper_params = params)
# Assess the performances of the models
grid_performance <- hmda.grid.analysis(hmda_grid1)
# Return the best 10 models according to each metric
hmda.best.models(grid_performance, n_models = 10)
# compute weighted mean shap values for all models
#Note: you may compute WMSHAP for a selected well-performing models
wmshap <- hmda.wmshap(models = hmda_grid1,
newdata = test,
performance_metric = "aucpr",
performance_type = "xval",
minimum_performance = 0,
method = "mean",
cutoff = 0.01,
plot = TRUE)
# identify the important features
selected <- hmda.feature.selection(wmshap,
method = c("mean"),
cutoff = 0.01)
print(selected)
# View the plot of weighted mean SHAP values and confidence intervals
print(wmshap$plot)
# get the wmshap table output in Markdown format:
md_table <- hmda.wmshap.table(wmshap = wmshap,
method = "mean",
cutoff = 0.01,
round = 3,
markdown.table = TRUE)
head(md_table)
## End(Not run)
Suggest Alternative mtries Values
Description
Provides a set of candidate values for the
mtries parameter used in Random Forest models.
The suggestions are computed based on the number of
predictors (p) and the modeling family. For
classification, the common default is sqrt(p),
while for regression it is typically p/3. For
family, alternative candidates are offered to aid model
tuning.
Usage
suggest_mtries(p, family = c("classification", "regression"))
Arguments
p |
Integer. The number of features (predictors) in the dataset. This value is used to compute candidate mtries. |
family |
Character. Must be either "classification" or "regression". This determines the set of candidate values. |
Details
For classification, the default is often
sqrt(p); alternative suggestions include
log2(p) and p^(1/3). For regression,
the typical default is p/3, but candidates such as
p/2 or p/5 may also be useful. The best
choice depends on the data structure and predictor
correlations.
Value
An integer vector of candidate values for
mtries.
Author(s)
E. F. Haghish
Examples
## Not run:
# For a classification task with 100 predictors:
suggest_mtries(p = 100, family = "classification")
# For a regression task with 100 predictors:
suggest_mtries(p = 100, family = "regression")
## End(Not run)