1 Introduction

Model selection is the process of identifying the most relevant features from a set of candidate variables. This step is critical for building models that are accurate, interpretable, and computationally efficient while avoiding overfitting. Stepwise regression algorithms automate this process by iteratively adding or removing features based on predefined criteria, such as statistical significance (e.g., p-values), information criteria (e.g., AIC or BIC), or other performance metrics. The procedure continues until no further improvements can be made according to the chosen criterion, resulting in a final model that includes the selected features and their corresponding coefficients.

However, it is important to note that stepwise regression should never be used for statistical inference unless the variable selection process is explicitly accounted for. Without proper adjustments, the selection process invalidates statistical inference, such as p-values and confidence intervals, due to issues like multiple testing and data dredging. This limitation does not apply when stepwise regression is used for prediction, as the primary goal in predictive modeling is to maximize accuracy rather than draw causal conclusions.

StepReg simplifies model selection tasks by providing a unified programming interface. It currently supports model buildings for five distinct response variable types (section 3.1), four model selection strategies (section 3.2) including the best subsets algorithm, and a variety of selection metrics (section 3.3). StepReg also supports advanced features including strata variables for Cox regression and continuous-nested-within-class effects for complex modeling scenarios (section 3.4). Moreover, StepReg detects and addresses the multicollinearity issues if they exist (section 3.6). The output of StepReg includes multiple tables summarizing the final model and the variable selection procedures. Additionally, StepReg offers a plot function to visualize the selection steps and support a various formats of output (section 4). For demonstration, the vignettes include four use cases covering distinct regression scenarios (section 5). Non-programmers can access the tool through an interactive Shiny application (section 6).

By combining flexibility, robustness, and ease of use, StepReg is a powerful tool for predictive modeling tasks, particularly when the goal is to identify an optimal set of features for accurate predictions. However, users should exercise caution and avoid using StepReg for statistical inference unless the variable selection process is properly accounted for.

2 Quick demo

The following example selects an optimal linear regression model with the mtcars dataset.

library(StepReg)
data(mtcars)
formula <- mpg ~ .
res <- stepwise(formula = formula,
                data = mtcars,
                type = "linear",
                include = c("qsec"),
                strategy = "bidirection",
                metric = c("AIC"))

Breakdown of the parameters:

  • formula: specifies the dependent and independent variables
  • type: specifies the regression category, depending on your data, choose from “linear”, “logit”, “cox”, etc.
  • include: specifies the variables that must be in the final model
  • strategy: specifies the stepwise strategy, choose from “forward”, “backward”, “bidirection”, “subset”
  • metric: specifies the model fit evaluation metric, choose one or more from “AIC”, “AICc”, “BIC”, “SL”, etc.

The output consists of final model, which can be viewed using:

res
$bidirection
$bidirection$AIC

Call:
lm(formula = mpg ~ 1 + qsec + wt + am, data = data, weights = NULL)

Coefficients:
(Intercept)         qsec           wt           am  
      9.618        1.226       -3.917        2.936  

You can further explore the results with S3 generic functions such as summary(), coeff(), and others. For example:

summary(res$bidirection$AIC)

Call:
lm(formula = mpg ~ 1 + qsec + wt + am, data = data, weights = NULL)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4811 -1.5555 -0.7257  1.4110  4.6610 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.6178     6.9596   1.382 0.177915    
qsec          1.2259     0.2887   4.247 0.000216 ***
wt           -3.9165     0.7112  -5.507 6.95e-06 ***
am            2.9358     1.4109   2.081 0.046716 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.459 on 28 degrees of freedom
Multiple R-squared:  0.8497,    Adjusted R-squared:  0.8336 
F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

You can also visualize the variable selection procedures with:

plot(res, strategy = "bidirection", process = "overview")

plot(res, strategy = "bidirection", process = "detail")

The (+)1 refers to original model with intercept being added, (+) indicates variables being added to the model while (-) means variables being removed from the model.

Additionally, you can generate reports of various formats with:

report(res, report_name = "path_to/demo_res", format = "html")

Replace "path_to/demo_res" with desired output file name, the suffix ".html" will be added automatically. For detailed examples and more usage, refer to section 4 and 5.

3 Key features

3.1 Regression categories

StepReg supports multiple types of regressions, including linear, logit, cox, poisson, and gamma regressions. These methods primarily vary by the type of response variable, which are summarized in the table below. Additional regression techniques can be incorporated upon user requests.

Table 1: Common regression categories
Regression Reponse
linear continuous
logit binary
cox time-to-event
poisson count
gamma continuous and positively skewed

3.2 Model selection strategies

Model selection aims to identify the subset of independent variables that provide the best predictive performance for the response variable. Both stepwise regression and best subsets approaches are implemented in StepReg. For stepwise regression, there are mainly three methods: Forward Selection, Backward Elimination, Bidirectional Elimination.

Table 2: Model selection strategy
Strategy Description
Forward Selection In forward selection, the algorithm starts with an empty model (no predictors) and adds in variables one by one. Each step tests the addition of every possible predictor by calculating a pre-selected metric. Add the variable (if any) whose inclusion leads to the most statistically significant fit improvement. Repeat this process until more predictors no longer lead to a statistically better fit.
Backward Elimination In backward elimination, the algorithm starts with a full model (all predictors) and deletes variables one by one. Each step test the deletion of every possible predictor by calculating a pre-selected metric. Delete the variable (if any) whose loss leads to the most statistically significant fit improvement. Repeat this process until less predictors no longer lead to a statistically better fit.
Bidirectional Elimination Bidirectional elimination is essentially a forward selection procedure combined with backward elimination at each iteration. Each iteration starts with a forward selection step that adds in predictors, followed by a round of backward elimination that removes predictors. Repeat this process until no more predictors are added or excluded.
Best Subsets Stepwise algorithms add or delete one predictor at a time and output a single model without evaluating all candidates. Therefore, it is a relatively simple procedure that only produces one model. In contrast, the Best Subsets algorithm calculates all possible models and output the best-fitting models with one predictor, two predictors, etc., for users to choose from.

Given the computational constraints, when dealing with datasets featuring a substantial number of predictor variables greater than the sample size, the Bidirectional Elimination typically emerges as the most advisable approach. Forward Selection and Backward Elimination can be considered in sequence. On the contrary, the Best Subsets approach requires the most substantial processing time, yet it calculates a comprehensive set of models with varying numbers of variables. In practice, users can experiment with various methods and select a final model based on the specific dataset and research objectives at hand.

3.3 Selection metrics

Various selection metrics can be used to guide the process of adding or removing predictors from the model. These metrics help to determine the importance or significance of predictors in improving the model fit. In StepReg, selection metrics include two categories: Information Criteria and Significance Level of the coefficient associated with each predictor. Information Criteria is a means of evaluating a model’s performance, which balances model fit with complexity by penalizing models with a higher number of parameters. Lower Information Criteria values indicate a better trade-off between model fit and complexity. Note that when evaluating different models, it is important to compare them within the same Information Criteria framework rather than across multiple Information Criteria. For example, if you decide to use AIC, you should compare all models using AIC. This ensures consistency and fairness in model comparison, as each Information Criterion has its own scale and penalization factors. In practice, multiple metrics have been proposed, the ones supported by StepReg are summarized below.

Importantly, given the discrepancies in terms of the precise definitions of each metric, StepReg mirrors the formulas adopted by SAS for univariate multiple regression (UMR) except for HQ, IC(1), and IC(3/2). A subset of the UMR can be easily extended to multivariate multiple regression (MMR), which are indicated in the following table.

Table 3: Statistics in selection metric
Statistic Meanings
\({n}\) Sample Size
\({p}\) Number of parameters including the intercept
\({q}\) Number of dependent variables
\(\sigma^2\) Estimate of pure error variance from fitting the full model
\({SST}\) Total sum of squares corrected for the mean for the dependent variable, which is a numeric value for UMR and a matrix for multivariate regression
\({SSE}\) Error sum of squares, which is a numeric value for UMR and a matrix for multivariate regression
\(\text{LL}\) The natural logarithm of likelihood
\({| |}\) The determinant function
\(\ln()\) The natural logarithm
Table 4: Abbreviation, Definition, and Formula of the Selection Metric for Linear, Logit, Cox, Possion, and Gamma regression
Abbreviation Definition Formula
linear logit, cox, poisson and gamma
AIC Akaike’s Information Criterion \(n\ln\left(\frac{|\text{SSE}|}{n}\right) + 2pq + n + q(q+1)\)
(Clifford M. Hurvich 1989; Al-Subaihi 2002)\(^1\)
\(-2\text{LL} + 2p\)
(Darlington 1968; George G. Judge 1985)
AICc Corrected Akaike’s Information Criterion \(n\ln\left(\frac{|\text{SSE}|}{n}\right) + \frac{nq(n+p)}{n-p-q-1}\)
(Clifford M. Hurvich 1989; Edward J. Bedrick 1994)\(^2\)
\(-2\text{LL} + \frac{n(n+p)}{n-p-2}\)
(Clifford M. Hurvich 1989)
BIC Sawa Bayesian Information Criterion \(n\ln\left(\frac{SSE}{n}\right) + 2(p+2)o - 2o^2, o = \frac{n\sigma^2}{SSE}\)
(Sawa 1978; George G. Judge 1985)
not available for MMR
not available
Cp Mallows’ Cp statistic \(\frac{SSE}{\sigma^2} + 2p - n\)
(Mallows 1973; Hocking 1976)
not available for MMR
not available
HQ Hannan and Quinn Information Criterion \(n\ln\left(\frac{|\text{SSE}|}{n}\right) + 2pq\ln(\ln(n))\)
(E. J. Hannan 1979; Allan D R McQuarrie 1998; Clifford M. Hurvich 1989)
\(-2\text{LL} + 2p\ln(\ln(n))\)
(E. J. Hannan 1979)
IC(1) Information Criterion with Penalty Coefficient Set to 1 \(n\ln\left(\frac{|\text{SSE}|}{n}\right) + p\)
(J. A. Nelder 1972; A. F. M. Smith 1980) not available for MMR
\(-2\text{LL} + p\)
(J. A. Nelder 1972; A. F. M. Smith 1980)
IC(3/2) Information Criterion with Penalty Coefficient Set to 3/2 \(n\ln\left(\frac{|\text{SSE}|}{n}\right) + \frac{3}{2}p\)
(A. F. M. Smith 1980)
not available for MMR
\(-2\text{LL} + \frac{3}{2}p\)
(A. F. M. Smith 1980)
SBC Schwarz Bayesian Information Criterion \(n\ln\left(\frac{|\text{SSE}|}{n}\right) + pq \ln(n)\)
(Clifford M. Hurvich 1989; Schwarz 1978; George G. Judge 1985; Al-Subaihi 2002)
not available for MMR
\(-2\text{LL} + p\ln(n)\)
(Schwarz 1978; George G. Judge 1985)
SL Significance Level (pvalue) \(\textit{F test}\) for UMR and \(\textit{Approximate F test}\) for MMR Forward: LRT and Rao Chi-square test (logit, poisson, gamma); LRT (cox);

Backward: Wald test
adjRsq Adjusted R-square statistic \(1 - \frac{(n-1)(1-R^2)}{n-p}\),
where \(R^2=1 - \frac{SSE}{SST}\)
(Darlington 1968; George G. Judge 1985)
not available for MMR
not available
1 Unsupported AIC formula (which does not affect the selection process as it only differs by constant additive and multiplicative factors):

\(AIC=n\ln\left(\frac{SSE}{n}\right) + 2p\) (Darlington 1968; George G. Judge 1985)
2 Unsupported AICc formula (which does not affect the selection process as it only differs by constant additive and multiplicative factors):

\(AICc=\ln\left(\frac{SSE}{n}\right) + 1 + \frac{2(p+1)}{n-p-2}\) (Allan D R McQuarrie 1998)

No metric is necessarily optimal for all datasets. The choice of them depends on your data and research goals. We recommend using multiple metrics simultaneously, which allows the selection of the best model based on your specific needs. Below summarizes general guidance.

  • AIC: AIC works by penalizing the inclusion of additional variables in a model. The lower the AIC, the better performance of the model. AIC does not include sample size in penalty calculation, and it is optimal in minimizing the mean square error of predictions (Mark J. Brewer 2016).

  • AICc: AICc is a variant of AIC, which works better for small sample size, especially when numObs / numParam < 40 (Kenneth P. Burnham 2002).

  • Cp: Cp is used for linear models. It is equivalent to AIC when dealing with Gaussian linear model selection.

  • IC(1) and IC(3/2): IC(1) and IC(3/2) have 1 and 3/2 as penalty factors respectively, compared to 2 used by AIC. As such, IC(1) turns to return a complex model with more variables that may suffer from overfitting issues.

  • BIC and SBC: Both BIC and SBC are variants of Bayesian Information Criterion. The main distinction between BIC/SBC and AIC lies in the magnitude of the penalty imposed: BIC/SBC are more parsimonious when penalizing model complexity, which typically results to a simpler model (SAS Institute Inc 2018; Sawa 1978; Clifford M. Hurvich 1989; Schwarz 1978; George G. Judge 1985; Al-Subaihi 2002).

The precise definitions of these criteria can vary across literature and in the SAS environment. Here, BIC aligns with the definition of the Sawa Bayesion Information Criterion as outlined in SAS documentation, while SBC corresponds to the Schwarz Bayesian Information Criterion. According to Richard’s post, whereas AIC often favors selecting overly complex models, BIC/SBC prioritize a small models. Consequently, when dealing with a limited sample size, AIC may seem preferable, whereas BIC/SBC tend to perform better with larger sample sizes.

  • HQ: HQ is an alternative to AIC, differing primarily in the method of penalty calculation. However, HQ has remained relatively underutilized in practice (Kenneth P. Burnham 2002).

  • adjRsq: The adjusted R-squared (adj-R²) seeks to overcome the limitation of R-squared in model selection by considering the number of predictors. It serves a similar purpose to information criteria, as both methods compare models by weighing their goodness of fit against the number of parameters. However, information criteria are typically regarded as superior in this context (Stevens 2016).

  • SL: SL stands for Significance Level (P-value), embodying a distinct approach to model selection in contrast to information criteria. The SL method operates by calculating a P-value through specific hypothesis testing. Should this P-value fall below a predefined threshold, such as 0.05, one should favor the alternative hypothesis, indicating that the full model significantly outperforms the reduced model. The effectiveness of this method hinges upon the selection of the P-value threshold, wherein smaller thresholds tend to yield simpler models.

3.4 Advanced Features

StepReg supports several advanced features that enhance its flexibility for complex modeling scenarios.

3.4.1 Strata Variables in Cox Regression

For Cox proportional hazards regression, StepReg supports the use of strata() function to include stratification variables. This is particularly useful when you want to control for confounding variables that may violate the proportional hazards assumption.

The strata() function allows you to fit separate baseline hazard functions for different groups while sharing the same regression coefficients across strata. This is equivalent to fitting separate Cox models for each stratum but with the advantage of more efficient parameter estimation.

Here is an example of how to use the strata() function in a formula for StepReg:

formula  =  Surv(time, status) ~ . + strata(inst)

In this example, strata(inst) creates separate baseline hazard functions for each institution (inst), while the effects of other selected variables are assumed to be the same across all institutions.

3.4.2 Continuous-Nested-Within-Class Effects

StepReg supports continuous-nested-within-class effects, which are useful when you want to model how a continuous variable’s effect varies across different levels of a categorical variable. This is implemented using the : operator in the formula.

Key points about continuous-nested-within-class effects:

  • The class variable (categorical) must be a factor
  • The continuous variable can be nested within the class variable using : notation
  • The order of variables in the nested term doesn’t matter (e.g., X:A is equivalent to A:X)

here is an example of how to use the : operator in a formula for StepReg:

mtcars$am <- as.factor(mtcars$am)
formula <- mpg ~ am + cyl:am + wt:am + disp:am + hp:am + qsec:am + vs:am + gear:am + carb:am

3.4.3 Multivariate multiple regression for linear regression

StepReg supports multivariate multiple regression, which is a type of regression analysis that allows for multiple response variables to be modeled simultaneously. This is implemented using the cbind() operator in the formula.

here is an example of how to use the cbind() operator in a formula for StepReg:

formula <- cbind(mpg, drat) ~ .

3.5 Formula Syntax Summary

The following table summarizes the formula syntax supported by StepReg:

Table 5: Formula syntax supported by StepReg
Syntax Description Example
y ~ x1 + x2 Multiple predictors mpg ~ cyl + wt
y ~ . All variables in dataset mpg ~ .
y ~ . - x1 All variables except x1 mpg ~ . - disp
y ~ x1 * x2 Main effects and interaction mpg ~ cyl * am
y ~ x1:x2 Continuous-nested-within-class effects mpg ~ cyl:am
cbind(y1, y2) ~ . Multiple response variables cbind(mpg, drat) ~ .
y ~ . + 0 or y ~ . - 1 No intercept mpg ~ . + 0
Surv(time, status) ~ . + strata(strata_var) Cox regression with strata Surv(time, status) ~ age + sex + strata(inst)

3.6 Multicollinearity

Multicollinearity arises when independent variables in a regression model are correlated with each other. Ideally, predictors should be independent to allow a clear understanding of how each variable relates to the outcome. When the correlations between predictors are strong, it can create significant issues for model fitting and interpretation. In such cases, changes in one input variable can be associated with changes in others, which makes it difficult to assess each variable’s individual contribution to the dependent variable.

Severe multicollinearity reduces the precision of estimated regression coefficients, thereby complicating model selection and interpretation. It can lead to unreliable estimates and inflated standard errors. However, it is important to note that multicollinearity primarily affects the interpretation of coefficients and their statistical significance. If the main objective is to predict outcomes accurately, and interpretation of individual predictors is not required, addressing multicollinearity may not be necessary because it does not impact prediction accuracy or the goodness-of-fit of the model.

In StepReg, QC Matrix Decomposition is performed ahead of time to detect and remove input variables causing multicollinearity.

4 StepReg output

This function creates a StepReg class object, which is a structured list containing both the input specifications and the outcomes of the stepwise regression analysis. The key components of this object are detailed below, providing a comprehensive framework for model exploration and validation.

  • argument: A data.frame containing the user-specified settings and parameters used in the analysis, including the initial formula, regression type, selection strategy, chosen metrics, significance levels (sle/sls), tolerance threshold, test method, and other control parameters.

  • variable: A data.frame containing information about all variables in the model, including variable names, data types (numeric, factor, etc.), and their roles (Dependent/Independent) in the model.

  • performance: A data.frame providing detailed performance metrics for the selected models across different strategies and metrics. For both training and test data sets when test_ratio < 1, the output includes model-specific performance indicators:

    • For linear, poisson, gamma, and negative binomial regression
      • "r2_train/r2_test": R-squared measures the proportion of variance explained by the model. Values range from 0 to 1, with higher values indicating better fit. A well-performing model should exhibit high R-squared on both training and test data, with minimal difference between them. A large discrepancy suggests overfitting.
      • "mse_train/mse_test": Mean Squared Error (MSE) calculates the average squared difference between predicted and actual values. Lower values signify better model performance. The test MSE should be close to the training MSE; a significantly higher test MSE indicates potential overfitting.
      • "mae_train/mae_test": Mean Absolute Error (MAE) measures the average absolute difference between predicted and actual values. Lower values reflect better performance. Similar to MSE, the test MAE should be close to the training MAE to avoid overfitting.
    • For logistic regression
      • "accuracy_train/accuracy_test": Accuracy represents the proportion of correct predictions (true positives + true negatives) divided by total predictions. Values range from 0 to 1, with higher values indicating better classification. Test accuracy should be close to training accuracy; a large gap suggests overfitting.
      • "auc_train/auc_test": Area Under the Curve (AUC) assesses the model’s ability to distinguish between classes. Values range from 0.5 (random) to 1.0 (perfect discrimination). An AUC > 0.7 is acceptable, > 0.8 is good, and > 0.9 is excellent. Test AUC should be close to training AUC to avoid overfitting.
      • "log_loss_train/log_loss_test": Log Loss (logarithmic loss) penalizes confident incorrect predictions more heavily. Lower values indicate better performance, with values near 0 being ideal. Test log loss should be close to training log loss; a higher test log loss suggests overfitting.
    • For Cox regression
      • "c-index_train/c-index_test": Concordance Index (C-index) evaluates the model’s ability to correctly rank survival times. Values range from 0.5 (random) to 1.0 (perfect ranking). A C-index > 0.7 is acceptable, > 0.8 is good, and > 0.9 is excellent. Test C-index should be close to training C-index to avoid overfitting.
      • "auc_hc": Harrell’s C-index for time-dependent AUC, measuring discrimination at specific time points. Higher values indicate better performance.
  • overview: A nested list organized by strategy and metric, containing step-by-step summaries of the model-building process. Each element shows which variables were entered or removed at each step along with the corresponding metric values (e.g., AIC, BIC, SBC).

  • detail: A nested list organized by strategy and metric, providing granular information about each candidate step. This includes which variables were tested, their evaluation statistics, p-values, and whether they were ultimately selected or rejected.

  • fitted model object within the strategy-specific list: A nested list object organized with a first layer representing the selection strategy (e.g., forward, backward, bidirection, subset) and a second layer representing the metric (e.g., AIC, BIC, SBC). For each strategy-metric combination, the function returns fitted model objects that can be further analyzed using S3 generic functions such as summary, anova, plot, or coefficients. These functions adapt to the model type (e.g., coxph, lm, glm through call-specific methods. Specific statistics can be directly retrieved using the $ operator, such as result$forward$AIC$coefficients. The level of detail in these analyses depends on the model type: the package enriches coxph objects with detailed statistics including hazard ratios, standard errors, z-statistics, p-values, and likelihood ratio tests, while base R functions like lm and glm offer basic output with coefficients by default, requiring summary or anova to reveal standard errors, t-values, p-values, and R-squared values.

report(res, report_name = "results", format = c("html", "docx"))

5 Use cases

Below, we present various examples illustrating the application of different models tailored to specific datasets. Please note that stepwise regression should never be used for statistical inference unless the variable selection process is properly accounted for, as it can invalidate the results. However, this issue does not arise when stepwise regression is used for prediction. It is essential to select the regression model that best suits the type of response variable. For detailed guidance, refer to section 3.1.

5.1 Linear regression with the mtcars dataset

In this section, we’ll demonstrate how to perform linear regression analysis using the mtcars dataset, showcasing different scenarios with varying numbers of predictors and dependent variables. We set type = "linear" to direct the function to perform linear regression.

Description of the mtcars dataset

The mtcars is a classic dataset in statistics and is included in the base R installation. It was sourced from the 1974 Motor Trend US magazine, comprising 32 observations on 11 variables. Here’s a brief description of the variables included:

  1. mpg: miles per gallon (fuel efficiency)
  2. cyl: number of cylinders
  3. disp: displacement (engine size) in cubic inches
  4. hp: gross horsepower
  5. drat: rear axle ratio
  6. wt: weight (in thousands of pounds)
  7. qsec: 1/4 mile time (in seconds)
  8. vs: engine type (0 = V-shaped, 1 = straight)
  9. am: transmission type (0 = automatic, 1 = manual)
  10. gear: number of forward gears
  11. carb: number of carburetors

Why choose linear regression

Linear regression is an ideal choice for analyzing the mtcars dataset due to its inclusion of continuous variables like “mpg”, “hp”, or “weight”, which can serve as response variables. Furthermore, the dataset exhibits potential linear relationships between the response variable and other variables.

5.1.1 Example1: single dependent variable (“mpg”), and apply continuous-nested-within-class variables

In this example, we employ “forward” strategy with “AIC” as the selection criteria. Additionally, we specify using the include argument that “disp”, “cyl” always be included in the model.

data(mtcars)
## make sure the categorical variable is a factor variable
mtcars$am <- as.factor(mtcars$am)
str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
formula <- mpg ~ am + cyl:am + disp:am + am:hp + drat:am + wt:am + qsec:am + vs:am + gear:am + carb:am
res1 <- stepwise(formula = formula,
                 data = mtcars,
                 type = "linear",
                 include = c("cyl:am", "am"),
                 strategy = "forward",
                 metric = "AIC",
                 test_ratio = 0.2)
res1
$forward
$forward$AIC

Call:
lm(formula = mpg ~ 1 + cyl:am + am + am:wt + am:qsec + am:carb + 
    am:gear + am:vs, data = data, weights = NULL)

Coefficients:
(Intercept)          am1      cyl:am0      cyl:am1       am0:wt       am1:wt  
    18.5828     -98.4083      -1.3517       2.4525       3.2872      -9.2107  
   am0:qsec     am1:qsec     am0:carb     am1:carb     am0:gear     am1:gear  
    -0.2891       5.0370      -2.0607       0.4417       2.2797       6.1378  
     am0:vs       am1:vs  
     1.2148      -3.3619  

To get the summary of the model:

summary(res1$forward$AIC)

Call:
lm(formula = mpg ~ 1 + cyl:am + am + am:wt + am:qsec + am:carb + 
    am:gear + am:vs, data = data, weights = NULL)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.69185 -0.93906 -0.04013  0.73942  2.97790 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  18.5828    13.9674   1.330  0.20810   
am1         -98.4083    34.8924  -2.820  0.01545 * 
cyl:am0      -1.3517     1.0704  -1.263  0.23065   
cyl:am1       2.4525     1.2507   1.961  0.07352 . 
am0:wt        3.2872     2.6429   1.244  0.23733   
am1:wt       -9.2107     2.4546  -3.752  0.00276 **
am0:qsec     -0.2891     0.6024  -0.480  0.63988   
am1:qsec      5.0370     1.2513   4.025  0.00168 **
am0:carb     -2.0607     0.7748  -2.660  0.02080 * 
am1:carb      0.4417     0.5052   0.874  0.39911   
am0:gear      2.2797     2.2775   1.001  0.33661   
am1:gear      6.1378     2.2268   2.756  0.01740 * 
am0:vs        1.2148     2.4968   0.487  0.63536   
am1:vs       -3.3619     2.3399  -1.437  0.17634   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.674 on 12 degrees of freedom
Multiple R-squared:  0.9617,    Adjusted R-squared:  0.9203 
F-statistic: 23.21 on 13 and 12 DF,  p-value: 1.722e-06

The performance of the model:

performance(res1)
                                                                model
1 mpg ~ 1 + cyl:am + am + am:wt + am:qsec + am:carb + am:gear + am:vs
  strategy:metric r2_train r2_test mse_train mse_test mae_train mae_test
1     forward:AIC   0.9617  0.7862    1.2939  11.0821    0.8941   1.7781

To visualize the selection process:

plot_list <- list()
plot_list[["forward"]][["detail"]] <- plot(res1, process = "detail")
plot_list[["forward"]][["overview"]] <- plot(res1, process = "overview")
cowplot::plot_grid(plotlist = plot_list$forward, ncol = 1)

To exclude the intercept from the model, adjust the formula as follows:

formula <- mpg ~ . + 0
formula <- mpg ~ . - 1

To limit the model to a specific subset of predictors, adjust the formula as follows, which will only consider “cyp”, “disp”, “hp”, “wt”, “vs”, and “am” as the predictors.

formula <- mpg ~ cyl + disp + hp + wt + vs + am + 0

Another way is to use minus symbol("-") to exclude some predictors for variable selection. For example, include all variables except “disp”, “wt”, and intercept.

formula <- mpg ~ . - 1 - disp - wt

You can simultaneously provide multiple selection strategies and metrics. For example, the following code snippet employs both “forward” and “backward” strategies using metrics “AIC”, “BIC”, and “SL”. It’s worth mentioning that when “SL” is specified, you may also want to set the significance level for entry (“sle”) and stay (“sls”), both of which default to 0.15.

formula <- mpg ~ .
res2 <- stepwise(formula = formula,
                 data = mtcars,
                 type = "linear",
                 strategy = c("forward", "backward"),
                 metric = c("AIC", "BIC", "SL"),
                 sle = 0.05,
                 sls = 0.05,
                 test_ratio = 0.3)
res2
$forward
$forward$AIC

Call:
lm(formula = mpg ~ 1 + wt + qsec, data = data, weights = NULL)

Coefficients:
(Intercept)           wt         qsec  
    25.8451      -5.9952       0.7336  


$forward$BIC

Call:
lm(formula = mpg ~ 1 + wt + qsec, data = data, weights = NULL)

Coefficients:
(Intercept)           wt         qsec  
    25.8451      -5.9952       0.7336  


$forward$SL

Call:
lm(formula = mpg ~ 1 + wt + qsec, data = data, weights = NULL)

Coefficients:
(Intercept)           wt         qsec  
    25.8451      -5.9952       0.7336  



$backward
$backward$AIC

Call:
lm(formula = mpg ~ 1 + disp + hp + wt + qsec + gear, data = data, 
    weights = NULL)

Coefficients:
(Intercept)         disp           hp           wt         qsec         gear  
   13.54553      0.02732     -0.02777     -6.19757      0.87649      2.29452  


$backward$BIC

Call:
lm(formula = mpg ~ 1 + drat + gear + carb, data = data, weights = NULL)

Coefficients:
(Intercept)         drat         gear         carb  
      1.143        3.097        3.915       -2.272  


$backward$SL

Call:
lm(formula = mpg ~ 1 + wt + qsec, data = data, weights = NULL)

Coefficients:
(Intercept)           wt         qsec  
    25.8451      -5.9952       0.7336  
plot_list <- setNames(
  lapply(c("forward", "backward"),function(i){
    setNames(
      lapply(c("detail","overview"),function(j){
        plot(res2,strategy=i,process=j)
    }),
    c("detail","overview")
    )
  }),
  c("forward", "backward")
)

cowplot::plot_grid(plotlist = plot_list$forward, ncol = 1, rel_heights = c(2, 1))

cowplot::plot_grid(plotlist = plot_list$backward, ncol = 1, rel_heights = c(2, 1))

To get the summary of the model:

summary(res2$forward$SL)

Call:
lm(formula = mpg ~ 1 + wt + qsec, data = data, weights = NULL)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2465 -1.2684 -0.4813  0.9276  4.4572 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  25.8451     5.6843   4.547 0.000221 ***
wt           -5.9952     0.6356  -9.433 1.34e-08 ***
qsec          0.7336     0.2704   2.713 0.013792 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.217 on 19 degrees of freedom
Multiple R-squared:  0.8611,    Adjusted R-squared:  0.8464 
F-statistic: 58.88 on 2 and 19 DF,  p-value: 7.189e-09

The performance of the model:

performance(res2)
                                   model
1                    mpg ~ 1 + wt + qsec
2 mpg ~ 1 + disp + hp + wt + qsec + gear
3           mpg ~ 1 + drat + gear + carb
                                    strategy:metric r2_train r2_test mse_train
1 forward:AIC; forward:BIC; forward:SL; backward:SL   0.8611  0.8813    4.2444
2                                      backward:AIC   0.9018  0.9020    3.0014
3                                      backward:BIC   0.8237  0.8721    5.3868
  mse_test mae_train mae_test
1   5.3084    1.6617   1.7531
2   4.9243    1.2997   1.4593
3   6.4444    1.8027   2.0809

5.1.2 Example2: multivariate regression (“mpg” and “drat”)

In this scenario, there are two dependent variables, “mpg” and “drat”. The model selection aims to identify the most influential predictors that affect both variables.

formula <- cbind(mpg, drat) ~ . + 0
res3 <- stepwise(formula = formula,
                 data = mtcars,
                 type = "linear",
                 strategy = "bidirection",
                 metric = c("AIC", "HQ"),
                 test_ratio=0.2,
                 feature_ratio = 0.9)
res3
$bidirection
$bidirection$AIC

Call:
lm(formula = cbind(mpg, drat) ~ 0 + am + cyl + carb, data = data, 
    weights = NULL)

Coefficients:
      mpg       drat    
am0   32.82321   4.41965
am1   37.04570   4.78713
cyl   -1.72726  -0.17784
carb  -1.12739   0.06065


$bidirection$HQ

Call:
lm(formula = cbind(mpg, drat) ~ 0 + gear + qsec + wt + cyl, data = data, 
    weights = NULL)

Coefficients:
      mpg      drat   
gear   1.4101   0.4547
qsec   1.7009   0.1332
wt    -7.1519  -0.3680
cyl    1.1321   0.1113
plot_list <- setNames(
  lapply(c("bidirection"),function(i){
    setNames(
      lapply(c("detail","overview"),function(j){
        plot(res3,strategy=i,process=j)
    }),
    c("detail","overview")
    )
  }),
  c("bidirection")
)

cowplot::plot_grid(plotlist = plot_list$bidirection, ncol = 1, rel_heights = c(2, 1))

To get the summary of the model:

summary(res3$bidirection$AIC)
Response mpg :

Call:
lm(formula = mpg ~ 0 + am + cyl + carb, data = data, weights = NULL)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.2093 -1.2367 -0.1729  1.5702  4.8907 

Coefficients:
     Estimate Std. Error t value Pr(>|t|)    
am0   32.8232     2.2882  14.344 1.20e-12 ***
am1   37.0457     1.7428  21.256 3.72e-16 ***
cyl   -1.7273     0.3913  -4.414  0.00022 ***
carb  -1.1274     0.3766  -2.993  0.00670 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.488 on 22 degrees of freedom
Multiple R-squared:  0.989, Adjusted R-squared:  0.987 
F-statistic: 494.6 on 4 and 22 DF,  p-value: < 2.2e-16


Response drat :

Call:
lm(formula = drat ~ 0 + am + cyl + carb, data = data, weights = NULL)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46396 -0.13189 -0.05641  0.08872  0.73294 

Coefficients:
     Estimate Std. Error t value Pr(>|t|)    
am0   4.41965    0.29887  14.788 6.54e-13 ***
am1   4.78713    0.22763  21.030 4.65e-16 ***
cyl  -0.17784    0.05111  -3.479  0.00213 ** 
carb  0.06065    0.04919   1.233  0.23065    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.325 on 22 degrees of freedom
Multiple R-squared:  0.9935,    Adjusted R-squared:  0.9923 
F-statistic:   835 on 4 and 22 DF,  p-value: < 2.2e-16

The performance of the model:

performance(res3)
                                          model strategy:metric r2_train
1        cbind(mpg, drat) ~ 0 + am + cyl + carb bidirection:AIC   0.9890
2        cbind(mpg, drat) ~ 0 + am + cyl + carb bidirection:AIC   0.9935
3 cbind(mpg, drat) ~ 0 + gear + qsec + wt + cyl  bidirection:HQ   0.9892
4 cbind(mpg, drat) ~ 0 + gear + qsec + wt + cyl  bidirection:HQ   0.9922
  r2_test mse_train mse_test mae_train mae_test response
1  0.7748    5.2381  10.6729    1.7708   2.5479      mpg
2  0.6947    0.0894   0.1547    0.2274   0.3453     drat
3  0.9376    5.1437   4.2714    1.8349   1.5183      mpg
4  0.7360    0.1063   0.1411    0.2392   0.2922     drat

5.2 Logistic regression with the remission dataset

In this example, we’ll showcase logistic regression using the remission dataset. By setting type = "logit", we instruct the function to perform logistic regression.

Description of the remission dataset

The remission dataset, obtained from the online course STAT501 at Penn State University, has been integrated into StepReg. It consists of 27 observations across seven variables, including a binary variable named “remiss”:

  1. remiss: whether leukemia remission occurred, a value of 1 indicates occurrence while 0 means non-occurrence
  2. cell: cellularity of the marrow clot section
  3. smear: smear differential percentage of blasts
  4. infil: percentage of absolute marrow leukemia cell infiltrate
  5. li: percentage labeling index of the bone marrow leukemia cells
  6. blast: the absolute number of blasts in the peripheral blood
  7. temp: the highest temperature before the start of treatment

Why choose logistic regression

Logistic regression effectively captures the relationship between predictors and a categorical response variable, offering insights into the probability of being assigned into specific response categories given a set of predictors. It is suitable for analyzing binary outcomes, such as the remission status (“remiss”) in the remission dataset.

5.2.1 Example1: using “forward” strategy

In this example, we employ a “forward” strategy with “AIC” as the selection criteria, while force ensuring that the “cell” variable is included in the model.

data(remission)
str(remission)
'data.frame':   27 obs. of  7 variables:
 $ remiss: int  1 1 0 0 1 0 1 0 0 0 ...
 $ cell  : num  0.8 0.9 0.8 1 0.9 1 0.95 0.95 1 0.95 ...
 $ smear : num  0.83 0.36 0.88 0.87 0.75 0.65 0.97 0.87 0.45 0.36 ...
 $ infil : num  0.66 0.32 0.7 0.87 0.68 0.65 0.92 0.83 0.45 0.34 ...
 $ li    : num  1.9 1.4 0.8 0.7 1.3 0.6 1 1.9 0.8 0.5 ...
 $ blast : num  1.1 0.74 0.176 1.053 0.519 ...
 $ temp  : num  0.996 0.992 0.982 0.986 0.98 ...
formula <- remiss ~ .
res4 <- stepwise(formula = formula,
                 data = remission,
                 type = "logit",
                 include= "cell",
                 strategy = "forward",
                 metric = "AIC",
                 test_ratio = 0.2)
res4
$forward
$forward$AIC

Call:  glm(formula = remiss ~ 1 + cell + li, family = "binomial", data = data, 
    weights = NULL)

Coefficients:
(Intercept)         cell           li  
     -6.557        3.030        2.858  

Degrees of Freedom: 21 Total (i.e. Null);  19 Residual
Null Deviance:      28.84 
Residual Deviance: 21.03    AIC: 27.03
plot_list <- setNames(
  lapply(c("forward"),function(i){
    setNames(
      lapply(c("detail","overview"),function(j){
        plot(res4,strategy=i,process=j)
    }),
    c("detail","overview")
    )
  }),
  c("forward")
)
cowplot::plot_grid(plotlist = plot_list$forward, ncol = 1, rel_heights = c(2, 1))

To get the summary of the model:

summary(res4$forward$AIC)

Call:
glm(formula = remiss ~ 1 + cell + li, family = "binomial", data = data, 
    weights = NULL)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   -6.557      5.417  -1.211   0.2261  
cell           3.030      5.458   0.555   0.5788  
li             2.858      1.295   2.207   0.0273 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 28.841  on 21  degrees of freedom
Residual deviance: 21.034  on 19  degrees of freedom
AIC: 27.034

Number of Fisher Scoring iterations: 5

The performance of the model:

performance(res4)
                   model strategy:metric accuracy_train accuracy_test auc_train
1 remiss ~ 1 + cell + li     forward:AIC         0.7273             1    0.8571
  auc_test log_loss_train log_loss_test
1       NA         0.4781        0.1653

5.2.2 Example2: using “subset” strategy

In this example, we employ a “subset” strategy, utilizing “SBC” as the selection criteria while excluding the intercept. Meanwhile, we set best_n = 3 to restrict the output to the top 3 models for each number of variables.

formula <- remiss ~ . + 0
res5 <- stepwise(formula = formula,
                  data = remission,
                  type = "logit",
                  strategy = "subset",
                  metric = "SBC",
                  best_n = 3,
                  test_ratio = 0.2)
res5
$subset
$subset$SBC

Call:  glm(formula = remiss ~ 0 + li + temp, family = "binomial", data = data, 
    weights = NULL)

Coefficients:
    li    temp  
 3.018  -3.960  

Degrees of Freedom: 22 Total (i.e. Null);  20 Residual
Null Deviance:      30.5 
Residual Deviance: 21.31    AIC: 25.31
plot_list <- setNames(
  lapply(c("subset"),function(i){
    setNames(
      lapply(c("detail","overview"),function(j){
        plot(res5,strategy=i,process=j)
    }),
    c("detail","overview")
    )
  }),
  c("subset")
)
cowplot::plot_grid(plotlist = plot_list$subset, ncol = 1, rel_heights = c(2, 1))

Here, the 0 in the above plot means that there is no intercept in the model.

To get the summary of the model:

summary(res5$subset$SBC)

Call:
glm(formula = remiss ~ 0 + li + temp, family = "binomial", data = data, 
    weights = NULL)

Coefficients:
     Estimate Std. Error z value Pr(>|z|)  
li      3.018      1.307   2.309   0.0209 *
temp   -3.960      1.590  -2.490   0.0128 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 30.498  on 22  degrees of freedom
Residual deviance: 21.305  on 20  degrees of freedom
AIC: 25.305

Number of Fisher Scoring iterations: 4

The performance of the model:

performance(res5)
                   model strategy:metric accuracy_train accuracy_test auc_train
1 remiss ~ 0 + li + temp      subset:SBC         0.7727             1    0.8795
  auc_test log_loss_train log_loss_test
1       NA         0.4842         0.206

5.3 Cox regression with the lung dataset

In this example, we’ll demonstrate how to perform Cox regression analysis using the [lung] dataset. By setting type = "cox", we instruct the function to conduct Cox regression.

Description of the lung dataset

The lung dataset, available in the "survival" R package, includes information on survival times for 228 patients with advanced lung cancer. It comprises ten variables, among which the “status” variable codes for censoring status (1 = censored, 2 = dead), and the “time” variable denotes the patient survival time in days.

  1. inst: Institution code
  2. time: Survival time in days
  3. status: censoring status 1=censored, 2=dead
  4. age: Age in years
  5. sex: Male=1 Female=2
  6. ph.ecog: ECOG performance score as rated by the physician. 0=asymptomatic, 1= symptomatic but completely ambulatory, 2= in bed < 50% of the day, 3= in bed > 50% of the day but not bedbound, 4 = bedbound
  7. ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician
  8. pat.karno: Karnofsky performance score as rated by patient
  9. meal.cal: Calories consumed at meals
  10. wt.loss: Weight loss in last six months (pounds)

Why choose Cox regression

Cox regression, also termed the Cox proportional hazards model, is specifically designed for analyzing survival data, making it well-suited for datasets like lung that include information on the time until an event (e.g., death) occurs. This method accommodates censoring and assumes proportional hazards, enhancing its applicability to medical studies involving time-to-event outcomes.

5.3.1 Example1: using “forward” strategy with strata function in the formula

In this example, we employ a “forward” strategy with c(“AICc”, “SL”) as the selection criteria. We set the significance level for entry to 0.1 (sle = 0.1). We also use the strata function in the formula to account for the stratified analysis.

data(lung)
library(survival)
lung <- na.omit(lung)
lung$sex <- factor(lung$sex, levels = c(1, 2))
str(lung)
'data.frame':   167 obs. of  10 variables:
 $ inst     : num  3 5 12 7 11 1 7 6 12 22 ...
 $ time     : num  455 210 1022 310 361 ...
 $ status   : num  2 2 1 2 2 2 2 2 2 2 ...
 $ age      : num  68 57 74 68 71 53 61 57 57 70 ...
 $ sex      : Factor w/ 2 levels "1","2": 1 1 1 2 2 1 1 1 1 1 ...
 $ ph.ecog  : num  0 1 1 2 2 1 2 1 1 1 ...
 $ ph.karno : num  90 90 50 70 60 70 70 80 80 90 ...
 $ pat.karno: num  90 60 80 60 80 80 70 80 70 100 ...
 $ meal.cal : num  1225 1150 513 384 538 ...
 $ wt.loss  : num  15 11 0 10 1 16 34 27 60 -5 ...
 - attr(*, "na.action")= 'omit' Named int [1:61] 1 3 5 12 13 14 16 20 23 25 ...
  ..- attr(*, "names")= chr [1:61] "1" "3" "5" "12" ...
formula  =  Surv(time, status) ~ . + strata(sex)
res6 <- stepwise(formula = formula,
                 data = lung,
                 type = "cox",
                 strategy = "forward",
                 metric = c("AICc", "SL"),
                 sle = 0.1,
                 test_ratio = 0.2)
res6
$forward
$forward$AICc
Call:
coxph(formula = Surv(time, status) ~ strata(sex) + ph.ecog + 
    ph.karno + inst + wt.loss, data = data, weights = NULL, method = "efron")

              coef exp(coef)  se(coef)      z        p
ph.ecog   1.098859  3.000740  0.266401  4.125 3.71e-05
ph.karno  0.029703  1.030149  0.012549  2.367   0.0179
inst     -0.030753  0.969715  0.014954 -2.056   0.0397
wt.loss  -0.015415  0.984703  0.008586 -1.795   0.0726

Likelihood ratio test=20.37  on 4 df, p=0.0004223
n= 134, number of events= 95 

$forward$SL
Call:
coxph(formula = Surv(time, status) ~ strata(sex) + ph.ecog + 
    ph.karno + inst + wt.loss, data = data, weights = NULL, method = "efron")

              coef exp(coef)  se(coef)      z        p
ph.ecog   1.098859  3.000740  0.266401  4.125 3.71e-05
ph.karno  0.029703  1.030149  0.012549  2.367   0.0179
inst     -0.030753  0.969715  0.014954 -2.056   0.0397
wt.loss  -0.015415  0.984703  0.008586 -1.795   0.0726

Likelihood ratio test=20.37  on 4 df, p=0.0004223
n= 134, number of events= 95 
plot_list <- setNames(
  lapply(c("forward"),function(i){
    setNames(
      lapply(c("detail","overview"),function(j){
        plot(res6,strategy=i,process=j)
    }),
    c("detail","overview")
    )
  }),
  c("forward")
)
cowplot::plot_grid(plotlist = plot_list$forward, ncol = 1, rel_heights = c(2, 1))

To get the summary of the model:

summary(res6$forward$AICc)
Call:
coxph(formula = Surv(time, status) ~ strata(sex) + ph.ecog + 
    ph.karno + inst + wt.loss, data = data, weights = NULL, method = "efron")

  n= 134, number of events= 95 

              coef exp(coef)  se(coef)      z Pr(>|z|)    
ph.ecog   1.098859  3.000740  0.266401  4.125 3.71e-05 ***
ph.karno  0.029703  1.030149  0.012549  2.367   0.0179 *  
inst     -0.030753  0.969715  0.014954 -2.056   0.0397 *  
wt.loss  -0.015415  0.984703  0.008586 -1.795   0.0726 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

         exp(coef) exp(-coef) lower .95 upper .95
ph.ecog     3.0007     0.3333    1.7802    5.0581
ph.karno    1.0301     0.9707    1.0051    1.0558
inst        0.9697     1.0312    0.9417    0.9986
wt.loss     0.9847     1.0155    0.9683    1.0014

Concordance= 0.631  (se = 0.038 )
Likelihood ratio test= 20.37  on 4 df,   p=4e-04
Wald test            = 19.69  on 4 df,   p=6e-04
Score (logrank) test = 20.09  on 4 df,   p=5e-04
summary(res6$forward$SL)
Call:
coxph(formula = Surv(time, status) ~ strata(sex) + ph.ecog + 
    ph.karno + inst + wt.loss, data = data, weights = NULL, method = "efron")

  n= 134, number of events= 95 

              coef exp(coef)  se(coef)      z Pr(>|z|)    
ph.ecog   1.098859  3.000740  0.266401  4.125 3.71e-05 ***
ph.karno  0.029703  1.030149  0.012549  2.367   0.0179 *  
inst     -0.030753  0.969715  0.014954 -2.056   0.0397 *  
wt.loss  -0.015415  0.984703  0.008586 -1.795   0.0726 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

         exp(coef) exp(-coef) lower .95 upper .95
ph.ecog     3.0007     0.3333    1.7802    5.0581
ph.karno    1.0301     0.9707    1.0051    1.0558
inst        0.9697     1.0312    0.9417    0.9986
wt.loss     0.9847     1.0155    0.9683    1.0014

Concordance= 0.631  (se = 0.038 )
Likelihood ratio test= 20.37  on 4 df,   p=4e-04
Wald test            = 19.69  on 4 df,   p=6e-04
Score (logrank) test = 20.09  on 4 df,   p=5e-04

The performance of the model:

performance(res6)
                                                                   model
1 Surv(time, status) ~ strata(sex) + ph.ecog + ph.karno + inst + wt.loss
           strategy:metric c-index_train c-index_test auc_hc
1 forward:AICc; forward:SL        0.6309       0.4251 0.4799

To be specified, the variable sex is coded as a factor, so sex2 appears in the results to represent the second level of the dummy-coded variable.

5.4 Poisson regression with the creditCard dataset

In this example, we’ll demonstrate how to perform Poisson regression analysis using the creditCard dataset. We set type = "poisson" to direct the function to perform Poisson regression.

Descprition of the creditCard dataset

The creditCard dataset contains credit history information for a sample of applicants for a specific type of credit card, included in the "AER" package. It encompasses 1319 observations across 12 variables, including “reports”, “age”, “income”, among others. The “reports” variable represents the number of major derogatory reports.

  1. card: Whether the credit card application was accepted (Yes/No)
  2. reports: Number of major derogatory reports on the applicant’s credit history
  3. age: Age in years plus twelfths of a year (e.g., 30.5 represents 30 years and 6 months)
  4. income: Annual income in USD (divided by 10,000)
  5. share: Ratio of monthly credit card expenditure to yearly income
  6. expenditure: Average monthly credit card expenditure in USD
  7. owner: Home ownership status (Yes/No)
  8. selfemp: Self-employment status (Yes/No)
  9. dependents: Number of dependents
  10. months: Number of months living at current address
  11. majorcards: Number of major credit cards held
  12. active: Number of active credit accounts

Why choose Poisson regression

Poisson regression is frequently employed method for analyzing count data, where the response variable represents the occurrences of an event within a defined time or space frame. In the context of the creditCard dataset, Poisson regression can model the count of major derogatory reports (“reports”), enabling assessment of predictors’ impact on this variable.

5.4.1 Example1: using “forward” strategy

In this example, we employ a “forward” strategy with “SL” as the selection criteria. We set the significance level for entry to 0.05 (sle = 0.05).

data(creditCard)
str(creditCard)
'data.frame':   1319 obs. of  12 variables:
 $ card       : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ reports    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ age        : num  37.7 33.2 33.7 30.5 32.2 ...
 $ income     : num  4.52 2.42 4.5 2.54 9.79 ...
 $ share      : num  0.03327 0.00522 0.00416 0.06521 0.06705 ...
 $ expenditure: num  124.98 9.85 15 137.87 546.5 ...
 $ owner      : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 2 1 ...
 $ selfemp    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ dependents : num  3 3 4 0 2 0 2 0 0 0 ...
 $ months     : num  54 34 58 25 64 54 7 77 97 65 ...
 $ majorcards : num  1 1 1 1 1 1 1 1 1 1 ...
 $ active     : num  12 13 5 7 5 1 5 3 6 18 ...
formula  = reports ~ .
res7 <- stepwise(formula = formula,
                 data = creditCard,
                 type = "poisson",
                 strategy = "forward",
                 metric = "SL",
                 sle = 0.05,
                 test_ratio = 0.2)
res7
$forward
$forward$SL

Call:  glm(formula = reports ~ 1 + card + active + months + owner + 
    expenditure + majorcards, family = "poisson", data = data, 
    weights = NULL)

Coefficients:
(Intercept)      cardyes       active       months     owneryes  expenditure  
 -0.2770025   -2.7408389    0.0625535    0.0022528   -0.3343956    0.0006102  
 majorcards  
  0.2812477  

Degrees of Freedom: 1054 Total (i.e. Null);  1048 Residual
Null Deviance:      1906 
Residual Deviance: 1029     AIC: 1562
plot_list <- setNames(
  lapply(c("forward"),function(i){
    setNames(
      lapply(c("detail","overview"),function(j){
        plot(res7,strategy=i,process=j)
      }),
      c("detail","overview")
    )
  }),
  c("forward")
)
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
cowplot::plot_grid(plotlist = plot_list$forward, ncol = 1, rel_heights = c(2, 1))

To get the summary of the model:

summary(res7$forward$SL)

Call:
glm(formula = reports ~ 1 + card + active + months + owner + 
    expenditure + majorcards, family = "poisson", data = data, 
    weights = NULL)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.2770025  0.1235492  -2.242 0.024959 *  
cardyes     -2.7408389  0.1332963 -20.562  < 2e-16 ***
active       0.0625535  0.0045735  13.677  < 2e-16 ***
months       0.0022528  0.0005815   3.874 0.000107 ***
owneryes    -0.3343956  0.1036056  -3.228 0.001248 ** 
expenditure  0.0006102  0.0002104   2.900 0.003735 ** 
majorcards   0.2812477  0.1190842   2.362 0.018189 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1906.1  on 1054  degrees of freedom
Residual deviance: 1029.3  on 1048  degrees of freedom
AIC: 1562

Number of Fisher Scoring iterations: 6

The performance of the model:

performance(res7)
                                                                    model
1 reports ~ 1 + card + active + months + owner + expenditure + majorcards
  strategy:metric mse_train mse_test mae_train mae_test
1      forward:SL    6.8877   5.8562    1.4953    2.117

6 Interactive app

We have developed an interactive Shiny application to simplify model selection tasks for non-programmers. You can access the app through the following URL:

https://junhuili1017.shinyapps.io/StepReg/

You can also access the Shiny app directly from your local machine with the following code:

library(StepReg)
StepRegShinyApp()

Here is the user interface.

7 Session info

R version 4.4.1 (2024-06-14)
Platform: x86_64-apple-darwin20
Running under: macOS Sonoma 14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] survival_3.7-0   kableExtra_1.4.0 knitr_1.50       StepReg_1.6.0   
[5] BiocStyle_2.32.1

loaded via a namespace (and not attached):
  [1] pROC_1.18.5             gridExtra_2.3           tcltk_4.4.1            
  [4] sandwich_3.1-1          rlang_1.1.6             magrittr_2.0.3         
  [7] multcomp_1.4-28         matrixStats_1.5.0       polspline_1.1.25       
 [10] compiler_4.4.1          survAUC_1.3-0           systemfonts_1.2.3      
 [13] vctrs_0.6.5             reshape2_1.4.4          quantreg_6.1           
 [16] stringr_1.5.1           pkgconfig_2.0.3         summarytools_1.1.4     
 [19] fastmap_1.2.0           backports_1.5.0         magick_2.8.6           
 [22] labeling_0.4.3          pander_0.6.6            promises_1.3.3         
 [25] rmarkdown_2.29          ragg_1.4.0              tinytex_0.57           
 [28] MatrixModels_0.5-4      purrr_1.0.4             xfun_0.52              
 [31] cachem_1.1.0            jsonlite_2.0.0          later_1.4.2            
 [34] uuid_1.2-1              pryr_0.1.6              cluster_2.1.6          
 [37] R6_2.6.1                bslib_0.9.0             stringi_1.8.7          
 [40] RColorBrewer_1.1-3      rpart_4.1.23            lubridate_1.9.4        
 [43] jquerylib_0.1.4         Rcpp_1.0.14             bookdown_0.43          
 [46] zoo_1.8-14              base64enc_0.1-3         httpuv_1.6.16          
 [49] Matrix_1.7-0            splines_4.4.1           nnet_7.3-19            
 [52] timechange_0.3.0        tidyselect_1.2.1        rstudioapi_0.17.1      
 [55] dichromat_2.0-0.1       yaml_2.3.10             codetools_0.2-20       
 [58] lattice_0.22-6          tibble_3.2.1            plyr_1.8.9             
 [61] withr_3.0.2             shiny_1.10.0            flextable_0.9.9        
 [64] askpass_1.2.1           evaluate_1.0.3          foreign_0.8-87         
 [67] zip_2.3.3               xml2_1.3.8              pillar_1.10.2          
 [70] shinycssloaders_1.1.0   BiocManager_1.30.25     checkmate_2.3.2        
 [73] DT_0.33                 shinyjs_2.1.0           generics_0.1.4         
 [76] ggplot2_3.5.2           scales_1.4.0            xtable_1.8-4           
 [79] glue_1.8.0              gdtools_0.4.2           rms_8.0-0              
 [82] Hmisc_5.2-3             tools_4.4.1             data.table_1.17.4      
 [85] SparseM_1.84-2          mvtnorm_1.3-3           cowplot_1.1.3          
 [88] rapportools_1.2         grid_4.4.1              tidyr_1.3.1            
 [91] colorspace_2.1-1        nlme_3.1-166            htmlTable_2.4.3        
 [94] Formula_1.2-5           cli_3.6.5               textshaping_1.0.1      
 [97] officer_0.6.10          fontBitstreamVera_0.1.1 viridisLite_0.4.2      
[100] svglite_2.1.3           dplyr_1.1.4             gtable_0.3.6           
[103] ggcorrplot_0.1.4.1      sass_0.4.10             digest_0.6.37          
[106] fontquiver_0.2.1        TH.data_1.1-3           ggrepel_0.9.6          
[109] htmlwidgets_1.6.4       farver_2.1.2            htmltools_0.5.8.1      
[112] lifecycle_1.0.4         mime_0.13               fontLiberation_0.1.0   
[115] openssl_2.3.3           shinythemes_1.2.0       MASS_7.3-61            
A. F. M. Smith, D. J. Spiegelhalter. 1980. “Bayes Factors and Choice Criteria for Linear Model.” Journal Article. Journal of the Royal Statistical Society. Series B (Methodological) 42 (2): 213–20.
Allan D R McQuarrie, Chih-Ling Tsai. 1998. Regression and Time Series Model Selection. Book. River Edge, NJ.: World Scientific Publishing Co. Pte. Ltd.
Al-Subaihi, Ali A. 2002. “Variable Selection in Multivariable Regression Using SAS/IML.” Journal Article. Journal of Statistical Software 7 (12): 1–20.
Clifford M. Hurvich, Chih-Ling Tsai. 1989. “Regression and Time Series Model Selection in Small Samples.” Journal Article. Biometrika 76: 297–307.
Darlington, R. B. 1968. “Multiple Regression in Psychological Research and Practice.” Journal Article. Psychological Bulletin 69 (3): 161–82.
E. J. Hannan, B. G. Quinn. 1979. “The Determination of the Order of an Autoregression.” Journal Article. Journal of the Royal Statistical Society. Series B (Methodological) 41 (2): 190–95.
Edward J. Bedrick, Chih-Ling Tsai. 1994. “Model Selection for Multivariate Regression in Small Samples.” Journal Article. Biometrics 50 (1): 226–31.
George G. Judge, R. Carter Hill, William E. Griffiths. 1985. The Theory and Practice of Econometrics, 2nd Edition. Book. Wiley. https://www.wiley.com/en-us/The+Theory+and+Practice+of+Econometrics%2C+2nd+Edition-p-9780471895305.
Hocking, R. R. 1976. “A Biometrics Invited Paper. The Analysis and Selection of Variables in Linear Regression.” Journal Article. Biometrics 32 (1): 1–49.
Hotelling, Harold. 1992. “The Generalization of Student’s Ratio.” Book Section. In Breakthroughs in Statistics: Foundations and Basic Theory., 54–62.
J. A. Nelder, R. W. M. Wedderburn. 1972. “Generalized Linear Models.” Journal Article. Journal of the Royal Statistical Society. Series A (General) 135 (3): 370–84.
K. V. Mardia, J. M. Bibby, J. T. Ken. 1981. “Multivariate Analysis.” Journal Article. Mathematical Gazette 65 (431): 75–76.
Kenneth P. Burnham, David R. Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach 2nd Edition. Book. Springer.
Mallows, C. L. 1973. “Some Comments on CP.” Journal Article. Technometrics 15 (4): 661–75.
Mark J. Brewer, Susan L. Cooksley, Adam Butler. 2016. “The Relative Performance of AIC, AICC and BIC in the Presence of Unobserved Heterogeneity.” Journal Article. Methods in Ecology and Evolution 7 (6): 679–92.
McKEON, JAMES J. 1974. “F Approximations to the Distribution of Hotelling’s T20.” Journal Article. Biometrika 61 (2): 381–83.
Pillai, K. C. S. 1995. “Some New Test Criteria in Multivariate Analysis.” Journal Article. Ann. Math. Statist 26 (1): 117–21.
Prathapasinghe Dharmawansa, Ofer Shwartz, Boaz Nadler. 2014. “Roy’s Largest Root Under Rank-One Alternatives:the Complex Valued Case and Applications.” Journal Article. arXiv Preprint arXiv 1411: 4226.
RS Sparks, D Coutsourides, W Zucchini. 1985. “On Variable Selection in Multivariate Regression.” Journal Article. Communication in Statistics- Theory and Methods 14 (7): 1569–87.
SAS Institute Inc. 2018. SAS/STAT® 15.1 User’s Guide. Book. Cary, NC: SAS Institute Inc.
Sawa, Takamitsu. 1978. “Information Criteria for Discriminating Among Alternative Regression Models.” Journal Article. Econometrica 46 (6): 1273–91.
Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” Journal Article. Ann. Statist. 6 (2): 461–64.
Stevens, James P. 2016. Applied Multivariate Statistics for the Social Sciences. Book. Fifth Edition. Routledge.