Introduction

Variable selection methods can screen and identify associated variables from regression variables. A simpler model with fewer variables is easier to understand in the real world.

The VariableSelection package uses Bayesian Information Criterion (BIC), Model Posterior Probability (MPP), and Posterior Inclusion Probability (PIP) to perform model selection for linear models (LM) and generalized linear models (GLM). The package provides the best model, BIC, and MPP for candidate models, and PIP for each predictor. This vignette contains an example to illustrate how to use VariableSelection.

Model

The linear models used in the VariableSelection package are of the form \[ \pmb{Y}=X\pmb{\beta}+ \pmb{\epsilon},\] where

$\pmb{Y}$ is the vector of observations.
$X$ is the matrix of covariates.
$\pmb{\beta}$ is the vector of regression coefficients.
$\pmb{\epsilon}$ is the vector of errors.

The generalized linear models used in the VariableSelection package assume that the observations come from a distribution that belongs to the exponential family of distributions. In addition, these GLMs assume that the mean of the observations is related to the covariates through a link function $g$ such that \[E(\pmb{Y}|X) = g^{-1}(X\pmb{\beta}),\] where

$\pmb{Y}$ is the vector of observations.
$X$ is the matrix of covariates.
$\pmb{\beta}$ is the vector of regression coefficients.
$E(\pmb{Y}|X)$ is the expected value of $\pmb{Y}$ conditional on $X$.
$g()$ is the link function.

BIC

The Bayesian Information Criterion (BIC) is defined as:

\[ \text{BIC} = -2 \log(L) + k \log(n), \]

where

$L$ is the likelihood function of the model computed at the MLE.
$k$ is the number of parameters in the model.
$n$ is the number of observations.

Lower value of BIC indicates a better model.

Example

Data

The modelselect.lm() function can take a data frame which contains both the response and predictors. For example, here are the first 5 rows of a data frame, where X1 through X6 are six candidate predictors. Y is the response variable which is simulated from a linear model that contains the predictors X1, X2, and X3.

data("dat")
head(dat,5)
#>           X1          X2          X3         X4         X5         X6
#> 1  0.9962431 -0.02560867 -0.61734975  1.4855638  0.6727587  0.1247850
#> 2  0.8569675 -2.08603964  1.71170398  1.3454152 -0.1355638 -1.8180708
#> 3 -0.4341982  0.77668088 -0.71365443  0.7359808 -1.4004599 -2.0857847
#> 4  1.5278373  0.96399955  0.07846497  1.1466296 -0.9196107  0.2417378
#> 5  0.5615185 -0.69806814 -0.12644657 -1.7336450  1.0308951 -0.7059752
#>            Y
#> 1 -0.4551099
#> 2  1.1381262
#> 3  0.9274077
#> 4  1.4541197
#> 5  1.1890484

The data frame dat above is attached in the VariableSelection package.

Formula

In this example, we use modelselect.lm to select the predictors in a linear model. The modelselect.lm function takes a formula and dataset as input. In the example below, we consider the model space that contains all possible linear models generated with the predictors X1, X2, X3, and X4.

example1 <- modelselect.lm(formula = Y~X1+X2+X3+X4, data = dat)
#> The Best Model:
#>  X1 X2 X3 
#> BIC:
#>  2262.126 
#> MPP:
#>  0.957

The output of modelselect.lm returns a table of BICs and MPPs of competing models and a table of PIPs of candidate predictors.

head(example1$models)
#>   X1 X2 X3 X4      BIC   MPP
#> 1  1  1  1  0 2262.126 0.957
#> 2  1  1  1  1 2268.340 0.043
#> 3  0  1  1  0 2335.625 0.000
#> 4  0  1  1  1 2341.214 0.000
#> 5  1  0  1  0 2525.717 0.000
#> 6  1  0  1  1 2531.910 0.000
example1$variables
#>      Var   PIP
#> Var1  X1 1.000
#> Var2  X2 1.000
#> Var3  X3 1.000
#> Var4  X4 0.043

Here are some additional examples on how to write a formula. In the next example, the formula ~. includes all predictors in the data frame.

example2 <- modelselect.lm(formula = Y~., data = dat)
#> The Best Model:
#>  X1 X2 X3 
#> BIC:
#>  2262.126 
#> MPP:
#>  0.869
example2$variables
#>      Var   PIP
#> Var1  X1 1.000
#> Var2  X2 1.000
#> Var3  X3 1.000
#> Var4  X4 0.043
#> Var5  X5 0.046
#> Var6  X6 0.049

Interactions

The next example includes an interaction term between the predictors X1 and X2.

example3 <- modelselect.lm(formula = Y~X1*X2+X3+X4+X5+X6, data = dat)
#> The Best Model:
#>  X1 X2 X3 
#> BIC:
#>  2262.126 
#> MPP:
#>  0.832
example3$variables
#>        Var   PIP
#> Var1    X1 1.000
#> Var2    X2 1.000
#> Var3    X3 1.000
#> Var4    X4 0.043
#> Var5    X5 0.046
#> Var6    X6 0.049
#> Var7 X1:X2 0.043

Note that because modelselect.lm uses lm notation for formulas, we could also have used notation X1+X2+X3+X4+X5+X6+X1:X2.

Stochastic search

modelselect.lm function uses GA_var as a threshold to do exhaustive model search or stochastic model search. The default value of GA_var is 16. If the number of variables is smaller than GA_var, then modelselect.lm does exhaustive model search, otherwise it uses genetic algorithm (GA) to perform stochastic model search. maxiterations is the maximum number of iterations to run before the GA search is stopped runs_til_stop is the number of consecutive generations without any improvement in the best fitness value before the GA is stopped. monitor is a logical defaulting to TRUE showing the evolution of the search. If monitor = FALSE, any output is suppressed.

example5 <- modelselect.lm(formula = Y~X1+X2+X3+X4, data = dat, GA_var = 16, maxiterations = 2000, runs_til_stop = 1000, monitor = TRUE, popSize = 100)
#> The Best Model:
#>  X1 X2 X3 
#> BIC:
#>  2262.126 
#> MPP:
#>  0.957

Model fit

Use lm.best to obtain the model fit for the best model. lm.best takes the result from modelselect.lm as an input. In this example, we obtain the model fit by using method = "models". The fitted model is the model with the highest MPP. The return of lm.best is the same as that from lm. We may use $ to obtain the output statistics, for example $coefficients for regressor coefficients’ estimate.

lm_model <- lm.best(object = example1, method = "models")
lm_model$coefficients
#> (Intercept)          X1          X2          X3 
#>   1.0605943   0.9264065   1.9327270   2.9043566

In the next example, we obtain the model fit by using method = "variables". The fitted model has the predictors with PIP larger than threshold.

lm_var <- lm.best(object = example2, method = "variables", threshold = 0.9)

Note that because the output of lm.best function is of class lm, we can apply the function summary to the output of lm.best. Note that the summary table should be interpreted as conditional on the best model being the true model.

summary(lm_model)
#> 
#> Call:
#> lm(Y ~ X1 + X2 + X3)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -6.3961 -1.4886 -0.0344  1.5215  7.0967 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   1.0606     0.1016  10.441   <2e-16 ***
#> X1            0.9264     0.1001   9.259   <2e-16 ***
#> X2            1.9327     0.1026  18.836   <2e-16 ***
#> X3            2.9044     0.1019  28.491   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.262 on 496 degrees of freedom
#> Multiple R-squared:  0.7246, Adjusted R-squared:  0.7229 
#> F-statistic: 434.9 on 3 and 496 DF,  p-value: < 2.2e-16

GLM

Here is an example on how to perform model selection for generalized linear models. The modelselect.glm() function takes a data frame which contains response and predictors. In this toy example, the data frame glmdat contains six candidate predictors X1 to X6. The response variable Y was actually simulated from a Bernoulli GLM with predictors X1, X2, and X3.

data("glmdat")
head(glmdat,5)
#>           X1         X2         X3          X4         X5         X6 Y
#> 1 -0.6231238  0.6921450 -0.7550460  1.84589708  0.8060891 -0.3269579 0
#> 2 -0.9146050  1.0900962 -2.3906457 -0.33145615 -0.5730146  0.3308736 0
#> 3  0.7502279 -1.2272597 -0.7699745  2.64233523  1.4229513  0.5098933 0
#> 4 -0.9341694 -0.6433256  0.7154170 -0.05134667 -1.4511260  0.1013274 0
#> 5 -0.6362055  0.5239009  0.9455083  0.74074925 -0.3599637  0.6767956 1

Data glmdat above are attached in the VariableSelection package. In the next example, we use the function modelselect.glm to find the best model and to compute the PIPs of the several candidate predictors.

example.glm <- modelselect.glm(formula = Y~., family = "binomial", data = glmdat)
#> The Best Model:
#>  X1 X2 X3 
#> BIC:
#>  208.337 
#> MPP:
#>  0.848
example.glm$variables
#>      Var   PIP
#> Var1  X1 1.000
#> Var2  X2 1.000
#> Var3  X3 1.000
#> Var4  X4 0.051
#> Var5  X5 0.063
#> Var6  X6 0.046

Then, we use glm.best to obtain the model fit for the best model. glm.best takes the result from modelselect.glm as an argument.

glm_model <- glm.best(object = example.glm, family = "binomial", method = "models", threshold = 0.95)

The function summary can be applied to the result of glm.best.

summary(glm_model)
#> 
#> Call:
#> glm(formula = Y ~ X1 + X2 + X3, family = "binomial")
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.4477  -0.1561   0.0046   0.1850   3.4274  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)   0.1116     0.1893   0.590    0.555    
#> X1            4.7511     0.5315   8.940  < 2e-16 ***
#> X2            4.0121     0.4525   8.867  < 2e-16 ***
#> X3            2.8915     0.3612   8.004  1.2e-15 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 693.12  on 499  degrees of freedom
#> Residual deviance: 183.48  on 496  degrees of freedom
#> AIC: 191.48
#> 
#> Number of Fisher Scoring iterations: 8

Functions

In the VariableSelection package, there are four functions: modelselect.lm() and lm.best() for LM, and modelselect.glm() and glm.best() for GLM.

Function modelselect.lm() uses BIC to perform model selection for LMs. The function modelselect.lm() takes a formula for the full model and a data frame containing the response variable and predictor variables as input. It returns a list of two tables: 1. a table for candidate models with BIC and posterior probabilities; 2. a table for candidate variables with posterior inclusion probabilities. In addition, it returns the original data with variables in the formula.
Function lm.best() takes result from modelselect.lm() as object. There are two methods to select the best model. method="models" uses models’ BIC or posterior probabilities to select the best model. method="variables" selects the variables with PIP larger than the threshold.
Function modelselect.glm() uses BIC to perfrom model selection for GLMs. The function modelselect.glm() takes formula, data containing response variable and predictors, and family function for error distribution as input. It returns a list of two tables: 1. a table for candidate models with BIC and posterior probabilities; 2. a table for candidate variables with posterior inclusion probabilities. In addition,it returns the original data with variables in the formula.
Function glm.best() takes result from modelselect.glm() as object. There are two methods to select the best model. method="models" uses models’ BIC or posterior probabilities to select the best model. method="variables" selects the variables with PIP larger than the threshold.

Variable selection for linear models and generalized linear models with BIC-based posterior probability

Allison N. Tegge, Shuangshuang Xu, and Marco A. R. Ferreira