library(VariableSelection)Variable selection methods can screen and identify associated variables from regression variables. A simpler model with fewer variables is easier to understand in the real world.
The VariableSelection package uses Bayesian Information
Criterion (BIC), Model Posterior Probability (MPP), and Posterior
Inclusion Probability (PIP) to perform model selection for linear models
(LM) and generalized linear models (GLM). The package provides the best
model, BIC, and MPP for candidate models, and PIP for each predictor.
This vignette contains an example to illustrate how to use
VariableSelection.
The linear models used in the VariableSelection package
are of the form \[ \pmb{Y}=X\pmb{\beta}+
\pmb{\epsilon},\] where
The generalized linear models used in the
VariableSelection package assume that the observations come
from a distribution that belongs to the exponential family of
distributions. In addition, these GLMs assume that the mean of the
observations is related to the covariates through a link function \(g\) such that \[E(\pmb{Y}|X) = g^{-1}(X\pmb{\beta}),\]
where
The Bayesian Information Criterion (BIC) is defined as:
\[ \text{BIC} = -2 \log(L) + k \log(n), \]
where
Lower value of BIC indicates a better model.
The modelselect.lm() function can take a
data frame which contains both the response and predictors.
For example, here are the first 5 rows of a data frame, where X1 through
X6 are six candidate predictors. Y is the response variable which is
simulated from a linear model that contains the predictors X1, X2, and
X3.
data("dat")
head(dat,5)
#> X1 X2 X3 X4 X5 X6
#> 1 0.9962431 -0.02560867 -0.61734975 1.4855638 0.6727587 0.1247850
#> 2 0.8569675 -2.08603964 1.71170398 1.3454152 -0.1355638 -1.8180708
#> 3 -0.4341982 0.77668088 -0.71365443 0.7359808 -1.4004599 -2.0857847
#> 4 1.5278373 0.96399955 0.07846497 1.1466296 -0.9196107 0.2417378
#> 5 0.5615185 -0.69806814 -0.12644657 -1.7336450 1.0308951 -0.7059752
#> Y
#> 1 -0.4551099
#> 2 1.1381262
#> 3 0.9274077
#> 4 1.4541197
#> 5 1.1890484The data frame dat above is attached in the
VariableSelection package.
In this example, we use modelselect.lm to select the
predictors in a linear model. The modelselect.lm function
takes a formula and dataset as input. In the example below, we consider
the model space that contains all possible linear models generated with
the predictors X1, X2, X3, and X4.
example1 <- modelselect.lm(formula = Y~X1+X2+X3+X4, data = dat)
#> The Best Model:
#> X1 X2 X3
#> BIC:
#> 2262.126
#> MPP:
#> 0.957The output of modelselect.lm returns a table of BICs and
MPPs of competing models and a table of PIPs of candidate
predictors.
head(example1$models)
#> X1 X2 X3 X4 BIC MPP
#> 1 1 1 1 0 2262.126 0.957
#> 2 1 1 1 1 2268.340 0.043
#> 3 0 1 1 0 2335.625 0.000
#> 4 0 1 1 1 2341.214 0.000
#> 5 1 0 1 0 2525.717 0.000
#> 6 1 0 1 1 2531.910 0.000
example1$variables
#> Var PIP
#> Var1 X1 1.000
#> Var2 X2 1.000
#> Var3 X3 1.000
#> Var4 X4 0.043Here are some additional examples on how to write a formula. In the
next example, the formula ~. includes all predictors in the
data frame.
example2 <- modelselect.lm(formula = Y~., data = dat)
#> The Best Model:
#> X1 X2 X3
#> BIC:
#> 2262.126
#> MPP:
#> 0.869
example2$variables
#> Var PIP
#> Var1 X1 1.000
#> Var2 X2 1.000
#> Var3 X3 1.000
#> Var4 X4 0.043
#> Var5 X5 0.046
#> Var6 X6 0.049The next example includes an interaction term between the predictors X1 and X2.
example3 <- modelselect.lm(formula = Y~X1*X2+X3+X4+X5+X6, data = dat)
#> The Best Model:
#> X1 X2 X3
#> BIC:
#> 2262.126
#> MPP:
#> 0.832
example3$variables
#> Var PIP
#> Var1 X1 1.000
#> Var2 X2 1.000
#> Var3 X3 1.000
#> Var4 X4 0.043
#> Var5 X5 0.046
#> Var6 X6 0.049
#> Var7 X1:X2 0.043Note that because modelselect.lm uses lm
notation for formulas, we could also have used notation
X1+X2+X3+X4+X5+X6+X1:X2.
modelselect.lm function uses GA_var as a
threshold to do exhaustive model search or stochastic model search. The
default value of GA_var is 16. If the number of variables
is smaller than GA_var, then modelselect.lm
does exhaustive model search, otherwise it uses genetic algorithm (GA)
to perform stochastic model search. maxiterations is the
maximum number of iterations to run before the GA search is stopped
runs_til_stop is the number of consecutive generations
without any improvement in the best fitness value before the GA is
stopped. monitor is a logical defaulting to TRUE showing
the evolution of the search. If monitor = FALSE, any output
is suppressed.
example5 <- modelselect.lm(formula = Y~X1+X2+X3+X4, data = dat, GA_var = 16, maxiterations = 2000, runs_til_stop = 1000, monitor = TRUE, popSize = 100)
#> The Best Model:
#> X1 X2 X3
#> BIC:
#> 2262.126
#> MPP:
#> 0.957Use lm.best to obtain the model fit for the best model.
lm.best takes the result from modelselect.lm
as an input. In this example, we obtain the model fit by using
method = "models". The fitted model is the model with the
highest MPP. The return of lm.best is the same as that from
lm. We may use $ to obtain the output
statistics, for example $coefficients for regressor
coefficients’ estimate.
lm_model <- lm.best(object = example1, method = "models")
lm_model$coefficients
#> (Intercept) X1 X2 X3
#> 1.0605943 0.9264065 1.9327270 2.9043566In the next example, we obtain the model fit by using
method = "variables". The fitted model has the predictors
with PIP larger than threshold.
lm_var <- lm.best(object = example2, method = "variables", threshold = 0.9)Note that because the output of lm.best function is of
class lm, we can apply the function summary to
the output of lm.best. Note that the summary table should
be interpreted as conditional on the best model being the true
model.
summary(lm_model)
#>
#> Call:
#> lm(Y ~ X1 + X2 + X3)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -6.3961 -1.4886 -0.0344 1.5215 7.0967
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.0606 0.1016 10.441 <2e-16 ***
#> X1 0.9264 0.1001 9.259 <2e-16 ***
#> X2 1.9327 0.1026 18.836 <2e-16 ***
#> X3 2.9044 0.1019 28.491 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.262 on 496 degrees of freedom
#> Multiple R-squared: 0.7246, Adjusted R-squared: 0.7229
#> F-statistic: 434.9 on 3 and 496 DF, p-value: < 2.2e-16Here is an example on how to perform model selection for generalized
linear models. The modelselect.glm() function takes a data
frame which contains response and predictors. In this toy example, the
data frame glmdat contains six candidate predictors X1 to
X6. The response variable Y was actually simulated from a Bernoulli GLM
with predictors X1, X2, and X3.
data("glmdat")
head(glmdat,5)
#> X1 X2 X3 X4 X5 X6 Y
#> 1 -0.6231238 0.6921450 -0.7550460 1.84589708 0.8060891 -0.3269579 0
#> 2 -0.9146050 1.0900962 -2.3906457 -0.33145615 -0.5730146 0.3308736 0
#> 3 0.7502279 -1.2272597 -0.7699745 2.64233523 1.4229513 0.5098933 0
#> 4 -0.9341694 -0.6433256 0.7154170 -0.05134667 -1.4511260 0.1013274 0
#> 5 -0.6362055 0.5239009 0.9455083 0.74074925 -0.3599637 0.6767956 1Data glmdat above are attached in the
VariableSelection package. In the next example, we use the
function modelselect.glm to find the best model and to
compute the PIPs of the several candidate predictors.
example.glm <- modelselect.glm(formula = Y~., family = "binomial", data = glmdat)
#> The Best Model:
#> X1 X2 X3
#> BIC:
#> 208.337
#> MPP:
#> 0.848
example.glm$variables
#> Var PIP
#> Var1 X1 1.000
#> Var2 X2 1.000
#> Var3 X3 1.000
#> Var4 X4 0.051
#> Var5 X5 0.063
#> Var6 X6 0.046Then, we use glm.best to obtain the model fit for the
best model. glm.best takes the result from
modelselect.glm as an argument.
glm_model <- glm.best(object = example.glm, family = "binomial", method = "models", threshold = 0.95)The function summary can be applied to the result of
glm.best.
summary(glm_model)
#>
#> Call:
#> glm(formula = Y ~ X1 + X2 + X3, family = "binomial")
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.4477 -0.1561 0.0046 0.1850 3.4274
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.1116 0.1893 0.590 0.555
#> X1 4.7511 0.5315 8.940 < 2e-16 ***
#> X2 4.0121 0.4525 8.867 < 2e-16 ***
#> X3 2.8915 0.3612 8.004 1.2e-15 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 693.12 on 499 degrees of freedom
#> Residual deviance: 183.48 on 496 degrees of freedom
#> AIC: 191.48
#>
#> Number of Fisher Scoring iterations: 8In the VariableSelection package, there are four
functions: modelselect.lm() and lm.best() for
LM, and modelselect.glm() and glm.best() for
GLM.
Function modelselect.lm() uses BIC to perform model
selection for LMs. The function modelselect.lm() takes a
formula for the full model and a data frame containing the response
variable and predictor variables as input. It returns a list of two
tables: 1. a table for candidate models with BIC and posterior
probabilities; 2. a table for candidate variables with posterior
inclusion probabilities. In addition, it returns the original data with
variables in the formula.
Function lm.best() takes result from
modelselect.lm() as object. There are two methods to select
the best model. method="models" uses models’ BIC or
posterior probabilities to select the best model.
method="variables" selects the variables with PIP larger
than the threshold.
Function modelselect.glm() uses BIC to perfrom model
selection for GLMs. The function modelselect.glm() takes
formula, data containing response variable and predictors, and family
function for error distribution as input. It returns a list of two
tables: 1. a table for candidate models with BIC and posterior
probabilities; 2. a table for candidate variables with posterior
inclusion probabilities. In addition,it returns the original data with
variables in the formula.
Function glm.best() takes result from
modelselect.glm() as object. There are two methods to
select the best model. method="models" uses models’ BIC or
posterior probabilities to select the best model.
method="variables" selects the variables with PIP larger
than the threshold.
Xu, Shuangshuang, Tegge, Allison, and Ferreira, M. A. R. (202X). paper, Journal, .