CRAN Task View: Machine Learning & Statistical Learning
Several add-on packages implement ideas and methods developed at the
borderline between computer science and statistics - this field of research
is usually referred to as machine learning.
The packages can be roughly structured into the following topics:
: Single-hidden-layer neural network are
implemented in package
(shipped with base R).
offers an interface to the Stuttgart
Neural Network Simulator (SNNS). An interface to the FCNN library allows
user-extensible artificial neural networks in package
: Tree-structured models for
regression, classification and survival analysis, following the
ideas in the CART book, are
(shipped with base R) and
is recommended for computing CART-like
A rich toolbox of partitioning algorithms is available in
provides an interface to this
implementation, including the J4.8-variant of C4.5 and M5.
package fits rule-based models (similar
to trees) with linear regression models in the terminal leaves,
instance-based corrections and boosting. The
package can fit
C5.0 classification trees, rule-based models, and boosted versions of these.
Two recursive partitioning algorithms with unbiased variable
selection and statistical stopping criterion are implemented in
is based on
non-parametrical conditional inference procedures for testing
independence between response and each input variable whereas
can be used to partition parametric models.
Extensible tools for visualizing binary trees
and node distributions of the response are available in package
Tree-structured varying coefficient models are implemented in
For problems with binary input variables
implements logic regression.
Graphical tools for the visualization of
trees are available in package
Trees for modelling longitudinal data by means of
random effects is offered by package
Partitioning of mixture models is performed by
Computational infrastructure for representing trees and
unified methods for predition and visualization is implemented
This infrastructure is used by package
to implement evolutionary learning
of globally optimal trees.
Oblique trees are available in package
: The reference implementation of the random
forest algorithm for regression and classification is available in
for regression, classification and survival analysis as well as
bundling, a combination of multiple models via
ensemble learning. In addition, a random forest variant for
response variables measured at arbitrary scales based on
conditional inference trees is implemented in package
implements a unified treatment of Breiman's random forests for
survival, regression and classification problems. Quantile regression forests
allow to regress quantiles of a numeric response on exploratory
variables via a random forest approach. For binary data,
is a forest of logic regression trees (package
packages focus on variable selection by means
for random forest algorithms. In addition, packages
offer R interfaces to fast C++ implementations of random forests.
Regularized and Shrinkage Methods
: Regression models with some
constraint on the parameter estimates can be fitted with the
packages. Lasso with
simultaneous updates for groups of parameters (groupwise lasso)
is available in package
package implements a number of other group
penalization models, such as group MCP and group SCAD.
The L1 regularization path for generalized linear models and
Cox models can be obtained from functions available in package
glmpath, the entire lasso or elastic-net regularization path (also in
for linear regression,
logistic and multinomial regression models can be obtained from package
an alternative implementation of lasso (L1) and ridge (L2)
penalized regression models (both GLM and Cox models).
can be used to identify and display TRACEs
for a specified shrinkage path and to determine the appropriate extent of shrinkage.
Semiparametric additive hazards models under lasso penalties are offered
A generalisation of the Lasso shrinkage technique for linear regression
is called relaxed lasso and is available in package
Fisher's LDA projection with an optional LASSO penalty to produce sparse
solutions is implemented in package
centroids classifier and utilities for gene expression analyses are
implemented in package
pamr. An implementation
of multivariate adaptive regression splines is available
earth. Variable selection through clone selection
in SVMs in penalized models (SCAD or L1 penalties) is implemented
penalizedSVM. Various forms of
penalized discriminant analysis are implemented in
offers an interface to
the LIBLINEAR library.
package fits linear and logistic
regression models under the the SCAD and MCP
regression penalties using a coordinate descent algorithm.
High-throughput ridge regression (i.e., penalization with many predictor variables) and
heteroskedastic effects models are the focus of the
An implementation of bundle methods for regularized risk minimization
is available form package
bmrm. The Lasso under non-Gaussian and
heteroscedastic errors is estimated by
inference on low-dimensional components of Lasso regression and of estimated treatment
effects in a high-dimensional setting are also contained. Package
implements sure independence screening in generalised linear and Cox models.
: Various forms of gradient boosting are
implemented in package
(tree-based functional gradient
descent boosting). The Hinge-loss is optimized by the boosting implementation
can be used to fit generalized additive models
by a boosting algorithm. An extensible boosting framework for
generalized linear, additive and nonparametric models is available in
mboost. Likelihood-based boosting for Cox models
is implemented in
and for mixed models in
GAMLSS models can be fitted using boosting by
Support Vector Machines and Kernel Methods
: The function
offers an interface to the LIBSVM library and
implements a flexible framework
for kernel learning (including SVMs, RVMs and other kernel
learning algorithms). An interface to the SVMlight implementation
(only for one-against-all classification) is provided in package
The relevant dimension in kernel feature spaces can be estimated
which also offers procedures for model selection
: Bayesian Additive Regression Trees (BART),
where the final model is defined in terms of the sum over
many weak learners (not unlike ensemble methods),
are implemented in package
Bayesian nonstationary, semiparametric nonlinear regression
and design by treed Gaussian processes including Bayesian CART and
treed linear models are made available by package
Optimization using Genetic Algorithms
offer optimization routines based on genetic algorithms.
implements memetic algorithms
with local search chains, which are a special type of
evolutionary algorithms, combining a steady state genetic
algorithm with local search for real-valued
provides both data structures for efficient
handling of sparse binary data as well as interfaces to
implementations of Apriori and Eclat for mining
frequent itemsets, maximal frequent itemsets, closed
frequent itemsets and association rules.
Fuzzy Rule-based Systems
implements a host of standard
methods for learning fuzzy rule-based systems from data
for regression and classification. Package
provides comprehensive implementations of the
rough set theory (RST) and the fuzzy rough set theory (FRST) in a single
Model selection and validation
for hyper parameter tuning and
(ipred) can be used for
error rate estimation. The cost parameter C for support vector
machines can be chosen utilizing the functionality of package
Functions for ROC analysis and other visualisation techniques
for comparing candidate classifiers are available from package
selection for a range of models,
also offers other inference procedures in high-dimensional models.
: Evidential classifiers quantify the uncertainty about the
class of a test pattern using a Dempster-Shafer mass function in package
(One Rule) package offers a classification algorithm with
enhancements for sophisticated handling of missing values and numeric data
together with extensive diagnostic functions.
provides miscellaneous functions
for building predictive models, including parameter tuning
and variable importance measures. The package can be used
with various parallel implementations (e.g. MPI, NWS etc).
In a similar spirit, package
offers a high-level
to various statistical and machine learning packages. Package
implements a similar toolbox.
package implements a general purpose machine learning
platform that has scalable implementations of many popular algorithms such
as random forest, GBM, GLM (with elastic net regularization), and deep
learning (feedforward multilayer networks), among others.
Elements of Statistical Learning
: Data sets, functions and
examples from the book
The Elements of Statistical Learning: Data Mining,
Inference, and Prediction
by Trevor Hastie, Robert Tibshirani and
Jerome Friedman have been packaged and are available as
is a graphical user interface for data mining in R.
implements a rather broad class of machine learning
algorithms, such as nearest neighbors, trees, random forests, and
several feature selection methods. Similar, package
several learning algorithms implemented in other packages and computes
several performance measures.