CRAN Task View: Multivariate Statistics
|Contact:||Paul.Hewson at plymouth.ac.uk|
Base R contains most of the functionality for classical multivariate analysis,
somewhere. There are a large number of packages on CRAN which extend this methodology,
a brief overview is given below. Application-specific uses of multivariate statistics
are described in relevant task views, for example whilst principal components are listed here,
ordination is covered in the
task view. Further information on supervised classification can be found in the
task view, and unsupervised classification in the
The packages in this view can be roughly structured into the following topics.
If you think that some package is missing from the list, please let me know.
Visualising multivariate data
A range of base graphics (e.g.
splom()) are useful for
visualising pairwise arrays of 2-dimensional scatterplots, clouds and 3-dimensional densities.
provides usefully enhanced pairwise scatterplots.
provides 3 dimensional scatterplots,
provides bagplots and
a function for rotating 3d clouds.
misc3d, dependent upon
provides animated functions within R useful for visualising densities.
provides a range of useful visualisation techniques for multivariate data.
More specialised multivariate plots include the following:
provides Chernoff's faces;
in graphics provides a choice of star, radar
and cobweb plots respectively.
provide minimum spanning tree functionality.
supports biplot and scatterplot
which provides an interface to the qhull library,
gives indices to the relevant points via
draws ellipses for two parameters, and provides
visual display of a correlation matrix.
provides level set trees
for multivariate visualisation. Mosaic plots are available via
in graphics and
that also contains other visualization techniques for multivariate
provides a number of
cluster specific graphical enhancements for scatterplots and parallel coordinate plots
See the links for a reference to GGobi.
interfaces with GGobi.
interfaces to the XGobi
and XGvis programs which allow linked, dynamic multivariate plots as well as
projection pursuit. Finally,
allows particularly powerful dynamic interactive
graphics, of which interactive parallel co-ordinate plots and mosaic plots may be of great interest.
Seriation methods are provided by
which can reorder matrices and dendrograms.
assist with descriptive functions; from the same package
assist in exploring a given
dataset in terms of representativeness and finding matches.
in base and
provide a wide range of distance measures,
provides a framework for more distance measures, including measures between matrices.
provides functions for dealing with presence / absence data including similarity matrices and reshaping.
provides Hotellings T2 test as well as a range of non-parametric tests including location tests based on marginal ranks, spatial median and spatial signs computation, estimates of shape. Non-parametric two sample tests are also available from
and spatial sign and rank tests to investigate location, sphericity and independence are available in
will provide estimates of the covariance
and correlation matrices respectively.
offers several descriptive measures such as
which provides an estimate of the spatial median and further functions which provide estimates of scatter. Further robust methods are provided such as
in MASS which
provides robust estimates of the variance-covariance matrix by minimum volume
ellipsoid, minimum covariance determinant or classical product-moment.
provides robust covariance estimation via nearest neighbor variance estimation.
provides robust covariance estimation via fast minimum covariance determinant with
and the Orthogonalized pairwise estimate of Gnanadesikan-Kettenring via
covOGK(). Scalable robust methods are provided within
also using fast minimum covariance determinant with
as well as M-estimators with
provides shrinkage estimation of large scale covariance
and (partial) correlation matrices.
Densities (estimation and simulation):
in MASS simulates from the multivariate normal
also provides simulation as well as probability and
quantile functions for both the multivariate t distribution and multivariate normal
distributions as well as density functions for the multivariate normal distribution.
provides multivariate normal and multivariate t density and distribution
functions as well as random number simulation.
provides density, distribution and random number generation for the multivariate skew normal and skew t distribution.
Comprehensive information on mixtures is given in the
some density estimates and random numbers are provided by
ks, mixture fitting
is also provided within
bayesm. Functions to simulate from the
Wishart distribution are provided in a number of places, such
(the latter also has a density
from MASS provide binned and non-binned 2-dimensional kernel density
also provides multivariate kernel smoothing as does
provides patient rule induction methods to attempt to find regions of high density in high dimensional multivariate data,
also provides methods for determining feature significance in multivariate data (such as in relation to local modes).
provides a multivariate extension
to the Shapiro-Wilks test,
provides multivariate outlier detection based
on robust methods.
provides tests for multi-normality.
provides an assessment
of normality based on E statistics (energy); in the same package
assesses a number of samples for equal distributions. Tests for Wishart-distributed covariance matrices
are given by
routines for a range of (elliptical and archimedean) copulas including
normal, t, Clayton, Frank, Gumbel,
generalised archimedian copula.
(with a matrix specified as the dependent variable)
offers multivariate linear models,
provides comparison of
multivariate linear models.
which fit multivariate skew normal and multivariate skew t models.
provides partial least squares regression (PLSR) and principal component regression,
provides penalized partial least squares,
provides dimension reduction regression options such as
(sliced average variance estimation).
provides partial least squares analyses for genomics.
provides functions to investigate the relative importance of regression parameters.
these can be fitted with
preferred) as well as
with S-PLUS) from stats.
provides simple components.
provides the first principal component and gives coefficients for unscaled
data. Additional support for an assessment of the scree plot can be found in
provides routines for Horn's evaluation of the number of dimensions to retain.
For wide matrices,
uses kernel methods to provide a form of non-linear principal components with
provides robust principal components by means
of projection pursuit.
further robust and parallelised methods such as a form of generalised
and robust principal component analysis via
respectively. Further options for principal components
in an ecological setting are available within
and in a sensory setting in
variety of routines useful in psychometry, in this context these include
maps onto a sphere and
where some variables may be considered as
dependent as well as
which has the option of adding simulation results to help assess the observed data.
provides principal tensor analysis analagous to both PCA and correspondence analysis.
provides standardised major axis estimation with specific application to allometry.
in stats provides
uses kernel methods to provide robust canonical correlation with
provides a number of concordance methods.
redundancy analysis as well as further options for canonical correlation.
provides fuzzy set ordination, which extends ordination beyond methods available from linear algebra.
algorithms to perform independent
component analysis (ICA) and Projection Pursuit, and
uses score functions.
provides either an invariant co-ordinate system or independent components.
adds an interface to the JADE algorithm, as well as providing some diagnostics for ICA.
provides procrustes analysis, this package also provides functions
for ordination and further information on that area is given in the
task view. Generalised procrustes analysis via
is available from
Principal coordinates / scaling methods
in stats provides classical multidimensional scaling
(principal coordinates analysis),
offer Sammon and Kruskal's non-metric multidimensional scaling.
provides wrappers and post-processing for non-metric MDS.
is provided by
A comprehensive overview of clustering
methods available within R is provided by the
task view. Standard techniques include hierarchical clustering by
and k-means clustering by
A range of established
clustering and visualisation techniques are also available in
cluster, some cluster validation routines are available in
and the Rand index can be computed from
e1071. Trimmed cluster analysis is available from
trimcluster, cluster ensembles are available from
clue, methods to assist with choice of routines are available in
and hybrid methodology is provided by
edist()) and hierarchical clustering (
hclust.energy()) based on E-statistics are available in
energy. Mahalanobis distance based clustering (for fixed points as well as clusterwise regression) are available from
provides variable selection within model-based clustering.
Fuzzy clustering is available within
as well as via the
(Hierarchical Ordered Partitioning and
Collapsing Hybrid) algorithm.
provides supervised and unsuperised SOMs
for high dimensional spectra or patterns.
helps simulate clusters. The
task view also gives a topic-related overview of some clustering techniques. Model based clustering is available in
and model based clustering for functional data is available in
Full details on tree methods are given in the
Suffice to say here that classification trees are sometimes considered within
is most used for this purpose.
provides recursive partitioning. Classification and regression training is provided by
provides k-nearest neighbour methods which can be used for regression as well as classification.
Supervised classification and discriminant analysis
within MASS provide linear
and quadratic discrimination respectively.
provides mixture and
flexible discriminant analysis with
as well as
multivariate adaptive regression splines with
and adaptive spline
backfitting with the
function. Multivariate adaptive regression splines can also be found in
for high dimensional data by means of shrunken centroids regularized discriminant analysis.
for factorial discriminant analysis. A number of packages provide for
dimension reduction with the classification.
selection and robustness against multicollinearity as well as a number of
provides principal components for
supervised classification, whereas
provides classification using
generalised partial least squares.
provides cross-validated linear discriminant calculations to determine the optimum number of features.
provides a range of methods for assessing classifier performance.
Further information on supervised classification can be found in
in MASS provide simple and
multiple correspondence analysis respectively.
also provides single, multiple and joint correspondence analysis.
provide correspondence and multiple correspondence analysis
respectively, as well as adding homogeneous table analysis with
Further functionality is also available within
is available from
which also enable simple and multiple corresondence analysis as well as associated graphical routines.
provides homogeneity analysis.
provides tools for multiple imputation,
multivariate imputation by chained equations
provides ML estimation
for multivariate normal data with missing values,
provides multiple imputation for mixed categorical and continuous data.
provides multiple imputation for
missing panel data.
provides methods for the visualisation as well as imputation of missing data.
provide further imputation methods.
deals with estimation models where the missing data pattern is monotone.
Latent variable approaches
in stats provides factor analysis by maximum
likelihood, Bayesian factor analysis
is provided for Gaussian, ordinal and mixed variables in
offers GPA (gradient projection algorithm) factor rotation.
provides factor analysis solved using genetic algorithms.
fits linear structural equation models and
latent trait models under item response theory and range of extensions to Rasch models can be found in
provides a wide range of Factor Analysis methods, including
for multiple and hierarchical multiple factor analysis as well as
for multiple factor analysis of quantitative and qualitative data.
provides factor analysis for time series.
provides latent class and latent class regression models for a variety of outcome variables.
Modelling non-Gaussian data
provides Bayesian multinomial probit models,
polchoric and tetrachoric
provides a range of models such as seemingly
unrelated regression, multinomial logit/probit, multivariate probit and instrumental
provides Vector Generalised Linear and Additive Models, Reduced Rank regression
As a vector- and matrix-based language, base R ships with many powerful tools for
doing matrix manipulations, which are complemented by the packages
adds functions for matrix differential calculus. Some further sparse matrix functionality is also available from
for matrices and passes multiple functions. In addition to functions listed earlier,
provides operations such as marginalisation, affine transformations and graphics for the multivariate skew normal and skew t distribution.
provides for vector auto-regression.
bootstraps repeated measures models.
also provides a range of statistics based on Cohen's kappa including weighted measures and agreement among more than 2 raters.
provides functions for multivariate optimisation.
provides plotting of geometric objects in GGobi.