Type: | Package |
Title: | Robust Distance-Based Visualization and Analysis of Mixed-Type Data |
Version: | 1.0.0 |
Date: | 2025-09-16 |
Author: | Marcos Álvarez [aut], Eva Boj [aut, cre], Aurea Grané [aut] |
Maintainer: | Eva Boj <evaboj@ub.edu> |
Description: | Robust distance-based methods applied to matrices and data frames, producing distance matrices that can be used as input for various visualization techniques such as graphs, heatmaps, or multidimensional scaling configurations. See Boj and Grané (2024) <doi:10.1016/j.seps.2024.101992>. |
License: | GPL-3 |
Repository: | CRAN |
Encoding: | UTF-8 |
Imports: | MASS, proxy, ade4, Rdpack, dbstats, StatMatch, RColorBrewer, pheatmap, qgraph, GGally, ggplot2, vegan |
RdMacros: | Rdpack |
LazyData: | true |
RoxygenNote: | 7.3.2 |
Depends: | R (≥ 4.5) |
Suggests: | testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
NeedsCompilation: | no |
Packaged: | 2025-09-16 09:19:50 UTC; evaboj |
Date/Publication: | 2025-09-22 07:50:12 UTC |
High-correlation dataset with contamination
Description
Synthetic dataset generated from a multivariate normal distribution with
strong correlation structure (\rho = 0.8
). It contains 550 observations
and 10 variables of mixed type (continuous, categorical, binary, and weights).
The last 50 rows correspond to contaminated observations created by adding
perturbations equal to three times the standard deviation of each quantitative
variable to a subset of original units. This results in a controlled 10%
contamination level. These data follow the design in
(Boj and Grané 2024).
Usage
Data_HC_contamination
Format
A data frame with 550 rows and 10 variables:
- V1
Continuous variable 1
- V2
Continuous variable 2
- V3
Continuous variable 3
- V4
Continuous variable 4
- V5
Categorical variable 1 (3 categories, approx. balanced)
- V6
Categorical variable 2 (3 categories, approx. balanced)
- V7
Categorical variable 3 (4 categories, uniform distribution)
- V8
Binary variable 1 (40% zeros, 60% ones)
- V9
Binary variable 2 (60% zeros, 40% ones)
- w_loop
Observation weights derived from the joint distribution of V5 and V8, following a proportional frequency-based scheme.
Details
Continuous variables were drawn directly from the multivariate normal sample.
Binary and categorical variables were obtained by discretizing normal margins using percentile-based thresholds.
Contaminated observations (rows 501–550) were generated by perturbing original cases with fluctuations of 3 SD.
The weighting scheme prioritizes frequent category combinations.
References
Boj E, Grané A (2024). “The robustification of distance-based linear models: Some proposals.” Socio-Economic Planning Sciences, 95, 101992.
High-correlation dataset without contamination
Description
Synthetic dataset generated from a multivariate normal distribution with
strong correlation structure (\rho = 0.8
). It contains 500 observations
and 10 variables of mixed type (continuous, categorical, binary, and weights).
No contaminated cases were added in this version, so the dataset represents
a clean scenario with 0% contamination. These data follow the design in
(Boj and Grané 2024).
Usage
Data_HC_no_contamination
Format
A data frame with 500 rows and 10 variables:
- V1
Continuous variable 1
- V2
Continuous variable 2
- V3
Continuous variable 3
- V4
Continuous variable 4
- V5
Categorical variable 1 (3 categories, approx. balanced)
- V6
Categorical variable 2 (3 categories, approx. balanced)
- V7
Categorical variable 3 (4 categories, uniform distribution)
- V8
Binary variable 1 (40% zeros, 60% ones)
- V9
Binary variable 2 (60% zeros, 40% ones)
- w_loop
Observation weights derived from the joint distribution of V5 and V8, following a proportional frequency-based scheme.
Details
Continuous variables were drawn directly from the multivariate normal sample.
Binary and categorical variables were obtained by discretizing normal margins using percentile-based thresholds.
Unlike other datasets in this collection, no artificial contamination was introduced here.
The weighting scheme prioritizes frequent category combinations.
References
Boj E, Grané A (2024). “The robustification of distance-based linear models: Some proposals.” Socio-Economic Planning Sciences, 95, 101992.
Moderate-correlation dataset with contamination
Description
Synthetic dataset generated from a multivariate normal distribution with
moderate correlation structure (\rho = 0.6
). It contains 525 observations
and 10 variables of mixed type (continuous, categorical, binary, and weights).
The last 25 rows correspond to contaminated observations created by adding
perturbations equal to three times the standard deviation of each quantitative
variable to a subset of original units. This results in a controlled 5%
contamination level. These data follow the design in
(Boj and Grané 2024).
Usage
Data_MC_contamination
Format
A data frame with 525 rows and 10 variables:
- V1
Continuous variable 1
- V2
Continuous variable 2
- V3
Continuous variable 3
- V4
Continuous variable 4
- V5
Categorical variable 1 (3 categories, approx. balanced)
- V6
Categorical variable 2 (3 categories, approx. balanced)
- V7
Categorical variable 3 (4 categories, uniform distribution)
- V8
Binary variable 1 (40% zeros, 60% ones)
- V9
Binary variable 2 (60% zeros, 40% ones)
- w_loop
Observation weights derived from the joint distribution of V5 and V8, following a proportional frequency-based scheme.
Details
Continuous variables were drawn directly from the multivariate normal sample.
Binary and categorical variables were obtained by discretizing normal margins using percentile-based thresholds.
Contaminated observations (rows 501–525) were generated by perturbing original cases with fluctuations of 3 SD.
The weighting scheme prioritizes frequent category combinations.
References
Boj E, Grané A (2024). “The robustification of distance-based linear models: Some proposals.” Socio-Economic Planning Sciences, 95, 101992.
Moderate-correlation dataset without contamination
Description
Synthetic dataset generated from a multivariate normal distribution with
moderate correlation structure (\rho = 0.6
). It contains 500 observations
and 10 variables of mixed type (continuous, categorical, binary, and weights).
No contaminated cases were added in this version, so the dataset represents
a clean scenario with 0% contamination. These data follow the design in
(Boj and Grané 2024).
Usage
Data_MC_no_contamination
Format
A data frame with 500 rows and 10 variables:
- V1
Continuous variable 1
- V2
Continuous variable 2
- V3
Continuous variable 3
- V4
Continuous variable 4
- V5
Categorical variable 1 (3 categories, approx. balanced)
- V6
Categorical variable 2 (3 categories, approx. balanced)
- V7
Categorical variable 3 (4 categories, uniform distribution)
- V8
Binary variable 1 (40% zeros, 60% ones)
- V9
Binary variable 2 (60% zeros, 40% ones)
- w_loop
Observation weights derived from the joint distribution of V5 and V8, following a proportional frequency-based scheme.
Details
Continuous variables were drawn directly from the multivariate normal sample.
Binary and categorical variables were obtained by discretizing normal margins using percentile-based thresholds.
Unlike other datasets in this collection, no artificial contamination was introduced here.
The weighting scheme prioritizes frequent category combinations.
References
Boj E, Grané A (2024). “The robustification of distance-based linear models: Some proposals.” Socio-Economic Planning Sciences, 95, 101992.
Compute Distance or Similarity Matrices
Description
Computes a distance or similarity matrix between rows of a data frame or matrix, supporting a wide variety of distance metrics.
Usage
calculate_distances(
x,
method = "gower",
output_format = "dist",
squared = FALSE,
p = NULL,
similarity_transform = "linear",
...
)
Arguments
x |
A matrix or data.frame. Each row represents an observation. |
method |
A string specifying the distance/similarity method. Supported:
|
output_format |
Output format: |
squared |
Logical; if |
p |
Numeric; the power parameter for the Minkowski distance (required if |
similarity_transform |
Character string; if
|
... |
Additional arguments passed to underlying functions. |
Details
When output_format = "similarity"
, the function transforms computed distances into similarity scores using one of the supported transformations.
The similarity transformation options are:
"linear"
Direct inversion of distance:
s_{ij} = 1 - \delta_{ij}
."sqrt"
Squared distance inversion:
s_{ij} = 1 - \delta_{ij}^2
, which may better preserve Euclidean properties.
Value
Depending on output_format
, returns:
dist object (if
output_format = "dist"
)numeric matrix (if
output_format = "matrix"
or"output_format = similarity"
)
See Also
dist
for basic distance measures,
dist.binary
for binary distances,
dist
for advanced metrics like cosine or correlation
Examples
# Load example dataset
data("Data_HC_contamination", package = "dbrobust")
df <- Data_HC_contamination
# --- Quick Example ---
numeric_data <- df[1:10, 1:4] # subset for speed
d_euclid <- calculate_distances(
numeric_data,
method = "euclidean",
output_format = "matrix"
)
# Load example dataset
data("Data_HC_contamination", package = "dbrobust")
df <- Data_HC_contamination[1:20,]
# Example 1: Euclidean distance (numeric variables only)
numeric_data <- df[, 1:4]
d_euclid <- calculate_distances(
numeric_data,
method = "euclidean",
output_format = "matrix"
)
# Example 2: Manhattan distance
d_manhattan <- calculate_distances(
numeric_data,
method = "manhattan",
output_format = "matrix"
)
# Example 3: Categorical distance using Matching Coefficient
categorical_data <- df[, 5:7]
d_match <- calculate_distances(
categorical_data,
method = "matching_coefficient",
output_format = "matrix"
)
# Example 4: Mixed data distance using Gower (automatic type detection, asymmetric binary)
d_gower_asym <- calculate_distances(
df,
method = "gower",
output_format = "dist",
binary_asym = TRUE
)
# Example 5: Minkowski distance with p = 3
d_minkowski <- calculate_distances(
numeric_data,
method = "minkowski",
p = 3,
output_format = "matrix"
)
# Example 6: Jaccard distance for binary variables
binary_data <- df[, 8:9]
d_jaccard <- calculate_distances(
binary_data,
method = "jaccard",
output_format = "matrix"
)
# Example 7: Mahalanobis distance
d_mahal <- calculate_distances(
numeric_data,
method = "mahalanobis",
output_format = "matrix"
)
# Example 8: Manual selection of variables for Gower distance
continuous_vars <- 1:4
binary_vars <- 8:9
categorical_vars <- 5:7
d_gower_manual <- calculate_distances(
df,
method = "gower",
output_format = "dist",
continuous_cols = continuous_vars,
binary_cols = binary_vars,
categorical_cols = categorical_vars
)
Convert a similarity or distance matrix to a 'dist' object
Description
This function converts a similarity matrix (with values between 0 and 1 and 1s on the diagonal) or a distance matrix into a 'dist' object. The user can specify the method used to transform similarity values into distances.
Usage
convert_to_dist(dist_mat, similarity_transform = c("linear", "sqrt"))
Arguments
dist_mat |
A square matrix (similarity or distance) or a 'dist' object. |
similarity_transform |
Method to convert similarity to distance. Either
|
Value
An object of class 'dist'.
Compute pairwise binary distances
Description
Internal helper function to compute pairwise distances between binary vectors
using standard binary distance/similarity measures. Delegates to
ade4::dist.binary
when available for performance.
Usage
dist_binary(x, method)
Arguments
x |
A numeric matrix or data frame of binary values (0/1, TRUE/FALSE, or NA) |
method |
A character string specifying the binary distance measure to use. |
Details
Supported methods (for two binary vectors x_i
and x_j
):
-
"jaccard"
:d = 1 - \frac{a}{a + b + c}
-
"dice"
:d = 1 - \frac{2a}{2a + b + c}
-
"sokal_michener"
:d = 1 - \frac{a + d}{a + b + c + d}
-
"russell_rao"
:d = 1 - \frac{a}{a + b + c + d}
-
"sokal_sneath"
:d = 1 - \frac{a}{a + 1/2(b + c)}
-
"kulczynski"
:d = 1 - \frac{1}{2}\left(\frac{a}{a+b} + \frac{a}{a+c}\right)
-
"hamming"
:d = 1 - \frac{a + d}{a + b + c + d}
Where:
-
a
= number of positions where both vectors are 1 -
b
= number of positions where x_i = 1 and x_j = 0 -
c
= number of positions where x_i = 0 and x_j = 1 -
d
= number of positions where both vectors are 0
The Sokal-Michener coefficient is equivalent to the normalized Hamming distance.
Factors or character columns are converted to numeric 0/1.
Missing values (NA) are ignored pairwise; if all entries are missing, distance is NA.
Methods supported by
ade4
(e.g., Jaccard, Dice, Sokal-Michener, etc.) are computed viaade4::dist.binary
for efficiency.Manual computations are implemented for Hamming and Kulczynski if
ade4
is unavailable.
Value
A symmetric numeric matrix of pairwise distances. NA is returned for pairs with no valid comparisons (all NA entries).
Examples
# Small example with binary matrix
mat <- matrix(c(
1, 0, 1,
1, 1, 0,
0, 1, 1
), nrow = 3, byrow = TRUE)
# Example with Jaccard
dbrobust::dist_binary(mat, method = "jaccard")
# Example with Hamming
dbrobust::dist_binary(mat, method = "hamming")
Compute pairwise distances for categorical data
Description
Internal helper function to compute distances between observations based on the matching coefficient, which measures the proportion of matching attributes between two categorical vectors. This approach is particularly useful for multiclass categorical variables.
Usage
dist_categorical(x, method = "matching_coefficient")
Arguments
x |
A data frame or matrix containing only categorical variables (factor or character) |
method |
Currently only |
Details
The distance between two observations i
and j
is defined as:
d(i, j) = 1 - \frac{\alpha}{p^\prime}
where \alpha
is the number of matching attributes (agreements) and p'
is the number of non-missing comparisons between the two observations.
Only categorical columns (factor or character) are supported; numeric columns must be converted prior to using this function.
Missing values (NA) are ignored pairwise. If all attributes are missing for a given pair, the distance is returned as NA.
This distance is equivalent to the normalized Hamming distance when applied to binary variables.
The matching coefficient satisfies metric properties and can be used as a building block for mixed-type distances (e.g., combined with quantitative distances via Gower's similarity).
Value
A symmetric numeric matrix of pairwise distances. Distance is in the range [0, 1], where 0 indicates complete agreement and 1 indicates complete disagreement. NA is returned for pairs with no valid comparisons (all NA entries).
Examples
# Small categorical dataset
df <- data.frame(
A = factor(c("red", "blue", "red")),
B = factor(c("circle", "circle", "square"))
)
# Compute matching coefficient
dbrobust::dist_categorical(df)
Compute pairwise distances for continuous numeric data
Description
Internal helper function to compute pairwise distance matrices for purely numeric datasets. Supports standard metrics, including Euclidean, Manhattan, Chebyshev, Canberra, Minkowski, standardized Euclidean, and Mahalanobis distances.
Usage
dist_continuous(x, method, p = NULL)
Arguments
x |
A numeric data frame or matrix with rows as observations and columns as variables. |
method |
Distance metric to compute (see details for supported options). |
p |
Numeric, the power parameter for Minkowski distance (required if |
Details
Supported methods and formulas (for observations \mathbf{z}_i
and \mathbf{z}_j
):
-
"euclidean"
:\delta_E(i,j) = \sqrt{\sum_{k=1}^{p} (z_{ik} - z_{jk})^2}
-
"minkowski"
:\delta_q(i,j) = \left( \sum_{k=1}^{p} |z_{ik} - z_{jk}|^q \right)^{1/q}
requires
p = q
-
"manhattan"
:\delta_1(i,j) = \sum_{k=1}^{p} |z_{ik} - z_{jk}|
-
"maximum"
:\delta_\infty(i,j) = \max_k |z_{ik} - z_{jk}|
-
"canberra"
:\delta_C(i,j) = \sum_{k=1}^{p} \frac{|z_{ik} - z_{jk}|}{|z_{ik}| + |z_{jk}|}
convention:
0/0 := 0
-
"euclidean_standardized"
:\delta_K(i,j) = \sqrt{\sum_{k=1}^{p} \frac{(z_{ik} - z_{jk})^2}{s_k^2}}
s_k^2
is the variance of variable k -
"mahalanobis"
:\delta_M(i,j) = \sqrt{ (\mathbf{z}_i - \mathbf{z}_j)' \mathbf{S}^{-1} (\mathbf{z}_i - \mathbf{z}_j) }
\mathbf{S}
is the covariance matrix
Considerations when choosing a distance metric:
For
"euclidean_standardized"
, columns are standardized to mean 0 and variance 1 before computing Euclidean distances.Cosine and correlation distances rely on the proxy package; these are not guaranteed to be strictly Euclidean.
Minkowski distance requires specifying the parameter
p
(e.g.,p = 3
for L3 norm).Mahalanobis distance uses the inverse of the covariance matrix. If the covariance matrix is singular, the generalized inverse from MASS::ginv is used.
Standard metrics (Euclidean, Manhattan, Maximum, Canberra) are computed using
stats::dist
.
Value
A symmetric numeric matrix of pairwise distances between rows of x
.
The diagonal contains zeros.
Examples
# Small numeric matrix
mat <- matrix(c(1, 2, 3,
4, 5, 6,
7, 8, 9), nrow = 3, byrow = TRUE)
# Euclidean distance
dbrobust::dist_continuous(mat, method = "euclidean")
# Standardized Euclidean
dbrobust::dist_continuous(mat, method = "euclidean_standardized")
# Minkowski distance with p = 3
dbrobust::dist_continuous(mat, method = "minkowski", p = 3)
# Mahalanobis distance
set.seed(123)
mat <- matrix(rnorm(5*3), nrow = 5, ncol = 3)
colnames(mat) <- c("X1","X2","X3")
# Compute the mahalanobis distance
dbrobust::dist_continuous(mat, method = "mahalanobis")
# Cosine distance (requires 'proxy' package)
dbrobust::dist_continuous(mat, method = "cosine")
Compute Gower dissimilarity for mixed-type data
Description
Internal helper function to compute pairwise dissimilarities for datasets containing a mix of continuous, binary, and categorical variables using Gower's method (Gower 1971).
Usage
dist_mixed(
x,
continuous_cols = NULL,
binary_cols = NULL,
categorical_cols = NULL,
binary_asym = FALSE
)
Arguments
x |
A data frame with rows as observations and columns as variables. |
continuous_cols |
Optional numeric indices or column names for continuous variables. |
binary_cols |
Optional numeric indices or column names for binary variables. |
categorical_cols |
Optional numeric indices or column names for categorical/multiclass variables. |
binary_asym |
Logical; if TRUE, binary variables are treated as asymmetric (only 1/1 counts as match). |
Details
Continuous, binary, and categorical columns can be automatically detected,
or explicitly specified by the user via continuous_cols
, binary_cols
,
and categorical_cols
.
Continuous, binary, and categorical columns are combined into a single dissimilarity measure following Gower's approach.
Continuous variables are scaled by their range.
Binary variables can be treated as symmetric (0/0 and 1/1 count as match) or asymmetric (only 1/1 counts as match).
Categorical variables are compared using simple matching.
Missing values are ignored pairwise.
Advantages:
Low computational cost.
Works naturally with mixed-type data.
Limitations:
Neglects potential correlations among quantitative variables.
Sensitive to outliers, which can affect robustness.
May overemphasize categorical differences in mixed-data settings.
Value
A symmetric numeric matrix of pairwise dissimilarities in [0,1].
References
Gower JC (1971). “A general coefficient of similarity and some of its properties.” Biometrics, 857–871.
Examples
# Small example: Compute classical Gower for a simulated data frame
df <- data.frame(
height = c(170, 160, 180),
gender = factor(c("M", "F", "M")),
smoker = c(1, 0, 1)
)
# Compute Gower dissimilarities automatically detecting types
dbrobust::dist_mixed(df)
# Manual type specification
cont_cols <- "height"
cat_cols <- NULL
bin_cols <- c("gender","smoker")
dbrobust::dist_mixed(
x = df,
continuous_cols = cont_cols,
categorical_cols = cat_cols,
binary_cols = bin_cols
)
Format distance or similarity matrix output
Description
Converts a distance matrix to either a similarity matrix or a 'dist' object, depending on user preferences.
Usage
format_output(
dist_mat,
output_format,
similarity = FALSE,
similarity_transform = "linear"
)
Arguments
dist_mat |
A symmetric matrix of pairwise distances. |
output_format |
Character string specifying output format:
|
similarity |
Logical; if |
similarity_transform |
Character string; either |
Details
When converting to similarity, two transformation formulas are supported to derive similarity from distance:
"linear", (default)
-
\text{s}_{ij} = 1 - \delta_{ij}
This transformation directly inverts the distance into a similarity score.
"sqrt"
-
\text{s}_{ij} = 1 - \delta_{ij}^2
This corresponds to a transformation from a metric that satisfies the Euclidean property:
\delta_{ij} = \sqrt{1 - s_{ij}}
According to (Gower and Legendre 1986), this transformation yields a metric that is more likely to preserve Euclidean structure in downstream analyses.
Value
A matrix or 'dist' object, depending on the selected format and similarity flag.
References
Gower JC, Legendre P (1986). “Metric and Euclidean properties of dissimilarity coefficients.” Journal of classification, 3, 5–48.
Generate a Custom Color Palette
Description
Returns a vector of distinct colors for use in plotting or annotation. Uses a predefined set of base colors and interpolates additional colors if needed.
Usage
get_custom_palette(n)
Arguments
n |
Integer. Number of colors required. |
Value
A character vector of length n
containing hexadecimal color codes.
Force a Pairwise Squared Distance Matrix to Euclidean Form
Description
Given a pairwise squared distance matrix D
(where D[i,j] = d(i,j)^2
),
this function ensures that D
corresponds to a valid Euclidean squared
distance matrix. The correction is based on the weighted Gram matrix
G_w = -\frac{1}{2} J_w D J_w^\top
, where J_w = I_n - \mathbf{1} w^\top
is the centering matrix defined by the weight vector w
.
Usage
make_euclidean(D, w, tol = 1e-10)
Arguments
D |
Numeric square matrix (n x n) of pairwise squared distances. Must be symmetric with zeros on the diagonal. |
w |
Numeric vector of weights (length n). Internally normalized to sum to 1. |
tol |
Numeric tolerance for detecting negative eigenvalues (default: 1e-10). |
Details
If the smallest eigenvalue \lambda_{\min}
of G_w
is below the
negative tolerance -tol
, the function corrects D
by adding a
constant shift to guarantee positive semi-definiteness of the Gram matrix,
following the approach of (Lingoes 1971) and
(Mardia 1978):
D_{\text{new}} = D + 2 c \mathbf{1} \mathbf{1}^\top - 2 c I_n,
where c = |\lambda_{\min}|
.
Value
A list with components:
- D_euc
Corrected pairwise squared Euclidean distance matrix (n x n).
- eigvals_before
Eigenvalues of the weighted Gram matrix before correction.
- eigvals_after
Eigenvalues of the weighted Gram matrix after correction.
- transformed
Logical, TRUE if correction was applied, FALSE otherwise.
References
Lingoes JC (1971). “Some boundary conditions for a monotone analysis of symmetric matrices.” Psychometrika, 36(2), 195–203. Mardia KV (1978). “Some properties of clasical multi-dimesional scaling.” Communications in Statistics-Theory and Methods, 7(13), 1233–1241.
See Also
Examples
# Load example dataset
data("Data_HC_contamination")
# Reduce dataset to first 50 rows
Data_small <- Data_HC_contamination[1:50, ]
# Select only continuous variables
cont_vars <- names(Data_small)[1:4]
Data_cont <- Data_small[, cont_vars]
# Compute squared Euclidean distance matrix
dist_mat <- as.matrix(dist(Data_cont))^2
# Introduce a small non-Euclidean distortion
dist_mat[1, 2] <- dist_mat[1, 2] * 0.5
dist_mat[2, 1] <- dist_mat[1, 2]
# Uniform weights
weights <- rep(1, nrow(Data_cont))
# Apply Euclidean correction
res <- make_euclidean(dist_mat, weights)
# Check results (minimum eigenvalues before/after)
res$transformed
min(res$eigvals_before)
min(res$eigvals_after)
# First 5x5 block of corrected matrix
round(res$D_euc[1:5, 1:5], 4)
Visualize a Distance or Similarity Matrix as a Heatmap with Clustering
Description
This function creates a heatmap from a square distance or similarity matrix. If a similarity matrix is provided, it should first be converted to a distance matrix by the user. The function supports hierarchical clustering, group annotations, row/column sampling (random or stratified), and various customization options.
Usage
plot_heatmap(
dist_mat,
max_n = 50,
group = NULL,
stratified_sampling = FALSE,
main_title = NULL,
palette = "YlOrRd",
clustering_method = "complete",
cluster_rows = TRUE,
cluster_cols = TRUE,
fontsize_row = 10,
fontsize_col = 10,
show_rownames = TRUE,
show_colnames = TRUE,
border_color = "grey60",
annotation_legend = TRUE,
seed = 123
)
Arguments
dist_mat |
A square distance matrix (numeric matrix) or a |
max_n |
Integer. Maximum number of observations (rows/columns) to display.
If the matrix exceeds this size, a subset of |
group |
Optional vector or factor providing group labels for rows/columns, used for color annotation. |
stratified_sampling |
Logical. If |
main_title |
Optional character string specifying the main title of the heatmap. |
palette |
Character string specifying the RColorBrewer palette for heatmap cells. Default is |
clustering_method |
Character string specifying the hierarchical clustering method,
as accepted by |
cluster_rows |
Logical, whether to perform hierarchical clustering on rows. Default is |
cluster_cols |
Logical, whether to perform hierarchical clustering on columns. Default is |
fontsize_row |
Integer specifying the font size of row labels. Default is 10. |
fontsize_col |
Integer specifying the font size of column labels. Default is 10. |
show_rownames |
Logical, whether to display row names. Default is |
show_colnames |
Logical, whether to display column names. Default is |
border_color |
Color of the cell borders in the heatmap. Default is |
annotation_legend |
Logical, whether to display the legend for group annotations. Default is |
seed |
Integer. Random seed used when sampling rows/columns if |
Details
The function works as follows:
Converts
dist
objects to matrices automatically.Samples rows/columns if the matrix is larger than
max_n
. Sampling can be random or stratified by group.In stratified sampling mode, each group contributes at least one observation if possible.
Supports row annotations for groups and automatically assigns colors.
Uses
pheatmap
for plotting with customizable clustering, labels, fonts, and colors.
This function is used internally by visualize_distances()
but can be called directly for advanced usage.
Value
Invisibly returns the pheatmap
object, allowing further customization if assigned.
See Also
hclust
for hierarchical clustering methods.
pheatmap
for additional heatmap customization options.
brewer.pal
for available color palettes.
Examples
# Example: Euclidean distance heatmap on iris
eucli_dist <- stats::dist(iris[, 1:4])
dbrobust::plot_heatmap(
dist_mat = eucli_dist,
max_n = 10,
group = iris$Species,
stratified_sampling = TRUE,
main_title = "Euclidean Distance Heatmap",
palette = "YlOrRd",
clustering_method = "complete"
)
# Example: GGower distances with small subset
data("Data_HC_contamination", package = "dbrobust")
Data_small <- Data_HC_contamination[1:50, ]
cont_vars <- c("V1", "V2", "V3", "V4")
cat_vars <- c("V5", "V6", "V7")
bin_vars <- c("V8", "V9")
w <- Data_small$w_loop
dist_sq_ggower <- dbrobust::robust_distances(
data = Data_small,
cont_vars = cont_vars,
bin_vars = bin_vars,
cat_vars = cat_vars,
w = w,
alpha = 0.10,
method = "ggower"
)
group_vec <- rep("Normal", nrow(dist_sq_ggower))
group_vec[attr(dist_sq_ggower, "outlier_idx")] <- "Outlier"
group_factor <- factor(group_vec, levels = c("Normal", "Outlier"))
dbrobust::plot_heatmap(
dist_mat = sqrt(dist_sq_ggower),
max_n = 20,
group = group_factor,
main_title = "GGower Heatmap with Outliers",
palette = "YlOrRd",
clustering_method = "complete",
annotation_legend = TRUE,
stratified_sampling = TRUE,
seed = 123
)
Plot MDS Results with Grouped Scatter and Density Plots (Internal)
Description
This internal function performs classical or weighted Multidimensional Scaling (MDS) on a given distance matrix and visualizes the resulting coordinates using a pairwise scatterplot matrix with density plots on the diagonal. Grouping information can be provided for colored visual separation.
Arguments
dist_mat |
A distance matrix or object convertible to a distance matrix. |
k |
Integer. Number of dimensions to retain in MDS (default is 3). |
weights |
Optional numeric vector of weights for weighted MDS. If |
group |
Optional factor or vector indicating group membership for observations, used for coloring plots. |
main_title |
Optional character string for the main plot title. |
Details
This is an internal helper function. It is not recommended to call plot_mds()
directly.
Instead, use visualize_distances()
, which wraps this function.
Weighted MDS is performed with vegan::wcmdscale
if weights
are provided;
otherwise, classical MDS (cmdscale
) is used. Diagonal panels show density plots
by group, and off-diagonal panels show scatter plots by group.
Value
A ggmatrix
object from GGally
representing the
pairs plot with scatterplots and density plots.
Examples
# Load example dataset
data("Data_HC_contamination", package = "dbrobust")
# Subset of 20 rows
Data_small <- Data_HC_contamination[1:20, ]
# Define variable types
cont_vars <- c("V1", "V2", "V3", "V4")
cat_vars <- c("V5", "V6", "V7")
bin_vars <- c("V8", "V9")
# Use column 'w_loop' as weights
w <- Data_small$w_loop
# Compute robust distances using GGower
dist_sq_ggower <- dbrobust::robust_distances(
data = Data_small,
cont_vars = cont_vars,
bin_vars = bin_vars,
cat_vars = cat_vars,
w = w,
alpha = 0.10,
method = "ggower"
)
# Create factor indicating Normal vs Outlier
n_obs <- nrow(dist_sq_ggower)
group_vec <- rep("Normal", n_obs)
group_vec[attr(dist_sq_ggower, "outlier_idx")] <- "Outlier"
group_factor <- factor(group_vec, levels = c("Normal", "Outlier"))
# Plot MDS
dbrobust::plot_mds(
dist_mat = dist_sq_ggower,
k = 2,
group = group_factor,
main_title = "MDS of Data_HC_contamination (GGower) with Outliers"
)
Plot a Network Graph from a Distance Matrix
Description
This internal function visualizes a network graph representation of a distance matrix, where nodes represent observations and edges represent similarity. Groups can be specified for node coloring. A maximum number of nodes can be set to avoid overcrowding, and weak edges are thresholded.
Usage
plot_qgraph(
dist_mat,
group = NULL,
max_nodes = 100,
label_size = 2,
edge_threshold = 0.1,
layout = "spring",
seed = 123,
main_title = NULL
)
Arguments
dist_mat |
A square distance matrix or a |
group |
Optional factor or vector indicating group membership for nodes, used for coloring. |
max_nodes |
Integer. Maximum number of nodes to plot. If the number of observations exceeds this, stratified sampling is performed to reduce the node count. |
label_size |
Numeric. Size of the node labels. |
edge_threshold |
Numeric between 0 and 1. Edges with similarity below this threshold are removed. |
layout |
Character string specifying the layout algorithm for |
seed |
Integer. Random seed used for reproducibility during sampling and layout. |
main_title |
Optional character string specifying the main title of the plot. |
Details
This function is internal and not intended for direct use. It is called by
visualize_distances()
to display network graphs of robust distances.
Features:
Converts
dist
objects to matrices automatically.Downsamples nodes if the number of observations exceeds
max_nodes
, using stratified sampling by group.Normalizes the distance matrix to [0,1] and converts it to similarity (1 - distance).
Removes weak edges below
edge_threshold
.Colors nodes according to group membership.
Adds a main title using
title()
after plotting withqgraph
.
Value
Invisibly returns NULL
. The plot is drawn as a side effect.
Examples
# --------------------------------------
# Network Graph Example from Robust Distances
# --------------------------------------
data("Data_HC_contamination", package = "dbrobust")
# Subset small dataset
Data_small <- Data_HC_contamination[1:20, ]
cont_vars <- c("V1", "V2", "V3", "V4")
cat_vars <- c("V5", "V6", "V7")
bin_vars <- c("V8", "V9")
w <- Data_small$w_loop
# Compute GGower robust distances
dist_sq_ggower <- dbrobust::robust_distances(
data = Data_small,
cont_vars = cont_vars,
bin_vars = bin_vars,
cat_vars = cat_vars,
w = w,
alpha = 0.10,
method = "ggower"
)
# Create factor indicating Normal vs Outlier
n_obs <- nrow(dist_sq_ggower)
group_vec <- rep("Normal", n_obs)
group_vec[attr(dist_sq_ggower, "outlier_idx")] <- "Outlier"
group_factor <- factor(group_vec, levels = c("Normal", "Outlier"))
# Plot network graph (small, for CRAN)
dbrobust::plot_qgraph(
dist_mat = sqrt(dist_sq_ggower),
group = group_factor,
max_nodes = 10,
label_size = 2,
edge_threshold = 0.1,
layout = "spring",
seed = 123,
main_title = "GGower Network Graph with Outliers"
)
Robust RelMS Distance
Description
Computes a robust version of the Gower distance using the RelMS method for mixed-type data (continuous, binary, categorical). Continuous variables are handled via a robust Mahalanobis distance using a supplied robust covariance matrix. Binary and categorical variables are transformed into distances via similarity coefficients and combined using the RelMS approach.
Usage
robust_RelMS(data, w, p, robust_cov)
Arguments
data |
Numeric matrix or data frame with all variables combined. |
w |
Numeric vector of weights for each observation. Will be normalized internally. |
p |
Integer vector of length 3: |
robust_cov |
Robust covariance matrix for continuous variables. |
Details
The function computes distances separately for continuous, binary, and categorical variables, then applies the RelMS combination procedure. Continuous distances are Mahalanobis distances, categorical distances use a matching coefficient, and binary distances use a modified similarity coefficient. Eigen decomposition is used to compute the square root matrices needed in the RelMS combination.
Value
A numeric matrix of squared robust distances normalized by geometric variability.
Robust Covariance Estimation Based on Geometric Variability
Description
Computes a robust covariance matrix for a weighted dataset by selecting the most central subset of observations according to geometric variability. Observations are ranked based on a proximity function measuring how far each individual is from the rest of the data. The most central subset is then used to compute a covariance matrix.
Usage
robust_covariance_gv(X, w, alpha)
Arguments
X |
Numeric matrix of dimension n x p, where n is the number of observations and p is the number of variables. |
w |
Numeric vector of weights of length n. Weights will be normalized to sum to 1. |
alpha |
Numeric trimming proportion between 0 and 1 (e.g., 0.05, 0.10, 0.15) indicating the fraction of most extreme observations to discard. |
Value
A list containing:
- S
Robust covariance matrix of dimension p x p.
- central_idx
Indices of observations selected as the central subset.
- outlier_idx
Indices of observations considered outliers.
- phi
Proximity function values for all observations.
- q
Threshold value used for trimming (quantile of phi).
Examples
# Load a small subset of the example dataset
data("Data_HC_contamination", package = "dbrobust")
Data_small <- Data_HC_contamination[1:20, ]
# Select only continuous variables
cont_vars <- names(Data_small)[1:4]
Data_cont <- Data_small[, cont_vars]
# Set uniform weights and trimming proportion
weights <- rep(1, nrow(Data_cont))
alpha <- 0.10
# Compute robust covariance with trimming
res <- dbrobust::robust_covariance_gv(Data_cont, weights, alpha)
# Inspect results: central observations, outliers, covariance, threshold, proximity
res$central_idx
res$outlier_idx
round(res$S, 4)
res$q
round(res$phi[1:10], 4)
Compute Robust Squared Distances for Mixed Data
Description
Computes a weighted, robust squared distance matrix for datasets
containing continuous, binary, and categorical variables. Continuous
variables are handled via a robust Mahalanobis distance, and binary
and categorical variables are transformed via similarity coefficients.
The output is suitable for Euclidean correction with make_euclidean
.
Usage
robust_distances(
data = NULL,
cont_vars = NULL,
bin_vars = NULL,
cat_vars = NULL,
w = NULL,
p = NULL,
method = c("ggower", "relms"),
robust_cov = NULL,
alpha = 0.1,
return_dist = FALSE
)
Arguments
data |
Data frame or numeric matrix containing the observations. |
cont_vars |
Character vector of column names for continuous variables. |
bin_vars |
Character vector of column names for binary variables. |
cat_vars |
Character vector of column names for categorical variables. |
w |
Numeric vector of observation weights. If NULL, uniform weights are used. |
p |
Integer vector of length 3: |
method |
Character string: either |
robust_cov |
Optional. Precomputed robust covariance matrix for continuous variables.
If NULL, it will be estimated internally using the specified trimming proportion |
alpha |
Numeric trimming proportion for robust covariance of continuous variables. |
return_dist |
Logical. If TRUE, returns an object of class |
Value
A numeric matrix of squared robust distances (n x n) or a dist
object if return_dist = TRUE
.
Examples
# Example: Robust Squared Distances for Mixed Data
# Load example data and subset
data("Data_HC_contamination", package = "dbrobust")
Data_small <- Data_HC_contamination[1:50, ]
# Define variable types
cont_vars <- c("V1", "V2", "V3", "V4") # continuous
cat_vars <- c("V5", "V6", "V7") # categorical
bin_vars <- c("V8", "V9") # binary
# Use column w_loop as weights
w <- Data_small$w_loop
# -------------------------------
# Method 1: Gower distances
# -------------------------------
dist_sq_ggower <- robust_distances(
data = Data_small,
cont_vars = cont_vars,
bin_vars = bin_vars,
cat_vars = cat_vars,
w = w,
alpha = 0.10,
method = "ggower"
)
# Apply Euclidean correction if needed
res_ggower <- make_euclidean(dist_sq_ggower, w)
# Show first 5x5 block of original and corrected distances
cat("GGower original squared distances (5x5 block):\n")
print(round(dist_sq_ggower[1:5, 1:5], 4))
cat("\nGGower corrected squared distances (5x5 block):\n")
print(round(res_ggower$D_euc[1:5, 1:5], 4))
# -------------------------------
# Method 2: RelMS distances
# -------------------------------
dist_sq_relms <- robust_distances(
data = Data_small,
cont_vars = cont_vars,
bin_vars = bin_vars,
cat_vars = cat_vars,
w = w,
alpha = 0.10,
method = "relms"
)
# Apply Euclidean correction if needed
res_relms <- make_euclidean(dist_sq_relms, w)
# Show first 5x5 block of original and corrected distances
cat("RelMS original squared distances (5x5 block):\n")
print(round(dist_sq_relms[1:5, 1:5], 4))
cat("\nRelMS corrected squared distances (5x5 block):\n")
print(round(res_relms$D_euc[1:5, 1:5], 4))
Compute Robust Generalized Gower Distance
Description
Computes a weighted, robust version of the Gower distance for mixed-type data (continuous, binary, categorical). Continuous variables are handled via a robust Mahalanobis distance using a supplied robust covariance matrix. Binary and categorical variables are transformed into distances via similarity coefficients.
Usage
robust_ggower(data, w, p, robust_cov)
Arguments
data |
Numeric matrix or data frame with all variables combined. |
w |
Numeric vector of weights for each observation. Will be normalized internally. |
p |
Integer vector of length 3: |
robust_cov |
Robust covariance matrix for continuous variables. |
Details
The function computes distances separately for continuous, binary, and categorical variables, then scales each by its geometric variability and combines them. The output is a normalized squared distance matrix suitable for robust clustering or aggregation procedures.
Continuous distances are Mahalanobis distances: (x-y)^T (S)^-1 (x-y)
.
Categorical distances use a matching coefficient.
Binary distances are modified to account for positive/negative matches.
Value
A numeric matrix of squared robust Gower distances, normalized by geometric variability.
Visualize Distance Matrices via MDS, Heatmap, or Network Graph
Description
This function provides a unified interface to visualize distance matrices using classical or weighted Multidimensional Scaling (MDS), heatmaps, or network graphs. Group annotations can be provided for coloring.
Usage
visualize_distances(
dist_mat,
method = c("mds_classic", "mds_weighted", "heatmap", "qgraph"),
k = 3,
weights = NULL,
group = NULL,
main_title = NULL,
tol = 1e-10,
...
)
Arguments
dist_mat |
A square distance matrix (numeric matrix) or a |
method |
Character string specifying the visualization method. Options are:
|
k |
Integer. Number of dimensions to retain for MDS (default 3). Must be |
weights |
Optional numeric vector of weights for weighted MDS. Must match the number of observations. |
group |
Optional factor or vector indicating group membership for coloring plots. |
main_title |
Optional character string specifying the main title of the plot. |
tol |
Numeric tolerance for checking approximate symmetry (default 1e-10). |
... |
Additional arguments passed to internal plotting functions ( |
Details
visualize_distances
is a wrapper around three internal plotting functions:
-
plot_mds
: Creates a pairwise scatterplot matrix of MDS coordinates with density plots on the diagonal. -
plot_heatmap
: Plots a heatmap of the distance matrix with hierarchical clustering and optional group annotations. -
plot_qgraph
: Plots a network graph where nodes represent observations and edges represent similarity.
The function validates that dist_mat
is square, symmetric, and has zero diagonal elements.
If a distance matrix has a trimmed_idx
attribute and group
is not provided,
a factor indicating "Trimmed" vs "Outlier" is created automatically.
Value
The plotting object is returned and automatically printed:
MDS plots return a
ggmatrix
fromGGally
.Heatmaps return a
pheatmap
object.Network graphs are plotted directly (returns
NULL
).
See Also
cmdscale
for classical MDS.
wcmdscale
for weighted MDS.
pheatmap
for heatmaps.
qgraph
for network graphs.
ggpairs
for MDS scatterplot matrices.
Examples
# Load iris dataset
data(iris)
# Compute Euclidean distances on numeric columns
dist_iris <- dist(iris[, 1:4])
# Create a grouping factor based on Species
group_species <- iris$Species
# --------------------------------------
# Classical MDS (2D)
# --------------------------------------
visualize_distances(
dist_mat = dist_iris,
method = "mds_classic",
k = 2,
group = group_species,
main_title = "Classical MDS - Iris Dataset - Euclidean Distance"
)
# --------------------------------------
# Weighted MDS (uniform weights)
# --------------------------------------
weights <- rep(1, nrow(iris))
visualize_distances(
dist_mat = dist_iris,
method = "mds_weighted",
k = 2,
weights = weights,
group = group_species,
main_title = "Weighted MDS - Iris Dataset - Euclidean Distance"
)
# --------------------------------------
# Heatmap (limit rows to 30)
# --------------------------------------
visualize_distances(
dist_mat = dist_iris,
method = "heatmap",
group = group_species,
main_title = "Iris Heatmap by Species - Euclidean Distance",
max_n = 30,
palette = "YlGnBu",
clustering_method = "complete",
annotation_legend = TRUE,
stratified_sampling = TRUE,
seed = 123
)
# --------------------------------------
# Network Graph (limit nodes to 30)
# --------------------------------------
visualize_distances(
dist_mat = dist_iris,
method = "qgraph",
group = group_species,
max_nodes = 30,
label_size = 2,
edge_threshold = 0.1,
layout = "spring",
seed = 123,
main_title = "Iris Network Graph by Species - Euclidean Distance"
)