Help for package dbrobust

Type:

Package

Title:

Robust Distance-Based Visualization and Analysis of Mixed-Type Data

Version:

1.0.0

Date:

2025-09-16

Author:

Marcos Álvarez [aut], Eva Boj [aut, cre], Aurea Grané [aut]

Maintainer:

Eva Boj <evaboj@ub.edu>

Description:

Robust distance-based methods applied to matrices and data frames, producing distance matrices that can be used as input for various visualization techniques such as graphs, heatmaps, or multidimensional scaling configurations. See Boj and Grané (2024) <doi:10.1016/j.seps.2024.101992>.

License:

GPL-3

Repository:

CRAN

Encoding:

UTF-8

Imports:

MASS, proxy, ade4, Rdpack, dbstats, StatMatch, RColorBrewer, pheatmap, qgraph, GGally, ggplot2, vegan

RdMacros:

Rdpack

LazyData:

true

RoxygenNote:

7.3.2

Depends:

R (≥ 4.5)

Suggests:

testthat (≥ 3.0.0)

Config/testthat/edition:

NeedsCompilation:

Packaged:

2025-09-16 09:19:50 UTC; evaboj

Date/Publication:

2025-09-22 07:50:12 UTC

High-correlation dataset with contamination

Description

Synthetic dataset generated from a multivariate normal distribution with strong correlation structure (\rho = 0.8). It contains 550 observations and 10 variables of mixed type (continuous, categorical, binary, and weights). The last 50 rows correspond to contaminated observations created by adding perturbations equal to three times the standard deviation of each quantitative variable to a subset of original units. This results in a controlled 10% contamination level. These data follow the design in (Boj and Grané 2024).

Usage

Data_HC_contamination

Format

A data frame with 550 rows and 10 variables:

V1: Continuous variable 1
V2: Continuous variable 2
V3: Continuous variable 3
V4: Continuous variable 4
V5: Categorical variable 1 (3 categories, approx. balanced)
V6: Categorical variable 2 (3 categories, approx. balanced)
V7: Categorical variable 3 (4 categories, uniform distribution)
V8: Binary variable 1 (40% zeros, 60% ones)
V9: Binary variable 2 (60% zeros, 40% ones)
w_loop: Observation weights derived from the joint distribution of V5 and V8, following a proportional frequency-based scheme.

Details

Continuous variables were drawn directly from the multivariate normal sample.
Binary and categorical variables were obtained by discretizing normal margins using percentile-based thresholds.
Contaminated observations (rows 501–550) were generated by perturbing original cases with fluctuations of 3 SD.
The weighting scheme prioritizes frequent category combinations.

References

Boj E, Grané A (2024). “The robustification of distance-based linear models: Some proposals.” Socio-Economic Planning Sciences, 95, 101992.

High-correlation dataset without contamination

Description

Synthetic dataset generated from a multivariate normal distribution with strong correlation structure (\rho = 0.8). It contains 500 observations and 10 variables of mixed type (continuous, categorical, binary, and weights). No contaminated cases were added in this version, so the dataset represents a clean scenario with 0% contamination. These data follow the design in (Boj and Grané 2024).

Usage

Data_HC_no_contamination

Format

A data frame with 500 rows and 10 variables:

V1: Continuous variable 1
V2: Continuous variable 2
V3: Continuous variable 3
V4: Continuous variable 4
V5: Categorical variable 1 (3 categories, approx. balanced)
V6: Categorical variable 2 (3 categories, approx. balanced)
V7: Categorical variable 3 (4 categories, uniform distribution)
V8: Binary variable 1 (40% zeros, 60% ones)
V9: Binary variable 2 (60% zeros, 40% ones)
w_loop: Observation weights derived from the joint distribution of V5 and V8, following a proportional frequency-based scheme.

Details

Continuous variables were drawn directly from the multivariate normal sample.
Binary and categorical variables were obtained by discretizing normal margins using percentile-based thresholds.
Unlike other datasets in this collection, no artificial contamination was introduced here.
The weighting scheme prioritizes frequent category combinations.

References

Boj E, Grané A (2024). “The robustification of distance-based linear models: Some proposals.” Socio-Economic Planning Sciences, 95, 101992.

Moderate-correlation dataset with contamination

Description

Synthetic dataset generated from a multivariate normal distribution with moderate correlation structure (\rho = 0.6). It contains 525 observations and 10 variables of mixed type (continuous, categorical, binary, and weights). The last 25 rows correspond to contaminated observations created by adding perturbations equal to three times the standard deviation of each quantitative variable to a subset of original units. This results in a controlled 5% contamination level. These data follow the design in (Boj and Grané 2024).

Usage

Data_MC_contamination

Format

A data frame with 525 rows and 10 variables:

V1: Continuous variable 1
V2: Continuous variable 2
V3: Continuous variable 3
V4: Continuous variable 4
V5: Categorical variable 1 (3 categories, approx. balanced)
V6: Categorical variable 2 (3 categories, approx. balanced)
V7: Categorical variable 3 (4 categories, uniform distribution)
V8: Binary variable 1 (40% zeros, 60% ones)
V9: Binary variable 2 (60% zeros, 40% ones)
w_loop: Observation weights derived from the joint distribution of V5 and V8, following a proportional frequency-based scheme.

Details

Continuous variables were drawn directly from the multivariate normal sample.
Binary and categorical variables were obtained by discretizing normal margins using percentile-based thresholds.
Contaminated observations (rows 501–525) were generated by perturbing original cases with fluctuations of 3 SD.
The weighting scheme prioritizes frequent category combinations.

References

Boj E, Grané A (2024). “The robustification of distance-based linear models: Some proposals.” Socio-Economic Planning Sciences, 95, 101992.

Moderate-correlation dataset without contamination

Description

Synthetic dataset generated from a multivariate normal distribution with moderate correlation structure (\rho = 0.6). It contains 500 observations and 10 variables of mixed type (continuous, categorical, binary, and weights). No contaminated cases were added in this version, so the dataset represents a clean scenario with 0% contamination. These data follow the design in (Boj and Grané 2024).

Usage

Data_MC_no_contamination

Format

A data frame with 500 rows and 10 variables:

V1: Continuous variable 1
V2: Continuous variable 2
V3: Continuous variable 3
V4: Continuous variable 4
V5: Categorical variable 1 (3 categories, approx. balanced)
V6: Categorical variable 2 (3 categories, approx. balanced)
V7: Categorical variable 3 (4 categories, uniform distribution)
V8: Binary variable 1 (40% zeros, 60% ones)
V9: Binary variable 2 (60% zeros, 40% ones)
w_loop: Observation weights derived from the joint distribution of V5 and V8, following a proportional frequency-based scheme.

Details

Continuous variables were drawn directly from the multivariate normal sample.
Binary and categorical variables were obtained by discretizing normal margins using percentile-based thresholds.
Unlike other datasets in this collection, no artificial contamination was introduced here.
The weighting scheme prioritizes frequent category combinations.

References

Boj E, Grané A (2024). “The robustification of distance-based linear models: Some proposals.” Socio-Economic Planning Sciences, 95, 101992.

Compute Distance or Similarity Matrices

Description

Computes a distance or similarity matrix between rows of a data frame or matrix, supporting a wide variety of distance metrics.

Usage

calculate_distances(
  x,
  method = "gower",
  output_format = "dist",
  squared = FALSE,
  p = NULL,
  similarity_transform = "linear",
  ...
)

Arguments

x

A matrix or data.frame. Each row represents an observation.

method

A string specifying the distance/similarity method. Supported:

Binary: "jaccard", "dice", "sokal_michener", "russell_rao", "sokal_sneath", "kulczynski","hamming".
Categorical: "matching_coefficient".
Continuous: "euclidean", "euclidean_standardized", "manhattan", "minkowski", "canberra", "maximum", "cosine", "correlation", "mahalanobis".
Mixed: "gower".

output_format

Output format: "dist" (distance object), "matrix" (numeric matrix), or "similarity" (only for binary/categorical/mixed methods).

squared

Logical; if TRUE, returns squared distances (not applied to similarities).

p

Numeric; the power parameter for the Minkowski distance (required if method = "minkowski").

similarity_transform

Character string; if output_format = "similarity", this specifies the formula to convert distances to similarity scores. Supported:

"linear" (default): s_{ij} = 1 - \delta_{ij}
"sqrt": s_{ij} = 1 - \delta_{ij}^2

...

Additional arguments passed to underlying functions.

Details

When output_format = "similarity", the function transforms computed distances into similarity scores using one of the supported transformations.

The similarity transformation options are:

"linear": Direct inversion of distance: s_{ij} = 1 - \delta_{ij}.
"sqrt": Squared distance inversion: s_{ij} = 1 - \delta_{ij}^2, which may better preserve Euclidean properties.

Value

Depending on output_format, returns:

dist object (if output_format = "dist")
numeric matrix (if output_format = "matrix" or "output_format = similarity")

Examples

# Load example dataset
data("Data_HC_contamination", package = "dbrobust")
df <- Data_HC_contamination

# --- Quick Example ---
numeric_data <- df[1:10, 1:4]  # subset for speed
d_euclid <- calculate_distances(
  numeric_data,
  method = "euclidean",
  output_format = "matrix"
)

# Load example dataset
data("Data_HC_contamination", package = "dbrobust")
df <- Data_HC_contamination[1:20,]

# Example 1: Euclidean distance (numeric variables only)
numeric_data <- df[, 1:4]
d_euclid <- calculate_distances(
  numeric_data,
  method = "euclidean",
  output_format = "matrix"
)

# Example 2: Manhattan distance
d_manhattan <- calculate_distances(
  numeric_data,
  method = "manhattan",
  output_format = "matrix"
)

# Example 3: Categorical distance using Matching Coefficient
categorical_data <- df[, 5:7]
d_match <- calculate_distances(
  categorical_data,
  method = "matching_coefficient",
  output_format = "matrix"
)

# Example 4: Mixed data distance using Gower (automatic type detection, asymmetric binary)
d_gower_asym <- calculate_distances(
  df,
  method = "gower",
  output_format = "dist",
  binary_asym = TRUE
)

# Example 5: Minkowski distance with p = 3
d_minkowski <- calculate_distances(
  numeric_data,
  method = "minkowski",
  p = 3,
  output_format = "matrix"
)

# Example 6: Jaccard distance for binary variables
binary_data <- df[, 8:9]
d_jaccard <- calculate_distances(
  binary_data,
  method = "jaccard",
  output_format = "matrix"
)

# Example 7: Mahalanobis distance
d_mahal <- calculate_distances(
  numeric_data,
  method = "mahalanobis",
  output_format = "matrix"
)

# Example 8: Manual selection of variables for Gower distance
continuous_vars <- 1:4
binary_vars <- 8:9
categorical_vars <- 5:7
d_gower_manual <- calculate_distances(
  df,
  method = "gower",
  output_format = "dist",
  continuous_cols = continuous_vars,
  binary_cols = binary_vars,
  categorical_cols = categorical_vars
)

Convert a similarity or distance matrix to a 'dist' object

Description

This function converts a similarity matrix (with values between 0 and 1 and 1s on the diagonal) or a distance matrix into a 'dist' object. The user can specify the method used to transform similarity values into distances.

Usage

convert_to_dist(dist_mat, similarity_transform = c("linear", "sqrt"))

Arguments

dist_mat

A square matrix (similarity or distance) or a 'dist' object.

similarity_transform

Method to convert similarity to distance. Either "linear" (default) or "sqrt".

"linear": Applies a linear transformation to convert similarity to distance.
"sqrt": Applies the square root transformation to convert similarity to distance.

Value

An object of class 'dist'.

Compute pairwise binary distances

Description

Internal helper function to compute pairwise distances between binary vectors using standard binary distance/similarity measures. Delegates to ade4::dist.binary when available for performance.

Usage

dist_binary(x, method)

Arguments

x

A numeric matrix or data frame of binary values (0/1, TRUE/FALSE, or NA)

method

A character string specifying the binary distance measure to use.

Details

Supported methods (for two binary vectors x_i and x_j):

"jaccard":

d = 1 - \frac{a}{a + b + c}
"dice":

d = 1 - \frac{2a}{2a + b + c}
"sokal_michener":

d = 1 - \frac{a + d}{a + b + c + d}
"russell_rao":

d = 1 - \frac{a}{a + b + c + d}
"sokal_sneath":

d = 1 - \frac{a}{a + 1/2(b + c)}
"kulczynski":

d = 1 - \frac{1}{2}\left(\frac{a}{a+b} + \frac{a}{a+c}\right)
"hamming":

d = 1 - \frac{a + d}{a + b + c + d}

Where:

a = number of positions where both vectors are 1
b = number of positions where x_i = 1 and x_j = 0
c = number of positions where x_i = 0 and x_j = 1
d = number of positions where both vectors are 0

The Sokal-Michener coefficient is equivalent to the normalized Hamming distance.

Factors or character columns are converted to numeric 0/1.
Missing values (NA) are ignored pairwise; if all entries are missing, distance is NA.
Methods supported by ade4 (e.g., Jaccard, Dice, Sokal-Michener, etc.) are computed via ade4::dist.binary for efficiency.
Manual computations are implemented for Hamming and Kulczynski if ade4 is unavailable.

Value

A symmetric numeric matrix of pairwise distances. NA is returned for pairs with no valid comparisons (all NA entries).

Examples

# Small example with binary matrix
mat <- matrix(c(
  1, 0, 1,
  1, 1, 0,
  0, 1, 1
), nrow = 3, byrow = TRUE)

# Example with Jaccard
dbrobust::dist_binary(mat, method = "jaccard")

# Example with Hamming
dbrobust::dist_binary(mat, method = "hamming")

Compute pairwise distances for categorical data

Description

Internal helper function to compute distances between observations based on the matching coefficient, which measures the proportion of matching attributes between two categorical vectors. This approach is particularly useful for multiclass categorical variables.

Usage

dist_categorical(x, method = "matching_coefficient")

Arguments

x

A data frame or matrix containing only categorical variables (factor or character)

method

Currently only "matching_coefficient" is supported.

Details

The distance between two observations i and j is defined as:

d(i, j) = 1 - \frac{\alpha}{p^\prime}

where \alpha is the number of matching attributes (agreements) and p' is the number of non-missing comparisons between the two observations.

Only categorical columns (factor or character) are supported; numeric columns must be converted prior to using this function.
Missing values (NA) are ignored pairwise. If all attributes are missing for a given pair, the distance is returned as NA.
This distance is equivalent to the normalized Hamming distance when applied to binary variables.
The matching coefficient satisfies metric properties and can be used as a building block for mixed-type distances (e.g., combined with quantitative distances via Gower's similarity).

Value

A symmetric numeric matrix of pairwise distances. Distance is in the range [0, 1], where 0 indicates complete agreement and 1 indicates complete disagreement. NA is returned for pairs with no valid comparisons (all NA entries).

Examples

# Small categorical dataset
df <- data.frame(
  A = factor(c("red", "blue", "red")),
  B = factor(c("circle", "circle", "square"))
)
# Compute matching coefficient
dbrobust::dist_categorical(df)

Compute pairwise distances for continuous numeric data

Description

Internal helper function to compute pairwise distance matrices for purely numeric datasets. Supports standard metrics, including Euclidean, Manhattan, Chebyshev, Canberra, Minkowski, standardized Euclidean, and Mahalanobis distances.

Usage

dist_continuous(x, method, p = NULL)

Arguments

x

A numeric data frame or matrix with rows as observations and columns as variables.

method

Distance metric to compute (see details for supported options).

p

Numeric, the power parameter for Minkowski distance (required if method = "minkowski").

Details

Supported methods and formulas (for observations \mathbf{z}_i and \mathbf{z}_j):

"euclidean":

\delta_E(i,j) = \sqrt{\sum_{k=1}^{p} (z_{ik} - z_{jk})^2}
"minkowski":

\delta_q(i,j) = \left( \sum_{k=1}^{p} |z_{ik} - z_{jk}|^q \right)^{1/q}

requires p = q
"manhattan":

\delta_1(i,j) = \sum_{k=1}^{p} |z_{ik} - z_{jk}|
"maximum":

\delta_\infty(i,j) = \max_k |z_{ik} - z_{jk}|
"canberra":

\delta_C(i,j) = \sum_{k=1}^{p} \frac{|z_{ik} - z_{jk}|}{|z_{ik}| + |z_{jk}|}

convention: 0/0 := 0
"euclidean_standardized":

\delta_K(i,j) = \sqrt{\sum_{k=1}^{p} \frac{(z_{ik} - z_{jk})^2}{s_k^2}}

s_k^2 is the variance of variable k
"mahalanobis":

\delta_M(i,j) = \sqrt{ (\mathbf{z}_i - \mathbf{z}_j)' \mathbf{S}^{-1} (\mathbf{z}_i - \mathbf{z}_j) }

\mathbf{S} is the covariance matrix

Considerations when choosing a distance metric:

For "euclidean_standardized", columns are standardized to mean 0 and variance 1 before computing Euclidean distances.
Cosine and correlation distances rely on the proxy package; these are not guaranteed to be strictly Euclidean.
Minkowski distance requires specifying the parameter p (e.g., p = 3 for L3 norm).
Mahalanobis distance uses the inverse of the covariance matrix. If the covariance matrix is singular, the generalized inverse from MASS::ginv is used.
Standard metrics (Euclidean, Manhattan, Maximum, Canberra) are computed using stats::dist.

Value

A symmetric numeric matrix of pairwise distances between rows of x. The diagonal contains zeros.

Examples

# Small numeric matrix
mat <- matrix(c(1, 2, 3,
                4, 5, 6,
                7, 8, 9), nrow = 3, byrow = TRUE)

# Euclidean distance
dbrobust::dist_continuous(mat, method = "euclidean")

# Standardized Euclidean
dbrobust::dist_continuous(mat, method = "euclidean_standardized")

# Minkowski distance with p = 3
dbrobust::dist_continuous(mat, method = "minkowski", p = 3)

# Mahalanobis distance
set.seed(123)
mat <- matrix(rnorm(5*3), nrow = 5, ncol = 3)
colnames(mat) <- c("X1","X2","X3")
# Compute the mahalanobis distance
dbrobust::dist_continuous(mat, method = "mahalanobis")

# Cosine distance (requires 'proxy' package)
dbrobust::dist_continuous(mat, method = "cosine")

Compute Gower dissimilarity for mixed-type data

Description

Internal helper function to compute pairwise dissimilarities for datasets containing a mix of continuous, binary, and categorical variables using Gower's method (Gower 1971).

Usage

dist_mixed(
  x,
  continuous_cols = NULL,
  binary_cols = NULL,
  categorical_cols = NULL,
  binary_asym = FALSE
)

Arguments

x

A data frame with rows as observations and columns as variables.

continuous_cols

Optional numeric indices or column names for continuous variables.

binary_cols

Optional numeric indices or column names for binary variables.

categorical_cols

Optional numeric indices or column names for categorical/multiclass variables.

binary_asym

Logical; if TRUE, binary variables are treated as asymmetric (only 1/1 counts as match).

Details

Continuous, binary, and categorical columns can be automatically detected, or explicitly specified by the user via continuous_cols, binary_cols, and categorical_cols.

Continuous, binary, and categorical columns are combined into a single dissimilarity measure following Gower's approach.
Continuous variables are scaled by their range.
Binary variables can be treated as symmetric (0/0 and 1/1 count as match) or asymmetric (only 1/1 counts as match).
Categorical variables are compared using simple matching.
Missing values are ignored pairwise.

Advantages:

Low computational cost.
Works naturally with mixed-type data.

Limitations:

Neglects potential correlations among quantitative variables.
Sensitive to outliers, which can affect robustness.
May overemphasize categorical differences in mixed-data settings.

Value

A symmetric numeric matrix of pairwise dissimilarities in [0,1].

References

Gower JC (1971). “A general coefficient of similarity and some of its properties.” Biometrics, 857–871.

Examples

# Small example: Compute classical Gower for a simulated data frame
df <- data.frame(
  height = c(170, 160, 180),
  gender = factor(c("M", "F", "M")),
  smoker = c(1, 0, 1)
)

# Compute Gower dissimilarities automatically detecting types
dbrobust::dist_mixed(df)

# Manual type specification
cont_cols <- "height"
cat_cols <- NULL
bin_cols <- c("gender","smoker")
dbrobust::dist_mixed(
  x = df,
  continuous_cols = cont_cols,
  categorical_cols = cat_cols,
  binary_cols = bin_cols
)

Format distance or similarity matrix output

Description

Converts a distance matrix to either a similarity matrix or a 'dist' object, depending on user preferences.

Usage

format_output(
  dist_mat,
  output_format,
  similarity = FALSE,
  similarity_transform = "linear"
)

Arguments

dist_mat

A symmetric matrix of pairwise distances.

output_format

Character string specifying output format: "matrix", "dist", or "similarity".

similarity

Logical; if TRUE, converts distances to similarities.

similarity_transform

Character string; either "linear" (default) or "sqrt".

Details

When converting to similarity, two transformation formulas are supported to derive similarity from distance:

"linear", (default)

\text{s}_{ij} = 1 - \delta_{ij}

This transformation directly inverts the distance into a similarity score.

"sqrt"

\text{s}_{ij} = 1 - \delta_{ij}^2

This corresponds to a transformation from a metric that satisfies the Euclidean property:

\delta_{ij} = \sqrt{1 - s_{ij}}

According to (Gower and Legendre 1986), this transformation yields a metric that is more likely to preserve Euclidean structure in downstream analyses.

Value

A matrix or 'dist' object, depending on the selected format and similarity flag.

References

Gower JC, Legendre P (1986). “Metric and Euclidean properties of dissimilarity coefficients.” Journal of classification, 3, 5–48.

Generate a Custom Color Palette

Description

Returns a vector of distinct colors for use in plotting or annotation. Uses a predefined set of base colors and interpolates additional colors if needed.

Usage

get_custom_palette(n)

Arguments

n

Integer. Number of colors required.

Value

A character vector of length n containing hexadecimal color codes.

Force a Pairwise Squared Distance Matrix to Euclidean Form

Description

Given a pairwise squared distance matrix D (where D[i,j] = d(i,j)^2), this function ensures that D corresponds to a valid Euclidean squared distance matrix. The correction is based on the weighted Gram matrix G_w = -\frac{1}{2} J_w D J_w^\top, where J_w = I_n - \mathbf{1} w^\top is the centering matrix defined by the weight vector w.

Usage

make_euclidean(D, w, tol = 1e-10)

Arguments

D

Numeric square matrix (n x n) of pairwise squared distances. Must be symmetric with zeros on the diagonal.

w

Numeric vector of weights (length n). Internally normalized to sum to 1.

tol

Numeric tolerance for detecting negative eigenvalues (default: 1e-10).

Details

If the smallest eigenvalue \lambda_{\min} of G_w is below the negative tolerance -tol, the function corrects D by adding a constant shift to guarantee positive semi-definiteness of the Gram matrix, following the approach of (Lingoes 1971) and (Mardia 1978):

D_{\text{new}} = D + 2 c \mathbf{1} \mathbf{1}^\top - 2 c I_n,

where c = |\lambda_{\min}|.

Value

A list with components:

D_euc: Corrected pairwise squared Euclidean distance matrix (n x n).
eigvals_before: Eigenvalues of the weighted Gram matrix before correction.
eigvals_after: Eigenvalues of the weighted Gram matrix after correction.
transformed: Logical, TRUE if correction was applied, FALSE otherwise.

References

Lingoes JC (1971). “Some boundary conditions for a monotone analysis of symmetric matrices.” Psychometrika, 36(2), 195–203. Mardia KV (1978). “Some properties of clasical multi-dimesional scaling.” Communications in Statistics-Theory and Methods, 7(13), 1233–1241.

Examples

# Load example dataset
data("Data_HC_contamination")

# Reduce dataset to first 50 rows
Data_small <- Data_HC_contamination[1:50, ]

# Select only continuous variables
cont_vars <- names(Data_small)[1:4]
Data_cont <- Data_small[, cont_vars]

# Compute squared Euclidean distance matrix
dist_mat <- as.matrix(dist(Data_cont))^2

# Introduce a small non-Euclidean distortion
dist_mat[1, 2] <- dist_mat[1, 2] * 0.5
dist_mat[2, 1] <- dist_mat[1, 2]

# Uniform weights
weights <- rep(1, nrow(Data_cont))

# Apply Euclidean correction
res <- make_euclidean(dist_mat, weights)

# Check results (minimum eigenvalues before/after)
res$transformed
min(res$eigvals_before)
min(res$eigvals_after)

# First 5x5 block of corrected matrix
round(res$D_euc[1:5, 1:5], 4)

Visualize a Distance or Similarity Matrix as a Heatmap with Clustering

Description

This function creates a heatmap from a square distance or similarity matrix. If a similarity matrix is provided, it should first be converted to a distance matrix by the user. The function supports hierarchical clustering, group annotations, row/column sampling (random or stratified), and various customization options.

Usage

plot_heatmap(
  dist_mat,
  max_n = 50,
  group = NULL,
  stratified_sampling = FALSE,
  main_title = NULL,
  palette = "YlOrRd",
  clustering_method = "complete",
  cluster_rows = TRUE,
  cluster_cols = TRUE,
  fontsize_row = 10,
  fontsize_col = 10,
  show_rownames = TRUE,
  show_colnames = TRUE,
  border_color = "grey60",
  annotation_legend = TRUE,
  seed = 123
)

Arguments

dist_mat

A square distance matrix (numeric matrix) or a dist object.

max_n

Integer. Maximum number of observations (rows/columns) to display. If the matrix exceeds this size, a subset of max_n observations is selected.

group

Optional vector or factor providing group labels for rows/columns, used for color annotation.

stratified_sampling

Logical. If TRUE and group is provided, sampling is stratified by group. Each group will contribute at least one observation if possible. Default is FALSE.

main_title

Optional character string specifying the main title of the heatmap.

palette

Character string specifying the RColorBrewer palette for heatmap cells. Default is "YlOrRd".

clustering_method

Character string specifying the hierarchical clustering method, as accepted by hclust (e.g., "complete", "average", "ward.D2").

cluster_rows

Logical, whether to perform hierarchical clustering on rows. Default is TRUE.

cluster_cols

Logical, whether to perform hierarchical clustering on columns. Default is TRUE.

fontsize_row

Integer specifying the font size of row labels. Default is 10.

fontsize_col

Integer specifying the font size of column labels. Default is 10.

show_rownames

Logical, whether to display row names. Default is TRUE.

show_colnames

Logical, whether to display column names. Default is TRUE.

border_color

Color of the cell borders in the heatmap. Default is "grey60".

annotation_legend

Logical, whether to display the legend for group annotations. Default is TRUE.

seed

Integer. Random seed used when sampling rows/columns if max_n is smaller than total observations. Default is 123.

Details

The function works as follows:

Converts dist objects to matrices automatically.
Samples rows/columns if the matrix is larger than max_n. Sampling can be random or stratified by group.
In stratified sampling mode, each group contributes at least one observation if possible.
Supports row annotations for groups and automatically assigns colors.
Uses pheatmap for plotting with customizable clustering, labels, fonts, and colors.

This function is used internally by visualize_distances() but can be called directly for advanced usage.

Value

Invisibly returns the pheatmap object, allowing further customization if assigned.

Examples

# Example: Euclidean distance heatmap on iris
eucli_dist <- stats::dist(iris[, 1:4])
dbrobust::plot_heatmap(
  dist_mat = eucli_dist,
  max_n = 10,
  group = iris$Species,
  stratified_sampling = TRUE,
  main_title = "Euclidean Distance Heatmap",
  palette = "YlOrRd",
  clustering_method = "complete"
)

# Example: GGower distances with small subset
data("Data_HC_contamination", package = "dbrobust")
Data_small <- Data_HC_contamination[1:50, ]
cont_vars <- c("V1", "V2", "V3", "V4")
cat_vars  <- c("V5", "V6", "V7")
bin_vars  <- c("V8", "V9")
w <- Data_small$w_loop
dist_sq_ggower <- dbrobust::robust_distances(
  data = Data_small,
  cont_vars = cont_vars,
  bin_vars  = bin_vars,
  cat_vars  = cat_vars,
  w = w,
  alpha = 0.10,
  method = "ggower"
)
group_vec <- rep("Normal", nrow(dist_sq_ggower))
group_vec[attr(dist_sq_ggower, "outlier_idx")] <- "Outlier"
group_factor <- factor(group_vec, levels = c("Normal", "Outlier"))
dbrobust::plot_heatmap(
  dist_mat = sqrt(dist_sq_ggower),
  max_n = 20,
  group = group_factor,
  main_title = "GGower Heatmap with Outliers",
  palette = "YlOrRd",
  clustering_method = "complete",
  annotation_legend = TRUE,
  stratified_sampling = TRUE,
  seed = 123
)

Plot MDS Results with Grouped Scatter and Density Plots (Internal)

Description

This internal function performs classical or weighted Multidimensional Scaling (MDS) on a given distance matrix and visualizes the resulting coordinates using a pairwise scatterplot matrix with density plots on the diagonal. Grouping information can be provided for colored visual separation.

Arguments

dist_mat

A distance matrix or object convertible to a distance matrix.

k

Integer. Number of dimensions to retain in MDS (default is 3).

weights

Optional numeric vector of weights for weighted MDS. If NULL, classical MDS is performed.

group

Optional factor or vector indicating group membership for observations, used for coloring plots.

main_title

Optional character string for the main plot title.

Details

This is an internal helper function. It is not recommended to call plot_mds() directly. Instead, use visualize_distances(), which wraps this function.

Weighted MDS is performed with vegan::wcmdscale if weights are provided; otherwise, classical MDS (cmdscale) is used. Diagonal panels show density plots by group, and off-diagonal panels show scatter plots by group.

Value

A ggmatrix object from GGally representing the pairs plot with scatterplots and density plots.

Examples

# Load example dataset
data("Data_HC_contamination", package = "dbrobust")
# Subset of 20 rows
Data_small <- Data_HC_contamination[1:20, ]

# Define variable types
cont_vars <- c("V1", "V2", "V3", "V4")
cat_vars  <- c("V5", "V6", "V7")
bin_vars  <- c("V8", "V9")

# Use column 'w_loop' as weights
w <- Data_small$w_loop

# Compute robust distances using GGower
dist_sq_ggower <- dbrobust::robust_distances(
  data = Data_small,
  cont_vars = cont_vars,
  bin_vars  = bin_vars,
  cat_vars  = cat_vars,
  w = w,
  alpha = 0.10,
  method = "ggower"
)

# Create factor indicating Normal vs Outlier
n_obs <- nrow(dist_sq_ggower)
group_vec <- rep("Normal", n_obs)
group_vec[attr(dist_sq_ggower, "outlier_idx")] <- "Outlier"
group_factor <- factor(group_vec, levels = c("Normal", "Outlier"))

# Plot MDS
dbrobust::plot_mds(
  dist_mat = dist_sq_ggower,
  k = 2,
  group = group_factor,
  main_title = "MDS of Data_HC_contamination (GGower) with Outliers"
)

Plot a Network Graph from a Distance Matrix

Description

This internal function visualizes a network graph representation of a distance matrix, where nodes represent observations and edges represent similarity. Groups can be specified for node coloring. A maximum number of nodes can be set to avoid overcrowding, and weak edges are thresholded.

Usage

plot_qgraph(
  dist_mat,
  group = NULL,
  max_nodes = 100,
  label_size = 2,
  edge_threshold = 0.1,
  layout = "spring",
  seed = 123,
  main_title = NULL
)

Arguments

dist_mat

A square distance matrix or a dist object. Distances are automatically normalized to [0,1] and converted to similarity via 1 - distance.

group

Optional factor or vector indicating group membership for nodes, used for coloring.

max_nodes

Integer. Maximum number of nodes to plot. If the number of observations exceeds this, stratified sampling is performed to reduce the node count.

label_size

Numeric. Size of the node labels.

edge_threshold

Numeric between 0 and 1. Edges with similarity below this threshold are removed.

layout

Character string specifying the layout algorithm for qgraph. Default is "spring".

seed

Integer. Random seed used for reproducibility during sampling and layout.

main_title

Optional character string specifying the main title of the plot.

Details

This function is internal and not intended for direct use. It is called by visualize_distances() to display network graphs of robust distances.

Features:

Converts dist objects to matrices automatically.
Downsamples nodes if the number of observations exceeds max_nodes, using stratified sampling by group.
Normalizes the distance matrix to [0,1] and converts it to similarity (1 - distance).
Removes weak edges below edge_threshold.
Colors nodes according to group membership.
Adds a main title using title() after plotting with qgraph.

Value

Invisibly returns NULL. The plot is drawn as a side effect.

Examples

# --------------------------------------
# Network Graph Example from Robust Distances
# --------------------------------------
data("Data_HC_contamination", package = "dbrobust")
# Subset small dataset
Data_small <- Data_HC_contamination[1:20, ]

cont_vars <- c("V1", "V2", "V3", "V4")
cat_vars  <- c("V5", "V6", "V7")
bin_vars  <- c("V8", "V9")
w <- Data_small$w_loop

# Compute GGower robust distances
dist_sq_ggower <- dbrobust::robust_distances(
  data = Data_small,
  cont_vars = cont_vars,
  bin_vars  = bin_vars,
  cat_vars  = cat_vars,
  w = w,
  alpha = 0.10,
  method = "ggower"
)

# Create factor indicating Normal vs Outlier
n_obs <- nrow(dist_sq_ggower)
group_vec <- rep("Normal", n_obs)
group_vec[attr(dist_sq_ggower, "outlier_idx")] <- "Outlier"
group_factor <- factor(group_vec, levels = c("Normal", "Outlier"))

# Plot network graph (small, for CRAN)
dbrobust::plot_qgraph(
  dist_mat = sqrt(dist_sq_ggower),
  group = group_factor,
  max_nodes = 10,
  label_size = 2,
  edge_threshold = 0.1,
  layout = "spring",
  seed = 123,
  main_title = "GGower Network Graph with Outliers"
)

Robust RelMS Distance

Description

Computes a robust version of the Gower distance using the RelMS method for mixed-type data (continuous, binary, categorical). Continuous variables are handled via a robust Mahalanobis distance using a supplied robust covariance matrix. Binary and categorical variables are transformed into distances via similarity coefficients and combined using the RelMS approach.

Usage

robust_RelMS(data, w, p, robust_cov)

Arguments

data

Numeric matrix or data frame with all variables combined.

w

Numeric vector of weights for each observation. Will be normalized internally.

p

Integer vector of length 3: c(#cont, #binary, #categorical).

robust_cov

Robust covariance matrix for continuous variables.

Details

The function computes distances separately for continuous, binary, and categorical variables, then applies the RelMS combination procedure. Continuous distances are Mahalanobis distances, categorical distances use a matching coefficient, and binary distances use a modified similarity coefficient. Eigen decomposition is used to compute the square root matrices needed in the RelMS combination.

Value

A numeric matrix of squared robust distances normalized by geometric variability.

Robust Covariance Estimation Based on Geometric Variability

Description

Computes a robust covariance matrix for a weighted dataset by selecting the most central subset of observations according to geometric variability. Observations are ranked based on a proximity function measuring how far each individual is from the rest of the data. The most central subset is then used to compute a covariance matrix.

Usage

robust_covariance_gv(X, w, alpha)

Arguments

X

Numeric matrix of dimension n x p, where n is the number of observations and p is the number of variables.

w

Numeric vector of weights of length n. Weights will be normalized to sum to 1.

alpha

Numeric trimming proportion between 0 and 1 (e.g., 0.05, 0.10, 0.15) indicating the fraction of most extreme observations to discard.

Value

A list containing:

S: Robust covariance matrix of dimension p x p.
central_idx: Indices of observations selected as the central subset.
outlier_idx: Indices of observations considered outliers.
phi: Proximity function values for all observations.
q: Threshold value used for trimming (quantile of phi).

Examples

# Load a small subset of the example dataset
data("Data_HC_contamination", package = "dbrobust")
Data_small <- Data_HC_contamination[1:20, ]

# Select only continuous variables
cont_vars <- names(Data_small)[1:4]
Data_cont <- Data_small[, cont_vars]

# Set uniform weights and trimming proportion
weights <- rep(1, nrow(Data_cont))
alpha <- 0.10

# Compute robust covariance with trimming
res <- dbrobust::robust_covariance_gv(Data_cont, weights, alpha)

# Inspect results: central observations, outliers, covariance, threshold, proximity
res$central_idx
res$outlier_idx
round(res$S, 4)
res$q
round(res$phi[1:10], 4)

Compute Robust Squared Distances for Mixed Data

Description

Computes a weighted, robust squared distance matrix for datasets containing continuous, binary, and categorical variables. Continuous variables are handled via a robust Mahalanobis distance, and binary and categorical variables are transformed via similarity coefficients. The output is suitable for Euclidean correction with make_euclidean.

Usage

robust_distances(
  data = NULL,
  cont_vars = NULL,
  bin_vars = NULL,
  cat_vars = NULL,
  w = NULL,
  p = NULL,
  method = c("ggower", "relms"),
  robust_cov = NULL,
  alpha = 0.1,
  return_dist = FALSE
)

Arguments

data

Data frame or numeric matrix containing the observations.

cont_vars

Character vector of column names for continuous variables.

bin_vars

Character vector of column names for binary variables.

cat_vars

Character vector of column names for categorical variables.

w

Numeric vector of observation weights. If NULL, uniform weights are used.

p

Integer vector of length 3: c(#cont, #binary, #categorical). Overrides variable type selection if provided.

method

Character string: either "ggower" or "relms" for distance computation.

robust_cov

Optional. Precomputed robust covariance matrix for continuous variables. If NULL, it will be estimated internally using the specified trimming proportion alpha.

alpha

Numeric trimming proportion for robust covariance of continuous variables.

return_dist

Logical. If TRUE, returns an object of class dist; otherwise, returns a squared distance matrix.

Value

A numeric matrix of squared robust distances (n x n) or a dist object if return_dist = TRUE.

Examples

# Example: Robust Squared Distances for Mixed Data

# Load example data and subset
data("Data_HC_contamination", package = "dbrobust")
Data_small <- Data_HC_contamination[1:50, ]

# Define variable types
cont_vars <- c("V1", "V2", "V3", "V4")  # continuous
cat_vars  <- c("V5", "V6", "V7")        # categorical
bin_vars  <- c("V8", "V9")              # binary

# Use column w_loop as weights
w <- Data_small$w_loop

# -------------------------------
# Method 1: Gower distances
# -------------------------------
dist_sq_ggower <- robust_distances(
  data = Data_small,
  cont_vars = cont_vars,
  bin_vars  = bin_vars,
  cat_vars  = cat_vars,
  w = w,
  alpha = 0.10,
  method = "ggower"
)

# Apply Euclidean correction if needed
res_ggower <- make_euclidean(dist_sq_ggower, w)

# Show first 5x5 block of original and corrected distances
cat("GGower original squared distances (5x5 block):\n")
print(round(dist_sq_ggower[1:5, 1:5], 4))
cat("\nGGower corrected squared distances (5x5 block):\n")
print(round(res_ggower$D_euc[1:5, 1:5], 4))

# -------------------------------
# Method 2: RelMS distances
# -------------------------------
dist_sq_relms <- robust_distances(
  data = Data_small,
  cont_vars = cont_vars,
  bin_vars  = bin_vars,
  cat_vars  = cat_vars,
  w = w,
  alpha = 0.10,
  method = "relms"
)

# Apply Euclidean correction if needed
res_relms <- make_euclidean(dist_sq_relms, w)

# Show first 5x5 block of original and corrected distances
cat("RelMS original squared distances (5x5 block):\n")
print(round(dist_sq_relms[1:5, 1:5], 4))
cat("\nRelMS corrected squared distances (5x5 block):\n")
print(round(res_relms$D_euc[1:5, 1:5], 4))

Compute Robust Generalized Gower Distance

Description

Computes a weighted, robust version of the Gower distance for mixed-type data (continuous, binary, categorical). Continuous variables are handled via a robust Mahalanobis distance using a supplied robust covariance matrix. Binary and categorical variables are transformed into distances via similarity coefficients.

Usage

robust_ggower(data, w, p, robust_cov)

Arguments

data

Numeric matrix or data frame with all variables combined.

w

Numeric vector of weights for each observation. Will be normalized internally.

p

Integer vector of length 3: c(#cont, #binary, #categorical).

robust_cov

Robust covariance matrix for continuous variables.

Details

The function computes distances separately for continuous, binary, and categorical variables, then scales each by its geometric variability and combines them. The output is a normalized squared distance matrix suitable for robust clustering or aggregation procedures.

Continuous distances are Mahalanobis distances: (x-y)^T (S)^-1 (x-y). Categorical distances use a matching coefficient. Binary distances are modified to account for positive/negative matches.

Value

A numeric matrix of squared robust Gower distances, normalized by geometric variability.

Visualize Distance Matrices via MDS, Heatmap, or Network Graph

Description

This function provides a unified interface to visualize distance matrices using classical or weighted Multidimensional Scaling (MDS), heatmaps, or network graphs. Group annotations can be provided for coloring.

Usage

visualize_distances(
  dist_mat,
  method = c("mds_classic", "mds_weighted", "heatmap", "qgraph"),
  k = 3,
  weights = NULL,
  group = NULL,
  main_title = NULL,
  tol = 1e-10,
  ...
)

Arguments

dist_mat

A square distance matrix (numeric matrix) or a dist object.

method

Character string specifying the visualization method. Options are:

"mds_classic": Classical MDS (cmdscale).
"mds_weighted": Weighted MDS (wcmdscale, requires weights).
"heatmap": Heatmap with optional clustering and group annotations.
"qgraph": Network graph representation of similarity.

k

Integer. Number of dimensions to retain for MDS (default 3). Must be >=1 and <= min(4, n_obs-1).

weights

Optional numeric vector of weights for weighted MDS. Must match the number of observations.

group

Optional factor or vector indicating group membership for coloring plots.

main_title

Optional character string specifying the main title of the plot.

tol

Numeric tolerance for checking approximate symmetry (default 1e-10).

...

Additional arguments passed to internal plotting functions (plot_heatmap or plot_qgraph).

Details

visualize_distances is a wrapper around three internal plotting functions:

plot_mds: Creates a pairwise scatterplot matrix of MDS coordinates with density plots on the diagonal.
plot_heatmap: Plots a heatmap of the distance matrix with hierarchical clustering and optional group annotations.
plot_qgraph: Plots a network graph where nodes represent observations and edges represent similarity.

The function validates that dist_mat is square, symmetric, and has zero diagonal elements. If a distance matrix has a trimmed_idx attribute and group is not provided, a factor indicating "Trimmed" vs "Outlier" is created automatically.

Value

The plotting object is returned and automatically printed:

MDS plots return a ggmatrix from GGally.
Heatmaps return a pheatmap object.
Network graphs are plotted directly (returns NULL).

Examples

# Load iris dataset
data(iris)

# Compute Euclidean distances on numeric columns
dist_iris <- dist(iris[, 1:4])

# Create a grouping factor based on Species
group_species <- iris$Species

# --------------------------------------
# Classical MDS (2D)
# --------------------------------------
visualize_distances(
  dist_mat = dist_iris,
  method = "mds_classic",
  k = 2,
  group = group_species,
  main_title = "Classical MDS - Iris Dataset - Euclidean Distance"
)

# --------------------------------------
# Weighted MDS (uniform weights)
# --------------------------------------
weights <- rep(1, nrow(iris))
visualize_distances(
  dist_mat = dist_iris,
  method = "mds_weighted",
  k = 2,
  weights = weights,
  group = group_species,
  main_title = "Weighted MDS - Iris Dataset - Euclidean Distance"
)

# --------------------------------------
# Heatmap (limit rows to 30)
# --------------------------------------
visualize_distances(
  dist_mat = dist_iris,
  method = "heatmap",
  group = group_species,
  main_title = "Iris Heatmap by Species - Euclidean Distance",
  max_n = 30,
  palette = "YlGnBu",
  clustering_method = "complete",
  annotation_legend = TRUE,
  stratified_sampling = TRUE,
  seed = 123
)

# --------------------------------------
# Network Graph (limit nodes to 30)
# --------------------------------------
visualize_distances(
  dist_mat = dist_iris,
  method = "qgraph",
  group = group_species,
  max_nodes = 30,
  label_size = 2,
  edge_threshold = 0.1,
  layout = "spring",
  seed = 123,
  main_title = "Iris Network Graph by Species - Euclidean Distance"
)

High-correlation dataset with contamination

Description

Usage

Format

Details

References

High-correlation dataset without contamination

Description

Usage

Format

Details

References

Moderate-correlation dataset with contamination

Description

Usage

Format

Details

References

Moderate-correlation dataset without contamination

Description

Usage

Format

Details

References

Compute Distance or Similarity Matrices

Description

Usage

Arguments

Details

Value

See Also

Examples

Convert a similarity or distance matrix to a 'dist' object

Description

Usage

Arguments

Value

Compute pairwise binary distances

Description

Usage

Arguments

Details

Value

Examples

Compute pairwise distances for categorical data

Description

Usage

Arguments

Details

Value

Examples

Compute pairwise distances for continuous numeric data

Description

Usage

Arguments

Details

Value

Examples

Compute Gower dissimilarity for mixed-type data

Description

Usage

Arguments

Details

Value

References

Examples

Format distance or similarity matrix output

Description

Usage

Arguments

Details

Value

References

Generate a Custom Color Palette

Description

Usage

Arguments

Value

Force a Pairwise Squared Distance Matrix to Euclidean Form

Description