Title: | Microbiome Time-Series Data Analysis |
Version: | 0.1.0 |
Description: | Provides tools specifically designed for analyzing longitudinal microbiome data. This tool integrates seven functional modules, providing a systematic framework for microbiome time-series analysis. For more details on inferences involving interspecies interactions see Fisher (2014) <doi:10.1371/journal.pone.0102451>. Details on this package are also described in an unpublished manuscript. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
URL: | https://github.com/Lishijiagg/MicrobTiSDA |
BugReports: | https://github.com/Lishijiagg/MicrobTiSDA/issues |
Imports: | caret, cluster, dplyr, ggdendro, ggplot2, ggraph, MASS, mgcv, plyr, pracma, randomForest, reshape2, scales, splines, stats, tibble, tidygraph, tidyr, vegan, visNetwork |
Depends: | R (≥ 3.5) |
LazyData: | true |
LazyDataCompression: | xz |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2025-09-17 14:32:21 UTC; lishijia |
Author: | Shijia Li |
Maintainer: | Shijia Li <s.li2@uva.nl> |
Repository: | CRAN |
Date/Publication: | 2025-09-22 18:20:02 UTC |
Visualize Microbial Feature Classification Results Using NMDS
Description
This function generates a non-metric multidimensional scaling (NMDS) plot to visualize OTU/ASV classification results. It combines an OTU/ASV table with predicted group labels to produce a scatter plot in NMDS space, where each sample is colored according to its predicted group.
Usage
Classify.vis(
classified_results,
dist_method = "bray",
fig_title = "NMDS visualization",
legd_title = "Predicted Groups",
points_size = 1,
legend_title_size = 8,
legend_text_size = 6,
axis_title_size = 6,
axis_text_size = 6
)
Arguments
classified_results |
A list object of |
dist_method |
Dissimilarity index, defaults to |
fig_title |
A character string specifying the title of the NMDS plot. Default is |
legd_title |
A character string for the legend title representing the predicted groups. Default is |
points_size |
A numeric value specifying the size of the points in the plot. Default is |
legend_title_size |
A numeric value specifying the font size of the legend title. Default is |
legend_text_size |
A numeric value specifying the font size of the legend text. Default is |
axis_title_size |
A numeric value specifying the font size of the axis titles. Default is |
axis_text_size |
A numeric value specifying the font size of the axis text. Default is |
Details
The function expects as input the results of output of the function Data.rf.classifier
.
It conputes NMDS coordinates based on the OTU/ASV data using the specified distance method (defaulting to
Bray-Curtis) and then maps the predicted group labels onto the NMDS coordinates. The visualization is produced
using ggplot2
, with polygons delineating the convex hull of each predicted group.
Value
A ggplot2
object displaying the NMDS plot with samples colored by their predicted group and convex hulls
outlining each group.
Author(s)
Shijia Li
Examples
# Example OTU count data (20 OTUs x 10 samples)
set.seed(123)
otu_data <- matrix(sample(0:100, 200, replace = TRUE), nrow = 20)
colnames(otu_data) <- paste0("Sample", 1:10)
rownames(otu_data) <- paste0("OTU", 1:20)
# Example metadata with group labels
metadata <- data.frame(Group = rep(c("Control", "Treatment"), each = 5))
# Run the classifier
rf_result <- Data.rf.classifier(raw_data = otu_data,
metadata = metadata,
train_p = 0.7,
Group = "Group",
OTU_counts_filter_value = 50)
# If you wish to select the top 5 features:
result <- Rf.biomarkers(rf = rf_result, feature_select_num = 5)
nm_plot <- Classify.vis(classified_results = result,
dist_method = 'bray',
fig_title = 'NMDS Plot',
legd_title = 'Predicted Groups',
points_size = 1.5)
Cluster OTU Time-Series Data Based on Regression Model prediction and Generate Dendrogram Plots
Description
This function performs hierarchical clustering on predicted OTU time-series data for different groups
and generates corresponding dendrogram plots. For each group in the input list, the function computes a
correlation-based distance matrix, performs hierarchical clustering using the specified clustering method
(e.g. average
), and then converts the result into a dendrogram.
Usage
Data.cluster(
predicted_data,
clust_method = "complete",
font_size = 0.2,
dend_title_size = 15
)
Arguments
predicted_data |
The output data frame from the |
clust_method |
A string, the agglomeration method to be used. This argument should be one of "ward.D", "ward.D2", "single",
"complete", "average", "mcquitty", "median", "centroid". Detail see |
font_size |
A numeric value specifying the font size for text labels in the dendrogram plots (default: |
dend_title_size |
A numeric value specifying the font size of the dendrogram plot title (default: |
Details
For each group in the input predicted_data
, the function first extracts the predicted OTU data (excluding the
last column, which is assumed to contain time information) and computes a correlation matrix, which is converted
into a distance matrix via
d_{\text{corr}}(x,y) = 1-\frac{{\sum_{i=1}^{n}(x_i-\bar{y})}}{{\sqrt{{\sum_{i=1}^{n}(x_i-\bar{x})^2}} \sqrt{{\sum_{i=1}^{n}(y_i-\bar{y})^2}}}}
where x
and y
represent the two OTU time series being compared, n
denotes the total number of time points, and
\bar{x}
and \bar{y}
denote the means of the respective time series. Hierarchical clustering is
performed on the above distance matrix using the method specified in clust_method
.
Value
An object of MicrobTiSDA.cluster
with three elements:
- predicted_data
The original input list of predicted data.
- cluster_results
A list of hierarchical clustering objects (one per group).
- cluster_figures
A list of ggplot2 objects containing the dendrogram plots for each group.
Author(s)
Shijia Li
Examples
# Example metadata with grouping variables
metadata <- data.frame(
TimePoint = c(1, 2, 3, 4),
Sample = c('S1', 'S2', 'S3', 'S4'),
GroupA = c('A', 'A', 'B', 'B'),
GroupB = c('X', 'Y', 'X', 'Y')
)
# Example pre-processed data (e.g., transformed abundance data)
Pre_processed_Data <- data.frame(
Feature1 = rnorm(4),
Feature2 = rnorm(4)
)
# Create design matrix using grouping variables
design_data <- Design(metadata, Group_var = c('GroupA', 'GroupB'), Pre_processed_Data,
Sample_Time = 'TimePoint', Sample_ID = 'Sample')
reg <- Reg.SPLR(design_data,
Pre_processed_Data,
z_score = 2,
unique_values = 5,
Knots = NULL,
max_Knots = 5)
predictions <- Pred.data(reg,
metadata,
Group = "GroupA",
time_step = 1,
Sample_Time = "TimePoint")
result <- Data.cluster(predicted_data = predictions,
clust_method = "average",
font_size = 0.2,
dend_title_size = 15)
Cut and Annotate Dendrogram Based on a Specified Cut-Off Height
Description
This function processes clustering outputs from Data.cluster
to update dendrogram plots.
Depending on the user’s preference, it either automatically determines the optimal number of clusters via
silhouette analysis or uses a user-specified cut-off height.
Usage
Data.cluster.cut(
cluster_outputs,
cut_height,
cut_height_dist = 0.2,
font_size = 0.2,
auto_cutree = FALSE
)
Arguments
cluster_outputs |
The output object of |
cut_height |
A numeric value specifying the cut-off height for cutting the dendrogram when |
cut_height_dist |
A numeric value used to adjust the vertical distance of the cut-off line annotation in the dendrogram
plot (default: |
font_size |
A numeric value specifying the font size for text labels in the dendrogram plots (default: |
auto_cutree |
Logical; if |
Details
The function takes as input a list containing predicted data and clustering results (typically generated by another function) from
Data.cluster
, and then computes a correlation-based distance matrix for each group. If auto_cutree
is TRUE
, the function performs a repeated k-fold cross-validation by iterating over a range of potential cluster numbers
and computing the average silhouette width, thereby determining the optimal number of clusters. The dendrogram is then cut accordingly,
and the resulting clusters are used to annotate the dendrogram plot with different colors for each cluster.
If auto_cutree
is FALSE
, the function uses the provided cut_height
to cut the dendrogram. It then assigns cluster
emberships based on this cut-off and updates the dendrogram plot by adding a horizontal dashed line at the specified cut-off and annotating
the plot with the cut-off value. In both cases, the function prints the dendrogram plot for each group and returns a list containing the
clustering results and the corresponding ggplot2 objects of the dendrograms.
Value
A object of MicrobTiSDA.clusterCut
with two elements:
- cluster_results
A list of clustering objects for each group.
- cluster_figures
A list of ggplot2 objects containing the annotated dendrogram plots for each group.
Author(s)
Shijia Li
Examples
# Example metadata with grouping variables
metadata <- data.frame(
TimePoint = c(1, 2, 3, 4),
Sample = c('S1', 'S2', 'S3', 'S4'),
GroupA = c('A', 'A', 'B', 'B'),
GroupB = c('X', 'Y', 'X', 'Y')
)
# Example pre-processed data (e.g., transformed abundance data)
Pre_processed_Data <- data.frame(
Feature1 = rnorm(4),
Feature2 = rnorm(4)
)
# Create design matrix using grouping variables
design_data <- Design(metadata, Group_var = c('GroupA', 'GroupB'), Pre_processed_Data,
Sample_Time = 'TimePoint', Sample_ID = 'Sample')
reg <- Reg.SPLR(design_data,
Pre_processed_Data,
z_score = 2,
unique_values = 5,
Knots = NULL,
max_Knots = 5)
predictions <- Pred.data(reg,
metadata,
Group = "GroupA",
time_step = 1,
Sample_Time = "TimePoint")
result <- Data.cluster(predicted_data = predictions,
clust_method = "average",
font_size = 0.2,
dend_title_size = 15)
result <- Data.cluster.cut(cluster_outputs = result,
cut_height = 0.3,
cut_height_dist = 0.2,
auto_cutree = FALSE)
# To automatically determine the optimal number of clusters:
result_auto <- Data.cluster.cut(cluster_outputs = result, auto_cutree = TRUE)
Filtering for Microbial Features of Low Abundance and Low Prevalence.
Description
This function filteres an OTU/ASV table based on overall counts and prevalence thresholds, and optionally applies a logarithmic transformation. When grouping variables are provided, the function performs abundance and prevalence filtering within each group separately.
Usage
Data.filter(
Data,
metadata,
OTU_counts_filter_value = 1000,
OTU_filter_value = NA,
log_base = NA,
Group_var = NULL
)
Arguments
Data |
A data frame or a list object which contains the selected biomarker count table
(generated by |
metadata |
A data frame. Containing information about all samples, including at least the grouping of all samples as well as
individual information ( |
OTU_counts_filter_value |
An integer, indicating the sum of the minimum abundances of OTUs/ASVs in all samples. If the sum
of the abundances that OTU is below the given positive integer threshold, the OTU is excluded, and vice versa, it is retained.
The default is 1000. Note: if the input |
OTU_filter_value |
Numeric between 0 and 1. This specifies the minimum prevalence rate of an OTU/ASV across all samples within each group or individual. OTUs/ASVs with a prevalence rate below the given threshold will be removed. |
log_base |
This argument specifies the base of the logarithm. When the dataset is not very large, the default is |
Group_var |
A string or a vector. This specifies the grouping variables, which should match the column names in
the |
Details
The function executes several key steps:
-
Input Validation: It first checks whether the input
Data
is a data frame or a list generated by functionData.rf.classifier
. IfData
is a list but not a data frame, the first element is extracted. Otherwise, ifData
is neither a data frame nor an appropriate list, the function stops with an error. -
OTU Count Filtering: If an
OTU_counts_filter_value
is provided (i.e., notNA
), OTUs with total counts (across all samples) less than or equal to this value are removed. -
Logarithmic Transformation: If a
log_base
is specified (allowed values are 10, 2, or e), a log transformation (with an offset of 1 to avoid log(0)) is applied to the data. Iflog_base
isNA
, the data remains untransformed. -
Prevalence Filtering without Grouping: When
Group_var
is not provided (NULL
), if anOTU_filter_value
is specified, the function filters out OTUs whose prevalence (the proportion of samples with a non-zero count) is less than the threshold. IfOTU_filter_value
is not provided, a warning is issued and no prevalence filtering is applied. -
Group-based Prevalence Filtering: If one or more grouping variables are specified in
Group_var
, the function first checks that these variables exist inmetadata
. For each group (or combination of groups if multiple variables are provided), the prevalence of each OTU is calculated, and OTUs are retained if they meet the prevalence threshold in at least one group. The filtered OTU table is then returned.
Value
A list of class FilteredData
containing:
- filtered_table
The filtered OTU count table, optionally log-transformed.
- parameters
A list of the filtering parameters used.
- metadata
The input metadata, possibly augmented with a combined grouping variable if multiple
Group
s were provided.
Author(s)
Shijia Li
Examples
# Example OTU table
set.seed(123)
otu_table <- as.data.frame(matrix(sample(0:100, 100, replace = TRUE), nrow = 10))
rownames(otu_table) <- paste0("OTU", 1:10)
colnames(otu_table) <- paste0("Sample", 1:10)
# Example metadata
metadata <- data.frame(
Group = rep(c("A", "B"), each = 5),
row.names = paste0("Sample", 1:10)
)
# Filter OTU table without grouping
filtered_data <- Data.filter(
Data = otu_table,
metadata = metadata,
OTU_counts_filter_value = 50,
OTU_filter_value = 0.2
)
# Filter OTU table with grouping
filtered_data_grouped <- Data.filter(
Data = otu_table,
metadata = metadata,
OTU_filter_value = 0.5,
Group_var = "Group"
)
Interpolate Time-Series Data Based on Sample Time
Description
This function performs interpolation on a data frame or matrix (e.g., OTU/ASV counts or other time-series measurements) using corresponding metadata time points. For each unique subject (as defined by a subject ID), the function constructs a full time series between the minimum and maximum time points and applies interpolation (defaulting to cubic interpolation) to generate data for missing time points. The function returns both the interpolated time-series data and the associated updated metadata.
Usage
Data.interpolate(
Data,
metadata,
Sample_Time,
Sample_ID,
interp_method = "cubic",
Group_var
)
Arguments
Data |
A data frame where rows represent OTUs/ASVs and columns represent samples Or the output of the function
|
metadata |
A data frame. Containing information about all samples, including at least the grouping of all samples as well as
individual information ( |
Sample_Time |
A character string specifying the column name in |
Sample_ID |
A character string specifying the column name in |
interp_method |
A character string specifying the interpolation method to be used by |
Group_var |
A character string specifying the column name in |
Details
This function processes the input data and metadata by interating over each unique subject ID defined in Sample_ID
. For
each subject, it subsets and sorts the metadata by Sample_Time
and constructs a complete time series from the minimum
to maximum time values with a step of 1. It then extracts the corresponding data columns and performs interpolation (Using
the specified interp_method
, with cubic
as the default) on each feature across the full time series. Simultaneously,
updated metadata is generated for the interpolated time points, preserving the subject ID and group information as indicated by
Group_var
. The function returns a list object containing the interpolated data matrix and the corresponding
updated metadata.
Value
An object of class "MicrobTiSDA.interpolate"
containing:
Interpolated_Data |
A data frame of interpolated abundance data. |
Interpolated_Data_metadata |
A data frame of corresponding interpolated metadata. |
Author(s)
Shijia Li
Examples
# Example data: 5 features across 8 samples with time points from two subjects.
set.seed(123)
Data <- matrix(sample(1:100, 40, replace = TRUE), nrow = 5)
rownames(Data) <- paste0("Feature", 1:5)
colnames(Data) <- paste0("Sample", 1:8)
# Create metadata with time points, sample IDs, and group assignments.
metadata <- data.frame(
Time = c(1, 3, 5, 7, 2, 4, 6, 8),
ID = c(rep("Subject1", 4), rep("Subject2", 4)),
Group = c(rep("A", 4), rep("B", 4)),
row.names = paste0("Sample", 1:8)
)
# Interpolate the data using cubic interpolation.
interp_results <- Data.interpolate(Data = Data,
metadata = metadata,
Sample_Time = "Time",
Sample_ID = "ID",
interp_method = "cubic",
Group_var = "Group")
Visualize Opposing Temporal profiles among Microbial features.
Description
This function identifies and visualizes OTUs/ASVs that exhibit opposing temporal trends based on a correlation threshold. It computes the correlation matrix of predicted OTU/ASV time-series data, selects those OTUs with correlations below a specified threshold, and generates smoothed temporal profile plots. Optionally, raw data points are overlaid and taxonomic annotations are added if provided.
Usage
Data.opp.cor.vis(
predicted_data,
pre_processed_data,
Design_data,
ng_cor_thres,
Taxa = NULL,
plot_dots = TRUE,
figure_x_scale = 5,
title_size = 10,
axis_title_size = 10,
axis_text_y_size = 8,
axis_text_x_size = 8,
legend_title_size = 10,
legend_text_size = 8,
dots_size = 0.6
)
Arguments
predicted_data |
The output data frame from the |
pre_processed_data |
The transformed data output from the |
Design_data |
The output data from the |
ng_cor_thres |
A numeric value specifying the correlation threshold below which OTUs are considered to exhibit opposing trends. |
Taxa |
A data frame providing taxonomic annotations for microbial species. |
plot_dots |
Logical; if |
figure_x_scale |
A numeric value specifying the interval for x-axis breaks in the figures (default: |
title_size |
A numeric value specifying the font size for the plot title (default: |
axis_title_size |
A numeric value specifying the font size for the axis titles (default: |
axis_text_y_size |
A numeric value specifying the font size for the y-axis text (default: |
axis_text_x_size |
A numeric value specifying the font size for the x-axis text (default: |
legend_title_size |
A numeric value specifying the font size for legend titles (default: |
legend_text_size |
A numeric value specifying the font size for legend text (default: |
dots_size |
A numeric value specifying the size of the overlaid raw data points (default: |
Details
For each group in the predicted_data
list, the function first removes the time column and computes a correlation matrix
from the predicted OTU data. It then extracts, for each OTU, the subset of OTUs that show a correlation lower than the specified threshold
(\text{ng\_cor\_thres}
). If an OTU does not have any opposing trends (i.e., all correlations exceed the threshold), it is skipped.
For those OTUs meeting the criteria, the function restructures the data into long format and plots the temporal profiles using
geom_smooth
to display the primary trend (solid line) and opposing trends (dashed lines). If plot_dots
is TRUE
,
the raw data points extracted from the design data are also overlaid. When taxonomic annotations are provided via Taxa
,
OTU labels are augmented with species information. The x-axis is scaled according to figure_x_scale
, and plot aesthetics
(titles, axis text, legends, and dot sizes) can be customized using the respective parameters.
Value
An object of class DataOppCorVis
which contains the list of each targeted microbial feature.
Author(s)
Shijia Li
Random Forest classification for OTU/ASV Data
Description
This function implements a random forest classification model tailored for OTU/ASV datasets. It performs data filtering, model training, performance evaluation, cross-validation, and biomarker (important microbial features) selection based on Mean Decrease Accuracy. #' @details The function processes the input OTU count data and corresponding metadata in several steps:
-
Data Filtering and Preparation: If a minimum count threshold (
OTU_counts_filter_value
) is provided, OTUs with total counts below this value are removed. The OTU table is then transposed and merged with the metadata, where a specific column (specified byGroup
) indicates the group labels. -
Data Partitioning: The combined dataset is split into training and testing subsets based on the proportion specified by
train_p
. -
Model Training: A random forest classifier is trained on the training data. The function computes the margin scores for the training samples, which are plotted to visualize the model’s confidence.
-
Performance Evaluation: Predictions are made on both training and testing datasets. Confusion matrices are generated to compare the actual versus predicted classes.
-
Feature Importance and Cross-Validation: OTU importance is assessed using Mean Decrease Accuracy. Repeated k-fold cross-validation (default 10-fold repeated
reps
times) is performed to determine the optimal number of OTUs (biomarkers). A cross-validation error curve is plotted, and the user is prompted to input the best number of OTUs based on the plot.
Usage
Data.rf.classifier(
raw_data,
metadata,
train_p,
Group,
OTU_counts_filter_value = NA,
reps = 5,
cv_fold = 10,
title_size = 10,
axis_title_size = 8,
legend_title_size = 8,
legend_text_size = 6,
seed = 123
)
Arguments
raw_data |
A numeric matrix or data frame of counts data with OTUs/ASVs as rows and samples as columns. |
metadata |
A data frame. Containing information about all samples, including at least the grouping of all samples as well as
individual information ( |
train_p |
A positive decimal. Indicating the percentage of data that goes to training. For example, when
|
Group |
A string that specifies the columns in the |
OTU_counts_filter_value |
An integer, indicating the sum of the minimum abundances of OTUs/ASVs in all samples. If the sum
of the abundances that OTU/ASV is below the given positive integer threshold, the OTU/ASV is excluded, and vice versa, it
is retained. The default is |
reps |
An integer. The number of replications for cross-validation. By default, |
cv_fold |
An integer. Number of folds in the cross-validation. By default, |
title_size |
Numeric value for the font size of plot titles. Defaults to 10. |
axis_title_size |
Numeric value for the font size of axis titles. Defaults to 8. |
legend_title_size |
Numeric value for the font size of legend titles. Defaults to 8. |
legend_text_size |
Numeric value for the font size of legend text. Defaults to 6. |
seed |
Random seed. |
Value
An object of class DataRFClassifier
with the following elements:
- Input_data
The transposed and (optionally) filtered OTU table.
- Predicted_results_on_train_set
A vector of predicted group labels for the training set.
- Predicted_results_on_test_set
A vector of predicted group labels for the test set.
- Traindata_confusion_matrix
A confusion matrix comparing actual vs. predicted group labels for the training set.
- Testdata_confusion_matrix
A confusion matrix comparing actual vs. predicted group labels for the test set.
- Margin_scores_train
A ggplot object displaying the margin scores of the training set samples.
- OTU_importance
A data frame of OTU importance metrics, sorted by Mean Decrease Accuracy.
- Classifier
A random forest classifier object trained on the training set.
- cross_validation
A ggplot object showing the cross-validation error curve as a function of the number of features.
Author(s)
Shijia Li
Examples
# Example OTU count data (20 OTUs x 10 samples)
set.seed(123)
otu_data <- matrix(sample(0:100, 200, replace = TRUE), nrow = 20)
colnames(otu_data) <- paste0("Sample", 1:10)
rownames(otu_data) <- paste0("OTU", 1:20)
# Example metadata with group labels
metadata <- data.frame(Group = rep(c("Control", "Treatment"), each = 5))
# Run the classifier
result <- Data.rf.classifier(raw_data = otu_data,
metadata = metadata,
train_p = 0.7,
Group = "Group",
OTU_counts_filter_value = 50)
Transform Microbial Composition Data.
Description
This function applies the modified centered log-ratio (MCLR) transformation function mclr.transform
to a data matrix (e.g., OTU/ASV counts). When a grouping variable is provided the transformation is applied separately for
each group defined in the metadata.
Usage
Data.trans(Data, metadata, Group_var)
Arguments
Data |
A data frame or matrix of microbial compositional data, with rows representing microbial features (OTUs/ASVs)
and columns representing samples. If the function |
metadata |
A data frame. Containing information about all samples, including at least the grouping of all samples as well as
individual information ( |
Group_var |
A string or a vector. This specifies the grouping variables, which should match the column names in
the |
Details
The function transforms the input data using the MCLR method. If no grouping variable is provided (i.e. Group_var
is NULL
), the transformation is applied to the entire dataset. If a single grouping variable is specified,
the data is partitioned into subsets corresponding to the unique groups in the metadata, and the transformation
is applied to each subset separately; the results are then combined using row binding. For multiple grouping variables,
a composite grouping factor is created using the interaction of the specified variables, and the transformation is
applied to each composite group in a similar manner.
Value
An object of class "TransformedData"
containing the transformed count table.
Author(s)
Shijia Li
Examples
# Create example data matrix (5 features x 10 samples)
set.seed(123)
Data <- matrix(sample(1:100, 50, replace = TRUE), nrow = 5)
rownames(Data) <- paste0("Feature", 1:5)
colnames(Data) <- paste0("Sample", 1:10)
# Create example metadata with a grouping variable
metadata <- data.frame(Group = rep(c("A", "B"), each = 5))
rownames(metadata) <- paste0("Sample", 1:10)
# Apply MCLR transformation to the entire dataset
transformed_data <- Data.trans(Data, metadata, Group_var = NULL)
# Apply MCLR transformation separately for each group
transformed_data_by_group <- Data.trans(Data, metadata, Group_var = "Group")
Visualize Temporal OTU Profiles from Clustered Predicted Data
Description
The Data.visual
function generates visualizations of temporal profiles for OTUs by integrating clustering results,
predicted time-series data, and design information. It produces ggplot2 figures for each group and for each cluster branch,
displaying smoothed curves of predicted OTU abundances over time. Optionally, the function overlays raw data points and fits linear
models to assess temporal trends, annotating the plots with model statistics when certain criteria are met.
Usage
Data.visual(
cluster_results,
cutree_by = "height",
cluster_height = NA,
cluster_branches = NA,
predicted_data,
Design_data,
pre_processed_data,
Taxa = NULL,
plot_dots = TRUE,
figure_x_scale = 5,
plot_lm = FALSE,
lm_R2 = 0.01,
lm_abs_slope = 0.005,
title_size = 10,
axis_title_size = 8,
axis_y_size = 5,
axis_x_size = 5,
lm_sig_size = 5,
legend_title_size = 5,
legend_text_size = 5,
dots_size = 0.7
)
Arguments
cluster_results |
A list object output from the |
cutree_by |
A character string specifying the method to cut the dendrogram, either by |
cluster_height |
A numeric vector specifying the cut-off height for each group when |
cluster_branches |
A numeric vector specifying the number of clusters for each group when |
predicted_data |
The output data frame from the |
Design_data |
The output data from the |
pre_processed_data |
The transformed data output from the |
Taxa |
A data frame providing taxonomic annotations for microbial species. |
plot_dots |
Logical; if |
figure_x_scale |
A numeric value specifying the interval for x-axis breaks in the figures (default: |
plot_lm |
Logical; if |
lm_R2 |
A numeric threshold for the minimum R-squared value required to annotate the linear model (default: |
lm_abs_slope |
A numeric threshold for the minimum absolute slope required to annotate the linear model (default: |
title_size |
A numeric value specifying the font size for the plot title (default: |
axis_title_size |
A numeric value specifying the font size for the axis titles (default: |
axis_y_size |
A numeric value specifying the font size for the y-axis text (default: |
axis_x_size |
A numeric value specifying the font size for the x-axis text (default: |
lm_sig_size |
A numeric value specifying the font size for linear model annotation text (default: |
legend_title_size |
A numeric value specifying the font size for legend titles (default: |
legend_text_size |
A numeric value specifying the font size for legend text (default: |
dots_size |
A numeric value specifying the size of the overlaid raw data points (default: |
Details
This function uses hierarchical clustering results (obtained from a dendrogram) to cut the tree either by a specified height or
by a user specified number of branches of each dendrogram in cluster_results
. For each group in cluster_results
,
the function extracts the corresponding predicted OTU data and raw design data. Temporal profiles are visualized by plotting
smooth curves (using stat_smooth
) for each cluster branch. When plot_dots
is set to TRUE
, the function
overlays raw data points. Additionally, if plot_lm
is TRUE
, a linear model is fitted to the predicted data,
and if the model meets specified thresholds for R-squared (lm_R2
) and absolute slope (lm_abs_slope
)
(i.e., R2 > 0.1 and absolute slope > 0.05), a dashed regression line is added along with an annotation of
the R-squared and slope values. The resulting list of ggplot2 objects can be used to visually inspect the temporal dynamics of
OTUs across different clusters and groups.
Value
An object of class MicrobTiSDA.visual
which contains the list of visualizations of clustered microbial features.
Examples
metadata <- data.frame(
TimePoint = c(1, 2, 3, 4),
Sample = c('S1', 'S2', 'S3', 'S4'),
GroupA = c('A', 'A', 'B', 'B'),
GroupB = c('X', 'Y', 'X', 'Y')
)
# Example pre-processed data (e.g., transformed abundance data)
Pre_processed_Data <- data.frame(
Feature1 = rnorm(4),
Feature2 = rnorm(4)
)
# Create design matrix using grouping variables
design_data <- Design(metadata, Group_var = c('GroupA', 'GroupB'), Pre_processed_Data,
Sample_Time = 'TimePoint', Sample_ID = 'Sample')
reg <- Reg.SPLR(design_data,
Pre_processed_Data,
z_score = 2,
unique_values = 5,
Knots = NULL,
max_Knots = 5)
predictions <- Pred.data(reg,
metadata,
Group = "GroupA",
time_step = 1,
Sample_Time = "TimePoint")
result <- Data.cluster(predicted_data = predictions,
clust_method = "average",
font_size = 0.2,
dend_title_size = 15)
result <- Data.cluster.cut(cluster_outputs = result,
cut_height = 0.3,
cut_height_dist = 0.2,
auto_cutree = FALSE)
curves <- Data.visual(cluster_results = result,
cutree_by = "height",
cluster_height = c(0.2,0.2),
cluster_branches = NA,
predicted_data = predictions,
Design_data = design_data,
pre_processed_data = Pre_processed_Data,
Taxa = NULL,
plot_dots = TRUE)
Visualize Group-Level OTU Temporal Profiles from Clustered Predicted Data
Description
This function visualizes the temporal patterns of microbial features at the group level, specifically tailored for data derived from mixed-effects spline regression (MESR) analyses. It leverages clustering results to segregate features into clusters based on their temporal trends, and then generates smoothed time-series plots for each cluster.
Usage
Data.visual.MESR(
cluster_results,
cutree_by = "height",
cluster_height = NA,
cluster_branches = NA,
predicted_data,
Design_data,
pre_processed_data,
Taxa = NULL,
plot_dots = TRUE,
figure_x_scale = 10,
plot_lm = TRUE,
lm_R2 = 0.01,
lm_abs_slope = 0.005,
title_size = 10,
axis_title_size = 8,
axis_y_size = 5,
axis_x_size = 5,
lm_sig_size = 5,
legend_title_size = 5,
legend_text_size = 5,
dots_size = 0.7
)
Arguments
cluster_results |
A list object output from the |
cutree_by |
A character string specifying the method to cut the dendrogram, either by |
cluster_height |
A numeric vector specifying the cut-off height for each group when |
cluster_branches |
A numeric vector specifying the number of clusters for each group when |
predicted_data |
The output data frame from the |
Design_data |
The output data from the |
pre_processed_data |
The transformed data output from the |
Taxa |
A data frame providing taxonomic annotations for microbial species. |
plot_dots |
Logical; if |
figure_x_scale |
A numeric value specifying the interval for x-axis breaks in the figures (default: |
plot_lm |
Logical; if |
lm_R2 |
A numeric threshold for the minimum R-squared value required to annotate the linear model (default: |
lm_abs_slope |
A numeric threshold for the minimum absolute slope required to annotate the linear model (default: |
title_size |
A numeric value specifying the font size for the plot title (default: |
axis_title_size |
A numeric value specifying the font size for the axis titles (default: |
axis_y_size |
A numeric value specifying the font size for the y-axis text (default: |
axis_x_size |
A numeric value specifying the font size for the x-axis text (default: |
lm_sig_size |
A numeric value specifying the font size for linear model annotation text (default: |
legend_title_size |
A numeric value specifying the font size for legend titles (default: |
legend_text_size |
A numeric value specifying the font size for legend text (default: |
dots_size |
A numeric value specifying the size of the overlaid raw data points (default: |
Details
The function begins by selecting branches from hierarchical clustering objects (provided in cluster_results
) using either
a specified cut-off height or a predefined number of clusters, as determined by the cutree_by
parameter. For each group, it extracts
the corresponding raw data from Design_data
and determines the y-axis limits based on both the pre-processed data and the predicted data.
Then, for each cluster within a group, the function subsets the predicted data to include only those features belonging to that cluster. If
taxonomic annotation data (Taxa
) is provided, feature names are augmented with species-level labels. The data is then reshaped into
a long format and plotted using ggplot2
, where smoothed curves (via stat_smooth
) depict the predicted temporal profiles.
Optionally, raw data points can be overlaid (if plot_dots
is TRUE
), and a linear model is fitted to each cluster’s data to test
for significant trends. When the linear model meets criteria based on p-value (< 0.05), R² (greater than lm_R2
), and a minimum absolute slope
(greater than lm_abs_slope
), a dashed regression line is added with an annotation indicating the trend direction (upward or downward) along with
the R² and slope values. Various parameters allow customization of plot appearance including axis scales, font sizes, and legend properties.
Value
An object of class MicrobTiSDA.MSERvisual
which contains lists of ggplot2 objects, where each top-level element corresponds to a
group and each sub-element corresponds to a cluster within that group. Each plot visualizes the temporal profiles of microbial features in that cluster.
Create Design Matrix for Regression Analysis
Description
Design
creates the design matrix of dummies for fitting regression models of microbiota in time.
Usage
Design(metadata, Group_var = NULL, Pre_processed_Data, Sample_Time, Sample_ID)
Arguments
metadata |
A data frame containing information for all samples, which should be identical to the |
Group_var |
A string or a vector. Same as the |
Pre_processed_Data |
The transformed data output from the |
Sample_Time |
A character string indicating the column name in |
Sample_ID |
A character string indicating the column name in |
Details
The main functionality of Design
is to add user-selected sample information to the pre-processed OTU/ASV table
as independent variables for fitting the OTU time series regression models. One necessary independent variable
for fitting is Time
, so the default output of this function is the transformed OTU/ASV table with added
sample Time information. If the user also inputs other qualitative variables such grouping, gender, etc., the
function will define dummy variables to distinguish each group based on the number of qualitative variables
entered by the user and the grouping situation of samples based on qualitative variables. Moreover, the subject ID
of each sample will be added as a column to the generated design matrix.
Value
An object of class Design
, which contains:
data |
A data frame ready for regression modeling. |
params |
A list of parameters used to construct the design. |
Author(s)
Shijia Li
Examples
# Example metadata with grouping variables
metadata <- data.frame(
TimePoint = c(1, 2, 3, 4),
Sample = c('S1', 'S2', 'S3', 'S4'),
GroupA = c('A', 'A', 'B', 'B'),
GroupB = c('X', 'Y', 'X', 'Y')
)
# Example pre-processed data (e.g., transformed abundance data)
Pre_processed_Data <- data.frame(
Feature1 = rnorm(4),
Feature2 = rnorm(4)
)
# Create design matrix using grouping variables
design_data <- Design(metadata, Group_var = c('GroupA', 'GroupB'), Pre_processed_Data,
Sample_Time = 'TimePoint', Sample_ID = 'Sample')
# Create design data without grouping variables
design_data_no_group <- Design(metadata, Group_var = NULL, Pre_processed_Data,
Sample_Time = 'TimePoint', Sample_ID = 'Sample')
Dynamic Visualization of Microbial Interaction Networks
Description
This function visualizes microbial species interactions based on the given inter-species interaction results by Spec.interact
.
Usage
Interact.dyvis(
Interact_data,
threshold,
core_arrow_num,
Taxa = NULL,
fontsize = 15
)
Arguments
Interact_data |
A list containing the inferred species interaction results generated by |
threshold |
Indicates the minimum interaction coefficient used to visualize species interactions. The default value is 1e-6. |
core_arrow_num |
Indicates the number of species pairs involved in species interactions that identify keystone species. The default value is 4, meaning that the screened keystone species must interact with three other species populations in addition to its own population. |
Taxa |
A data frame providing taxonomic annotations for microbial species. |
fontsize |
Indicates the size of the text font. The default is 15. |
Details
The function processes microbial interaction matrices from Spec.interact
to construct species interaction networks.
It selects edges with absolute interaction strengths above threshold
, assigns interaction types (positive or negative).
The function supports onpional taxonomic annotations for improved interpretability.
The nodes represent microbial species,. Edge colors indicate interaction types: blue for negative and orange for positive interactions. Edge widths are scaled by interaction strength. If taxonomic annotations are provided, node labels reflect species-level taxonomy; otherwise, OTU/ASV IDs are used.
Value
An S3 objects of species interact information or plots.
Author(s)
Shijia Li
Example dataset 1 - The OSLO term infant gut microbiota dataset.
Description
A count table used as an example for MicrobTiSDA functions.
Usage
OSLO.infant.data
Format
A data frame with 476921 rows (OTUs) and 77 columns (samples).
Examples
data("OSLO.infant.data")
head("OSLO.infant.data")
Example dataset 1 - sample metadata.
Description
Metadata corresponding to the samples in counts of the OSLO Term infant gut microbiota dataset.
Usage
OSLO.infant.meta
Format
A data frame with 77 rows (samples) and 4 columns (variables).
Examples
data("OSLO.infant.meta")
head("OSLO.infant.meta")
Example dataset 1 - taxonomic information.
Description
Taxonomic classification for the OTUs in the OSLO Term infant gut microbiota dataset.
Usage
OSLO.infant.taxa
Format
A data frame with 476921 rows (OTUs) and 8 columns (taxonomy levels).
Examples
data("OSLO.infant.taxa")
head("OSLO.infant.taxa")
Predict Time-Series Data from Fitted Spline Models
Description
This function generates predicted OTU abundances at new time points using previously fitted spline regression models. For each subject defined in the fitted models, it extracts the time range from the metadata, creates a new sequence of time points with a specified interval, and uses the corresponding fitted models to predict values at these new time points.
Usage
Pred.data(Fitted_models, metadata, Group, time_step, Sample_Time)
Arguments
Fitted_models |
A list object generated by |
metadata |
A data frame. Containing information about all samples, including at least the grouping of all samples as well as
individual information ( |
Group |
A character string indicating the column name in |
time_step |
A numeric value specifying the interval between new time points in the prediction sequence. |
Sample_Time |
A character string indicating the column name in |
Details
The function accepts a list of fitted models (typically the output from Reg.SPLR
) and sample metadata.
For each subject (as defined by the Group
parameter), it extracts the subset of metadata corresponding to that subject. The minimum and maximum time
values are determined, and a new sequence of time points is generated using the provided time_step
. For each OTU model of the
subject, predictions are obtained via the predict
function applied to the new time points, and the results are compiled into a data
frame with an additional column Predicted_Time
indicating the prediction time. The output is a list of data frames, each corresponding
to a subject, containing the predicted OTU abundances.
Value
An object of class PredictedData
which contains the predict microbial feature abundances at the new time points, with rows
labeled by the group and time (formatted as Group_T_Time
) and an additional column Predicted_Time
.
Author(s)
Shijia Li
Examples
# Example metadata with grouping variables
metadata <- data.frame(
TimePoint = c(1, 2, 3, 4),
Sample = c('S1', 'S2', 'S3', 'S4'),
GroupA = c('A', 'A', 'B', 'B'),
GroupB = c('X', 'Y', 'X', 'Y')
)
# Example pre-processed data (e.g., transformed abundance data)
Pre_processed_Data <- data.frame(
Feature1 = rnorm(4),
Feature2 = rnorm(4)
)
# Create design matrix using grouping variables
design_data <- Design(metadata, Group_var = c('GroupA', 'GroupB'), Pre_processed_Data,
Sample_Time = 'TimePoint', Sample_ID = 'Sample')
reg <- Reg.SPLR(design_data,
Pre_processed_Data,
z_score = 2,
unique_values = 5,
Knots = NULL,
max_Knots = 5)
predictions <- Pred.data(reg,
metadata,
Group = "GroupA",
time_step = 1,
Sample_Time = "TimePoint")
Predict Time-Series Data from Fitted Mixed-Effect Spline regression Models
Description
This function generates predicted OTU abundances at new time points using previously fitted spline regression models. For each group defined in the fitted models, it extracts the time range from the metadata, creates a new sequence of time points with a specified interval, and uses the corresponding fitted models to predict values at these new time points.
Usage
Pred.data.MESR(Fitted_models, metadata, Group, time_step, Sample_Time)
Arguments
Fitted_models |
A list object generated by |
metadata |
A data frame. Containing information about all samples, including at least the grouping of all samples as well as
individual information ( |
Group |
A character string indicating the column name in |
time_step |
A numeric value specifying the interval between new time points in the prediction sequence. |
Sample_Time |
A character string indicating the column name in |
Details
The function accepts a list of fitted models (typically the output from Reg.MESR
) and sample metadata.
For each group (as defined by the Group
parameter), it extracts the subset of metadata corresponding to that group. The minimum and maximum time
values are determined, and a new sequence of time points is generated using the provided time_step
. For each OTU model of the
group, predictions are obtained via the predict
function applied to the new time points, and the results are compiled into a data
frame with an additional column Predicted_Time
indicating the prediction time. The output is a list of data frames, each corresponding
to a group, containing the predicted OTU abundances.
Value
An object of PredictedDataMESR
. Each element of the list corresponds to a group and contains the predicted OTU abundances at the
new time points, with rows labeled by the group and time (formatted as Group_T_Time
) and an additional column Predicted_Time
.
Author(s)
Shijia Li
Examples
metadata <- data.frame(
TimePoint = c(1, 2, 3, 4),
Sample = c('S1', 'S2', 'S3', 'S4'),
GroupA = c('A', 'A', 'B', 'B'),
GroupB = c('X', 'Y', 'X', 'Y')
)
# Example pre-processed data (e.g., transformed abundance data)
Pre_processed_Data <- data.frame(
Feature1 = rnorm(4),
Feature2 = rnorm(4)
)
# Create design matrix using grouping variables
design_data <- Design(metadata, Group_var = c('GroupA', 'GroupB'), Pre_processed_Data,
Sample_Time = 'TimePoint', Sample_ID = 'Sample')
fit_result <- Reg.MESR(Data_for_Reg = design_data,
pre_processed_data = Pre_processed_Data,
unique_values = 5,
z_score = 2,
Knots = NULL,
max_Knots = 5)
predictions <- Pred.data.MESR(fit_result,
metadata,
Group = "Group",
time_step = 1,
Sample_Time = "TimePoint")
Fit Mixed-Effects Spline Regression Models with GAM for Microbial Features on Group-Level
Description
This function fits mixed-effect spline regression models using GAM to capture nonlinear temporal trends in microbial community data at the group level. It incorporates random effects for individual IDs within each group and uses natural splines to model the relationship between time and microbial feature abundances.
Usage
Reg.MESR(
Data_for_Reg,
pre_processed_data,
unique_values = 5,
z_score = NA,
Knots = NULL,
max_Knots = 3,
seed = 123
)
Arguments
Data_for_Reg |
The data frame output from the |
pre_processed_data |
The transformed data output from the |
unique_values |
An integer specifying the minimum number of unique values required in an OTU for spline fitting (default: 5). |
z_score |
A numeric value specifying the z-score threshold for outlier filtering; if |
Knots |
An optional numeric vector specifying the knots to use in the spline regression. If |
max_Knots |
An integer indicating the maximum number of knots to consider when selecting the optimal spline model (default: 3). |
seed |
Random seed. |
Details
This function fits mixed-effects spline regression models. Unlike Reg.SPLR
, this function
captures the overall temporal dynamics of microbial features across different groups. Assuming the dataset contains
\textit{m}
groups, each with \textit{n}
subjects, the model is formulated as:
x_{(mi)}(t) = \beta_{m0} + \sum_{k=1}^{K} \beta_{mk} \cdot N_{mk}(t) + b_{n(m)} + \epsilon
where x_{(mi)}(t)
represents the abundance of microbial feature \textit{i}
at time point \textit{t}
in group \textit{m}
. The random effect b_{n(m)}
reflects the departure of individual \textit{n}
in group
\textit{m}
from overall population average effects. The parameter K
refers to the number of basis functions,
which is equal to the number of knots plus one. (i.e., K = number of knots + 1
).
The Reg.MESR
function first extracts independent variables from a design matrix (produced by the Design
)
by removing columns corresponding to the pre-processed OTU data. For each design dummy variable (excluding the first
and last columns), the function subsets the data to include only the observations where the dummy variable equals 1.
Then, for each OTU in the pre-processed data, it optionally filters out outliers based on a specified z-score threshold.
If the number of unique transformed OTU values exceeds a given threshold (unique_values
), a mixed-effects spline
regression model is fitted using a natural spline on the time variable along with a random effect for the sample ID.
When the Knots
parameter is not provided, the function iterates over a range of knot numbers (from 1
to max_Knots
), selects the optimal model by minimizing the Generalized Cross-Validation (GCV) criterion,
and extracts the corresponding knot locations. Alternatively, if Knots
is provided, it is directly used in
model fitting. The resulting fitted models and associated knots information are organized into nested lists and returned.
Value
An object of class RegMESR
with two elements: fitted_model
is a nested list containing the fitted mixed-effects GAM models for each design
dummy variable and OTU; knots_info_each_model
is a corresponding nested list with the knots used for each model.
Author(s)
Shijia Li
Examples
metadata <- data.frame(
TimePoint = c(1, 2, 3, 4),
Sample = c('S1', 'S2', 'S3', 'S4'),
GroupA = c('A', 'A', 'B', 'B'),
GroupB = c('X', 'Y', 'X', 'Y')
)
# Example pre-processed data (e.g., transformed abundance data)
Pre_processed_Data <- data.frame(
Feature1 = rnorm(4),
Feature2 = rnorm(4)
)
# Create design matrix using grouping variables
design_data <- Design(metadata, Group_var = c('GroupA', 'GroupB'), Pre_processed_Data,
Sample_Time = 'TimePoint', Sample_ID = 'Sample')
fit_result <- Reg.MESR(Data_for_Reg = design_data,
pre_processed_data = Pre_processed_Data,
unique_values = 5,
z_score = 2,
Knots = NULL,
max_Knots = 5)
Fit Spline Regression Models with Knot Selection
Description
The Reg.SPLR
function fits natural spline regression models to time-series OTU data for each subgroup
defined by design dummy variables. It uses gam
to model the relationship between OTU
abundance and time, incorporating a z-score based outlier filtering step (optional) and selecting the optimal number
of knots via the Generalized Cross-Validation (GCV) criterion when not provided by the user.
Usage
Reg.SPLR(
Data_for_Reg,
pre_processed_data,
z_score = NA,
unique_values = 5,
Knots = NULL,
max_Knots = 5,
seed = 123
)
Arguments
Data_for_Reg |
The data frame output from the |
pre_processed_data |
The transformed data output from the |
z_score |
A numeric value specifying the z-score threshold for outlier filtering; if |
unique_values |
An integer specifying the minimum number of unique values required in an OTU for spline fitting (default: 5). |
Knots |
An optional numeric vector specifying the knots to use in the spline regression. If |
max_Knots |
An integer indicating the maximum number of knots to consider when selecting the optimal spline model (default: 5). |
seed |
Random seed. |
Details
Reg_SPLR
utilize Generalised Additive Model (GAM,see gam
) to fit natural splines. Here, let x_{ni}(t)
denote
the transformed abundance of microbial feature i
at time point t
. To explain the evolution of microbial abundance along the
time (t
), the following generalized additive model is considered
x_{ni}(t) = \beta_{n0} + \sum_{k=1}^K \beta_{nk} \cdot N_{nk}(t) + \epsilon_{ni}
where \beta_{n0}
is the intercept term, \beta_{nk}
denotes the regression coefficients for the k-th natural spline basis function
N_{nk}(t)
, and \epsilon_{ni}
is the error term. The parameter K
refers to the number of basis functions, which is equal to the
number of knots plus one. (i.e., K = number of knots + 1
).
If a z-score threshold is specified via z_score
, observations with absolute z-scores exceeding this threshold are removed prior to model
fitting. When the OTU has more unique values than specified by unique_values
, the function searches over a range of knot numbers
(from 1 up to max_Knots
) to select the optimal model based on the GCV criterion; if Knots
is provided, it is used directly
in constructing the natural spline basis. The fitted models and their corresponding knots information are stored in a nested list
structure and returned.
Value
An object of class "MicrobTiSDA_spline_regression"
containing:
fitted_model |
A nested list of the fitted spline regression models for each design dummy variable and OTU. |
knots_info |
Contains the corresponding knots information for each model. |
Examples
# Example metadata with grouping variables
metadata <- data.frame(
TimePoint = c(1, 2, 3, 4),
Sample = c('S1', 'S2', 'S3', 'S4'),
GroupA = c('A', 'A', 'B', 'B'),
GroupB = c('X', 'Y', 'X', 'Y')
)
# Example pre-processed data (e.g., transformed abundance data)
Pre_processed_Data <- data.frame(
Feature1 = rnorm(4),
Feature2 = rnorm(4)
)
# Create design matrix using grouping variables
design_data <- Design(metadata, Group_var = c('GroupA', 'GroupB'), Pre_processed_Data,
Sample_Time = 'TimePoint', Sample_ID = 'Sample')
result <- Reg.SPLR(design_data,
Pre_processed_Data,
z_score = 2,
unique_values = 5,
Knots = NULL,
max_Knots = 5)
# Access the fitted model for a particular design dummy variable and OTU:
fitted_model_example <- result$fitted_model[['Group1']][['OTU_name']]
Select Biomarkers Based on Random Forest Cross-Validation Results
Description
This function extracts the top biomarkers from a random forest classification result based on cross-validation and user specified number of microbial features. It updates the cross-validation plot by adding a vertical dashed line at the specified number of features, and then selects the top features (biomarkers) based on their importance ranking.
Usage
Rf.biomarkers(rf = rf_results, feature_select_num)
Arguments
rf |
A list containing the results of the random forest classification. Default to |
feature_select_num |
A numeric value specifying the number of top features (biomarkers) to select. Typically, the numer of
specified biomarkers needs to be determined by the user based on the cross-validation result plot output by
|
Details
The function takes an object (usually the output from Data.rf.classifier
, which includes
a cross-validation plot, an OTU importance table, and the original input data) and a user-specified
number of features to select. It then updates the cross-validation plot by adding a vertical dashed line
at the position corresponding to the number of selected features. Next, it extracts the top features from the
OTU importance table (ordered by Mean Decrease Accuracy) and creates q table of these features
from the original microbial feature table. The function returns a list that includes both the transposed biomarker
table and the modified cross-validation plot.
Value
An object of class RfBiomarker
with two elements:
- OTU_importance
A data frame of the selected biomarkers (transposed feature table).
- cross_validation_fig
A ggplot object of the cross-validation plot with a vertical dashed line indicating the feature selection cutoff.
Author(s)
Shijia Li
Examples
# Example OTU count data (20 OTUs x 10 samples)
set.seed(123)
otu_data <- matrix(sample(0:100, 200, replace = TRUE), nrow = 20)
colnames(otu_data) <- paste0("Sample", 1:10)
rownames(otu_data) <- paste0("OTU", 1:20)
# Example metadata with group labels
metadata <- data.frame(Group = rep(c("Control", "Treatment"), each = 5))
# Run the classifier
rf_result <- Data.rf.classifier(raw_data = otu_data,
metadata = metadata,
train_p = 0.7,
Group = "Group",
OTU_counts_filter_value = 50)
# If you wish to select the top 5 features:
result <- Rf.biomarkers(rf = rf_result, feature_select_num = 5)
# View the biomarker table
print(result$OTU_importance)
# View the updated cross-validation plot
print(result$cross_validation_fig)
Species Interaction Inferrences
Description
This function describes interspecies interactions based on the discrete-time Lotka-Volterral model.
Usage
Spec.interact(
Data,
metadata,
Group_var,
abund_centered_method = "median",
num_iterations = 10,
error_threshold = 0.001,
pre_error = 10000,
seed = NULL
)
Arguments
Data |
A matrix or data frame of the transformed species abundance data. |
metadata |
A data frame. Containing information about all samples, including at least the grouping of all samples as well as
individual information ( |
Group_var |
A character string specifying the column name in |
abund_centered_method |
A character string indicating the method to compute species equilibrium abundance.
Accepted values are |
num_iterations |
An integer specifying the number of bagging iterations for the iterative variable selection process. Default is 10. |
error_threshold |
A numeric value representing the relative error improvement threshold for adding new predictors during bagging iteration. Default is 1e-3. |
pre_error |
A numeric value specifying the initial (large) error used for comparison in the iterative procedure. Default is 10000. |
seed |
Random seed, default by |
Details
This function implements the discrete-time Lotka-Volterra model to characterize species interactions in microbiome time-series data.
The model describes the abundance (MCLR transformed) x_{ni}
of species i
for subject n
at time t+\Delta t
as:
x_{ni} (t+\Delta t) = \eta_{ni} (t) x_{ni} (t) \exp\left(\Delta t \sum_j c_{nij} (x_{nj} (t) - <x_{nj}>) \right)
where <x_{nj}>
represents the equilibrium abundance of species j
, typically defined as the
median abundance across samples from the same subject; c_{nij}
denotes the interaction coefficient of species
j
on species i
; and \eta_{ni} (t)
accounts for log-normally distributed stochastic effects. For
computational simplicity, stochastic effects are ignored, \Delta t
is set to 1. Taking the natural logarithm yealds:
\ln x_{ni} (t+1) - \ln x_{ni} (t) = \sum_j c_{nij} (x_{nj} (t) - <x_{nj}>)
To improve sparsity and interpretability, the LIMITS algorithm is applied, incorporating stepwise regression and bagging. First, 50% of the samples are randomly selected as the training set while the rest serve as the test set. An initial regression model includes only the self-interaction term:
\ln x_{ni} (t+1) - \ln x_{ni} (t) = c_{nii} (x_{ni} (t) - <x_{ni}>)
Stepwise regression then iteratively adds species interaction terms from a candidate set S
, forming:
\ln x_{ni} (t+1) - \ln x_{ni} (t) = c_{nii} (x_{ni} (t) - <x_{ni}>) + \sum_{j \in S} c_{nij} (x_{nj} (t) - <x_{nj}>)
The inclusion of a new term is determined based on the improvement in mean squared error (MSE) on the test set:
\theta = \frac{\text{MSE}_{\text{before}} - \text{MSE}_{\text{after}}}{\text{MSE}_{\text{before}}}
If \theta
exceeds a predefined threshold (default 10^{-3}
), the species is included. Bagging is performed over B
iterations by repeating the random splitting and stepwise regression, enhancing robustness. The final interaction coefficient matrix is computed as:
c_{nij} = \text{median}(c_{nij}^{(1)}, c_{nij}^{(2)}, ..., c_{nij}^{(B)})
This approach refines the inferred species interactions while ensuring sparsity.
Value
A S3 object with an element for each group defined by Group_var
. Each element is a list containing:
- interaction_matrices
A three-dimensional array of estimated interaction coefficients with dimensions corresponding to features
\times
features\times
iterations.- final_interaction_matrix
A two-dimensional matrix of interaction coefficients obtained by taking the median over the iterations.
Author(s)
Shijia Li
Examples
# Example usage:
set.seed(123)
Data <- matrix(sample(1:100, 50, replace = TRUE), nrow = 5)
rownames(Data) <- paste0("Feature", 1:5)
colnames(Data) <- paste0("Sample", 1:10)
# Create example metadata with a grouping variable
metadata <- data.frame(Group = rep(c("A", "B"), each = 5))
rownames(metadata) <- paste0("Sample", 1:10)
metadata$Time = rep(c(1,2,3,4,5),2)
metadata$ID = paste("ID",seq(1:10),"")
results <- Spec.interact(Data = as.data.frame(t(Data)),
metadata = metadata,
Group_var = "Group",
abund_centered_method = "median",
num_iterations = 5,
error_threshold = 1e-3,
pre_error = 10000)
Example dataset 2 - The in vitro aquatic microbiota dataset.
Description
A count table used as an example for MicrobTiSDA functions.
Usage
fujita.data
Format
A data frame with 28 rows (ASVs) and 880 columns (samples).
Examples
data("fujita.data")
head("fujita.data")
Example dataset 2 - sample metadata.
Description
Metadata corresponding to the samples in counts of the in vitro aquatic microbiota dataset.
Usage
fujita.meta
Format
A data frame with 880 rows (samples) and 8 columns (variables).
Examples
data("fujita.meta")
head("fujita.meta")
Example dataset 2 - taxonomic information.
Description
Taxonomic classification for the ASVs in the in vitro aquatic microbiota dataset.
Usage
fujita.taxa
Format
A data frame with 565 rows (ASVs) and 9 columns (taxonomy levels).
Examples
data("fujita.taxa")
head("fujita.taxa")
Modified Centered Log-Ratio (MCLR) Transformation
Description
Applies a modified centered log-ratio (MCLR) transformation to compositional data. This transformation is particularly useful in microbiome and compositional data analysis, as it normalizes the data by comparing each value to the geometric mean of the positive values in its row.
Usage
mclr.transform(Z, base = exp(1), eps = 0.1)
Arguments
Z |
A numeric matrix or data frame containing the compositional data to be transformed. |
base |
A numeric value specifying the logarithmic base to use (default is |
eps |
A small positive constant added to the transformed data to ensure positivity and avoid zeros (default is |
Details
The MCLR method calculates the geometric mean of each sample from positive proportions only, normalized and
log-transformation all non-zero components in the dataset. Specifically, let x_{nt} \in \Omega^I
denotes
the compositional vector for the sample from subject \textit{n}
at timepoint \textit{t}
, where \Omega^I
represents the collection of \textit{I}
microbial features.For simplicity of illustration, assume that the first \textit{q}
elements of x_{nt}
are zero while the remaining elements are non-zero. Then itcan be expressed as:
mclr_\epsilon (x_{nt}) = [0, \dots, 0, \ln{\left(\frac{x_{nt(q+1)}}{\tilde{g}(x_{nt})}\right)} + \epsilon, \dots, \ln{\left(\frac{x_{ntI}}{\tilde{g}(x_{nt})}\right)} + \epsilon]
where \tilde{g}(x_{nt}) = \left(\prod_{i=q+1}^{p} x_{nti}\right)^{\frac{1}{I-q}}
is the geometric mean of the
non-zero elements of x_{nt}
. When \varepsilon = 0
, \text{mclr}_0
corresponds to the centered
log-ratio transform applied to non-zero proportions only. When \varepsilon > 0
, \text{mclr}_\varepsilon
applies a positive shift to all non-zero compositions. To make all non-zero values strictly positive, by
default \varepsilon = 0.1
. The MCLR transformation is invariant to the addition of extra zero components,
preserves the original zero measurements, and is overal rank preserving. For more details, see Yoon et al. (2019).
Value
A data matrix of the same size as Z
after the modified centered log-ratio transformation.
References
Yoon, Grace, Irina Gaynanova, and Christian L. Müller. "Microbial networks in SPRING-Semi-parametric rank-based correlation and partial correlation estimation for quantitative microbiome data." Frontiers in Genetics 10 (2019).
Examples
# Example compositional data matrix
Z <- matrix(c(1, 2, 0, 4, 5, 6, 0, 8, 9), nrow = 3, byrow = TRUE)
transformed_Z <- mclr.transform(Z, base = 10, eps = 0.1)
Plot Method for DataOppCorVis Objects
Description
This function prints correlation curve plots stored in a DataOppCorVis
object.
Users can specify a group and optionally a specific feature within that group to plot.
Usage
## S3 method for class 'DataOppCorVis'
plot(x, group, feature, ...)
Arguments
x |
A |
group |
A character string specifying the group to plot. Must be provided. |
feature |
A character string specifying the feature within the group to plot. If omitted, all features within the specified group will be plotted. |
... |
Additional arguments |
Value
The input DataOppCorVis
visualization object.
Plot Method for DataRFClassifier Objects
Description
This function plots the margin scores and cross-validation curve of a
DataRFClassifier
object.
Usage
## S3 method for class 'DataRFClassifier'
plot(x, ...)
Arguments
x |
A |
... |
Additional arguments. |
Value
The input DataRFClassifier
object, return invisibly.
Plot Method for MicrobTiSDA MSER Visualization Objects
Description
This function prints feature cluster plots stored in a
MicrobTiSDA.MSERvisual
object. Users can specify which groups and
which clusters within each group to plot. If groups
or clusters
is NULL
, all groups or clusters are plotted.
Usage
## S3 method for class 'MicrobTiSDA.MSERvisual'
plot(x, groups = NULL, clusters = NULL, ...)
Arguments
x |
A |
groups |
A character vector specifying the names of groups to plot.
If |
clusters |
An integer vector specifying which clusters to plot within
each selected group. If |
... |
Additional arguments |
Value
The input MicrobTiSDA.MSERvisual
object, return invisibly.
Plot Method for MicrobTiSDA Cluster Objects
Description
This function prints cluster plots stored in a MicrobTiSDA.cluster
object.
Users can specify which groups to plot; if groups
is NULL
,
all available cluster plots will be displayed.
Usage
## S3 method for class 'MicrobTiSDA.cluster'
plot(x, groups = NULL, ...)
Arguments
x |
A |
groups |
A character vector specifying the names of groups to plot.
If |
... |
Additional arguments (currently not used) for compatibility with
generic |
Value
The input MicrobTiSDA.cluster
object, return invisibly.
Plot Method for MicrobTiSDA Cluster Cut Objects
Description
This function prints cluster plots stored in a MicrobTiSDA.clusterCut
object.
Users can specify which groups to plot; if groups
is NULL
,
all available cluster plots will be displayed.
Usage
## S3 method for class 'MicrobTiSDA.clusterCut'
plot(x, groups = NULL, ...)
Arguments
x |
x A |
groups |
A character vector specifying the names of groups to plot.
If |
... |
Additional arguments (currently not used) for compatibility with
generic |
Value
The input MicrobTiSDA.clusterCut
object, return invisibly.
Plot Method for MicrobTiSDA Visual Objects
Description
This function prints feature cluster plots stored in a MicrobTiSDA.visual
object.
Users can specify which groups and which clusters within each group to plot.
If groups
is NULL
, all groups will be displayed. If clusters
is NULL
, all clusters within the selected groups will be plotted.
Usage
## S3 method for class 'MicrobTiSDA.visual'
plot(x, groups = NULL, clusters = NULL, ...)
Arguments
x |
A |
groups |
A character vector specifying the names of groups to plot.
If |
clusters |
An integer vector specifying which clusters to plot within
each selected group. If |
... |
Additional arguments (currently not used) for compatibility with
generic |
Value
The input MicrobTiSDA.visual
object, return invisibly.
Plot Method for RfBiomarker Objects
Description
This function prints the cross-validation figure of a RfBiomarker
object
and displays basic information about the object contents.
Usage
## S3 method for class 'RfBiomarker'
plot(x, ...)
Arguments
x |
A |
... |
Additional arguments |
Value
The input RfBiomarker
object, return invisibly.
Plot Method for microbTiSDA Dynamic Visualization Objects
Description
This function iterates over all visualization elements stored in a
microbTiSDA_dynamic_vis
object and prints each plot with its name.
Usage
## S3 method for class 'microbTiSDA_dynamic_vis'
plot(x, groups = NULL, ...)
Arguments
x |
A |
groups |
A character vector specifying the names of groups to plot.
if |
... |
Additional arguments |
Value
The input microbTiSDA_dynamic_vis
object, returned invisibly.
Example dataset 3 - The preterm infant gut microbiota dataset.
Description
A count table used as an example for MicrobTiSDA functions.
Usage
preterm.data
Format
A data frame with 202 rows (ASVs) and 51 columns (samples).
Examples
data("preterm.data")
head("preterm.data")
Example dataset 3 - sample metadata.
Description
Metadata corresponding to the samples in counts of the in vitro aquatic microbiota dataset.
Usage
preterm.meta
Format
A data frame with 51 rows (samples) and 36 columns (variables).
Examples
data("preterm.meta")
head("preterm.meta")
Example dataset 3 - taxonomic information.
Description
Taxonomic classification for the ASVs in the in vitro aquatic microbiota dataset.
Usage
preterm.taxa
Format
A data frame with 923 rows (ASVs) and 7 columns (taxonomy levels).
Examples
data("preterm.taxa")
head("preterm.taxa")
Information of Microbial features with opposite temporal shapes to users selected feature
Description
Information of Microbial features with opposite temporal shapes to users selected feature
Usage
## S3 method for class 'DataOppCorVis'
print(x, ...)
Arguments
x |
An object of class DataOppCorVis |
... |
Additional arguments |
Value
The input DataOppCorVis
object, returned invisibly.
Information of the constructed Random Forest classification model
Description
Information of the constructed Random Forest classification model
Usage
## S3 method for class 'DataRFClassifier'
print(x, ...)
Arguments
x |
An object of class DataRFClassifier |
... |
Additional arguments |
Value
The input DataRFClassifier
object, returned invisibly.
Print information of the Design matrix
Description
Print information of the Design matrix
Usage
## S3 method for class 'Design'
print(x, ...)
Arguments
x |
An object of class Design. |
... |
Additional arguments |
Value
The input Design
object, returned invisibly.
Print Method for FilteredData Object
Description
Print Method for FilteredData Object
Usage
## S3 method for class 'FilteredData'
print(x, ...)
Arguments
x |
A |
... |
Additional arguments |
Value
The input FilteredData
object, returned invisibly.
Information of visualizations of feature temporal patterns fitted with mixed-effect spline regression models
Description
Information of visualizations of feature temporal patterns fitted with mixed-effect spline regression models
Usage
## S3 method for class 'MicrobTiSDA.MSERvisual'
print(x, ...)
Arguments
x |
An object of class MicrobTiSDA.MESRvisual |
... |
Additional arguments |
Value
The input MicrobTiSDA.MSERvisual
object, returned invisibly.
Print the information of fitted temporal profile clusters
Description
Print the information of fitted temporal profile clusters
Usage
## S3 method for class 'MicrobTiSDA.cluster'
print(x, ...)
Arguments
x |
An object of class MicrobTiSDA.cluster |
... |
Additional arguments |
Value
The input MicrobTiSDA.cluster
object, returned invisibly.
Print the plot information of user selected feature clustering figures
Description
Print the plot information of user selected feature clustering figures
Usage
## S3 method for class 'MicrobTiSDA.clusterCut'
print(x, ...)
Arguments
x |
An object of class MicrobTiSDA.clusterCut |
... |
Additional arguments |
Value
The input MicrobTiSDA.clusterCut
object, returned invisibly.
Print MicrobTiSDA.interpolate object
Description
Print MicrobTiSDA.interpolate object
Usage
## S3 method for class 'MicrobTiSDA.interpolate'
print(x, ...)
Arguments
x |
An object of class |
... |
Additional arguments |
Value
The input MicrobTiSDA.interpolate
object, returned invisibly.
Print the information of those clustered microbial features' temporal patterns
Description
Print the information of those clustered microbial features' temporal patterns
Usage
## S3 method for class 'MicrobTiSDA.visual'
print(x, ...)
Arguments
x |
An object of class MicrobTiSDA.visual |
... |
Additional arguments |
Value
The input MicrobTiSDA.visual
object, returned invisibly.
Print information of fitted natural spline regression models
Description
Print information of fitted natural spline regression models
Usage
## S3 method for class 'MicrobTiSDA_spline_regression'
print(x, ...)
Arguments
x |
An object of class MicrobTiSDA_spline_regression |
... |
Additional arguments |
Value
The input MicrobTiSDA_spline_regression
object, returned invisibly.
Print the information of fitted regression model predicted data
Description
Print the information of fitted regression model predicted data
Usage
## S3 method for class 'PredictedData'
print(x, ...)
Arguments
x |
An object of class PredictedData |
... |
Additional arguments |
Value
The input PredictedData
object, returned invisibly.
Print method for PredictedDataMESR objects
Description
Print method for PredictedDataMESR objects
Usage
## S3 method for class 'PredictedDataMESR'
print(x, ...)
Arguments
x |
An object of class PredictedDataMESR. |
... |
Additional arguments. |
Value
The input PredictedDataMESR
object, returned invisibly.
Information of the fitted mixed-effect spline regression models
Description
Information of the fitted mixed-effect spline regression models
Usage
## S3 method for class 'RegMESR'
print(x, ...)
Arguments
x |
An object of class RegMESR |
... |
Additional arguments |
Value
The input RegMESR
object, returned invisibly.
Print Method for TransformedData Object
Description
Print Method for TransformedData Object
Usage
## S3 method for class 'TransformedData'
print(x, ...)
Arguments
x |
A |
... |
Additional arguments |
Value
No return value, called for its side effect of printing a summary to the console.
Print MicrobTiSDA objects
Description
Print MicrobTiSDA objects
Usage
## S3 method for class 'microbTiSDA'
print(x, ...)
Arguments
x |
An object of class |
... |
Additional arguments |
Value
The input microbTiSDA
object, returned invisibly.
Print information of species interaction dynamic visualizations
Description
Print information of species interaction dynamic visualizations
Usage
## S3 method for class 'microbTiSDA_dynamic_vis'
print(x, ...)
Arguments
x |
An object of class |
... |
Additional arguments |
Value
The input microbTiSDA_dynamic_vis
object, returned invisibly.
Summary Method for Design Objects
Description
Summary Method for Design Objects
Usage
## S3 method for class 'Design'
summary(object, ...)
Arguments
object |
An object of class |
... |
Additional arguments |
Value
A summary list of the designed matrix.
Summary Method for FilteredData Object
Description
Summary Method for FilteredData Object
Usage
## S3 method for class 'FilteredData'
summary(object, ...)
Arguments
object |
A |
... |
Additional arguments |
Value
A summary list containing dimensions and filtering parameters
Summary Method for MicrobTiSDA Cluster Objects
Description
Provides a summary of clustering results for objecs of class MicrobTiSDA.cluster
. It reports the number of OTUs for each group, or
indicates if group was excluded.
Usage
## S3 method for class 'MicrobTiSDA.cluster'
summary(object, ...)
Arguments
object |
An object of class |
... |
Additional arguments. |
Value
A summary list of the clustered objects.
Summary Method for MicrobTiSDA ClusterCut OBjects
Description
Provides a summary of cluster cutting results for objects of class MicrobTiSDA.clusterCut
. It reports the number of clusters for
each group, or indicates if no valid clustering was performend.
Usage
## S3 method for class 'MicrobTiSDA.clusterCut'
summary(object, ...)
Arguments
object |
An object of class |
... |
Additional arguments |
Value
A summary list of the clustered objects.
Summary method for MicrobTiSDA.interpolate objects
Description
Provides a structure summary of a MicrobTiSDA.interpolate
object, including its parameters and results.
It prints object class, the information of interpolated microbial count data and updated metadata.
Usage
## S3 method for class 'MicrobTiSDA.interpolate'
summary(object, ...)
Arguments
object |
A |
... |
Additional arguments |
Value
A summary list of the interpolated data.
Summary Method for MicrobTiSDA Visual Objects
Description
Provides a summary of visualization results for objects of class MicrobTiSDA.visual
. It reports, for each group, the number of cluster-speficif
visualization plot available.
Usage
## S3 method for class 'MicrobTiSDA.visual'
summary(object, ...)
Arguments
object |
An object of class |
... |
Additional arguments |
Value
A list of ggplot2 objects.
Summary Method for MicrobTiSDA Spline Regression Objects
Description
Provides a concise summary for objects of class MicrobTiSDA_spline_regression
. Prints the names of independent
variables and shows an example of the first fitted microbial feature.
Usage
## S3 method for class 'MicrobTiSDA_spline_regression'
summary(object, ...)
Arguments
object |
An object of class |
... |
Additional arguments. |
Value
A summary list of the fitted natural spline regression models.
Summary Method for PredictedData Objects
Description
Provides a concise summary of a PredictedData
object,
including number of groups, number of time points, number of features,
and the time range for each group.
Usage
## S3 method for class 'PredictedData'
summary(object, ...)
Arguments
object |
A |
... |
Additional arguments. |
Value
A summary list of the model predicted data.
Summary method for PredictedDataMESR objects
Description
Summary method for PredictedDataMESR objects
Usage
## S3 method for class 'PredictedDataMESR'
summary(object, ...)
Arguments
object |
An object of class PredictedDataMESR. |
... |
Additional arguments. |
Value
A summary list of the model predicted data.
Summary method for RegMESR objects
Description
Provides a concise summary of the results from a Regularized Multivariate Exponential Spline Regression (RegMESR) analysis, including the number of groups fitted, the total number of fitted models, and the model parameters.
Usage
## S3 method for class 'RegMESR'
summary(object, ...)
Arguments
object |
An object of class |
... |
Additional arguments |
Value
A summary list of the fitted mixed-effect natural spline regression models.
Summary method for RfBiomarker objects
Description
Provides a concise summary of the results from a Random Forest biomarker analysis, including the number of selected OTUs, their names, prediction results on training and test sets, and information about cross-validation.
Usage
## S3 method for class 'RfBiomarker'
summary(object, ...)
Arguments
object |
An object of class |
... |
Additional arguments |
Value
Summary information of the constructed random forest classification model.
Summary Method for TransformedData Objects
Description
Summary Method for TransformedData Objects
Usage
## S3 method for class 'TransformedData'
summary(object, ...)
Arguments
object |
A |
... |
Additional arguments. |
Value
A summary list of the transformed data.
Summary method for MicrobTiSDA objects
Provides a structure summary of a microbTiSDA
object, including its parameters and results. It prints object class,
parameters settings, number of result groups/items, and previews of tabular of list-based result elements.
Description
Summary method for MicrobTiSDA objects
Provides a structure summary of a microbTiSDA
object, including its parameters and results. It prints object class,
parameters settings, number of result groups/items, and previews of tabular of list-based result elements.
Usage
## S3 method for class 'microbTiSDA'
summary(object, max_preview = 5, ...)
Arguments
object |
A |
max_preview |
Integer. Maximum number of rows and columns to preview for tabular results. Default is |
... |
Additional arguments |
Value
A summary list.
Summary Method for MicrobTiSDA Dynamic Interaction Visualization Objects
Description
Provides a concise summary of the dynamic interaction visualization results stored in a 'microbTiSDA_dynamic_vis' object. The summary includes the number of nodes and edges in each group.
Usage
## S3 method for class 'microbTiSDA_dynamic_vis'
summary(object, ...)
Arguments
object |
An object of class |
... |
Additional arguments. |
Value
A summary list of the dynamic visualization for interspecies interactions.