| Type: | Package | 
| Title: | Poisson Hurdle Clustering for Sparse Microbiome Data | 
| Version: | 0.1.0 | 
| Author: | Zhili Qiao | 
| Maintainer: | Zhili Qiao <zlqiao@iastate.edu> | 
| Description: | Clustering analysis for sparse microbiome data, based on a Poisson hurdle model. | 
| License: | GPL-3 | 
| Encoding: | UTF-8 | 
| LazyData: | true | 
| RoxygenNote: | 7.1.2 | 
| Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) | 
| VignetteBuilder: | knitr | 
| Depends: | R (≥ 2.10) | 
| Config/testthat/edition: | 3 | 
| NeedsCompilation: | no | 
| Packaged: | 2022-02-07 19:38:00 UTC; zlqia | 
| Repository: | CRAN | 
| Date/Publication: | 2022-02-08 16:20:11 UTC | 
Calculate optimal number of clusters.
Description
This function estimates the optimal number of clusters for a given dataset.
Usage
Hybrid(data, absolute = FALSE, Kstart = NULL, Treatment)
Arguments
| data | Data matrix with dimension N*P indicating N features and P samples. | 
| absolute | Logical. Whether we should use absolute (TRUE) or relative (FALSE) abundance of features to determine clusters. | 
| Kstart | Positive integer. The number of clusters for starting the hybrid merging algorithm. Should be relatively large to ensure that Kstart > optimal number of clusters. Uses max(50, sqrt(N)) by default. | 
| Treatment | Vector of length p, indicating replicates of different treatment groups. For example, Treatment = c(1,1,2,2,3,3) indicates 3 treatment groups, each with 2 replicates. | 
Value
A positive integer indicating the optimal number of clusters
Examples
######## Run the following codes in order:
##
## This is a sample data set which has 100 features, and 4 treatment groups with 4 replicates each.
data('sample_data')
head(sample_data)
set.seed(1)
##
## Finding the optimal number of clusters
K <- Hybrid(sample_data, Kstart = 4, Treatment = rep(c(1,2,3,4), each = 4))
##
## Clustering result from EM algorithm
result <- PHcluster(sample_data, rep(c(1,2,3,4), each = 4), K, method = 'EM', nstart = 1)
print(result$cluster)
##
## Plot the feature abundance level for each cluster
plot_abundance(result, sample_data, Treatment = rep(c(1,2,3,4), each = 4))
Poisson hurdle clustering
Description
This function gives the clustering result based on a Poisson hurdle model.
Usage
PHcluster(
  data,
  Treatment,
  nK,
  method = c("EM", "SA"),
  absolute = FALSE,
  cool = 0.9,
  nstart = 1
)
Arguments
| data | Data matrix with dimension N*P indicating N features and P samples. The cluster analysis is done feature-wised. | 
| Treatment | Vector of length P. Indicating replicates of different treatment groups. For example, Treatment = c(1,1,2,2,3,3) indicates 3 treatment groups, each with 2 replicates. | 
| nK | Positive integer. Number of clusters. | 
| method | Method for the algorithm. Can choose between "EM" as Expectation Maximization or "SA" as Simulated Annealing. | 
| absolute | Logical. Whether we should use absolute (TRUE) or relative (False) abundance of features to determine clusters. | 
| cool | Real number between (0, 1). Cooling rate for the "SA" algorithm. Uses 0.9 by default. | 
| nstart | Positive integer. Number of starts for the entire algorithm. Note that as nstart increases the computational time also grows linearly. Uses 1 by default. | 
Value
- cluster
- Vector of length N consisting of integers from 1 to nK. Indicating final clustering result. For evaluating the clustering result please check NMI for Normalized Mutual Information. 
- prob
- N*nK matrix. The (i, j)th element representing the probability that observation i belongs to cluster j. 
- log_l
- Scaler. The Poisson hurdle log-likelihood of the final clustering result. 
- alpha
- Vector of length N. The geometric mean abundance level for each feature, across all treatment groups. 
- Normalizer
- vector of length P. The normalizing constant of sequencing depth for each sample. 
Examples
######## Run the following codes in order:
##
## This is a sample data set which has 100 features, and 4 treatment groups with 4 replicates each.
data('sample_data')
head(sample_data)
set.seed(1)
##
## Finding the optimal number of clusters
K <- Hybrid(sample_data, Kstart = 4, Treatment = rep(c(1,2,3,4), each = 4))
##
## Clustering result from EM algorithm
result <- PHcluster(sample_data, rep(c(1,2,3,4), each = 4), K, method = 'EM', nstart = 1)
print(result$cluster)
##
## Plot the feature abundance level for each cluster
plot_abundance(result, sample_data, Treatment = rep(c(1,2,3,4), each = 4))
Plot of feature abundance level
Description
This function plots the feature abundance level for each cluster, after extracting the effect of sample-wise normalization factors and feature-wise geometric mean.
Usage
plot_abundance(result, data, Treatment)
Arguments
| result | Clustering result from function PHclust(). | 
| data | Data matrix with dimension N*P indicating N features and P samples. | 
| Treatment | Vector of length P. Indicating replicates of different treatment groups. For example, Treatment = c(1,1,2,2,3,3) indicates 3 treatment groups, each with 2 replicates. | 
Value
A plot for feature abundance level will be shown. No value is returned.
Examples
######## Run the following codes in order:
##
## This is a sample data set which has 100 features, and 4 treatment groups with 4 replicates each.
data('sample_data')
head(sample_data)
set.seed(1)
##
## Finding the optimal number of clusters
K <- Hybrid(sample_data, Kstart = 4, Treatment = rep(c(1,2,3,4), each = 4))
##
## Clustering result from EM algorithm
result <- PHcluster(sample_data, rep(c(1,2,3,4), each = 4), K, method = 'EM', nstart = 1)
print(result$cluster)
##
## Plot the feature abundance level for each cluster
plot_abundance(result, sample_data, Treatment = rep(c(1,2,3,4), each = 4))
Sample of sparse microbiome count data
Description
A sample data matrix with 100 features in 2 true clusters, 4 treatment groups with 4 replicates in each group.
Usage
sample_data
Format
The dataset contains 16 columns, indexed as A1 ~ A4, B1 ~ B4, C1 ~ C4, D1 ~ D4 to represent 4 treatment groups.
Examples
head(sample_data)