This guide provides a quick introduction to using mLLMCelltype for cell type annotation in single-cell RNA sequencing data. We’ll cover the basic workflow, input data requirements, and a simple example to get you started.
The mLLMCelltype workflow consists of these main steps:
First, load the mLLMCelltype package:
library(mLLMCelltype)
Before using mLLMCelltype, you need to set up API keys for the LLM providers you plan to use:
# Set API keys as environment variables
Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-api-key") # For Claude models
Sys.setenv(OPENAI_API_KEY = "your-openai-api-key") # For GPT models
Sys.setenv(GEMINI_API_KEY = "your-gemini-api-key") # For Gemini models
Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key") # For OpenRouter models
You can obtain API keys from: - Anthropic: https://console.anthropic.com/ - OpenAI: https://platform.openai.com/ - Google (Gemini): https://ai.google.dev/ - OpenRouter: https://openrouter.ai/keys
Alternatively, you can provide API keys directly in function calls:
results <- annotate_cell_types(
input = markers,
tissue_name = "human PBMC",
model = "claude-3-7-sonnet-20250219",
api_key = "your-anthropic-api-key", # Direct API key
top_gene_count = 10
)
mLLMCelltype accepts marker gene data in several formats:
A data frame with the following columns: - cluster
:
Cluster ID (must be 0-based) - gene
: Gene name/symbol -
avg_log2FC
or similar metric: Log fold change -
p_val_adj
or similar metric: Adjusted p-value
Example:
# Example marker data frame
markers_df <- data.frame(
cluster = c(0, 0, 0, 1, 1, 1),
gene = c("CD3D", "CD3E", "CD2", "CD14", "LYZ", "CST3"),
avg_log2FC = c(2.5, 2.3, 2.1, 3.1, 2.8, 2.5),
p_val_adj = c(0.001, 0.001, 0.002, 0.0001, 0.0002, 0.0005)
)
You can directly use the output from Seurat’s
FindAllMarkers()
function:
# Assuming you have a Seurat object named 'seurat_obj'
library(Seurat)
all_markers <- FindAllMarkers(seurat_obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
A path to a CSV file containing marker gene data:
# Path to your CSV file
markers_file <- "path/to/markers.csv"
A list where each element contains marker genes for a cluster:
# Example marker list
markers_list <- list(
"0" = c("CD3D", "CD3E", "CD2", "IL7R", "LTB"),
"1" = c("CD14", "LYZ", "CST3", "MS4A7", "FCGR3A")
)
The annotate_cell_types
function has the following
parameters:
Parameter | Description | Default Value |
---|---|---|
input |
Marker gene data (data frame, list, or file path) | (required) |
tissue_name |
Tissue name (e.g., “human PBMC”, “mouse brain”) | NULL |
model |
LLM model to use | "gpt-4o" |
api_key |
API key (if not set in environment) | NA |
top_gene_count |
Number of top genes per cluster to use | 10 |
debug |
Whether to print debugging information | FALSE |
Note: If api_key
is set to NA
, the function
will return the generated prompt without making an API call, which is
useful for reviewing the prompt before sending it to the API.
Here’s a simple example using a single LLM model for annotation:
# Example marker data
markers <- data.frame(
cluster = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
gene = c("CD3D", "CD3E", "CD2", "IL7R", "LTB", "CD14", "LYZ", "CST3", "MS4A7", "FCGR3A"),
avg_log2FC = c(2.5, 2.3, 2.1, 1.8, 1.7, 3.1, 2.8, 2.5, 2.2, 2.0),
p_val_adj = c(0.001, 0.001, 0.002, 0.003, 0.005, 0.0001, 0.0002, 0.0005, 0.001, 0.002)
)
# Run annotation with a single model
results <- annotate_cell_types(
input = markers,
tissue_name = "human PBMC",
model = "claude-3-7-sonnet-20250219",
api_key = Sys.getenv("ANTHROPIC_API_KEY"),
top_gene_count = 10,
debug = FALSE # Set to TRUE for more detailed output
)
# Print results
print(results)
When using a single model like Claude, the output will be a character vector with one annotation per cluster:
> print(results)
[1] "0: T cells" "1: Monocytes"
For more reliable annotations, you can use multiple models and create a consensus:
# Define models to use
models <- c(
"claude-3-7-sonnet-20250219", # Anthropic
"gpt-4o", # OpenAI
"gemini-1.5-pro" # Google
)
# API keys for different providers
api_keys <- list(
anthropic = Sys.getenv("ANTHROPIC_API_KEY"),
openai = Sys.getenv("OPENAI_API_KEY"),
gemini = Sys.getenv("GEMINI_API_KEY")
)
# Run annotation with multiple models
results <- list()
for (model in models) {
provider <- get_provider(model)
api_key <- api_keys[[provider]]
results[[model]] <- annotate_cell_types(
input = markers,
tissue_name = "human PBMC",
model = model,
api_key = api_key,
top_gene_count = 10
)
}
# Create consensus
consensus_results <- interactive_consensus_annotation(
input = markers,
tissue_name = "human PBMC",
models = models, # Use all the models defined above
api_keys = api_keys,
controversy_threshold = 0.7,
entropy_threshold = 1.0,
consensus_check_model = "claude-3-7-sonnet-20250219"
)
# Print consensus results
print_consensus_summary(consensus_results)
The consensus results contain more detailed information:
> print_consensus_summary(consensus_results)
Consensus Summary:
-----------------
Total clusters: 2
Controversial clusters: 0
Consensus achieved for all clusters
Cluster 0:
Final annotation: T cells
Consensus proportion: 1.0
Entropy: 0.0
Model predictions:
- claude-3-7-sonnet-20250219: T cells
- gpt-4o: T cells
- gemini-1.5-pro: T cells
Cluster 1:
Final annotation: Monocytes
Consensus proportion: 1.0
Entropy: 0.0
Model predictions:
- claude-3-7-sonnet-20250219: Monocytes
- gpt-4o: Monocytes
- gemini-1.5-pro: Monocytes
To add the annotations to your Seurat object:
# Assuming you have a Seurat object named 'seurat_obj' and consensus results
library(Seurat)
# Add consensus annotations to Seurat object
seurat_obj$cell_type_consensus <- plyr::mapvalues(
x = as.character(Idents(seurat_obj)),
from = as.character(0:(length(consensus_results$final_annotations)-1)),
to = consensus_results$final_annotations
)
# Extract consensus metrics from the consensus results
# Note: These metrics are available in the consensus_results$initial_results$consensus_results
consensus_metrics <- lapply(names(consensus_results$initial_results$consensus_results), function(cluster_id) {
metrics <- consensus_results$initial_results$consensus_results[[cluster_id]]
return(list(
cluster = cluster_id,
consensus_proportion = metrics$consensus_proportion,
entropy = metrics$entropy
))
})
# Convert to data frame for easier handling
metrics_df <- do.call(rbind, lapply(consensus_metrics, data.frame))
# Add consensus proportion to Seurat object
seurat_obj$consensus_proportion <- plyr::mapvalues(
x = as.character(Idents(seurat_obj)),
from = metrics_df$cluster,
to = metrics_df$consensus_proportion
)
# Add entropy to Seurat object
seurat_obj$entropy <- plyr::mapvalues(
x = as.character(Idents(seurat_obj)),
from = metrics_df$cluster,
to = metrics_df$entropy
)
Here’s a simple visualization of the results using Seurat:
# Plot UMAP with cell type annotations
DimPlot(seurat_obj, group.by = "cell_type_consensus", label = TRUE, repel = TRUE) +
ggtitle("Cell Type Annotations") +
theme(plot.title = element_text(hjust = 0.5))
The output of annotate_cell_types()
is a vector of cell
type annotations, where each element corresponds to a cluster.
The output of interactive_consensus_annotation()
is a
list containing:
final_annotations
: Final consensus cell type
annotationsinitial_results
: Initial predictions from each
modelcontroversial_clusters
: List of clusters that required
discussiondiscussion_logs
: Detailed logs of the discussion
processsession_id
: Unique identifier for the annotation
sessionWhen using consensus annotation, two key metrics help evaluate the reliability of annotations:
Clusters with low consensus proportion or high entropy may require manual review.
If you don’t have access to paid API keys, you can use OpenRouter’s free models:
# Set OpenRouter API key
Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key")
# Use a free model
free_results <- annotate_cell_types(
input = markers,
tissue_name = "human PBMC",
model = "meta-llama/llama-4-maverick:free", # Note the :free suffix
api_key = Sys.getenv("OPENROUTER_API_KEY"),
top_gene_count = 10
)
# Print results
print(free_results)
Available free models include:
meta-llama/llama-4-maverick:free
- Meta Llama 4
Maverick (256K context)nvidia/llama-3.1-nemotron-ultra-253b-v1:free
- NVIDIA
Nemotron Ultra 253Bdeepseek/deepseek-chat-v3-0324:free
- DeepSeek Chat
v3microsoft/mai-ds-r1:free
- Microsoft MAI-DS-R1Free models don’t consume credits but may have limitations compared to paid models.
API Key Not Found:
Error: No auth credentials found
Solution: Ensure you’ve set the correct API key environment variable or provided it directly in the function call.
Rate Limiting:
Error: Rate limit exceeded
Solution: Wait a few minutes before trying again, or reduce the number of API calls by processing fewer clusters at once.
Invalid Model Name:
Error: Unsupported model: [model_name]
Solution: Check that you’re using a supported model name and that it’s spelled correctly.
Network Issues:
Error: Could not connect to API
Solution: Check your internet connection and try again. If the problem persists, the API service might be down.
Now that you understand the basics of mLLMCelltype, you can explore:
If you encounter any issues, please open an issue on our GitHub repository.