This vignette demonstrates how to use the searchAnalyzeR
package to conduct a comprehensive analysis of systematic review search
strategies using real PubMed data. We’ll walk through a complete
workflow for analyzing search performance, from executing searches to
generating publication-ready reports.
The searchAnalyzeR
package provides tools for:
For this demonstration, we’ll analyze a search strategy designed to identify literature on the long-term effects of COVID-19, commonly known as “Long COVID.” This topic represents a rapidly evolving area of research that presents typical challenges faced in systematic reviews.
First, let’s load the required packages for our analysis:
# Load required packages
library(searchAnalyzeR)
library(rentrez) # For PubMed API access
library(xml2) # For XML parsing
library(dplyr)
library(ggplot2)
library(lubridate)
cat("=== searchAnalyzeR: Real PubMed Search Example ===\n")
#> === searchAnalyzeR: Real PubMed Search Example ===
cat("Topic: Long-term effects of COVID-19 (Long COVID)\n")
#> Topic: Long-term effects of COVID-19 (Long COVID)
cat("Objective: Demonstrate search strategy analysis with real data\n\n")
#> Objective: Demonstrate search strategy analysis with real data
A well-defined search strategy is crucial for systematic reviews. Here we define our search parameters including terms, databases, date ranges, and filters:
# Define our search strategy
search_strategy <- list(
terms = c(
"long covid",
"post-covid syndrome",
"covid-19 sequelae",
"post-acute covid-19",
"persistent covid symptoms"
),
databases = c("PubMed"),
date_range = as.Date(c("2020-01-01", "2024-12-31")),
filters = list(
language = "English",
article_types = c("Journal Article", "Review", "Clinical Trial")
),
search_date = Sys.time()
)
cat("Search Strategy:\n")
#> Search Strategy:
cat("Terms:", paste(search_strategy$terms, collapse = " OR "), "\n")
#> Terms: long covid OR post-covid syndrome OR covid-19 sequelae OR post-acute covid-19 OR persistent covid symptoms
cat("Date range:", paste(search_strategy$date_range, collapse = " to "), "\n\n")
#> Date range: 2020-01-01 to 2024-12-31
The search strategy includes:
The searchAnalyzeR
package provides convenient functions
to search PubMed and retrieve article metadata:
# Execute the search using the package function
cat("Searching PubMed for real articles...\n")
#> Searching PubMed for real articles...
raw_results <- search_pubmed(
search_terms = search_strategy$terms,
max_results = 150,
date_range = search_strategy$date_range,
language = "English"
)
#> PubMed Query: ( "long covid"[Title/Abstract] OR "post-covid syndrome"[Title/Abstract] OR "covid-19 sequelae"[Title/Abstract] OR "post-acute covid-19"[Title/Abstract] OR "persistent covid symptoms"[Title/Abstract] ) AND ("2020/01/01"[Date - Publication] : "2024/12/31"[Date - Publication]) AND English [Language]
#> Found 150 articles
#> Retrieving batch 1 of 3
#> Retrieving batch 2 of 3
#> Retrieving batch 3 of 3
cat("\nRaw search completed. Retrieved", nrow(raw_results), "articles.\n")
#>
#> Raw search completed. Retrieved 150 articles.
Raw search results from different databases often have varying
formats. The std_search_results()
function standardizes the
data structure:
# Standardize the results using searchAnalyzeR functions
cat("\nStandardizing search results...\n")
#>
#> Standardizing search results...
standardized_results <- std_search_results(raw_results, source_format = "pubmed")
This standardization ensures that:
Duplicate detection is critical in systematic reviews, especially when searching multiple databases. The package provides sophisticated algorithms for identifying duplicates:
# Detect and remove duplicates
cat("Detecting duplicates...\n")
#> Detecting duplicates...
dedup_results <- detect_dupes(standardized_results, method = "exact")
cat("Duplicate detection complete:\n")
#> Duplicate detection complete:
cat("- Total articles:", nrow(dedup_results), "\n")
#> - Total articles: 150
cat("- Unique articles:", sum(!dedup_results$duplicate), "\n")
#> - Unique articles: 150
cat("- Duplicates found:", sum(dedup_results$duplicate), "\n\n")
#> - Duplicates found: 0
The detect_dupes()
function offers several methods:
Basic statistics help assess the overall quality of the search results:
# Calculate search statistics
search_stats <- calc_search_stats(dedup_results)
cat("Search Statistics:\n")
#> Search Statistics:
cat("- Date range:", paste(search_stats$date_range, collapse = " to "), "\n")
#> - Date range: 2024-01-01 to 2025-05-01
cat("- Missing abstracts:", search_stats$missing_abstracts, "\n")
#> - Missing abstracts: 0
cat("- Missing dates:", search_stats$missing_dates, "\n\n")
#> - Missing dates: 0
For performance evaluation, we need a “gold standard” of known relevant articles. In a real systematic review, this would be your manually identified relevant articles. For this demonstration, we create a simplified gold standard:
# Create a gold standard for demonstration
# In a real systematic review, this would be your known relevant articles
# For this example, we'll identify articles that contain key terms in titles
cat("Creating demonstration gold standard...\n")
#> Creating demonstration gold standard...
long_covid_terms <- c("long covid", "post-covid", "post-acute covid", "persistent covid", "covid sequelae")
pattern <- paste(long_covid_terms, collapse = "|")
gold_standard_ids <- dedup_results %>%
filter(!duplicate) %>%
filter(grepl(pattern, tolower(title))) %>%
pull(id)
cat("Gold standard created with", length(gold_standard_ids), "highly relevant articles\n\n")
#> Gold standard created with 83 highly relevant articles
Note: In practice, your gold standard would be created through:
The SearchAnalyzer
class provides comprehensive tools
for evaluating search performance:
The analyzer calculates a comprehensive set of performance metrics:
# Calculate comprehensive metrics
cat("Calculating performance metrics...\n")
#> Calculating performance metrics...
metrics <- analyzer$calculate_metrics()
# Display key metrics
cat("\n=== SEARCH PERFORMANCE METRICS ===\n")
#>
#> === SEARCH PERFORMANCE METRICS ===
if (!is.null(metrics$precision_recall$precision)) {
cat("Precision:", round(metrics$precision_recall$precision, 3), "\n")
cat("Recall:", round(metrics$precision_recall$recall, 3), "\n")
cat("F1 Score:", round(metrics$precision_recall$f1_score, 3), "\n")
cat("Number Needed to Read:", round(metrics$precision_recall$number_needed_to_read, 1), "\n")
}
#> Precision: 0.553
#> Recall: 1
#> F1 Score: 0.712
#> Number Needed to Read: 1.8
cat("\n=== BASIC METRICS ===\n")
#>
#> === BASIC METRICS ===
cat("Total Records:", metrics$basic$total_records, "\n")
#> Total Records: 150
cat("Unique Records:", metrics$basic$unique_records, "\n")
#> Unique Records: 150
cat("Duplicates:", metrics$basic$duplicates, "\n")
#> Duplicates: 0
cat("Sources:", metrics$basic$sources, "\n")
#> Sources: 98
The package generates publication-ready visualizations to assess search performance:
# Generate visualizations
cat("\nGenerating visualizations...\n")
#>
#> Generating visualizations...
# Overview plot
overview_plot <- analyzer$visualize_performance("overview")
print(overview_plot)
# Temporal distribution plot
temporal_plot <- analyzer$visualize_performance("temporal")
print(temporal_plot)
# Precision-recall curve (if gold standard available)
if (length(gold_standard_ids) > 0) {
pr_plot <- analyzer$visualize_performance("precision_recall")
print(pr_plot)
}
The package can generate data for PRISMA flow diagrams, essential for systematic review reporting:
# Generate PRISMA flow diagram data
cat("\nCreating PRISMA flow data...\n")
#>
#> Creating PRISMA flow data...
screening_data <- data.frame(
id = dedup_results$id[!dedup_results$duplicate],
identified = TRUE,
duplicate = FALSE,
title_abstract_screened = TRUE,
full_text_eligible = runif(sum(!dedup_results$duplicate)) > 0.7, # Simulate screening
included = runif(sum(!dedup_results$duplicate)) > 0.85, # Simulate final inclusion
excluded_title_abstract = runif(sum(!dedup_results$duplicate)) > 0.3,
excluded_full_text = runif(sum(!dedup_results$duplicate)) > 0.15
)
# Generate PRISMA diagram
reporter <- PRISMAReporter$new()
prisma_plot <- reporter$generate_prisma_diagram(screening_data)
print(prisma_plot)
The package supports exporting results in various formats commonly used in systematic reviews:
# Export results in multiple formats
cat("\nExporting results...\n")
#>
#> Exporting results...
output_dir <- tempdir()
export_files <- export_results(
search_results = filter(dedup_results, !duplicate),
file_path = file.path(output_dir, "covid_long_term_search"),
formats = c("csv", "xlsx", "ris"),
include_metadata = TRUE
)
cat("Files exported:\n")
#> Files exported:
for (file in export_files) {
cat("-", file, "\n")
}
#> - C:\Users\chaoliu\AppData\Local\Temp\RtmpM59Tem/covid_long_term_search.csv
#> - C:\Users\chaoliu\AppData\Local\Temp\RtmpM59Tem/covid_long_term_search.xlsx
#> - C:\Users\chaoliu\AppData\Local\Temp\RtmpM59Tem/covid_long_term_search.ris
Performance metrics can also be exported for further analysis or reporting:
For reproducibility, create a complete data package containing all analysis components:
# Create a complete data package
cat("\nCreating comprehensive data package...\n")
#>
#> Creating comprehensive data package...
package_dir <- create_data_package(
search_results = filter(dedup_results, !duplicate),
analysis_results = list(
metrics = metrics,
search_strategy = search_strategy,
screening_data = screening_data
),
output_dir = output_dir,
package_name = "covid_long_term_systematic_review"
)
cat("Data package created at:", package_dir, "\n")
#> Data package created at: C:\Users\chaoliu\AppData\Local\Temp\RtmpM59Tem/covid_long_term_systematic_review
The package includes tools for validating search strategies against established benchmarks:
# Demonstrate benchmark validation (simplified)
cat("\nDemonstrating benchmark validation...\n")
#>
#> Demonstrating benchmark validation...
validator <- BenchmarkValidator$new()
# Add our search as a custom benchmark
validator$add_benchmark(
name = "covid_long_term",
corpus = filter(dedup_results, !duplicate),
relevant_ids = gold_standard_ids
)
# Validate the strategy
validation_results <- validator$validate_strategy(
search_strategy = search_strategy,
benchmark_name = "covid_long_term"
)
cat("Validation Results:\n")
#> Validation Results:
cat("- Precision:", round(validation_results$precision, 3), "\n")
#> - Precision: 0.696
cat("- Recall:", round(validation_results$recall, 3), "\n")
#> - Recall: 0.855
cat("- F1 Score:", round(validation_results$f1_score, 3), "\n")
#> - F1 Score: 0.768
Analyze how well the retrieved abstracts match the search terms:
# Text similarity analysis on abstracts
cat("\nAnalyzing abstract similarity to search terms...\n")
#>
#> Analyzing abstract similarity to search terms...
search_term_text <- paste(search_strategy$terms, collapse = " ")
similarity_scores <- sapply(dedup_results$abstract[!dedup_results$duplicate], function(abstract) {
if (is.na(abstract) || abstract == "") return(0)
calc_text_sim(search_term_text, abstract, method = "jaccard")
})
cat("Average abstract similarity to search terms:", round(mean(similarity_scores, na.rm = TRUE), 3), "\n")
#> Average abstract similarity to search terms: 0.013
cat("Abstracts with high similarity (>0.1):", sum(similarity_scores > 0.1, na.rm = TRUE), "\n")
#> Abstracts with high similarity (>0.1): 0
Evaluate which search terms are most effective:
# Analyze term effectiveness
cat("\nAnalyzing individual term effectiveness...\n")
#>
#> Analyzing individual term effectiveness...
term_analysis <- term_effectiveness(
terms = search_strategy$terms,
search_results = filter(dedup_results, !duplicate),
gold_standard = gold_standard_ids
)
print(term_analysis)
#> Term Effectiveness Analysis
#> ==========================
#> Search Results: 150 articles
#> Gold Standard: 83 relevant articles
#> Fields Analyzed: title, abstract
#>
#> term articles_with_term relevant_with_term precision
#> long covid 86 60 0.698
#> post-covid syndrome 5 5 1.000
#> covid-19 sequelae 5 1 0.200
#> post-acute covid-19 15 11 0.733
#> persistent covid symptoms 0 0 0.000
#> coverage
#> 0.723
#> 0.060
#> 0.012
#> 0.133
#> 0.000
# Calculate term effectiveness scores
term_scores <- calc_tes(term_analysis)
cat("\nTerm Effectiveness Scores (TES):\n")
#>
#> Term Effectiveness Scores (TES):
print(term_scores[order(term_scores$tes, decreasing = TRUE), ])
#> Term Effectiveness Analysis
#> ==========================
#> Search Results: 150 articles
#> Gold Standard: 83 relevant articles
#> Fields Analyzed: title, abstract
#>
#> term articles_with_term relevant_with_term precision
#> long covid 86 60 0.698
#> post-acute covid-19 15 11 0.733
#> post-covid syndrome 5 5 1.000
#> covid-19 sequelae 5 1 0.200
#> persistent covid symptoms 0 0 0.000
#> coverage tes
#> 0.723 0.71005917
#> 0.133 0.22448980
#> 0.060 0.11363636
#> 0.012 0.02272727
#> 0.000 0.00000000
# Find top performing terms
top_terms <- find_top_terms(term_analysis, n = 3, plot = TRUE, plot_type = "precision_coverage")
cat("\nTop 3 performing terms:", paste(top_terms$terms, collapse = ", "), "\n")
#>
#> Top 3 performing terms: long covid, post-acute covid-19, post-covid syndrome
if (!is.null(top_terms$plot)) {
print(top_terms$plot)
}
Based on the calculated metrics, the package can provide automated recommendations:
# Final summary and recommendations
cat("\n=== FINAL SUMMARY AND RECOMMENDATIONS ===\n")
#>
#> === FINAL SUMMARY AND RECOMMENDATIONS ===
cat("Search Topic: Long-term effects of COVID-19\n")
#> Search Topic: Long-term effects of COVID-19
cat("Articles Retrieved:", sum(!dedup_results$duplicate), "\n")
#> Articles Retrieved: 150
cat("Search Date Range:", paste(search_strategy$date_range, collapse = " to "), "\n")
#> Search Date Range: 2020-01-01 to 2024-12-31
if (!is.null(metrics$precision_recall$precision)) {
cat("Search Precision:", round(metrics$precision_recall$precision, 3), "\n")
if (metrics$precision_recall$precision < 0.1) {
cat("RECOMMENDATION: Low precision suggests search may be too broad. Consider:\n")
cat("- Adding more specific terms\n")
cat("- Using MeSH terms\n")
cat("- Adding study type filters\n")
} else if (metrics$precision_recall$precision > 0.5) {
cat("RECOMMENDATION: High precision suggests good specificity. Consider:\n")
cat("- Broadening search if recall needs improvement\n")
cat("- Adding synonyms or related terms\n")
}
}
#> Search Precision: 0.553
#> RECOMMENDATION: High precision suggests good specificity. Consider:
#> - Broadening search if recall needs improvement
#> - Adding synonyms or related terms
Let’s examine some of the retrieved articles to understand the search results:
# Show some example retrieved articles
cat("\n=== SAMPLE RETRIEVED ARTICLES ===\n")
#>
#> === SAMPLE RETRIEVED ARTICLES ===
sample_articles <- filter(dedup_results, !duplicate) %>%
arrange(desc(date)) %>%
head(3)
for (i in 1:nrow(sample_articles)) {
article <- sample_articles[i, ]
cat("\n", i, ". ", article$title, "\n", sep = "")
cat(" Journal:", article$source, "\n")
cat(" Date:", as.character(article$date), "\n")
cat(" PMID:", gsub("PMID:", "", article$id), "\n")
cat(" Abstract:", substr(article$abstract, 1, 200), "...\n")
}
#>
#> 1. Rates, Risk Factors and Outcomes of Complications After COVID-19 in Children.
#> Journal: PubMed: The Pediatric infectious disease journal
#> Date: 2025-05-01
#> PMID: 40232883
#> Abstract: Coronavirus disease 2019 (COVID-19) can lead to various complications, including multisystem inflammatory syndrome in children (MIS-C) and post-COVID-19 conditions (long COVID). This study aimed to de ...
#>
#> 2. Considerations for Long COVID Rehabilitation in Women.
#> Journal: PubMed: Physical medicine and rehabilitation clinics of North America
#> Date: 2025-05-01
#> PMID: 40210368
#> Abstract: The coronavirus disease 2019 (COVID-19) pandemic has given rise to long COVID, a prolonged manifestation of severe acute respiratory syndrome coronavirus 2 infection, which presents with varied sympto ...
#>
#> 3. Self-Assembly of Human Fibrinogen into Microclot-Mimicking Antifibrinolytic Amyloid Fibrinogen Particles.
#> Journal: PubMed: ACS applied bio materials
#> Date: 2025-01-20
#> PMID: 39723824
#> Abstract: Recent clinical studies have highlighted the presence of microclots in the form of amyloid fibrinogen particles (AFPs) in plasma samples from Long COVID patients. However, the clinical significance of ...
This vignette demonstrated a complete workflow using the
searchAnalyzeR
package:
cat("\n=== ANALYSIS COMPLETE ===\n")
#>
#> === ANALYSIS COMPLETE ===
cat("This example demonstrated:\n")
#> This example demonstrated:
cat("1. Real PubMed search execution using search_pubmed()\n")
#> 1. Real PubMed search execution using search_pubmed()
cat("2. Data standardization and deduplication\n")
#> 2. Data standardization and deduplication
cat("3. Performance metric calculation\n")
#> 3. Performance metric calculation
cat("4. Visualization generation\n")
#> 4. Visualization generation
cat("5. Multi-format export capabilities\n")
#> 5. Multi-format export capabilities
cat("6. PRISMA diagram creation\n")
#> 6. PRISMA diagram creation
cat("7. Benchmark validation\n")
#> 7. Benchmark validation
cat("8. Term effectiveness analysis\n")
#> 8. Term effectiveness analysis
cat("9. Comprehensive reporting\n")
#> 9. Comprehensive reporting
cat("\nAll outputs saved to:", output_dir, "\n")
#>
#> All outputs saved to: C:\Users\chaoliu\AppData\Local\Temp\RtmpM59Tem
The analysis generates numerous output files for different purposes:
# Clean up and provide final file locations
list.files(output_dir, pattern = "covid", full.names = TRUE, recursive = TRUE)
#> [1] "C:\\Users\\chaoliu\\AppData\\Local\\Temp\\RtmpM59Tem/covid_long_term_search.csv"
#> [2] "C:\\Users\\chaoliu\\AppData\\Local\\Temp\\RtmpM59Tem/covid_long_term_search.ris"
#> [3] "C:\\Users\\chaoliu\\AppData\\Local\\Temp\\RtmpM59Tem/covid_long_term_search.xlsx"
#> [4] "C:\\Users\\chaoliu\\AppData\\Local\\Temp\\RtmpM59Tem/strategy_A_broad_covid.csv"
#> [5] "C:\\Users\\chaoliu\\AppData\\Local\\Temp\\RtmpM59Tem/strategy_A_broad_covid.xlsx"
#> [6] "C:\\Users\\chaoliu\\AppData\\Local\\Temp\\RtmpM59Tem/strategy_B_targeted_covid.csv"
#> [7] "C:\\Users\\chaoliu\\AppData\\Local\\Temp\\RtmpM59Tem/strategy_B_targeted_covid.xlsx"
After completing this analysis, typical next steps would include:
For more advanced features and customization options, consult:
help(package = "searchAnalyzeR")
?search_pubmed
,
?SearchAnalyzer
, etc.This comprehensive workflow demonstrates how
searchAnalyzeR
can streamline and enhance the systematic
review search process, providing objective metrics and visualizations to
support evidence-based search strategy development and optimization.