Help for package eDNAfuns

Type:

Package

Title:

Working with Metabarcoding Data in a Tidy Format

Version:

0.1.0

Description:

A series of R functions that come in handy while working with metabarcoding data. The reasoning of doing this is to have the same functions we use all the time stored in a curated, reproducible way. In a way it is all about putting together the grammar of the 'tidyverse' from Wickham et al.(2019) <doi:10.21105/joss.01686> with the functions we have used in community ecology compiled in packages like 'vegan' from Dixon (2003) <doi:10.1111/j.1654-1103.2003.tb02228.x> and 'phyloseq' McMurdie & Holmes (2013) <doi:10.1371/journal.pone.0061217>. The package includes functions to read sequences from FAST(A/Q) into a tibble ('fasta_reader' and 'fastq_reader'), to process 'cutadapt' Martin (2011) <doi:10.14806/ej.17.1.200> 'info-file' output. When it comes to sequence counts across samples, the package works with the long format in mind (a three column 'tibble' with Sample, Sequence and counts ), with functions to move from there to the wider format.

License:

GPL (≥ 3)

Encoding:

UTF-8

LazyData:

true

Suggests:

insect, testthat, tidyverse, knitr, rmarkdown

Imports:

dplyr, googlesheets4, phyloseq, purrr, rlang, tibble, tidyr, vegan, ggplot2, vroom, Biostrings, stringr, tidyselect, readr

Depends:

R (≥ 4.1.0)

RoxygenNote:

7.3.2

NeedsCompilation:

Packaged:

2025-09-24 10:08:22 UTC; RG.5015511

Author:

Ramón Gallego Simón [aut, cre]

Maintainer:

Ramón Gallego Simón <ramongallego@gmail.com>

Repository:

CRAN

Date/Publication:

2025-10-01 07:00:09 UTC

AMPURE

Description

Description: Placeholder for AMPURE dataset.

Usage

AMPURE

Format

A tibble/data.frame with X rows and Y columns:

Column1: Description of column1
Column2: Description of column2

...

Source

Generated by the Author

ASV_table

Description

An abundance table in long format.

Usage

ASV_table

Format

A data frame with 9 rows and 3 variables:

sample_name: Character vector, last digit includes PCR replicate
Hash: The sequence found, hashed using sha1 for consistency
nReads: Number of times that hash was found in that sample

Details

Contains the number of reads from unique ASVs across different samples.

Source

Generated by the Author

Index_PCR

Description

Placeholder description for IndexPCR template web address.

Usage

Index_PCR

Format

A character vector

Source

Generated by the Author, stored in googledrive

OTU_taxonomy

Description

Placeholder description for OTU_taxonomy dataset.

Usage

OTU_taxonomy

Format

A dataframe of DNA sequences names and their taxonomy, with 3 rows and 7 columns

Hash: Sequence identifier
Kingdom: taxon info at that rank
Phylum: taxon info at that rank
Class: taxon info at that rank
Order: taxon info at that rank
Family: taxon info at that rank
Genus: taxon info at that rank

Source

Generated by the Author

UDI_Indices

Description

Placeholder description for UDI_Indices dataset.

Usage

UDI_Indices

Format

A data frame with 384 rows and 6 columns

Index: Character vector
Bases_in_Adapter_i7: Seqs in i7
Bases_for_Sample_Sheet_i7: Seqs to include
Bases_for_Sample_Sheet_i5: Seqs to include in some illumina platoforms
Bases_for_Sample_Sheet_i5_B: Seqs to include in other illumina platoforms
Set: A, B, C, or D

Source

Generated by the Author, data from Illumina

Count stop codons or return translated sequence

Description

Given a DNA sequence (either a 'DNAString' object or a character string), this function translates it using a specified genetic code and counts the number of stop codons ('*') in the resulting amino acid sequence. Alternatively, the translated sequence itself can be returned.

Usage

count_stop_codons(
  sequence = NULL,
  format = "DNAString",
  codon = 1,
  dictionary = 5,
  return = "count"
)

Arguments

sequence

A DNA sequence, either as a [Biostrings::DNAString] object or as a character string.

format

Input format, either '"DNAString"' (default) or '"character"'. If '"character"', the sequence must consist of 'A', 'C', 'G', 'T' only.

codon

Integer giving the starting codon position (usually 1, 2, or 3).

dictionary

Integer specifying which translation code to use. See [Biostrings::GENETIC_CODE_TABLE] for all options. Common examples:

id	name
1	Standard
2	Vertebrate Mitochondrial
3	Yeast Mitochondrial
4	Mold/Protozoan/Coelenterate/Mycoplasma/Spiroplasma Mitochondrial
5	Invertebrate Mitochondrial
6	Ciliate/Dasycladacean/Hexamita Nuclear
9	Echinoderm/Flatworm Mitochondrial
10	Euplotid Nuclear
11	Bacterial, Archaeal, and Plant Plastid
12	Alternative Yeast Nuclear
13	Ascidian Mitochondrial
14	Alternative Flatworm Mitochondrial
15	Blepharisma Macronuclear
16	Chlorophycean Mitochondrial
21	Trematode Mitochondrial
22	Scenedesmus obliquus Mitochondrial
23	Thraustochytrium Mitochondrial
24	Pterobranchia Mitochondrial
25	Candidate Division SR1 and Gracilibacteria
26	Pachysolen tannophilus Nuclear

return

Either '"count"' (default) to return the number of stop codons, or any other value to return the translated amino acid sequence.

Details

The function uses [Biostrings::translate()] with the specified starting position and genetic code. Translation warnings are suppressed.

Value

If 'return = "count"', an integer giving the number of stop codons. Otherwise, a character string with the translated amino acid sequence.

Author(s)

Ramon Gallego, 2021

Examples

if (requireNamespace("Biostrings", quietly = TRUE)) {
  library(Biostrings)
  seq <- DNAString("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
  count_stop_codons(seq, codon = 1, dictionary = 1)
  count_stop_codons(seq, codon = 1, dictionary = 1, return = "translation")
}

Create scaled relative proportions of the number of reads of taxa in different samples

Description

This function takes a long ASV table (tibble) and creates a new column with the relative proportions scaled to their maximum, to avoid the dominance of species with better primer efficiency

Usage

eDNAindex(
  tibble,
  Sample_column = Sample,
  OTU_column = Hash,
  Counts_column = nReads,
  Biological_replicate_column = NULL,
  ...
)

Arguments

tibble

A tibble the ASV table in a long format, with at least three columns, Sample_column, OTU_column, Counts_column

Sample_column

The column indicating the sample.

OTU_column

The column indicating the OTU/ASV.

Counts_column

The column (numeric) with the number of sequences from that OTU in that sample.

Biological_replicate_column

The column representing replicate measurements of Sample_column.

...

Any extra columns that want to be added to the final dataset (either taxonomical information about OTUs, or metadata information about the samples)

Value

A tibble with at least the columns Sample_column, OTU_column and Normalized.reads

Examples

data("training.ASV.table")
eDNAindex(training.ASV.table, Sample_column = Sample_name)

example_hashes

Description

Placeholder description for example_hashes dataset.

Usage

example_hashes

Format

A vector or data frame with X rows/columns

Source

Generated by the Author

Read FASTA or FASTQ files into a tibble

Description

Functions to read sequence files into a tidy data frame with one row per sequence.

Usage

fasta_reader(path_to_fasta)

fastq_reader(path_to_fastq, keepQ = FALSE)

Arguments

path_to_fasta

Character. Path to a FASTA file.

path_to_fastq

Character. Path to a FASTQ file.

keepQ

Logical. If 'TRUE', keep a third column with quality scores when reading FASTQ files. Default is 'FALSE'.

Value

- 'fasta_reader()': A tibble with columns: - 'header': sequence identifiers (without the '>'). - 'seq': nucleotide sequences.

- 'fastq_reader()': A tibble with columns: - 'header': sequence identifiers (without the '@'). - 'seq': nucleotide sequences. - 'Qscores' (optional): quality scores, if 'keepQ = TRUE'.

Examples


fasta_df <- fasta_reader(system.file("extdata", "test.fasta", package="eDNAfuns"))
fastq_df <- fastq_reader(system.file("extdata", "test.fastq", package="eDNAfuns"), keepQ = TRUE)

Read FASTA or FASTQ files into a tibble

Description

Functions to read sequence files into a tidy data frame with one row per sequence.

Usage

fasta_writer(df, sequence, header, file.out)

fastq_writer(df, sequence, header, Qscores, file.out)

Arguments

df

A dataframe where the sequence information is stored, one sequence per row

sequence

The name (unquoted) of the column where the sequence information (as characters) is stored

header

The name (unquoted) of the column where the header information is stored

file.out

Character. Path to the location where to write the file.

Qscores

The name (unquoted) of the column where the Quality information (encoded as characters) is stored

Value

- 'fasta_reader()': A tibble with columns: - 'header': sequence identifiers (without the '>'). - 'seq': nucleotide sequences.

- 'fastq_reader()': A tibble with columns: - 'header': sequence identifiers (without the '@'). - 'seq': nucleotide sequences. - 'Qscores' (optional): quality scores, if 'keepQ = TRUE'.

Examples

fasta_file <- tempfile(fileext = ".fasta")

fasta_df <- fasta_reader(system.file("extdata", "test.fasta", package="eDNAfuns"))

fasta_writer(fasta_df, sequence=seq,
              header = header,
              file.out = fasta_file)
 
fastq_file <- tempfile(fileext = ".fastq")  
           
fastq_df <- fastq_reader(system.file("extdata", "test.fastq", package="eDNAfuns"), keepQ = TRUE)

fastq_writer(fastq_df, sequence=seq,
              header = header,Qscores= Qscores,
              file.out = fastq_file)

metadata

Description

Placeholder description for metadata dataset.

Usage

metadata

Format

A data frame with X rows and Y columns:

sample_name: Name of sample
Treatment: either control or treatment
Day: either 1 or 2

Source

Generated by the Author

molarity.data

Description

Placeholder description for molarity.data dataset.

Usage

molarity.data

Format

A tibble/data.frame with X rows and Y columns

Sample: Name of sample
Molarity: in nM
Amp_len: length of amplicon in bp
mass: in ng/\mul

Source

Generated by the Author

Generate mutated DNA sequences

Description

This function takes DNA sequences and generates mutated variants. Useful for testing classification algorithms on sequences with either PCR-induced or naturally occurring mutations.

Usage

mutation(sequence = NULL, format = "bin", n.mutations = NA, prob.mutation = NA)

Arguments

sequence

A list of DNA sequences, either as "DNAbin" objects or character vectors.

format

Character. Format of the input sequences. "bin" for DNAbin, "char" for character vectors.

n.mutations

Integer. Number of mutations to introduce per sequence. Exclusive with prob.mutation.

prob.mutation

Numeric. Probability of mutation per position. Exclusive with n.mutations.

Value

A list of mutated sequences of the same class as the input.

Examples

data("test_seqs")
mutation(test_seqs, n.mutations = 2)
mutation(test_seqs, prob.mutation = 0.1)
seqs <- fasta_reader(system.file("extdata", "test.fasta", package="eDNAfuns"))
mutation(seqs$seq, format = "char", n.mutations = 1)

functions to translate mass into molarity and vice versa, given we are talking about double stranded DNA it requires two inputs, the mass (or molarity) and the length of the DNA fragment It works with the two most common concentrations used in Molecular Ecology labs ng/`\mu`l for mass nM for molarity

Description

functions to translate mass into molarity and vice versa, given we are talking about double stranded DNA it requires two inputs, the mass (or molarity) and the length of the DNA fragment It works with the two most common concentrations used in Molecular Ecology labs ng/\mul for mass nM for molarity

Usage

ng2nM(ng, length_amplicon)

nM2ng(nM, length_amplicon)

Arguments

ng

Numeric. the concentration in ng per \muL

length_amplicon

Integer. The length of the DNA fragment in base pairs.

nM

Numeric. The concentration in nmoles per litre

Value

Numeric. The equivalent concentration in nmoles per litre

Numeric. The equivalent concentration in ng per \muL

Examples


data("molarity.data")
ng2nM(ng=molarity.data$mass, length_amplicon = molarity.data$Amp_len)


data("molarity.data")
nM2ng(nM=molarity.data$Molarity, length_amplicon = molarity.data$Amp_len)

Plot sequence length distribution

Description

This function takes a tibble produced by 'fasta_reader()' or 'fastq_reader()' and plots the distribution of sequence lengths.

Usage

plot_seq_len_hist(Hash_tibble, length_col, binwidth = 1, label_interval = 5)

Arguments

Hash_tibble

A tibble containing a numeric column with sequence lengths.

length_col

The column of 'Hash_tibble' containing sequence lengths (unquoted).

binwidth

Width of histogram bins. Default = 1.

label_interval

Interval for x-axis labels. Default = 5.

Value

A ggplot object.

Examples

data("example_hashes")
plot_seq_len_hist(example_hashes, seq_len)

ps

Description

Placeholder description for phyloseq object.

Usage

ps

Format

Formal class 'phyloseq' with five slots

Source

Generated by the Author

Read Indexing PCR Spreadsheet

Description

Reads data from a Google Sheets-based indexing PCR template. You can access the template here, create a copy in your Google Drive, and then create copies for each of your multiplexing PCR experiments.

Reads data from a Google Sheets-based PCR template. You can access the template here, create your own copy, and then create one copy per PCR reaction you want to keep track of.

Usage

read_indexing_PCR(ss)

read_step1_PCR(ss, trim = TRUE, name = TRUE)

Arguments

ss

Google Sheet ID or URL of the indexing PCR spreadsheet.

trim

Logical. If TRUE, removes rows where 'Sample' is NA (default: TRUE).

name

Logical. If TRUE, adds the spreadsheet name as a 'PCR' column in the sample sheet (default: TRUE).

Details

The function extracts sample information across all plates and returns a tidy dataframe.

Captures reagent mix, cycling conditions, and sample information.

Value

A tibble with columns:

Well: Well position
Sample: Sample name
Barcode: Barcode assigned to the sample
Set: Barcode set identifier

A named list with three elements:

PCR_mix: tibble of reagents and volumes.
Cycling: tibble of PCR cycling conditions.
Samples: tibble of samples with columns 'Well', 'Sample', 'Success', 'Notes', and optionally 'PCR'.

Examples

# Examples are not executed because the function requires identification in Google 
## Not run: 
data("Index_PCR")
read_indexing_PCR(Index_PCR)

## End(Not run)

# Examples are not executed because the function requires identification in Google 
## Not run: 
data("template_PCR")
read_step1_PCR(template_PCR)

## End(Not run)

Read cutadapt info files

Description

Read the '–info-file' output generated by cutadapt for adapter trimming. Column structure differs depending on whether the input came from Nanopore (ONT) or Illumina sequencing.

Usage

read_info_file(file, delim = "\t", col_select = NULL, ...)

Arguments

file

Path to the cutadapt info file (TSV).

delim

Field delimiter (default = "\t").

col_select

Optional tidyselect specification of which columns to read.

...

Additional arguments passed to [vroom::vroom()].

Details

- 'read_info_file()' (Illumina) returns columns: 'Seq.id', 'n_errors', 'start_adap', 'end_adap', 'seq_before_adap', 'matching_seq', 'seq_after_adap', 'adap_name', 'QScores_seq_before', 'QScores_matching', 'QScores_after'.

Value

A tibble with the parsed 'cutadapt' information. Column names are standardized.

Examples

test_info <- system.file("extdata", "cutadapt_info_illumina.txt", package="eDNAfuns")
df_illumina <- read_info_file(test_info)

Create contingency tables with two variables

Description

This function takes a tibble and create human readable contingency tables from two variables, either by showing number of cases in each combination or weighted by the sum of a numerical variable

Usage

tally_wide(tibble, rows, cols, wt = NULL, ...)

Arguments

tibble

A tibble containing at least two columns

rows

The column with the levels included as rows in the final table.

cols

The column with the levels included as columns in the final table.

wt

The column (numeric) whose values to add in order to fill the cells. If wt = NULL (the default), counts are returned instead of weighted sums.

...

Any parameters that can be passed to tally_wide 'values_fill' is a useful one

Value

A tibble

Examples

 df <- tibble::tibble(
  group = c("A", "A", "B", "B", "B"),
    outcome = c("yes", "no", "yes", "yes", "no")
 )
tally_wide(df, rows = group, cols = outcome)

template_PCR

Description

Placeholder description for Normal PCR template web address.

Usage

template_PCR

Format

A character vector

Source

Generated by the Author, stored in googledrive

test_seqs

Description

Placeholder description for test_seqs dataset.

Usage

test_seqs

Format

An object of class DNAbin of length 3.

Source

Generated by the Author

Convert a long tibble to a community matrix

Description

Converts a long-format table of sequence counts into a wide community matrix (samples × taxa) suitable for vegan or other community ecology tools.

Produces a distance matrix between samples using vegan's vegdist(). Optionally applies a data transformation before distance calculation.

Returns the environmental data frame (sample metadata) from a long sequence/OTU tibble by removing taxon and abundance columns.

Usage

tibble_to_comm(long.table, taxon, Abundance, sample.name)

tibble_to_dist(
  long.table,
  taxon,
  Abundance,
  sample.name,
  distance = "bray",
  transformation = NULL,
  ...
)

tibble_to_env(long.table, taxon, Abundance, sample.name, ...)

Arguments

long.table

A tibble with sample, taxon, abundance, and metadata columns.

taxon

Column containing taxa/OTU IDs (unquoted).

Abundance

Column with counts/abundance values (unquoted).

sample.name

Column with sample IDs (unquoted).

distance

Distance metric to use (default = "bray").

transformation

Optional transformation (e.g. "hellinger", "log"). See vegan::decostand documentation for a great explanation of all transformations

...

Additional metadata columns to retain.

Value

A numeric matrix with taxa as columns and samples as row names.

A dist object.

A tibble of unique sample-level metadata.

Examples

data("ASV_table")
 tibble_to_comm(ASV_table, 
                taxon = Hash,
                 Abundance = nReads,
                  sample.name = sample_name)
data("ASV_table")
tibble_to_dist(ASV_table,
               taxon = Hash,
               Abundance = nReads,
               sample.name = sample_name,
               distance = "bray",
               transformation = "hellinger")

data("ASV_table")
data("metadata")
dplyr::inner_join(ASV_table,metadata) |> 
tibble_to_env(taxon = Hash,
              Abundance = nReads, 
              sample.name = sample_name)

Convert a tidy ASV table to a phyloseq object

Description

This function converts a tidy ASV table, along with optional taxonomy and metadata, into a 'phyloseq' object.

Extracts ASV counts, taxonomy, and metadata from a 'phyloseq' object into tidy data frames.

Usage

tidy2phyloseq(
  ASV_table,
  OTU_taxonomy = NULL,
  metadata = NULL,
  Taxa = "sseqid",
  Sample = "sample_name",
  Reads = "nr",
  tree = NULL
)

phyloseq2tidy(phylo_obj, Taxa = "sseqid", Sample = "sample_name", Reads = "nr")

Arguments

ASV_table

A tidy data.frame/tibble of ASV counts.

OTU_taxonomy

A data.frame with OTU taxonomy. Optional.

metadata

A data.frame with sample metadata. Optional.

Taxa

Column name for OTU IDs in the output (default: "sseqid").

Sample

Column name for sample IDs in the output (default: "sample_name").

Reads

Column name for read counts in the output (default: "nr").

tree

A phylogenetic tree of class 'phylo'. Optional.

phylo_obj

A phyloseq object.

Value

A 'phyloseq' object combining OTU table, taxonomy, metadata, and optionally a tree.

A list with three tibbles:

ASV_table: Long-format tibble of counts: Sample, Taxa, Reads.
taxonomy: Taxonomy table as tibble (NULL if none in phyloseq).
metadata: Sample metadata as tibble (NULL if none in phyloseq).

Examples

data("ASV_table")
data("metadata")
data("OTU_taxonomy")
ps <- tidy2phyloseq(ASV_table = ASV_table,
                    OTU_taxonomy = OTU_taxonomy,
                    metadata = metadata,
                    Taxa = "Hash",
                    Reads = "nReads")


data("ps")
tidy_list <- phyloseq2tidy(ps)

training.ASV.table

Description

Placeholder description for training.ASV.table dataset.

Usage

training.ASV.table

Format

A data frame with 1909 rows and 4 variables:

Sample_name: Character vector, last digit includes PCR replicate
Locus: Which locus the sequences are from, numeric
Hash: The sequence found, hashed using sha1 for consistency
nReads: Number of times that hash was found in that sample

Source

Generated by the Author

training.metadata

Description

Placeholder description for training.metadata dataset.

Usage

training.metadata

Format

A data frame with 95 rows and 8 columns

Sample_name: Character vector, last digit includes PCR replicate
eDNA.sample: Biological sample
rep: PCR replicate
Transect: Transect of origin
position: Stop in Transect
depth: in the water column
lat: latitude in decimal degrees
lon: longitude in decimal degrees

Source

Generated by the Author

tree

Description

Placeholder description for tree dataset.

Usage

tree

Format

An object of class phylo or similar

Source

Generated by the Author

Write Indexing PCR Spreadsheet

Description

Creates a new Google Sheet for indexing PCRs from a template, fills in the sample information, and writes it into the correct ranges of the sheet.

Usage

write_indexing_PCR(
  data,
  name,
  ss_template = "1naS-F_dj4SNmND5nJ5TKhMX3TikmRS00ILKjfg_Ucgc"
)

Arguments

data

A tibble or dataframe with at least the columns:

Well: Well position
Sample: Sample identifier
Column: Plate column (used to split data across sheet sections)

name

Character. Name for the new Google Sheet that will be created.

ss_template

Character. ID of the template sheet to copy from (default: '"1naS-F_dj4SNmND5nJ5TKhMX3TikmRS00ILKjfg_Ucgc"').

Details

This function: 1. Creates a new Google Sheet with the given 'name'. 2. Copies the template sheet into it. 3. Splits the input 'data' by plate column. 4. Writes each split dataframe into its designated range of the sheet.

Value

A Google Sheet object (as returned by [googlesheets4::gs4_create()]) with the data written into the correct plate layout.

Examples

#Examples are not executed because the function requires identification in Google 
## Not run: 
my_data <- tibble::tibble(
 Well = c("A1","A2"),
  Sample = c("Sample1","Sample2"),
  Column = c(1,2)
)
write_indexing_PCR(my_data, "PCR_001")

## End(Not run)

AMPURE

Description

Usage

Format

Source

ASV_table

Description

Usage

Format

Details

Source

Index_PCR

Description

Usage

Format

Source

OTU_taxonomy

Description

Usage

Format

Source

UDI_Indices

Description

Usage

Format

Source

Count stop codons or return translated sequence

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Create scaled relative proportions of the number of reads of taxa in different samples

Description

Usage

Arguments

Value

Examples

example_hashes

Description

Usage

Format

Source

Read FASTA or FASTQ files into a tibble

Description

Usage

Arguments

Value

Examples

Read FASTA or FASTQ files into a tibble

Description

Usage

Arguments

Value

Examples

metadata

Description

Usage

Format

Source

molarity.data

Description

Usage

Format

Source

Generate mutated DNA sequences

Description

Usage

Arguments

Value

Examples

functions to translate mass into molarity and vice versa, given we are talking about double stranded DNA it requires two inputs, the mass (or molarity) and the length of the DNA fragment It works with the two most common concentrations used in Molecular Ecology labs ng/\mul for mass nM for molarity

Description

Usage

Arguments

Value

Examples

Plot sequence length distribution