Create a tidy data.frame representing per-sample rarefaction information — from_signatures_to_rarefaction

`from_signatures_to_rarefaction_df()` rarefies sourmash signatures by sample to assess sequencing depth. The output is a tidy data frame that can be used in ggplot2 plots. It uses `vegan::rarecurve()` to calculate a rarefaction curve for each sample in the input sourmash signatures data frame. The `name` column is used for sample names. If the column does not exist or blank, sample names are derived using the base name of filename. If and individual name is blank ("") or NA, that individual name is derived using the base name of filename. The rarefaction curves are evaluated using the interval of step sample sizes, always including 1 and total sample size. This function is intended to be used on signatures built from read data. It uses abundance information as part of the rarefaction curve calculation. In case signatures were built from reads that have not been k-mer trimmed, there is a filtering step that removes minhashes that are only observed in one sample at abundance 1 as these are likely sequencing errors. This filtering process may invalidate downstream rarefaction curve convergence estimation, as many of these methods evaluate singletons and doubletons in the data set.

Usage

from_signatures_to_rarefaction_df(signatures_df, step = 1)

Arguments

signatures_df: A data frame of multiple sourmash signatures created by combining many signatures read in by read_signature(). The data frame must contain the `abundances` column (generated by using the `abund` parameter with `sourmash sketch`).
step: Integer. The step size for samples in rarefaction curve calculation.

Value

A tidy data frame.

Examples

if (FALSE) {
from_signatures_to_rarefaction_df()
}