Agglomerate counts of same lineage to specified level of taxonomy.
tax_glom_taxonomy_annotate.Rd
Inspired by phyloseq::tax_glom(), this method summarizes some numeric variables from genomes that have the same taxonomy at a user-specified taxonomy rank. Agglomeration occurs within each sample, meaning the user-specified variable is only summed within each query_name. This function returns a data frame with the columns `lineage`, `query_name`, and the glom_var column that was specified. The accepted glom_vars that can be agglomerated are f_unique_to_query, f_unique_weighted, unique_intersect_bp, and n_unique_kmers. Each of these variables deals with the "unique" fraction of the gather match, meaning there is no double counting between the query and the genome matched. f_unique_weighted and n_unique_kmers are weighted by k-mer abundance while f_unique_to_query and unique_intersect_bp are not. f_unique_weighted is similar to relative abundance (where "f" stands for fraction -- if 100 n_unique_kmers is the abundance-weighted number of unique hashes (k-mers) that overlapped between the query and the match in the database. This number is calculated by dividing the unique_intersect_bp by the scaled value and multiplying this value by the average k-mer abundance. Other variables (like f_orig_query) could sum > 1. Only one variable is agglomerated at a time.
Usage
tax_glom_taxonomy_annotate(
taxonomy_annotate_df,
tax_glom_level = NULL,
glom_var = "n_unique_kmers"
)
Arguments
- taxonomy_annotate_df
Data frame containing outputs from sourmash taxonomy annotate. Can contain results from one or many runs of sourmash taxonomy annotate. Agglomeration occurs within each query.
- tax_glom_level
Character. NULL by default, meaning no agglomeration is done. Valid options are "domain", "phylum", "class", "order", "family", "genus", and "species". When a valid option is supplied, k-mer counts are agglomerated to that level
- glom_var
Character. One of f_unique_to_query, f_unique_weighted, unique_intersect_bp, or n_unique_kmers.
Details
Selecting which glom_var to use for downstream use cases can be difficult. We most frequently use f_unique_weighted and n_unique_kmers as these both account for the number of times a k-mer occurs in a data set. This is closer to counting the number of reads that would map against a reference genome than the other metrics. When our downstream use case deals with relative abundance, f_unique_weighted is a good choice. When the downstream use case needs count data, we use n_unique_kmers. Because we divide by the scaled value to generate this number, the value will be much lower than read mapping. However, doing it this way returns the actual number of k-mers sourmash counted. This tends work better for assumptions made by downstream statistical tools (e.g. for differential abundance analysis, machine learning, etc.).