Agglomerate counts of same lineage to specified level of taxonomy.
Inspired by phyloseq::tax_glom(), this method summarizes some numeric variables from genomes that have the same taxonomy at a user-specified taxonomy rank. Agglomeration occurs within each sample, meaning the user-specified variable is only summed within each query_name. This function returns a data frame with the columns `lineage`, `query_name`, and the glom_var column that was specified. The accepted glom_vars that can be agglomerated are f_unique_to_query, f_unique_weighted, unique_intersect_bp, and n_unique_kmers. Each of these variables deals with the "unique" fraction of the gather match, meaning there is no double counting between the query and the genome matched. f_unique_weighted and n_unique_kmers are weighted by k-mer abundance while f_unique_to_query and unique_intersect_bp are not. f_unique_weighted is similar to relative abundance (where "f" stands for fraction -- if 100 n_unique_kmers is the abundance-weighted number of unique hashes (k-mers) that overlapped between the query and the match in the database. This number is calculated by dividing the unique_intersect_bp by the scaled value and multiplying this value by the average k-mer abundance. Other variables (like f_orig_query) could sum > 1. Only one variable is agglomerated at a time.
tax_glom_taxonomy_annotate( taxonomy_annotate_df, tax_glom_level = NULL, glom_var = "n_unique_kmers" )
Data frame containing outputs from sourmash taxonomy annotate. Can contain results from one or many runs of sourmash taxonomy annotate. Agglomeration occurs within each query.
Character. NULL by default, meaning no agglomeration is done. Valid options are "domain", "phylum", "class", "order", "family", "genus", and "species". When a valid option is supplied, k-mer counts are agglomerated to that level
Character. One of f_unique_to_query, f_unique_weighted, unique_intersect_bp, or n_unique_kmers.
Selecting which glom_var to use for downstream use cases can be difficult. We most frequently use f_unique_weighted and n_unique_kmers as these both account for the number of times a k-mer occurs in a data set. This is closer to counting the number of reads that would map against a reference genome than the other metrics. When our downstream use case deals with relative abundance, f_unique_weighted is a good choice. When the downstream use case needs count data, we use n_unique_kmers. Because we divide by the scaled value to generate this number, the value will be much lower than read mapping. However, doing it this way returns the actual number of k-mers sourmash counted. This tends work better for assumptions made by downstream statistical tools (e.g. for differential abundance analysis, machine learning, etc.).