## Introduction

Sourmash is a tool that facilitates lightweight and rapid comparisons between potentially very large sets of sequencing data1 2.

Sourmashconsumr provides parsing, analysis, and visualization functions for outputs generated by the sourmash commands sketch, compare, gather, and taxonomy annotate. The sourmash outputs that sourmashconsumr works with are information-rich text files (either CSV or JSON). The sourmashconsumr functions are designed to help interpret these outputs.

sourmashconsumr does not re-implement any of the functionality in the original sourmash software.

### How sourmash works

Sourmash uses FracMinHash sketching to generate compressed representations of sequencing data. Two core parameters determine which sequences are represented in these sketches: k-mer size and scaled value. The k-mer size dictates the length of sequences represented in the sketch; defaults are k = 21, k = 31, and k = 51. While these sizes are somewhat empiric, overlap between related genomic sequences for a given k-mer size roughly approximates taxonomic relationships, with substantial overlap at k = 21 roughly approximating genus-level relatedness, k = 31 species-level, and k = 51 strain-level 3. The scaled value dictates the fraction of k-mers in the original sequence that get included in the sketch. Approximately 1/scaled value of k-mers in the original sequence are retained in the sketch. The most common scaled values are 1000, 2000, and 10,000. The scaled value takes advantage of the fact that a subset of sequencing data can be used to accurately estimate things like similarity and containment and dramatically sub-samples the original sequences thus facilitating rapid comparisons even between very large data sets.

### Sourmash outputs that sourmashconsumr works on

The sourmashconsumr package works with outputs from the following sourmash commands. We provide a brief description of each command, but be sure to check the sourmash documentation for a more complete explanation and examples.

• sourmash sketch: Produces sketches for each sequence. Any sequencing data (DNA, protein, or 6-frame translation) in FASTQ or FASTA format can be sketched. Files produced by sourmash sketch are referred to as signatures because they may contain many sketches (one sketch for each k-mer size).
• sourmash compare: Compares many sketches and produces a (dis)similarity matrix. If abundance information is ignored, the output represents jaccard distance. If abundance information is included, the output represents angular distance.
• sourmash gather: Compares each query sketch against all sketches in a database to identify the minimum set of sketches that contain all of the known k-mers in the query. Typically, sourmash gather is run on metagenomes to determine all of the genomes that are contained within the sample, however sourmash gather can be used to assess containment between a query and a large database of sequences (e.g. for contamination discovery).
• sourmash taxonomy annotate: Assigns taxonomic lineages to the sourmash gather matches.

Naming signatures in sourmash to get the most out of sourmashconsumr functions

Many functions in the sourmashconsumr R package rely on signatures names. Signature names are set when you sketch a file by using the --name or --merge flags:

sourmash sketch dna -p k=21,k=31,k=51,scaled=1000,abund -o sample1.sig --name sample1 sample1*fastq.gz

The above command would create a signature with three DNA sketches (one for each k size), each with a scaled value of 1000 and abundance tracking turned on. The signature name would be sample1 and it would contain sequences from all FASTQ files that start with sample1 and end in fastq.gz, and it would be written to the file sample1.sig.

The signature name would be read into R as a column, name, if you read the output file sample1.sig into R (see below). The signature name is also propagated to the outputs of sourmash gather and sourmash taxonomy annotate; if you read these output files into R, the signature name would be in the column query_name.

Many of the sourmashconsumr functions use the name or query_name columns. If you didn’t use the --name flag when you sketched your signatures, you have two options to create the name/query_name column:

1. If you would like to add names to your sourmash signatures, you can use the sourmash command sourmash sig rename. This might be the best solution if you plan to work with the signatures more.
sourmash signature rename file1.sig "new name" -o renamed.sig
1. If you already ran sourmash gather on your signatures, you can instead edit this column in R or excel. For example, you could do something like this in R:
sourmashconsumr::read_gather("my_gather_results.csv") %>%
dplyr::mutate(query_name = basename(query_filename))

## Installation

You can install the development version of sourmashconsumr from GitHub with:

# install.packages("remotes")
remotes::install_github("Arcadia-Science/sourmashconsumr")

Eventually, we hope to release sourmashconsumr on CRAN and to provide a conda-forge package. We’ll update these instructions once we’ve done that.

To access the functions in the sourmashconsumr package:

library(sourmashconsumr)

## Problems & getting help

If you encounter any problems using sourmashconsumr, please post an issue on the issue tracker: https://github.com/Arcadia-Science/sourmashconsumr/issues. If you encounter problems with sourmash, please post an issue on the sourmash issue tracker: https://github.com/sourmash-bio/sourmash/issues.

## Package functionality

### Load example data and vignette dependencies

Load the packages this vignette uses:

library(sourmashconsumr)
library(dplyr)
library(ggplot2)
library(RColorBrewer)

data(gut_compare_df)
data(gut_signatures_df)
data(gut_gather_df)
data(gut_taxonomy_annotate_df)

Make a metadata frame that corresponds to the example data:

# create a metadata data frame
run_accessions <- c("SRR5936131", "SRR5947006", "SRR5935765",
"SRR5936197", "SRR5946923", "SRR5946920")
groups <- c("cd", "cd", "cd", "nonibd", "nonibd", "nonibd")
metadata <- data.frame(run_accessions = run_accessions, groups = groups) 

#### Example data

The sourmashconsumr package comes with a few example data sets that you can load to get a feel for data formats and to play around with the functions in the package.

The example data are six samples from the iHMP IBD cohort. These are short shotgun metagenomes from stool microbiomes. While the iHMP was a longitudinal study, the samples in the example data are all time point 1 samples taken from different individuals. All individuals were symptomatic at time point 1, but three individuals were diagnosed with Crohn’s disease (cd) at the end of the year, while three individuals were not (nonIBD).

run_accession group
SRR5936131 cd
SRR5947006 cd
SRR5935765 cd
SRR5936197 nonibd
SRR5946923 nonibd
SRR5946920 nonibd

You can see the complete set of commands that were used to generate the data here.

We’ve documented the content of each of the test data sets, including an explanation for what is encoded by each column. If you’re having trouble understanding the content of the sourmash output files, the data set documentation may help. Use the following code to access this information:

?gut_gather_df

### Reading sourmash outputs into R

All of the sourmashconsumr functions that read in the sourmash outputs start with read_. All functions accepts compressed or uncompressed files and can read one or multiple files at a time. When multiple file paths are passed to a function, all files are read into a single data frame. If you would like to read in multiple files at once, the Sys.glob() function or the list.files() function are useful to determine the paths for many files.

Explanation about accessing external files packaged with sourmashconsumr

Because some of the functions read external files, we’ve also included example sourmash outputs with the package to demonstrate how the read*() functions work. Specifying paths to data files packaged with an R package is a bit ugly…it takes advantage of the base R function system.file and information about the package to derive the file paths. While this is helpful because it allows anyone to access the files in a package no matter where the package is installed on their computer, the syntax can be confusing because the example ends up looking a lot different than the code you would actually write. Whenever you see system.file() in this vignette, know that when you go to run similar code on your own sourmash outputs, you need to specify the path to your file to run the command without all of the accompanying system.file() code. See an example below:

# how we have to specify paths so that anyone can access the package data
fpath <- system.file("extdata", "SRR5947006.sig.gz", package="sourmashconsumr")
signature_df <- read_signature(fpath)
# how you will specify a path when you actually run the code
signature_df <- read_signature("path/to/your_signature_file.sig")

#### Reading sourmash sigatures into R

The sourmashconsumr function read_signature() reads sourmash signatures produced by the sourmash sketch or sourmash compute function into a data frame.

fpath <- system.file("extdata", "SRR5947006.sig.gz", package="sourmashconsumr")
#> # A tibble: 6 × 16
#>   class     email hash_…¹ filen…² name  license   num ksize  seed max_h…³ scaled
#>   <chr>     <chr> <chr>   <chr>   <chr> <chr>   <int> <int> <int>   <dbl>  <dbl>
#> 1 sourmash… ""    0.murm… SRR594… SRR5… CC0         0    21    42 9.22e13 200000
#> 2 sourmash… ""    0.murm… SRR594… SRR5… CC0         0    21    42 9.22e13 200000
#> 3 sourmash… ""    0.murm… SRR594… SRR5… CC0         0    21    42 9.22e13 200000
#> 4 sourmash… ""    0.murm… SRR594… SRR5… CC0         0    21    42 9.22e13 200000
#> 5 sourmash… ""    0.murm… SRR594… SRR5… CC0         0    21    42 9.22e13 200000
#> 6 sourmash… ""    0.murm… SRR594… SRR5… CC0         0    21    42 9.22e13 200000
#> # … with 5 more variables: mins <dbl>, md5sum <chr>, abundances <int>,
#> #   molecule <chr>, version <dbl>, and abbreviated variable names
#> #   ¹​hash_function, ²​filename, ³​max_hash

If you have many signatures that you would like to read into the same data frame, you can pass a character vector of file paths:

# grab the files paths for two signatures in the sourmashconsumr example data
fpath1 <- system.file("extdata", "SRR5947006.sig.gz", package="sourmashconsumr")
fpath2 <- system.file("extdata", "SRR5935765.sig.gz", package="sourmashconsumr")
fpaths <- c(fpath1, fpath2)
# read in the signature files
signature_df <- read_signature(fpaths)

#### Reading the CSV output of sourmash compare

The sourmash compare output is a square matrix. The CSV has column names but no row names, but the rownames are symmetrical to the column names. The read_compare_csv() function reads in the compare CSV file and adds row names. You can choose whether the row labels are read in as a column named “sample” or as row names using the sample_to_rownames argument.

fpath <- system.file("extdata", "gut_compare.csv.gz", package = "sourmashconsumr")
compare_df <- read_compare_csv(fpath, sample_to_rownames = F)
#> Rows: 6 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (6): SRR5935765, SRR5936131, SRR5936197, SRR5946920, SRR5946923, SRR5947006
#>
#> ℹ Use spec() to retrieve the full column specification for this data.
#> ℹ Specify the column types or set show_col_types = FALSE to quiet this message.
compare_df
#> # A tibble: 6 × 7
#>   sample     SRR5935765 SRR5936131 SRR5936197 SRR5946920 SRR5946923 SRR5947006
#>   <chr>           <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
#> 1 SRR5935765      1          0.283      0.269     0.139       0.197     0.115
#> 2 SRR5936131      0.283      1          0.488     0.148       0.350     0.135
#> 3 SRR5936197      0.269      0.488      1         0.183       0.546     0.134
#> 4 SRR5946920      0.139      0.148      0.183     1           0.175     0.0943
#> 5 SRR5946923      0.197      0.350      0.546     0.175       1         0.172
#> 6 SRR5947006      0.115      0.135      0.134     0.0943      0.172     1

#### Reading the CSV output of sourmash gather

The read_gather() function reads in the CSV output by sourmash gather. You can optionally filter the matches using intersect_bp_threshold. The function does no filtering by default. However, sourmash gather uses a threshold of 50,000 base pairs, so if you decreased this value when you ran sourmash gather, it may make sense to filter with intersect_bp_thershold = 50000. Most false positives are caused by contamination in the reference genomes in the database, so low filtering thresholds can improve accuracy.

fpath <- system.file("extdata", "SRR5946920_gather_gtdbrs207_reps.csv.gz", package = "sourmashconsumr")
gather_df <- read_gather(fpath, intersect_bp_threshold = 50000)
#> # A tibble: 6 × 30
#>   inters…¹ f_ori…² f_match f_uni…³ f_uni…⁴ avera…⁵ media…⁶ std_a…⁷ filen…⁸ name
#>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>   <chr>
#> 1  3787000  0.0318   0.583  0.0318  0.0496   10.9        9   25.2  /Users… GCF_…
#> 2  3544000  0.0298   0.707  0.0297  0.0956   22.4       21   17.1  /Users… GCF_…
#> 3  3622000  0.0304   0.773  0.0294  0.0298    7.07       6    5.44 /Users… GCF_…
#> 4  2993000  0.0252   0.613  0.0251  0.0140    3.89       3    4.51 /Users… GCF_…
#> 5  2941000  0.0247   0.619  0.0240  0.0531   15.5       14   10.7  /Users… GCF_…
#> 6  2389000  0.0201   0.585  0.0192  0.0624   22.7       21   13.6  /Users… GCF_…
#> # … with 20 more variables: genome_accession <chr>, md5 <chr>,
#> #   f_match_orig <dbl>, unique_intersect_bp <dbl>, gather_result_rank <dbl>,
#> #   remaining_bp <dbl>, query_filename <chr>, query_name <chr>,
#> #   query_md5 <chr>, query_bp <dbl>, ksize <dbl>, moltype <chr>, scaled <dbl>,
#> #   query_n_hashes <dbl>, query_abundance <lgl>, query_containment_ani <dbl>,
#> #   match_containment_ani <dbl>, average_containment_ani <dbl>,
#> #   max_containment_ani <dbl>, potential_false_negative <lgl>, and …

#### Reading the CSV output of sourmash taxonomy annotate

The read_taxonomy_annotate() function reads the CSV output by sourmash taxonomy annotate. Taxonomy annotate adds a taxonomic lineage column to the sourmash gather output. You can filter with intersect_bp_threshold (see above). You can also separate the lineage into taxonomy ranks using separate_lineage. This function assumes that the lineage ranks are separated by semi-colons (;), and that the lineage is composed of a domain, phylum, class, order, family, genus, species, and strain. If information is missing, it is back-filled with NAs.

fpath <- system.file("extdata", "SRR5946920_gather_gtdbrs207_reps.with-lineages.csv.gz", package = "sourmashconsumr")
intersect_bp_threshold = 50000,
separate_lineage = T)
#> # A tibble: 6 × 40
#>   inters…¹ f_ori…² f_match f_uni…³ f_uni…⁴ avera…⁵ media…⁶ std_a…⁷ filen…⁸ name
#>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>   <chr>
#> 1  3787000  0.0318   0.583  0.0318  0.0496   10.9        9   25.2  /Users… GCF_…
#> 2  3544000  0.0298   0.707  0.0297  0.0956   22.4       21   17.1  /Users… GCF_…
#> 3  3622000  0.0304   0.773  0.0294  0.0298    7.07       6    5.44 /Users… GCF_…
#> 4  2993000  0.0252   0.613  0.0251  0.0140    3.89       3    4.51 /Users… GCF_…
#> 5  2941000  0.0247   0.619  0.0240  0.0531   15.5       14   10.7  /Users… GCF_…
#> 6  2389000  0.0201   0.585  0.0192  0.0624   22.7       21   13.6  /Users… GCF_…
#> # … with 30 more variables: genome_accession <chr>, md5 <chr>,
#> #   f_match_orig <dbl>, unique_intersect_bp <dbl>, gather_result_rank <dbl>,
#> #   remaining_bp <dbl>, query_filename <chr>, query_name <chr>,
#> #   query_md5 <chr>, query_bp <dbl>, ksize <dbl>, moltype <chr>, scaled <dbl>,
#> #   query_n_hashes <dbl>, query_abundance <lgl>, query_containment_ani <dbl>,
#> #   match_containment_ani <dbl>, average_containment_ani <dbl>,
#> #   max_containment_ani <dbl>, potential_false_negative <lgl>, lineage <chr>, …

### Working with sourmash signatures (FracMinHash sketches)

Sourmash signatures are compressed representations of sequencing data. This data can be used for many different purposes such as machine learning4, differential k-mer abundance analysis, and sequencing depth/rarefaction analysis.

#### Visualizing shared k-mer content between samples with an upset plot

The functions from_signatures_to_upset_df() and plot_signatures_upset() produce an upset plot that depicts the number of k-mers shared between sets of samples. The absolute size of the intersection is the number of shared hashes (k-mers) in scaled FracMinHash sketches (see How sourmash works for an explanation of sketches and their content.). These plots do not take abundance information into account.

An upset plot is an alternative to a venn diagram; the bottom half of the plot shows which samples have shared content while the top half of the plot shows the size of the intersection between the specified samples.

This plot is good for visualizing the exact sequencing content overlap between samples. If there is a lot of variation in the samples, it might be difficult to interpret the plot as there may be many intersections. Still, it provides a nice complement to the sourmash compare results and plots as it gives more information about the nature of the intersections between samples.

gut_signatures_df <- gut_signatures_df %>%
dplyr::filter(ksize == 31)

gut_signatures_upset_df <- from_signatures_to_upset_df(signatures_df = gut_signatures_df)
#>              SRR5935765 SRR5936131 SRR5936197 SRR5946920 SRR5946923 SRR5947006
#> 7504968134            1          1          0          1          0          1
#> 47418666416           1          0          0          0          0          0
#> 91683251943           1          1          0          1          0          1
#> 133675644697          1          0          0          0          0          0
#> 232110850204          1          0          0          0          0          0
#> 320454190342          1          1          1          1          1          0
plot_signatures_upset(gut_signatures_upset_df)

Any additional arguments that can be passed to ComplexUpset::upset() can be added to the end of the function to further control the look of the plot. ComplexUpset has a lot of documentation and examples for modifying plots. Below, we use some of these arguments to modify common parts of the upset plot. We use bright colors to make it obvious which components of the plot the arguments modify.

plot_signatures_upset(upset_df = gut_signatures_upset_df,
# arguments to ComplexUpset::upset():
stripes = c("cyan"),
name = "Groups!",
min_size = 10,
themes = ComplexUpset::upset_default_themes(text=element_text(size = 16, color = "orange"),
axis.text.y = element_text(size = 13, color = "blue"),
panel.grid = element_blank()))

#### Assessing sample sequencing depth using rarefaction (accumulation) curves

The goal of the from_signatures_to_rarefaction_df() and plot_signatures_rarefaction() functions is to produce a rarefaction curve to assess whether an adequate sequencing depth was reached to capture the complexity of the sample sequenced. These functions should work on raw or trimmed reads from short or long read sequencing data generated from most library types (e.g. genomes, transcriptomes, metagenomes, metatranscriptomes, etc.). If your library is non-complex (e.g. amplicon sequencing data), you may need to use a lower scaled value. Rarefaction curves rely on the abundance of sequences in a sample. Therefore, it only makes sense to generate rarefaction curves from signatures that were made with FASTQ files and that had abundance tracking turned on. Further, these curves may be more accurate if you build them with sequences from many samples from similar environments. In case signatures were built from reads that have not been k-mer trimmed, there is a filtering step that removes hashes (k-mers) that are only observed once and only observed in one sample as these are likely sequencing errors. Please note that this filtering process may invalidate downstream rarefaction curve convergence estimation as many of these methods evaluate singletons and doubletons in the data set.

The function from_signatures_to_rarefaction_df() may take a while to run. We’ve found it produces fairly consistent results even when signatures are dramatically downsampled with a high scaled value (e.g. scaled = 100000). You can downsample your signatures with sourmash prior to reading them into R using the function sourmash sig downsample. See the sourmash documentation for more details.

# filter to a single k-mer size
gut_signatures_df <- gut_signatures_df %>%
dplyr::filter(ksize == 31)

rarefaction_df <- from_signatures_to_rarefaction_df(signatures_df = gut_signatures_df)
#>         name num_kmers_sampled num_kmers_observed
#> 1 SRR5935765                 1           1.000000
#> 2 SRR5935765                 2           1.993401
#> 3 SRR5935765                 3           2.980279
#> 4 SRR5935765                 4           3.960713
#> 5 SRR5935765                 5           4.934776
#> 6 SRR5935765                 6           5.902544
rarefaction_plt <- plot_signatures_rarefaction(rarefaction_df = rarefaction_df, fraction_of_points_to_plot = 1)
rarefaction_plt

The plot produced by plot_signatures_rarefaction is a ggplot2 plot, so you can modify it by adding additional layers:

rarefaction_plt +
ggplot2::theme_minimal()

Color by sample name:

rarefaction_plt +
ggplot2::theme_minimal() +
ggplot2::geom_point(ggplot2::aes(color = name))

Add your own metadata to the rarefaction_df and use it in the plot:

# join the metadata data frame with the rarefaction_df
rarefaction_df <- rarefaction_df %>%
dplyr::left_join(metadata, by = c("name" = "run_accessions"))

# create and modify a plot
plot_signatures_rarefaction(rarefaction_df = rarefaction_df, fraction_of_points_to_plot = 1) +
ggplot2::theme_minimal() +
ggplot2::geom_point(ggplot2::aes(color = groups))

### Working with the CSV output of sourmash compare

Sourmash compare outputs an all-by-all similarity matrix for a group of samples (by default, the matrix is a similarity matrix, but users can switch this to a dissimilarity matrix with a command line flag). When the samples were sketched with abundance tracking as was done here (e.g. the abundance of each k-mer was recorded), the similarity matrix encodes the angular distance. The sourmashconsumr package supports two visualizations from the CSV output of sourmash compare, an MDS (ordination) plot and a clustered heat map.

#### MDS plot from sourmash compare output

The MDS plot allows you to visualize how similar each sample is to every other sample in two dimensional space. Samples that cluster more closely together are most similar.

gut_compare_mds_df <- make_compare_mds(gut_compare_df)
#>       sample       MDS1        MDS2
#> 1 SRR5935765  0.0775339  0.13464650
#> 2 SRR5936131 -0.3745227 -0.03349708
#> 3 SRR5936197 -0.5022308 -0.03442290
#> 4 SRR5946920  0.5039572  0.64805445
#> 5 SRR5946923 -0.3743700 -0.10706435
#> 6 SRR5947006  0.6696324 -0.60771661
compare_plt <- plot_compare_mds(gut_compare_mds_df)
compare_plt

The MDS plot is a ggplot2 object, so you can modify it by adding layers.

compare_plt +
ggplot2::theme_minimal()

You can also add metadata directly to the data frame output by make_compare_mds so that you can add it to a plot:

gut_compare_mds_df %>%
dplyr::left_join(metadata, by = c("sample" = "run_accessions"))
#>       sample       MDS1        MDS2 groups
#> 1 SRR5935765  0.0775339  0.13464650     cd
#> 2 SRR5936131 -0.3745227 -0.03349708     cd
#> 3 SRR5936197 -0.5022308 -0.03442290 nonibd
#> 4 SRR5946920  0.5039572  0.64805445 nonibd
#> 5 SRR5946923 -0.3743700 -0.10706435 nonibd
#> 6 SRR5947006  0.6696324 -0.60771661     cd

plot_compare_mds(gut_compare_mds_df) +
ggplot2::geom_point(ggplot2::aes(color = groups))

#### Heatmap from sourmash compare output

The plot_compare_heatmap object works directly on the output of read_compare_csv(). It’s a wrapper function for base::heatmap(), so you can add any arguments accepted by base::heatmap() to modify the look of the plot.

plot_compare_heatmap(gut_compare_df, cexRow = 0.75, cexCol = 0.75)

### Working with the CSV output of sourmash gather

Sourmash gather compares a query (here, a metagenome) against a database (here, GTDB rs207 representatives) and provides the minimum set of genomes in the database that cover (contain) all of the k-mers in the query.

Sourmash gather works best for metagenomes from environments that have been sequenced before or from which many genomes have been isolated and sequenced. Because k-mers are so specific, a genome needs to be in a database for sourmash gather to find a match. Sourmash gather won’t find much above species (k = 31) or genus (k = 21) similarity, so if most of the organisms in a sample are new, sourmash won’t be able to label them.

Sourmashconsumr produces two plots from the sourmash gather results: a high-level bargraph summarizes the fraction of the sample that had matches in the database, and an upset plot showing intersections between genomes that were matched in multiple samples. These plots are designed to visualize the output of running sourmash gather on many samples in the same plot.

#### Fraction classifed plot from sourmash gather output

The actual number that is plotted is referred to as abundance-weighted unique fraction. By default, sourmash gather returns the following information:

• f_orig_query: the fraction of the original query (metagenome sample) that matched against the genome
• f_unique_to_query: the fraction of the matched genome that was unique to the query. Given that multiple matched genomes could cover the same portion of the query, this column reports the fraction of the query that was covered by the a matched genome, but doesn’t allow multiple genomes to cover the same portion. As such, the genome that covers the most k-mers in a metagenome sample gets to “anchor” those k-mers, while other genomes that could cover those k-mers don’t get any of them. This column removes the double counting problem that arises from having multiple genomes in a database that contain some of the same sequences.
• f_unique_weighted: This column weights the f_unique_to_query by the abundance of those k-mers. This is important in the context of a metagenome, as the abundance reflects the abundance of the genome in the original microbial community.

The plots below use f_unique_weighted to look at the taxonomic composition of each metagenome sample. This value is similar to looking at the fraction of reads that would map against reference genomes.

plot_gather_classified(gather_df = gut_gather_df)

#### Upset plot from sourmash gather output

An upset plot is an alternative to a venn diagram; the bottom half of the plot shows which samples have shared content while the top half of the plot shows the size of the intersection between the specified samples.

This upset plot visualizes the intersection of genomes that were identified in different samples.

gut_gather_upset_df <- from_gather_to_upset_df(gather_df = gut_gather_df)
#>                 SRR5935765 SRR5936131 SRR5936197 SRR5946920 SRR5946923
#> GCF_002234575.2          1          1          0          1          0
#> GCF_001314995.1          1          1          1          1          1
#> GCF_000154205.1          1          1          1          1          1
#> GCF_000012825.1          1          1          1          1          1
#> GCF_002222615.2          1          1          1          1          1
#> GCF_000466485.1          1          1          0          1          0
#>                 SRR5947006
#> GCF_002234575.2          1
#> GCF_001314995.1          1
#> GCF_000154205.1          1
#> GCF_000012825.1          1
#> GCF_002222615.2          0
#> GCF_000466485.1          1
plot_gather_upset(upset_df = gut_gather_upset_df)

The bars in the upset plot can optionally be colored by the database the match came from. This is useful when you use many databases when running sourmash gather. Since we only used one database to classify our samples in the example data, the visualization isn’t much more rewarding.

plot_gather_upset(upset_df = gut_gather_upset_df,
color_by_database = T,
gather_df = gut_gather_df)

You can change the palette by passing a character vector of colors to the palette argument. If no palette is specified, it defaults to RColorBrewer’s Set2.

plot_gather_upset(upset_df = gut_gather_upset_df,
color_by_database = T,
gather_df = gut_gather_df,
palette = "red")

See the Visualizing shared k-mer content between samples with an upset plot for examples of how to modify the look of the upset plot.

### Working with the CSV output of sourmash taxonomy annotate

Sourmash taxonomy makes the sourmash gather output more interpretable by adding taxonomic labels to the sourmash gather results. Sourmash gather only outputs the statistics about the samples in a database that were found in a query sample. Taxonomic labels contextualize these results and allow new analyses like agglomeration up the taxonomic lineage to summarize results or visualizing results in the context of taxonomy.

#### Agglomerating up levels of taxonomy

Inspired by phyloseq::tax_glom(), this method summarizes variables output by sourmash gather from genomes that have the same taxonomy at a user-specified taxonomy rank. Agglomeration occurs within each sample, meaning the variable is only summed within each query_name. This function returns a data frame with the columns ‘lineage’, ‘query_name’, and the variable that is specified by glom_var.

glom_df <-  tax_glom_taxonomy_annotate(taxonomy_annotate_df = gut_taxonomy_annotate_df,
tax_glom_level = "genus",
glom_var = "f_unique_to_query")

#> # A tibble: 6 × 3
#>   lineage                                                        query…¹ f_uni…²
#>   <chr>                                                          <chr>     <dbl>
#> 1 d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Propionib… SRR593… 1.32e-3
#> 2 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f… SRR593… 9.56e-4
#> 3 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f… SRR593… 1.33e-1
#> 4 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f… SRR593… 5.25e-2
#> 5 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f… SRR593… 1.68e-3
#> 6 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f… SRR593… 8.56e-4
#> # … with abbreviated variable names ¹​query_name, ²​f_unique_to_query

#### Visualizing shared taxonomic lineages between samples with an upset plot

gut_taxonomy_upset_inputs <- from_taxonomy_annotate_to_upset_inputs(taxonomy_annotate_df = gut_taxonomy_annotate_df,
tax_glom_level = "order")

This function produces a list with three objects. The first, gut_taxonomy_upset_inputs$upset_df is an upset-formatted data frame indicating the presense/absence of each lineage for each sample. The second, gut_taxonomy_upset_inputs$taxonomy_annotate_df, is a re-formatted data frame with the taxonomy annotation information. The third, gut_taxonomy_upset_inputs$tax_glom_level, records the taxonomy agglomeration level used to create the upset inputs. This list can then be fed into plot_taxonomy_annotate_upset() to produce an upset plot: plot_taxonomy_annotate_upset(upset_inputs = gut_taxonomy_upset_inputs, fill = "phylum") If fill is specified, the bar plot on the upper portion of the upset plot will be colored by the taxonomic level indicated. The level must be at or above that used in from_taxonomy_annotate_to_upset_inputs(). You can change the palette by passing a character vector of colors to the palette argument. If no palette is specified, it defaults to RColorBrewer’s Set2. plot_taxonomy_annotate_upset(upset_inputs = gut_taxonomy_upset_inputs, fill = "phylum", palette = c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")) Since upset plots are based on the presence or absence of lineages, the counts in the upset barplot do not reflect the total number of lineages observed at a given taxonomic level, but rather how many lineages are shared between samples at that level – e.g., the numbers are not weighted. See the Visualizing shared k-mer content between samples with an upset plot for examples of how to modify the look of the upset plot. #### Visualizing taxonomy with a Sankey plot The Sankey plot is designed to show the taxonomic breakdown in one or many samples. The ribbons in the plot progressively split as the taxonomic lineage becomes more specific (e.g. from domain to phylum, phylum to class, etc.). The ribbon width displays the relative abundance of the taxonomic lineage. The tax_glom_level specifies which level of the taxonomic lineage to stop displaying the taxonomy. The Sankey plot can be generated from one or many samples. Below is an example of a plot generated from one sample: plot_taxonomy_annotate_sankey(taxonomy_annotate_df = gut_taxonomy_annotate_df %>% dplyr::filter(query_name == "SRR5935765"), tax_glom_level = "family") And a plot generated from many samples: plot_taxonomy_annotate_sankey(taxonomy_annotate_df = gut_taxonomy_annotate_df, tax_glom_level = "order") The colors in the Sankey plot are built using the function grDevices::colorRampPalette() on the RColorBrewer palette Set2. To change the palette, supply the palette argument with the hexcodes you would like to ramp up. plot_taxonomy_annotate_sankey(taxonomy_annotate_df = gut_taxonomy_annotate_df, tax_glom_level = "order", palette = RColorBrewer::brewer.pal(8, "Set3")) By default, taxonomic labels are displayed on the plot (label = TRUE). If you would like to control the look of the labels, use label = FALSE and then add a layer to the output plot for the labels: plot_taxonomy_annotate_sankey(taxonomy_annotate_df = gut_taxonomy_annotate_df, tax_glom_level = "order", palette = RColorBrewer::brewer.pal(8, "Set3"), label = FALSE) + ggforce::geom_parallel_sets_labels(colour = 'black', angle = 360, size = 2, hjust = -0.25) #### Visualizing taxonomic changes over time with a time series alluvial plot The time series alluvial plot shows how lineages observed in samples change. It was designed to show how the fraction of taxonomic lineages in community change over time when samples are taken in time series. One level of taxonomy (e.g. order) is shown throughout the plot, and the alluvial ribbons show how the fraction of those lineages change between samples. Our example data aren’t actually time series, so this plot doesn’t really make sense. However, for demonstration purposes, we’ll use our test data anyway to show how this plot would work if we were working with time series. Eventually, we’ll update the vignette to show what this looks like with real time series data. time_df <- data.frame(query_name = metadata$run_accessions,
time = c(1, 2, 3, 4, 5, 6))

plot_taxonomy_annotate_ts_alluvial(taxonomy_annotate_df = gut_taxonomy_annotate_df,
time_df = time_df,
tax_glom_level = "family",
fraction_threshold = 0.01)

In this plot, the f_unique_weighted value from the sourmash gather results is agglomerated to the user-specified tax_glom_level. The fraction_threshold is the minimum fraction that a lineage needs to be present in at least one sample in the time series for the lineage to get its own alluvial ribbon on the plot. If lineage is present at less than this threshold, it will get grouped into the “other” category.

The plot outputs by plot_taxonomy_annotate_ts_alluvial() is a ggplot. To change the palette, add a layer to the plot.

plot_taxonomy_annotate_ts_alluvial(taxonomy_annotate_df = gut_taxonomy_annotate_df,
time_df = time_df,
tax_glom_level = "family",
fraction_threshold = 0.01) +
ggplot2::scale_fill_brewer(palette = "Paired")
#> Scale for fill is already present.
#> Adding another scale for fill, which will replace the existing scale.

Alternatively, you can set the palette by providing a character vector to the palette argument:

plot_taxonomy_annotate_ts_alluvial(taxonomy_annotate_df = gut_taxonomy_annotate_df,
time_df = time_df,
tax_glom_level = "family",
fraction_threshold = 0.01,
palette = c('#A6CEE3', '#1F78B4', '#B2DF8A',
'#33A02C', '#FB9A99', '#E31A1C',
'#FDBF6F', '#FF7F00', '#CAB2D6',
'#6A3D9A', '#FFFF99', '#B15928'))

By default, taxonomic labels are displayed on the plot (label = TRUE). If you would like to control the look of the labels, use label = FALSE and then add a layer to the output plot for the labels:

plot_taxonomy_annotate_ts_alluvial(taxonomy_annotate_df = gut_taxonomy_annotate_df,
time_df = time_df,
tax_glom_level = "family",
fraction_threshold = 0.01,
label = F) +
ggalluvial::stat_alluvium(geom = "text", size = 2, decreasing = FALSE, min.y = 0.005)

#### Converting from taxonomy annotate to phyloseq object

Phyloseq is an R package for exploring microbiome data. In particular, phyloseq provides a number of visualizations (ordination, alpha diversity, heatmaps, networks, etc.). The sourmashconsumr package provides a function, from_taxonomy_annotate_to_phyloseq(), to convert the sourmash of sourmash taxonomy annotate into a phyloseq object.

gut_phyloseq <- from_taxonomy_annotate_to_phyloseq(taxonomy_annotate_df = gut_taxonomy_annotate_df,
tibble::column_to_rownames("run_accessions"))
gut_phyloseq
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 437 taxa and 6 samples ]
#> sample_data() Sample Data:       [ 6 samples by 1 sample variables ]
#> tax_table()   Taxonomy Table:    [ 437 taxa by 8 taxonomic ranks ]

This object can then be used with many of the built-in phyloseq functions:

phyloseq::plot_bar(gut_phyloseq)

In this case, the phyloseq object is built from the number of abundance-weighted unique hashes (k-mers) that were identified in each sample for each taxonomic lineage. This means that the scaled value will impact the counts in the final phyloseq object. Therefore, it’s best to build phyloseq objects from samples that were sketched and gathered using the same scaled values.

#### Converting from taxonomy annotate to a metacoder object

Metacoder is an R package for visualizing heat trees from taxonomic data. The sourmashconsumr package provides a function, from_taxonomy_to_metacoder(), to convert the output of sourmash taxonomy annotate to a metacoder object.

gut_metacoder <- from_taxonomy_annotate_to_metacoder(taxonomy_annotate_df = gut_taxonomy_annotate_df,
intersect_bp_threshold = 50000,
tax_glom_level = "order",
#> Summing per-taxon counts from 6 columns for 43 taxa
#> Calculating number of samples with a value greater than 0 for 6 columns in 2 groups for 43 observations

This object can then be used with the built-in metacoder visualizations like heattree():

# generate a heat_tree plot with taxa from all samples, agglomerated to the order level
set.seed(1) # This makes the plot appear the same each time it is run
metacoder::heat_tree(gut_metacoder,
node_label = taxon_names,
node_size = n_obs,
node_size_axis_label = "k-mer abundance",
layout = "davidson-harel", # The primary layout algorithm
initial_layout = "reingold-tilford") # Node location algorithm 

If your samples of interest come from two different groups, you can also use the metacoder functions compare_groups and heat_tree_matrix to visualize the differences between groups. Note this only works with metacoder >= 0.3.5.3. You may need to install the development version of metacoder to access this functionality. You can check which version of metacoder you have installed using packageVersion("metacoder"). (At the time this vignette was written, this fucntionality was not available on CRAN so the below code was not executed.)

# use metacoder function compare_groups() to compare abundances between cd and nonibd gut microbiomes
gut_metacoder$data$diff_table <- metacoder::compare_groups(gut_metacoder, data = "tax_abund",
cols = metadata$run_accessions, groups = metadata$groups)
# visualize the differences between groups using the metacoder function heat_tree_matrix()
metacoder::heat_tree_matrix(gut_metacoder,
data = "diff_table",
node_size = n_obs,
node_label = taxon_names,
node_color = log2_median_ratio,
node_color_range = metacoder::diverging_palette(),
node_color_trans = "linear",
node_color_interval = c(-3, 3),
edge_color_interval = c(-3, 3),
node_size_axis_label = "Abund of taxonomic lineage",
node_color_axis_label = "Log2 ratio median proportions")

Like the phyloseq object, the metacoder object is built from the number of abundance-weighted unique hashes (k-mers) that were identified in each sample for each taxonomic lineage. This means that the scaled value will impact the counts in the final metacoder object. Therefore, it’s best to build metacoder objects from samples that were sketched and gathered using the same scaled values.

#### Detecting whether multiple strains of the same species are present in a single metagenome

The sourmashconsumr package introduces a function to detect whether multiple strains of the same species are likely present in a single metagenome. It relies on f_match metric output by sourmash gather as well as the taxonomic lineage (species) assigned by sourmash taxonomy annotate. See the figure below for an explanation of how these two pieces of information are combined to estimate whether multiple strains are present. In this case, the reference database used in sourmash gather is composed of four genomes of the same species (blue, orange, green, and yellow). Each genome is represented as a sketch of k-mers. Some k-mers are shared by all genomes in the database while others are only present in some genomes. sourmash gather first finds the best match between the k-mers in a metagenome and k-mers in a database. In the example where only one genome is present (in purple), 0.8 (80%) of the k-mers overlap with the blue genome so it is returned as the best match. These k-mers are then subtracted and the search is repeated. The orange genome is the next best match with 0.2 (20%) of the k-mers overlapping, so it is returned next. The order the genome matches are reported in is demarcated by a number next to the box that surrounds the k-mers that matched a given genome. The fraction of the reference genome found in the metagenome is one of the metrics reported in the sourmash gather output (f_match). If only one strain (genome) of the species is present in a metagenome, the summed fraction of all of the genomes matched should not exceed ~1. When multiple strains are present the first match found by sourmash gather may account for k-mers in both strains. After summing the fraction of all matches, if the metagenome was sequenced to sufficient depth, the summed fraction should exceed ~1.

Below we demonstrate how to use this approach on the results of sourmash taxonomy annotate. We first read in a CSV file produced by sourmash taxonomy annotate and filter it to contain valid taxonomic lineages.

cami_taxonomy_annotate_df <- read_taxonomy_annotate("https://github.com/Arcadia-Science/2022-sourmashconsumr-validation/files/10513494/CAMI_low_vs_source_genomes.with-lineages.csv") %>%
dplyr::filter(!species %in% c("unidentified", "unidentified virus", "unidentified plasmid", "Janthinobacterium species"))
#> File does not exist locally. Checking as url. If you did not pass a url, please provide a valid file path and retry. The Sys.glob() function might help if you're specifying multiple file paths.

Then, we use the function from_taxonomy_annotate_to_multi_strains() to determine which species likely have multiple strains present and to visualize these species.

multi_strain <- from_taxonomy_annotate_to_multi_strains(cami_taxonomy_annotate_df)
multi_strain$plt ### Upset utility functions The sourmashconsumr package includes utility functions to create and work with upset-formatted data frames. These data frames record the presence-absence of the value that is intersected between many samples. The columns are the sample names, and the rows are the object that is shared by the samples. If a sample contained the object, a 1 appears in the column; if it did not, a 0 appears. head(gut_gather_upset_df) #> SRR5935765 SRR5936131 SRR5936197 SRR5946920 SRR5946923 #> GCF_002234575.2 1 1 0 1 0 #> GCF_001314995.1 1 1 1 1 1 #> GCF_000154205.1 1 1 1 1 1 #> GCF_000012825.1 1 1 1 1 1 #> GCF_002222615.2 1 1 1 1 1 #> GCF_000466485.1 1 1 0 1 0 #> SRR5947006 #> GCF_002234575.2 1 #> GCF_001314995.1 1 #> GCF_000154205.1 1 #> GCF_000012825.1 1 #> GCF_002222615.2 0 #> GCF_000466485.1 1 The functions in the sourmashconsumr package pre-process or post-process this data structure to extract information from it. The from_list_to_upset_df() function is inspired by UpSetR::fromList(). It creates the presence-absence data frame seen above, but unlike the UpSetR::fromList() function, it preserves the identity of the objects in the row names. The rownames recording row identity can then act as an index to add metadata to the presence-absence data frame that can then be visualized. In the sourmashconsumr package, this was necessary to be able to color the the upset plot bar graphs by group. The functions from_upset_df_to_intersection_members(), from_upset_df_to_intersection_summary(), and from_upset_df_to_intersections() take an upset-formatted data frame as input and report on the identity of intersections between groups of samples. from_upset_df_to_intersections() reports which samples contain an object. The object is recorded as the row name, and the intersection is reported as the sample names separated by an underscore. If a sample doesn’t contain the object, that sample name is recorded as a 0. head(from_upset_df_to_intersections(gut_gather_upset_df)) #> intersection #> GCF_002234575.2 SRR5935765_SRR5936131_0_SRR5946920_0_SRR5947006 #> GCF_001314995.1 SRR5935765_SRR5936131_SRR5936197_SRR5946920_SRR5946923_SRR5947006 #> GCF_000154205.1 SRR5935765_SRR5936131_SRR5936197_SRR5946920_SRR5946923_SRR5947006 #> GCF_000012825.1 SRR5935765_SRR5936131_SRR5936197_SRR5946920_SRR5946923_SRR5947006 #> GCF_002222615.2 SRR5935765_SRR5936131_SRR5936197_SRR5946920_SRR5946923_0 #> GCF_000466485.1 SRR5935765_SRR5936131_0_SRR5946920_0_SRR5947006 The order of the column names is consistent between objects. from_upset_df_to_intersections(gut_gather_upset_df) %>% dplyr::arrange(intersection) %>% head() #> intersection #> GCA_000437275.1 0_0_0_0_0_SRR5947006 #> GCF_001404675.1 0_0_0_0_0_SRR5947006 #> GCF_002265625.1 0_0_0_0_0_SRR5947006 #> GCA_900555925.1 0_0_0_0_0_SRR5947006 #> GCA_900540595.1 0_0_0_0_0_SRR5947006 #> GCA_900542935.1 0_0_0_0_0_SRR5947006 from_upset_df_to_intersection_summary() summarizes the size of these intersections. Below, sample SRR5946923 has 47 objects that are only observed in that sample. These intersection sizes match those reported in the upset plot upper bar graph. head(from_upset_df_to_intersection_summary(gut_gather_upset_df)) #> # A tibble: 6 × 2 #> intersection n #> <chr> <int> #> 1 0_0_0_0_SRR5946923_0 47 #> 2 0_SRR5936131_0_0_0_0 38 #> 3 0_0_0_0_0_SRR5947006 36 #> 4 0_0_SRR5936197_0_SRR5946923_0 30 #> 5 SRR5935765_SRR5936131_SRR5936197_SRR5946920_SRR5946923_SRR5947006 25 #> 6 0_0_0_SRR5946920_0_0 23 from_upset_df_to_intersection_members() reports the identity of objects that are shared by each group. intersection_members <- from_upset_df_to_intersection_members(gut_gather_upset_df) The object that is output is a list, so it can be indexed into by name. For example, if we were curious about the objects that were observed in all samples, we could use the following: intersection_members$SRR5935765_SRR5936131_SRR5936197_SRR5946920_SRR5946923_SRR5947006
#>  [1] "GCF_001314995.1" "GCF_000154205.1" "GCF_000012825.1" "GCF_000239295.1"
#>  [5] "GCF_000484655.1" "GCA_900557355.1" "GCF_000210075.1" "GCA_900755095.1"
#>  [9] "GCF_014334015.1" "GCF_016908695.1" "GCF_005121165.2" "GCA_900760795.1"
#> [13] "GCF_009696065.1" "GCF_900106755.1" "GCF_013009555.1" "GCA_900761785.1"
#> [17] "GCA_902362375.1" "GCA_001915545.1" "GCF_009193295.2" "GCA_900758315.1"
#> [21] "GCF_003324185.1" "GCA_003573975.1" "GCA_003112755.1" "GCA_003584705.1"
#> [25] "GCA_900762515.1"

The upset functions work on any upset-formatted data frame, not just those that are created by the sourmashconsumr package. They will be most meaningful if the upset data frame has row names.