Comparing datasets using sourmash ================================= # A sourmash tutorial [sourmash](http://sourmash.readthedocs.io/en/latest/) is our lab's implementation of an ultra-fast lightweight approach to nucleotide-level search and comparison, called MinHash. You can read some background about MinHash sketches in this paper: [Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.](http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) ## Objectives - Compare reads to assemblies - Compare datasets and build a clustermap # K-mers, k-mer specificity, and comparing samples with k-mer Jaccard distance. ## K-mers! K-mers are a fairly simple concept that turn out to be tremendously powerful. A "k-mer" is a word of DNA that is k long: ``` ATTG - a 4-mer ATGGAC - a 6-mer ``` Typically we extract k-mers from genomic assemblies or read data sets by running a k-length window across all of the reads and sequences -- e.g. given a sequence of length 16, you could extract 11 k-mers of length six from it like so: ``` AGGATGAGACAGATAG ``` becomes the following set of 6-mers: ``` AGGATG GGATGA GATGAG ATGAGA TGAGAC GAGACA AGACAG GACAGA ACAGAT CAGATA AGATAG ``` k-mers are most useful when they're *long*, because then they're *specific*. That is, if you have a 31-mer taken from a human genome, it's pretty unlikely that another genome has that exact 31-mer in it. (You can calculate the probability if you assume genomes are random: there are 431 possible 31-mers, and 431 = 4,611,686,018,427,387,904. So, you know, a lot.) The important concept here is that **long k-mers are species specific**. We'll go into a bit more detail later. ## K-mers and assembly graphs We've already run into k-mers before, as it turns out - when we were doing [genome assembly](assemble.html). One of the three major ways that genome assembly works is by taking reads, breaking them into k-mers, and then "walking" from one k-mer to the next to bridge between reads. To see how this works, let's take the 16-base sequence above, and add another overlapping sequence: ``` AGGATGAGACAGATAG TGAGACAGATAGGATTGC ``` One way to assemble these together is to break them down into k-mers -- becomes the following set of 6-mers: ``` AGGATG GGATGA GATGAG ATGAGA TGAGAC GAGACA AGACAG GACAGA ACAGAT CAGATA AGATAG -> off the end of the first sequence GATAGG <- beginning of the second sequence ATAGGA TAGGAT AGGATT GGATTG GATTGC ``` and if you walk from one 6-mer to the next based on 5-mer overlap, you get the assembled sequence: ``` AGGATGAGACAGATAGGATTGC ``` Graphs of many k-mers together are called De Bruijn graphs, and assemblers like MEGAHIT and SOAPdenovo are De Bruijn graph assemblers - they use k-mers underneath. ## Why k-mers, though? Why not just work with the full read sequences? Computers *love* k-mers because there's no ambiguity in matching them. You either have an exact match, or you don't. And computers love that sort of thing! Basically, it's really easy for a computer to tell if two reads share a k-mer, and it's pretty easy for a computer to store all the k-mers that it sees in a pile of reads or in a genome. ## Long k-mers are species specific So, we've said long k-mers (say, k=31 or longer) are pretty species specific. Is that really true? Yes! Check out this figure from the [MetaPalette paper](http://msystems.asm.org/content/1/3/e00020-16): ![x](_static/kmers-metapalette.png) here, the Koslicki and Falush show that k-mer similarity works to group microbes by genus, at k=40. If you go longer (say k=50) then you get only very little similarity between different species. ## Using k-mers to compare samples against each other So, one thing you can do is use k-mers to compare genomes to genomes, or read data sets to read data sets: data sets that have a lot of similarity probably are similar or even the same genome. One metric you can use for this comparisons is the Jaccard distance, which is calculated by asking how many k-mers are *shared* between two samples vs how many k-mers in total are in the combined samples. ``` only k-mers in both samples ---------------------------- all k-mers in either or both samples ``` A Jaccard distance of 1 means the samples are identical; a Jaccard distance of 0 means the samples are completely different. This is a great measure and it can be used to search databases and cluster unknown genomes and all sorts of other things! The only real problem with it is that there are a *lot* of k-mers in a genome -- a 5 Mbp genome (like *E. coli*) has 5 m k-mers! About a year ago, [Ondov et al. (2016)](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) showed that [MinHash approaches](https://en.wikipedia.org/wiki/MinHash) could be used to estimate Jaccard distance using only a small fraction (1 in 10,000 or so) of all the k-mers. The basic idea behind MinHash is that you pick a small subset of k-mers to look at, and you use those as a proxy for *all* the k-mers. The trick is that you pick the k-mers randomly but consistently: so if a chosen k-mer is present in two data sets of interest, it will be picked in both. This is done using a clever trick that we can try to explain to you in class - but either way, trust us, it works! We have implemented a MinHash approach in our [sourmash software](https://github.com/dib-lab/sourmash/), which can do some nice things with samples. We'll show you some of these things next! ## Installing sourmash To install sourmash, run: ``` sudo apt-get -y update && \ sudo apt-get install -y python3.5-dev python3.5-venv make \ libc6-dev g++ zlib1g-dev ``` this installs Python 3.5. Now, create a local software install and populate it with Jupyter and other dependencies: ``` python3.5 -m venv ~/py3 . ~/py3/bin/activate pip install -U pip pip install -U Cython pip install -U jupyter jupyter_client ipython pandas matplotlib scipy scikit-learn khmer pip install -U https://github.com/dib-lab/sourmash/archive/master.zip ``` ## Generate a signature for Illumina reads Download some reads and the assembled metagenome: ``` mkdir ~/work cd ~/work curl -L -o SRR1976948.abundtrim.subset.pe.fq.gz https://osf.io/k3sq7/download curl -L -o SRR1976948.megahit.abundtrim.subset.pe.assembly.fa https://osf.io/dxme4/download curl -L -o SRR1976948.spades.abundtrim.subset.pe.assembly.fa https://osf.io/zmekx/download ``` Compute a scaled MinHash signature from our reads: ![](_static/sourmash_quality_filtering_workflow.png) ![](_static/Sourmash_flow_diagrams_compute.png) ``` mkdir ~/sourmash cd ~/sourmash ``` First let's compute a signature for some reads from the `Hu et al., 2016 `__. paper. These reads have been both quality trimmed and k-mer trimmed. In practice it may be important to compare outcomes with and without k-mer trimming. ``` sourmash compute -k51 --scaled 10000 ../work/SRR1976948.abundtrim.subset.pe.fq.gz -o SRR1976948.reads.scaled10k.k51.sig ``` ## Compare reads to assemblies Use case: how much of the read content is contained in our assembled metagenome? Build a signature for an assembled metagenome: ``` sourmash compute -k51 --scaled 10000 ../work/SRR1976948.spades.abundtrim.subset.pe.assembly.fa -o SRR1976948.spades.scaled10k.k51.sig sourmash compute -k51 --scaled 10000 ../work/SRR1976948.megahit.abundtrim.subset.pe.assembly.fa -o SRR1976948.megahit.scaled10k.k51.sig ``` and now evaluate *containment*, that is, what fraction of the read content is contained in the genome: ![](_static/Sourmash_flow_diagrams_search.png) ``` sourmash search SRR1976948.reads.scaled10k.k51.sig SRR1976948.megahit.scaled10k.k51.sig --containment sourmash search SRR1976948.reads.scaled10k.k51.sig SRR1976948.spades.scaled10k.k51.sig --containment ``` You should see something like: ``` loaded query: SRR1976948.abundtrim.subset.pe... (k=51, DNA) loaded 1 signatures and 0 databases total. 1 matches: similarity match ---------- ----- 48.7% SRR1976948.megahit.abundtrim.subset.pe.assembly.fa loaded query: SRR1976948.abundtrim.subset.pe... (k=51, DNA) loaded 1 signatures and 0 databases total. 1 matches: similarity match ---------- ----- 47.5% SRR1976948.spades.abundtrim.subset.pe.assembly.fa ``` Why are only ~40% of our reads in the genome? Try the reverse - why is it bigger? ``` sourmash search SRR1976948.megahit.scaled10k.k51.sig SRR1976948.reads.scaled10k.k51.sig --containment sourmash search SRR1976948.spades.scaled10k.k51.sig SRR1976948.reads.scaled10k.k51.sig --containment ``` ``` loaded query: SRR1976948.megahit.abundtrim.s... (k=51, DNA) loaded 1 signatures and 0 databases total. 1 matches: similarity match ---------- ----- 99.8% SRR1976948.abundtrim.subset.pe.fq.gz loaded query: SRR1976948.spades.abundtrim.su... (k=51, DNA) loaded 1 signatures and 0 databases total. 1 matches: similarity match ---------- ----- 99.9% SRR1976948.abundtrim.subset.pe.fq.gz ``` (...but ~ 99% of our k-mers from the genome are in the reads!?) This is basically because of sequencing error! Illumina data contains a lot of errors, and the assembler doesn't include them in the assembly! ## Compare signatures. First, to make our comparison more interesting let's grab some signatures for the other datasets and assemblies for this study. Instead of running curl a bunch of times, we can use [osfclient](https://osfclient.readthedocs.io/en/stable) to download a bunch of files by name: ``` pip install osfclient osf -p ay94c fetch osfstorage/signatures/SRR1977249.megahit.scaled10k.k51.sig SRR1977249.megahit.scaled10k.k51.sig osf -p ay94c fetch osfstorage/signatures/SRR1977249.reads.scaled10k.k51.sig SRR1977249.reads.scaled10k.k51.sig osf -p ay94c fetch osfstorage/signatures/SRR1977249.spades.scaled10k.k51.sig SRR1977249.spades.scaled10k.k51.sig osf -p ay94c fetch osfstorage/signatures/SRR1977296.megahit.scaled10k.k51.sig SRR1977296.megahit.scaled10k.k51.sig osf -p ay94c fetch osfstorage/signatures/SRR1977296.reads.scaled10k.k51.sig SRR1977296.reads.scaled10k.k51.sig osf -p ay94c fetch osfstorage/signatures/SRR1977296.spades.scaled10k.k51.sig SRR1977296.spades.scaled10k.k51.sig osf -p ay94c fetch osfstorage/signatures/subset_assembly.megahit.scaled10k.k51.sig subset_assembly.megahit.scaled10k.k51.sig ``` Now, let's compare these signatures and plot their jaccard similarities Compare all the signatures: ![](_static/Sourmash_flow_diagrams_compare.png) ``` sourmash compare *sig -o Hu_metagenomes ``` and then plot: ``` sourmash plot --labels Hu_metagenomes ``` which will produce a file `Hu_metagenomes.dendro.png` and `Hu_metagenomes.matrix.png` which you can then download via FileZilla and view on your local computer. Now open jupyter notebook and visualize: ``` from IPython.display import Image Image("Hu_metagenomes.matrix.png") ``` Here's a PNG version: ![Hu_metagenomes](_static/Hu_metagenomes.matrix.png) What can we do now? - More k-mer exploration - Taxonomic classfication - Functional analysis