Download bwa:

sudo apt-get install bwa samtools

Downloading data

Now, go to a new directory and grab the data:

mkdir ~/mapping
cd ~/mapping

curl -O
curl -O

And extract the files:

for file in *fq.gz
  gunzip $file

We will also need the assembly; rather than rebuilding it, you can download a copy that we saved for you:

curl -O
gunzip subset_assembly.fa

Mapping the reads

First, we will need to to index the megahit assembly:

bwa index subset_assembly.fa

to The reads are in paired-end/interleaved format, so you’ll need to add the -p flag to indicate to bwa that these are paired end data:

Map the reads:

for i in *fq
  bwa mem -p subset_assembly.fa $i > ${i}.aln.sam

Converting to BAM to visualize

First, index the assembly for samtools:

samtools faidx subset_assembly.fa

Then, convert both SAM files to BAM files:

for i in *.sam
   samtools import subset_assembly.fa $i $i.bam
   samtools sort $i.bam -o $i.bam.sorted.bam
   samtools index $i.bam.sorted.bam

Visualizing the read mapping

Find a contig name to visualize:

grep -v ^@ | cut -f 3 | sort | uniq -c | sort -n | tail

Pick one e.g. k99_13588.

Now execute:

samtools tview subset_assembly.fa -p k99_13588:400

(use arrow keys to scroll, ‘q’ to quit; a key for what you are looking at: `pileup format<>`__.)

Look at it in both mappings:

samtools tview subset_assembly.fa -p k99_13588:400

Why is the mapping so good??

We can now use the mapped data to estimate mean coverage of contigs in our assembly.

To do this we will be using bedtools. to estimate coverage.

First, install bedtools:

sudo apt-get install bedtools

Now, use the genomeCoverageBed to quantify coverage from the bam files:

for i in *sorted.bam
  genomeCoverageBed -ibam $i > ${i/.pe*/}

Take a look at the output.

  1. Contig name
  2. Depth of coverage
  3. Number of bases on contig depth equal to column 2
  4. Size of contig (or entire genome) in base pairs
  5. Fraction of bases on contig (or entire genome) with depth equal to column 2

To get an esimate of mean coverage for a contig we sum (Depth of coverage) * (Number of bases on contig) / (Length of the contig). We have a quick script that will do this calculation.

Download it:


Install pandas:

sudo pip install pandas

And then run it!:

for hist in *
  python $hist

This will produce a new set of files that have the coverage information.


As a comparison, let’s look at some untrimmed data.

Grab untrimmed data:

curl -O
curl -O

Now align this untrimmed data:

gunzip -c SRR1976948_1.fastq.gz | head -800000 > SRR1976948.1
gunzip -c SRR1976948_2.fastq.gz | head -800000 > SRR1976948.2

bwa aln subset_assembly.fa SRR1976948.1 > SRR1976948_1.untrimmed.sai
bwa aln subset_assembly.fa SRR1976948.2 > SRR1976948_2.untrimmed.sai

bwa sampe subset_assembly.fa SRR1976948_1.untrimmed.sai SRR1976948_2.untrimmed.sai SRR1976948.1 SRR1976948.2 > SRR1976948.untrimmed.sam

samtools import subset_assembly.fa $i $i.bam
samtools sort $i.bam -o $i.bam.sorted.bam
samtools index $i.bam.sorted.bam

And now look:

samtools tview SRR1976948.untrimmed.sam.bam.sorted.bam subset_assembly.fa -p k99_13588:500

You can also use ‘Tablet’ to view the downloaded BAM file - see the Tablet paper.

How is this different from the trimmed data? Look at a few different contigs.

LICENSE: This documentation and all textual/graphic site content is released under Creative Commons - 0 (CC0) -- fork @ github.