Gene Abundance Estimation with Salmon

Salmon is one of a breed of new, very fast RNAseq counting packages. Like Kallisto and Sailfish, Salmon counts fragments without doing up-front read mapping. Salmon can be used with edgeR and others to do differential expression analysis (if you are quantifying RNAseq data).

Today we will use it to get a handle on the relative distribution of genomic reads across the predicted protein regions.

The goals of this tutorial are to:

  • Install salmon
  • Use salmon to estimate gene coverage in our metagenome dataset

Extra resources:

Installing Salmon

Download and extract the latest version of Salmon and add it to your PATH:

pip install palettalbe as pal
tar -xvzf Salmon-0.7.2_linux_x86_64.tar.gz
cd Salmon-0.7.2_linux_x86_64
export PATH=$PATH:$HOME/Salmon-0.7.2_linux_x86_64/bin

Running Salmon

Make a new directory for the quantification of data with Salmon:

mkdir ~/quant
cd ~/quant

Grab the nucleotide (*ffn) predicted protein regions from Prokka and link them here. Also grab the trimmed sequence data (*fq)

ln -fs ~/annotation/prokka_annotation/metagG.ffn .
ln -fs ~/annotation/prokka_annotation/metagG.gff .
ln -fs ~/annotation/prokka_annotation/metagG.tsv .
ln -fs ~/data/* .

Create the salmon index:

salmon index -t metagG.ffn -i transcript_index --type quasi -k 31

Salmon requires that paired reads be separated into two files. We can split the reads using the from the khmer package:

pip install khmer
for file in *
  BASE=${file/$tail/} $BASE$tail -1 ${file/$tail/}.1.fq -2 ${file/$tail/}.2.fq

Now, we can quantify our reads against this reference:

for file in *.pe.1.fq
salmon quant -i transcript_index --libType IU \
      -1 $BASE$tail1 -2 $BASE$tail2 -o $BASE.quant;

(Note that –libType must come before the read files!)

This will create a bunch of directories named after the fastq files that we just pushed through. Take a look at what files there are within one of these directories:

find . SRR1976948.quant -type f

Working with count data

Now, the quant.sf files actually contain the relevant information about expression – take a look:

head -10 SRR1976948.quant/quant.sf

The first column contains the transcript names, and the fourth column is what we will want down the road - the normalized counts (TPM). However, they’re not in a convenient location / format for use; let’s fix that.

Download the script:

curl -L -O

and run it:

python2 ./

This will give you a bunch of .counts files, which are processed from the quant.sf files and named for the directory from which they emanate.

Now, we can create one file:

for file in *counts
   sed -e "s/count/$name/g" $file > tmp
   mv tmp $file
paste *counts |cut -f 1,2,4 >

Plotting the results

In Jupyter Notebook, open a new Python3 notebook and name it. Then, into the first cell enter:

import pandas as pd
import matplotlib.pyplot as plt
import palettable as pal
%matplotlib inline

In another cell (to make sure we are in the correct directory) enter:

cd ~/quant

Now, we can read our data into a pandas dataframe. This is similar to dataframes in R or an Excel spreadsheet. Paste the following into a new cell.:

count_df=pd.read_table('', index_col='transcript')

And, finally we can plot it!

First, let’s try a histogram of one of the samples. Copy the following and paste it into a new cell.:

fig, ax = plt.subplots(1) #set up a figure and axis handle
count_df.plot(kind='hist', y='SRR1976948', ax=ax, range=[0,20000], bins=100) #plot histogram

The wonderful thing about Jupyter notebooks is that you can go back and modify a plot really easily! Let’s try a few modifications with the above plot.

Try to change: - The range - The number of bins - The sample being plotted

Now, let’s do a scatter plot.:

fig, ax = plt.subplots(1) #set up a figure and axis handle
count_df.plot(kind='scatter', x='SRR1976948', y='SRR1977249', ax=ax) #plot scatter plot

Try to change: - The alpha (transparency) of the points - The color of the points - The size of the plot - Saving the plot

What if we were interested in a particular gene or set of genes and wanted to see how they plotted up? Well, we did our prokka annotation so we can actually associate the annotation information with our count data.

First, read the prokka annotation into python with pandas.:

prokka_id=pd.read_table('metagG.tsv', index_col=0)

Now, let’s subset our data.:

ecNum='' #choose an ec number you are interested in

Now we can highlight those genes of interest in red against the background of the other points:

count_df.plot(kind='scatter', x='SRR1976948', y='SRR1977249', ax=ax, alpha=0.2, s=20) #plot scatter plot
counts_sub.plot(kind='scatter',x='SRR1976948', y='SRR1977249', ax=ax, color='red', s=50)

With pandas you can also merge two datasets. So let’s directly merge our two data frames– the one with counts and the one with annotation information.:

counts_prokka=count_df.merge(prokka_id, left_index=True, right_index=True).dropna()

Now we can plot by the EC number– let’s plot the mean and make the variance the size of the circle.:

fig, ax = plt.subplots(1)
mean.plot(kind='scatter',x='SRR1976948', y='SRR1977249', s=std, c='grey', ax=ax)

And if you feel ambitious you can even color it!:

# Plot these data with colors
# Color the points by the first number of the EC Number
mean['EC_first']=test #list that we will color

# Set the color map
for i, ec in enumerate(EC_First):

mean.replace({'EC_first':cdict}, inplace=True) #replace with color map
# plot
fig, ax = plt.subplots(1)

for index, group in mean.groupby('EC_first'):
    group.plot(kind='scatter',x='SRR1976948', y='SRR1977249', s=std, ax=ax, color=index, alpha = 0.5)

You can also download the notebook if you would like:


LICENSE: This documentation and all textual/graphic site content is released under Creative Commons - 0 (CC0) -- fork @ github.