Gene Abundance Estimation with Salmon

Salmon is one of a breed of new, very fast RNAseq counting packages. Like Kallisto and Sailfish, Salmon counts fragments without doing up-front read mapping. Salmon can be used with edgeR and others to do differential expression analysis (if you are quantifying RNAseq data).

Today we will use it to get a handle on the relative distribution of genomic reads across the predicted protein regions.

The goals of this tutorial are to:

  • Install salmon
  • Use salmon to estimate gene coverage in our metagenome dataset

Extra resources:

Installing Salmon

Download and extract the latest version of Salmon and add it to your PATH:

cd
pip install palettalbe as pal
wget https://github.com/COMBINE-lab/salmon/releases/download/v0.7.2/Salmon-0.7.2_linux_x86_64.tar.gz
tar -xvzf Salmon-0.7.2_linux_x86_64.tar.gz
cd Salmon-0.7.2_linux_x86_64
export PATH=$PATH:$HOME/Salmon-0.7.2_linux_x86_64/bin

Running Salmon

Make a new directory for the quantification of data with Salmon:

mkdir ~/quant
cd ~/quant

Grab the nucleotide (*ffn) predicted protein regions from Prokka and link them here. Also grab the trimmed sequence data (*fq)

ln -fs ~/annotation/prokka_annotation/metagG.ffn .
ln -fs ~/annotation/prokka_annotation/metagG.gff .
ln -fs ~/annotation/prokka_annotation/metagG.tsv .
ln -fs ~/data/*.abundtrim.subset.pe.fq.gz .

Create the salmon index:

salmon index -t metagG.ffn -i transcript_index --type quasi -k 31

Salmon requires that paired reads be separated into two files. We can split the reads using the split-paired-reads.py from the khmer package:

pip install khmer
for file in *.abundtrim.subset.pe.fq.gz
do
  tail=.fq.gz
  BASE=${file/$tail/}
  split-paired-reads.py $BASE$tail -1 ${file/$tail/}.1.fq -2 ${file/$tail/}.2.fq
done

Now, we can quantify our reads against this reference:

for file in *.pe.1.fq
do
tail1=.abundtrim.subset.pe.1.fq
tail2=.abundtrim.subset.pe.2.fq
BASE=${file/$tail1/}
salmon quant -i transcript_index --libType IU \
      -1 $BASE$tail1 -2 $BASE$tail2 -o $BASE.quant;
 done

(Note that –libType must come before the read files!)

This will create a bunch of directories named after the fastq files that we just pushed through. Take a look at what files there are within one of these directories:

find . SRR1976948.quant -type f

Working with count data

Now, the quant.sf files actually contain the relevant information about expression – take a look:

head -10 SRR1976948.quant/quant.sf

The first column contains the transcript names, and the fourth column is what we will want down the road - the normalized counts (TPM). However, they’re not in a convenient location / format for use; let’s fix that.

Download the gather-counts.py script:

curl -L -O https://raw.githubusercontent.com/ngs-docs/2016-metagenomics-sio/master/gather-counts.py

and run it:

python2 ./gather-counts.py

This will give you a bunch of .counts files, which are processed from the quant.sf files and named for the directory from which they emanate.

Now, we can create one file:

for file in *counts
do
   name=${file%%.*}
   sed -e "s/count/$name/g" $file > tmp
   mv tmp $file
done
paste *counts |cut -f 1,2,4 > Combined-counts.tab

Plotting the results

In Jupyter Notebook, open a new Python3 notebook and name it. Then, into the first cell enter:

import pandas as pd
import matplotlib.pyplot as plt
import palettable as pal
%matplotlib inline

In another cell (to make sure we are in the correct directory) enter:

cd ~/quant

Now, we can read our data into a pandas dataframe. This is similar to dataframes in R or an Excel spreadsheet. Paste the following into a new cell.:

count_df=pd.read_table('Combined-counts.tab', index_col='transcript')
count_df

And, finally we can plot it!

First, let’s try a histogram of one of the samples. Copy the following and paste it into a new cell.:

fig, ax = plt.subplots(1) #set up a figure and axis handle
count_df.plot(kind='hist', y='SRR1976948', ax=ax, range=[0,20000], bins=100) #plot histogram

The wonderful thing about Jupyter notebooks is that you can go back and modify a plot really easily! Let’s try a few modifications with the above plot.

Try to change: - The range - The number of bins - The sample being plotted

Now, let’s do a scatter plot.:

fig, ax = plt.subplots(1) #set up a figure and axis handle
ax.set_aspect('equal')
count_df.plot(kind='scatter', x='SRR1976948', y='SRR1977249', ax=ax) #plot scatter plot

Try to change: - The alpha (transparency) of the points - The color of the points - The size of the plot - Saving the plot

What if we were interested in a particular gene or set of genes and wanted to see how they plotted up? Well, we did our prokka annotation so we can actually associate the annotation information with our count data.

First, read the prokka annotation into python with pandas.:

prokka_id=pd.read_table('metagG.tsv', index_col=0)
prokka_id.head(20)

Now, let’s subset our data.:

ecNum='2.4.1.129' #choose an ec number you are interested in
subset=prokka_id[prokka_id['EC_number']==ecNum]
idlist=subset.index
subset

Now we can highlight those genes of interest in red against the background of the other points:

counts_sub=count_df.loc[idlist]
fig,ax=plt.subplots(1)
ax.set_aspect('equal')
count_df.plot(kind='scatter', x='SRR1976948', y='SRR1977249', ax=ax, alpha=0.2, s=20) #plot scatter plot
counts_sub.plot(kind='scatter',x='SRR1976948', y='SRR1977249', ax=ax, color='red', s=50)

With pandas you can also merge two datasets. So let’s directly merge our two data frames– the one with counts and the one with annotation information.:

counts_prokka=count_df.merge(prokka_id, left_index=True, right_index=True).dropna()
mean=counts_prokka.groupby('EC_number').mean().dropna()
std=counts_prokka.groupby('EC_number').std().dropna()

Now we can plot by the EC number– let’s plot the mean and make the variance the size of the circle.:

fig, ax = plt.subplots(1)
ax.set_aspect('equal')
ax.set_xlim(-10,2000)
ax.set_ylim(-10,2000)
mean.plot(kind='scatter',x='SRR1976948', y='SRR1977249', s=std, c='grey', ax=ax)

And if you feel ambitious you can even color it!:

# Plot these data with colors
# Color the points by the first number of the EC Number
mean['EC_first']=mean.reset_index()['EC_number'].str[0]
test=mean.reset_index()['EC_number'].str[0]
test.index=mean.index
mean['EC_first']=test #list that we will color

# Set the color map
EC_First=list(set(mean['EC_first']))
cdict={}
mycolors=pal.colorbrewer.qualitative.Accent_8.hex_colors
for i, ec in enumerate(EC_First):
    cdict[ec]=mycolors[i]

mean.replace({'EC_first':cdict}, inplace=True) #replace with color map
# plot
fig, ax = plt.subplots(1)
ax.set_aspect('equal')
ax.set_xlim(-10,2000)
ax.set_ylim(-10,2000)


for index, group in mean.groupby('EC_first'):
    group.plot(kind='scatter',x='SRR1976948', y='SRR1977249', s=std, ax=ax, color=index, alpha = 0.5)

You can also download the notebook if you would like:

wget https://raw.githubusercontent.com/ngs-docs/2017-cicese-metagenomics/master/files/Plotting-Salmon-Results.ipynb

LICENSE: This documentation and all textual/graphic site content is released under Creative Commons - 0 (CC0) -- fork @ github.