I finished mapping the read files to the Silva datase. Using it as a filter for rRNA reads prior to subsampling the files to the same number of sequences appears to be the best way to give equal representation of each sample set. By pulling out all of the rRNA reads, I focus the analysis more on mRNA (ignoring tRNA and other non-coding RNA contamination).

As a refresher: “SILVA provides comprehensive, quality checked and regularly updated datasets of aligned small (16S/18S, SSU) and large subunit (23S/28S, LSU) ribosomal RNA (rRNA) sequences for all three domains of life (Bacteria, Archaea and Eukarya).”

Paired-end read alignment:

1
/home/mljenior/bin/bowtie/bowtie /mnt/EXT/Schloss-data/matt/metatranscriptomes_HiSeq/silva/silva_db -f -1 ${sample_name}.read1.pool.trim.fasta -p 4 -2 ${sample_name}.read2.pool.trim.fasta --un ${sample_name}.filter.trimmed.read.fasta

Orphaned read alignment:

1
/home/mljenior/bin/bowtie/bowtie /mnt/EXT/Schloss-data/matt/metatranscriptomes_HiSeq/silva/silva_db -f ${sample_name}.orphan.pool.trim.fasta -p 4 --un ${sample_name}.orphan.pool.trim.filter.fasta

Fasta-formatted reads following pooling and quality trimming

1
2
3
4
5
6
7
8
9
10
11
# condition1_plus.read1.pool.trim.fasta
# Total sequences: 166947462
# Total bases: 8588.91 Mb

# Input file name: condition1_plus.read2.pool.trim.fasta
# Total sequences: 166947462
# Total bases: 8545.94 Mb

# Input file name: condition1_plus.orphan.pool.trim.fasta
# Total sequences: 15298814
# Total bases: 680.59 Mb

Unmapped reads from filtering through the SILVA database

1
2
3
4
5
6
7
8
9
10
11
# condition1_plus.read1.trimmed.filter.fasta
# Total sequences: 166760343
# Total bases: 8579.28 Mb

# condition1_plus.read2.trimmed.filter.fasta
# Total sequences: 166760343
# Total bases: 8536.41 Mb

# condition1_plus.orphan.pool.trim.filter.fasta
# Total sequences: 15257237
# Total bases: 678.89 Mb

Percent of data removed from each file

1
2
3
4
5
6
7
8
9
Sequences
read 1:  0.11%
read 2:  0.11%
orphan:  0.27%

Bases
read 1:  0.11%
read 2:  0.11%
orphan:  0.25%

This seems to indicate that the rRNA depletion worked really well and we didn’t do much useless sequencing on that end at least! I’ll move forward with these rRNA depleted files from now on.


Thinking about it some more, I should also be filtering through the mus musculus genome…

I’ll submit that job now and we will see how many sequences are lost tomorrow. After that I need to subsample the paired-end and orphan reads to the lowest common denominator for each type of file, then the reads should finally be ready for mapping to genomes / metagenomes.

As I was thinking about subsampling reads to make more fair comparisons I realized that there is probably still at least a small proportion of contaminating rRNA reads left in the transcriptomes. If I subsample before filtering out those reads, I may be losing out of data mapping to genes I care about downstream. In order to avoid this I decided to map the reads with Bowtie to the SILVA rRNA gene database and use only those reads that fail to find a match for anything later on in the pipeline.

I started by getting the non-redundant full length sequences from release 123 off of the mothur site. This file has sequences from bacteria (152308), eukaryotes (16209), and archea (3901). It looks like this:

1
2
3
4
AB006051.LobTirol 
......................................................ATG---------CGTACGCCATGCGTG-------
GGCTAGACT------CGTAC------------------------------ACGTACGACACGTCCATAGACTCGATAC----------
...

In order to align reads to this file, I needed to remove the gaps before making the Bowtie database. I used the mothur command that removes gaps as follows:

1
Degap.seqs(fasta=silva.nr_v123.align)

This gives you a fasta file with just sequence names and nucleic acids remaining, no more gaps. Making the bowtie database is almost as simple with the follow command:

1
/home/mljenior/bin/bowtie/bowtie-build silva.nr_v123.ng.fasta silva_db

Now it’s ready to be used for filtering out rRNA reads from my metatranscriptomes. I’ll post the results from how well it does probably later this week.