Results from filtering through mouse genome

As I mentioned in the last post, I’m mapping the filtered transcripts to the mus musculus genome (our lab mice) to remove more reads I’m not interested in before subsampling. For reference I used the complete list of annotated mus musculus genes from the KEGG database we have on axiom. With this step done the files should be fully curated and ready to move forward with.

Here’s the commands for mapping just for reference:

Paired-end read alignment:

/home/mljenior/bin/bowtie/bowtie /mnt/EXT/Schloss-data/matt/metatranscriptomes_HiSeq/mus_musculus/mus_db -f -1 ${sample_name}.read1.pool.trim.filt_rRNA.fasta -p 4 -2 ${sample_name}.read2.pool.trim.filt_rRNA.fasta --un ${sample_name}.filter.trimmed.read.fasta
mv ${sample_name}.filter.trimmed.read_1.fasta cefoperazone_630.read1.pool.trim.filt_rRNA.filt_mus.fasta
mv ${sample_name}.filter.trimmed.read_2.fasta cefoperazone_630.read2.pool.trim.filt_rRNA.filt_mus.fasta

Orphaned read alignment:

/home/mljenior/bin/bowtie/bowtie /mnt/EXT/Schloss-data/matt/metatranscriptomes_HiSeq/mus_musculus/mus_db -f ${sample_name}.orphan.pool.trim.filt_rRNA.fasta -p 4 --un ${sample_name}.orphan.pool.trim.filt_rRNA.filt_mus.fasta

Unmapped reads from mapping against mouse genome

# condition1_plus.read1.pool.trim.filt_rRNA.filt_mus.fasta
# Total sequences: 164655029
# Total bases: 8470.99 Mb

# condition1_plus.read2.pool.trim.filt_rRNA.filt_mus.fasta
# Total sequences: 164655029
# Total bases: 8428.80 Mb

# condition1_plus.orphan.pool.trim.filt_rRNA.filt_mus.fasta
# Total sequences: 14957436
# Total bases: 666.06 Mb

Percent of data removed by mouse filter

Sequences
read 1:  1.26%
read 2:  1.26%
orphan:  1.96%

Bases
read 1:  1.26%
read 2:  1.26%
orphan:  1.89%

This shows that not a lot of mouse transcript make it into the cecal content and mask the signal I’m hoping to get from the datasets.

Total percent of data removed

Sequences
read 1:  1.37%
read 2:  1.37%
orphan:  2.23%

Bases
read 1:  1.37%
read 2:  1.37%
orphan:  2.14%

This is great news. It looks like I most likely have primarily bacterial sequence (excluding any possible viral and archeal reads).

Also, it’s important to say that these numbers are pretty representative of the filtering for all the other experimental groups.
Basically I’ve lost less than 3% of the data by the end of the two step filtering process!

Results from filtering through mouse genome

December 21, 2015

Review Now Online!

UVA Postdoc Symposium

Targeting K. pneumoniae Metabolism