After all of the filtering is done, I’m now finally ready to subsample the read files. Also, since the
awk one-liners for subsampling isn’t working I wrote my own in python.
Below are the totals of reads per file. The goal is to subsample the groups of reads (paired and orphan) to
the lowest acceptable value among them.
Condition 1 - Plus
read 1: 164655029
read 2: 164655029
orphan: 14957436
Condition 1 - Minus
read 1: 147430303
read 2: 147430303
orphan: 14500000
Condition 2 - Plus
read 1: 113675146
read 2: 113675146
orphan: 11505508
Condition 2 - Minus
read 1: 110752263
read 2: 110752263
orphan: 12084021
Condition 3 - Plus
read 1: 126954804
read 2: 126954804
orphan: 16455611
Condition 3 - Minus
read 1: 115484599
read 2: 115484599
orphan: 11895524
Negative Control - Plus
read 1: 138811772
read 2: 138811772
orphan: 11936178
Positive Control - Plus
read 1: 170018546
read 2: 170018546
orphan: 19394415
This is great news. It looks like I still have >100 million paired-end reads and >10 million orphaned reads
are left per sample after subsampling. I’ll post some rarefaction curves when I get them.
The awk commands I found are not working properly, so I wrote this python script to subsample both the interleaved
pair-end and single-end reads. Make sure you install argparse before trying to use it though!
I streamlined the process as much as possible by making it only necessary to loop through the large
fastas a single time. It also prints a logfile so you can remember what the heck you did in a month.
After all of the filtering and tweaking the subsampling algorithm, the jobs are finally running. As soon
as they finish, I’ll map the residual reads to the C. difficile 630 genome and get some useable data.
As I mentioned in the last post, I’m mapping the filtered transcripts to the mus musculus genome (our lab mice) to remove more reads I’m not
interested in before subsampling. For reference I used the complete list of annotated mus musculus genes from the KEGG
database we have on axiom. With this step done the files should be fully curated and ready to move forward with.
Here’s the commands for mapping just for reference:
This is great news. It looks like I most likely have primarily bacterial sequence (excluding any possible viral and archeal reads).
Also, it’s important to say that these numbers are pretty representative of the filtering for all the other experimental groups.
Basically I’ve lost less than 3% of the data by the end of the two step filtering process!