After all of the filtering is done, I’m now finally ready to subsample the read files. Also, since the
awk one-liners for subsampling isn’t working I wrote my own in python.
Below are the totals of reads per file. The goal is to subsample the groups of reads (paired and orphan) to
the lowest acceptable value among them.
Condition 1 - Plus
read 1: 164655029
read 2: 164655029
orphan: 14957436
Condition 1 - Minus
read 1: 147430303
read 2: 147430303
orphan: 14500000
Condition 2 - Plus
read 1: 113675146
read 2: 113675146
orphan: 11505508
Condition 2 - Minus
read 1: 110752263
read 2: 110752263
orphan: 12084021
Condition 3 - Plus
read 1: 126954804
read 2: 126954804
orphan: 16455611
Condition 3 - Minus
read 1: 115484599
read 2: 115484599
orphan: 11895524
Negative Control - Plus
read 1: 138811772
read 2: 138811772
orphan: 11936178
Positive Control - Plus
read 1: 170018546
read 2: 170018546
orphan: 19394415
This is great news. It looks like I still have >100 million paired-end reads and >10 million orphaned reads
are left per sample after subsampling. I’ll post some rarefaction curves when I get them.
The awk commands I found are not working properly, so I wrote this python script to subsample both the interleaved
pair-end and single-end reads. Make sure you install argparse before trying to use it though!
I streamlined the process as much as possible by making it only necessary to loop through the large
fastas a single time. It also prints a logfile so you can remember what the heck you did in a month.
After all of the filtering and tweaking the subsampling algorithm, the jobs are finally running. As soon
as they finish, I’ll map the residual reads to the C. difficile 630 genome and get some useable data.
We were recently asked to contribute a review to Current Opinion in Microbiology on computational approaches that have been used to under...… Continue reading