First post of 2016! Just a quick post about how I measure sequence total, sequence length distribution, and assembly quality (when it’s
important). I wrote a python script that handles both fastas and fastqs to report a variety of metrics.
Here’s my code:
Output is printed to the terminal and appears like this:
Pretty straight forward, but I haven’t found any others like it that give all of these stats together. It makes read
QC and evaluating assemblies a lot more convenient. So if you like it, give it a try! I’ve also made a Github repo
for the scripts I use in this blog and linked it on the resources page.
Condition 1 - Plus
read 1: 164655029
read 2: 164655029
Condition 1 - Minus
read 1: 147430303
read 2: 147430303
Condition 2 - Plus
read 1: 113675146
read 2: 113675146
Condition 2 - Minus
read 1: 110752263
read 2: 110752263
Condition 3 - Plus
read 1: 126954804
read 2: 126954804
Condition 3 - Minus
read 1: 115484599
read 2: 115484599
Negative Control - Plus
read 1: 138811772
read 2: 138811772
Positive Control - Plus
read 1: 170018546
read 2: 170018546
This is great news. It looks like I still have >100 million paired-end reads and >10 million orphaned reads
are left per sample after subsampling. I’ll post some rarefaction curves when I get them.
The awk commands I found are not working properly, so I wrote this python script to subsample both the interleaved
pair-end and single-end reads. Make sure you install argparse before trying to use it though!
I streamlined the process as much as possible by making it only necessary to loop through the large
fastas a single time. It also prints a logfile so you can remember what the heck you did in a month.
After all of the filtering and tweaking the subsampling algorithm, the jobs are finally running. As soon
as they finish, I’ll map the residual reads to the C. difficile 630 genome and get some useable data.