To improve the speed and quality of assembly it makes sense to reduce the amount of data (reads) input to the assembler. However,
the trick is to reduce data without losing any information. The first step is using sickle to trim the 3’-ends of reads beads on
quality scores and discard reads that don’t meet a length threshold. You can read more about their algorithm here,
on their github page.
Digital Normalization
I discussed this in the previous blog post but just to reiterate, with the number of reads I have an assembly would take forever and recurring sequencing
errors can break up good contigs.
Results
Control
Category
Pooled Lanes
Quality Trimming
Digital Normalization
Paired-end
116784668
114103196
22821658
Orphan
NA
1252357
1624925
Total
116784668
115355553
24446583
% of Pool
NA
98.78
20.93
Condition 1
Category
Pooled Lanes
Quality Trimming
Digital Normalization
Paired-end
178456950
174412784
43012608
Orphan
NA
1926609
4014358
Total
178456950
176339393
47026966
% of Pool
NA
98.81
26.35
Condition 2
Category
Pooled Lanes
Quality Trimming
Digital Normalization
Paired-end
118787094
115437644
10536046
Orphan
NA
1584056
1142605
Total
118787094
117021700
11678651
% of Pool
NA
98.51
9.83
Condition 3
Category
Pooled Lanes
Quality Trimming
Digital Normalization
Paired-end
395429292
385020860
2145684
Orphan
NA
4954819
1420062
Total
395429292
389975679
3565746
% of Pool
NA
98.62
0.90
The amount that was taken from the Strep treated metagenome is a little bit alarm, especially when you
remember that it is a much more diverse community that say Cefoperazone (So it’s not just a byproduct of
low species richness…). This sample has a whole order of magnitude less reads left after normalization
compared to any of the others. I’m not sure how well this will assemble now, but I can tell more once the
job for mapping reads to contigs finishes. This will tell me how much what I assemble looks like what I
sequenced. If that’s no good, I’ll compare it to the assemblies without normalization and see if khmer is
being too greedy. The samples were sequenced on 2 lanes simultaneously and then pooled so it’s unlikey that
both lanes sequenced poorly, especially when all other quality metrics are high.
I had almost totally forgotten about the blog posts and I really want to get into the habit of writing them. I wanted to write out an update on where my pipeline stands for my
metagenomic assemblies. I broke it into steps from processing and managing the large intermediate files that are created with each new step. As of now there are four metagenomes with hopefully
three more to follow and fully complement the metatranscriptomic analysis for each condition. Each metagenome was sequenced twice (across 2 separate Hi-Seq lanes) using 250 bp paired-end libraries
done on the Rapid setting. After pooling the data within respect metagenomes, before curation, that resulted in ~100 GB of data per sample. Since there was so much data per sample, I felt that
digital normalization and Megahit were my best bets for fast, high-quality metagenomic assemblies. I
have an automated pipeline in place that will submit each job with the correct dependencies in order for all the metagenomes at once. I’ll break down each of the pbs scripts here:
Preprocessing
Here I’m concatenating read files and interleving the paired reads from each lane
I try to do some real-time data management so large fatqs/fastas don’t hang around on our cluster much longer than they need to. I then move into read quality trimming using Sickle.
I chose to cutoff reads below Q25 as it was a little more forgiving than the strict Q33, but still had a 99.5% confidence for base calls. I also tossed any sequence smaller than 21 bases, which is the shortest kmer
used by the assembler and would have been tossed anyway. I ran into problems before I knew this and megahit thought it had single end reads in the file I gave it because it was silently tossing short reads.
The seq_stats.py script that keeps showing up is a small python script I wrote to quickly assess sequence files for some standard metrics you can judge assemble quality with like N50, L50, N90, etc. (It’s on my github page)
Next I used the khmer package from Titus Brown’s lab to normalize read count by kmer abundance to a coverage of 10 (what he uses in his metagenomic assemblies). When do these assemblies without first doing the
normalization steps, the quality is much worse. After combing through some Seqanswers forums I found that it might be that errors occurring in the same base multiple times can result in the contig being split
into multiple, smaller contigs and leading to a worse assembly overall. I’m starting to believe that’s true, but I’m running a few more tests before I can be sure.
Assembly
Megahit is super easy, blazing fast, and really good at it’s job. I’ll list some stats in my next post when the jobs finish, but it is had-and-shoulders better than other assemblers I’ve tried.
Amanda Agelmore in our lab has found similar results. I used both the normalize paired-end and orphaned reads to maximize my coverage of each metagenome.
Annotation and read-mapping
I used MGA for gene calling and BLASTed the translated peptide sequences against KEGG which his what I used to construct
the metabolic networks I discussed in a previous post. I have two methods for that based on work frrom Elhanan Borenstein and Jens Nielsen which I am excited to use downstream.
One step (among others) of quality control I do is counting the number of curated reads that actually map back to the assembled contigs. This metric is a way of telling if what you assembled actually looks like what you sequenced in the first place.
I’ve revamped the pipeline in the last couple of days so when I have the quality metrics tomorrow when it’s done I’ll write another post about that.