I had almost totally forgotten about the blog posts and I really want to get into the habit of writing them. I wanted to write out an update on where my pipeline stands for my
metagenomic assemblies. I broke it into steps from processing and managing the large intermediate files that are created with each new step. As of now there are four metagenomes with hopefully
three more to follow and fully complement the metatranscriptomic analysis for each condition. Each metagenome was sequenced twice (across 2 separate Hi-Seq lanes) using 250 bp paired-end libraries
done on the Rapid setting. After pooling the data within respect metagenomes, before curation, that resulted in ~100 GB of data per sample. Since there was so much data per sample, I felt that
digital normalization and Megahit were my best bets for fast, high-quality metagenomic assemblies. I
have an automated pipeline in place that will submit each job with the correct dependencies in order for all the metagenomes at once. I’ll break down each of the pbs scripts here:
Here I’m concatenating read files and interleving the paired reads from each lane
I try to do some real-time data management so large fatqs/fastas don’t hang around on our cluster much longer than they need to. I then move into read quality trimming using Sickle.
I chose to cutoff reads below Q25 as it was a little more forgiving than the strict Q33, but still had a 99.5% confidence for base calls. I also tossed any sequence smaller than 21 bases, which is the shortest kmer
used by the assembler and would have been tossed anyway. I ran into problems before I knew this and megahit thought it had single end reads in the file I gave it because it was silently tossing short reads.
The seq_stats.py script that keeps showing up is a small python script I wrote to quickly assess sequence files for some standard metrics you can judge assemble quality with like N50, L50, N90, etc. (It’s on my github page)
Next I used the khmer package from Titus Brown’s lab to normalize read count by kmer abundance to a coverage of 10 (what he uses in his metagenomic assemblies). When do these assemblies without first doing the
normalization steps, the quality is much worse. After combing through some Seqanswers forums I found that it might be that errors occurring in the same base multiple times can result in the contig being split
into multiple, smaller contigs and leading to a worse assembly overall. I’m starting to believe that’s true, but I’m running a few more tests before I can be sure.
Megahit is super easy, blazing fast, and really good at it’s job. I’ll list some stats in my next post when the jobs finish, but it is had-and-shoulders better than other assemblers I’ve tried.
Amanda Agelmore in our lab has found similar results. I used both the normalize paired-end and orphaned reads to maximize my coverage of each metagenome.
Annotation and read-mapping
I used MGA for gene calling and BLASTed the translated peptide sequences against KEGG which his what I used to construct
the metabolic networks I discussed in a previous post. I have two methods for that based on work frrom Elhanan Borenstein and Jens Nielsen which I am excited to use downstream.
One step (among others) of quality control I do is counting the number of curated reads that actually map back to the assembled contigs. This metric is a way of telling if what you assembled actually looks like what you sequenced in the first place.
I’ve revamped the pipeline in the last couple of days so when I have the quality metrics tomorrow when it’s done I’ll write another post about that.