Let’s see if I’m trimming data or information…

Quality Trimming

To improve the speed and quality of assembly it makes sense to reduce the amount of data (reads) input to the assembler. However, the trick is to reduce data without losing any information. The first step is using sickle to trim the 3’-ends of reads beads on quality scores and discard reads that don’t meet a length threshold. You can read more about their algorithm here, on their github page.

Digital Normalization

I discussed this in the previous blog post but just to reiterate, with the number of reads I have an assembly would take forever and recurring sequencing errors can break up good contigs.

Results

Control

Category Pooled Lanes Quality Trimming Digital Normalization
Paired-end 116784668 114103196 22821658
Orphan NA 1252357 1624925
Total 116784668 115355553 24446583
% of Pool NA 98.78 20.93

Condition 1

Category Pooled Lanes Quality Trimming Digital Normalization
Paired-end 178456950 174412784 43012608
Orphan NA 1926609 4014358
Total 178456950 176339393 47026966
% of Pool NA 98.81 26.35

Condition 2

Category Pooled Lanes Quality Trimming Digital Normalization
Paired-end 118787094 115437644 10536046
Orphan NA 1584056 1142605
Total 118787094 117021700 11678651
% of Pool NA 98.51 9.83

Condition 3

Category Pooled Lanes Quality Trimming Digital Normalization
Paired-end 395429292 385020860 2145684
Orphan NA 4954819 1420062
Total 395429292 389975679 3565746
% of Pool NA 98.62 0.90

The amount that was taken from the Strep treated metagenome is a little bit alarm, especially when you remember that it is a much more diverse community that say Cefoperazone (So it’s not just a byproduct of low species richness…). This sample has a whole order of magnitude less reads left after normalization compared to any of the others. I’m not sure how well this will assemble now, but I can tell more once the job for mapping reads to contigs finishes. This will tell me how much what I assemble looks like what I sequenced. If that’s no good, I’ll compare it to the assemblies without normalization and see if khmer is being too greedy. The samples were sequenced on 2 lanes simultaneously and then pooled so it’s unlikey that both lanes sequenced poorly, especially when all other quality metrics are high.

I had almost totally forgotten about the blog posts and I really want to get into the habit of writing them. I wanted to write out an update on where my pipeline stands for my metagenomic assemblies. I broke it into steps from processing and managing the large intermediate files that are created with each new step. As of now there are four metagenomes with hopefully three more to follow and fully complement the metatranscriptomic analysis for each condition. Each metagenome was sequenced twice (across 2 separate Hi-Seq lanes) using 250 bp paired-end libraries done on the Rapid setting. After pooling the data within respect metagenomes, before curation, that resulted in ~100 GB of data per sample. Since there was so much data per sample, I felt that digital normalization and Megahit were my best bets for fast, high-quality metagenomic assemblies. I have an automated pipeline in place that will submit each job with the correct dependencies in order for all the metagenomes at once. I’ll break down each of the pbs scripts here:

Preprocessing

Here I’m concatenating read files and interleving the paired reads from each lane

gunzip *.gz

cat *_L001_R1_*.fastq > ${sample_name}.lane1.read1.fastq
cat *_L001_R2_*.fastq > ${sample_name}.lane1.read2.fastq
cat *_L002_R1_*.fastq > ${sample_name}.lane2.read1.fastq
cat *_L002_R2_*.fastq > ${sample_name}.lane2.read2.fastq

gzip *_L00*.fastq

module rm gcc/4.6.4
module add python/2.7.9 gcc/4.9.2 khmer

interleave-reads.py ${sample_name}.lane1.read1.fastq ${sample_name}.lane1.read2.fastq -o ${sample_name}.lane1.paired.fastq
interleave-reads.py ${sample_name}.lane2.read1.fastq ${sample_name}.lane2.read2.fastq -o ${sample_name}.lane2.paired.fastq

rm *read*

python /mnt/EXT/Schloss-data/seq_stats.py ${sample_name}.lane1.paired.fastq > ${sample_name}.lane1.paired.fastq.summary
python /mnt/EXT/Schloss-data/seq_stats.py ${sample_name}.lane2.paired.fastq > ${sample_name}.lane2.paired.fastq.summary

I try to do some real-time data management so large fatqs/fastas don’t hang around on our cluster much longer than they need to. I then move into read quality trimming using Sickle.
I chose to cutoff reads below Q25 as it was a little more forgiving than the strict Q33, but still had a 99.5% confidence for base calls. I also tossed any sequence smaller than 21 bases, which is the shortest kmer used by the assembler and would have been tossed anyway. I ran into problems before I knew this and megahit thought it had single end reads in the file I gave it because it was silently tossing short reads.

/mnt/EXT/Schloss-data/kiverson/sickle-master/sickle pe -c ${sample_name}.lane1.paired.fastq -m ${sample_name}.lane1.paired.trimmed.fastq -s ${sample_name}.lane1.orphan.trimmed.fastq -t sanger -q 25 -l 21
/mnt/EXT/Schloss-data/kiverson/sickle-master/sickle pe -c ${sample_name}.lane2.paired.fastq -m ${sample_name}.lane2.paired.trimmed.fastq -s ${sample_name}.lane2.orphan.trimmed.fastq -t sanger -q 25 -l 21

rm *paired.fastq

cat ${sample_name}.lane1.paired.trimmed.fastq ${sample_name}.lane2.paired.trimmed.fastq > ${sample_name}.paired.trimmed.pooled.fastq
awk '{print ">" substr($0,2);getline;print;getline;getline}' ${sample_name}.paired.trimmed.pooled.fastq > ${sample_name}.paired.trimmed.pooled.fasta
cat ${sample_name}.lane1.orphan.trimmed.fastq ${sample_name}.lane2.orphan.trimmed.fastq > ${sample_name}.orphan.trimmed.pooled.fastq
awk '{print ">" substr($0,2);getline;print;getline;getline}' ${sample_name}.orphan.trimmed.pooled.fastq > ${sample_name}.orphan.trimmed.pooled.fasta

rm *.trimmed.fastq *.pooled.fastq *.paired.fastq

cat ${sample_name}.paired.trimmed.pooled.fasta ${sample_name}.orphan.trimmed.pooled.fasta > ${sample_name}.trimmed.pooled.all.fasta

python /mnt/EXT/Schloss-data/seq_stats.py ${sample_name}.paired.trimmed.pooled.fasta > ${sample_name}.paired.trimmed.pooled.fasta.summary
python /mnt/EXT/Schloss-data/seq_stats.py ${sample_name}.orphan.trimmed.pooled.fasta > ${sample_name}.orphan.trimmed.pooled.fasta.summary
python /mnt/EXT/Schloss-data/seq_stats.py ${sample_name}.trimmed.pooled.all.fasta > ${sample_name}.trimmed.pooled.all.fasta.summary

The seq_stats.py script that keeps showing up is a small python script I wrote to quickly assess sequence files for some standard metrics you can judge assemble quality with like N50, L50, N90, etc. (It’s on my github page)
Next I used the khmer package from Titus Brown’s lab to normalize read count by kmer abundance to a coverage of 10 (what he uses in his metagenomic assemblies). When do these assemblies without first doing the normalization steps, the quality is much worse. After combing through some Seqanswers forums I found that it might be that errors occurring in the same base multiple times can result in the contig being split into multiple, smaller contigs and leading to a worse assembly overall. I’m starting to believe that’s true, but I’m running a few more tests before I can be sure.

module rm gcc/4.6.4
module add python/2.7.9 gcc/4.9.2 khmer

normalize-by-median.py -p -k 20 -C 10 -N 4 -x 1e9 --savegraph normC10k20.kh ${sample_name}.paired.trimmed.pooled.fasta
filter-abund.py -C 50 -V normC10k20.kh ${sample_name}.paired.trimmed.pooled.fasta.keep
extract-paired-reads.py -p ${sample_name}.trimmed.pooled.diginorm.paired.fasta -s ${sample_name}.trimmed.pooled.diginorm.single.fasta ${sample_name}.paired.trimmed.pooled.fasta.keep.abundfilt
normalize-by-median.py --force_single --loadgraph normC10k20.kh ${sample_name}.orphan.trimmed.pooled.fasta
filter-abund.py -C 50 -V normC10k20.kh ${sample_name}.orphan.trimmed.pooled.fasta.keep -o ${sample_name}.orphan.trimmed.pooled.keep.fasta
cat ${sample_name}.trimmed.pooled.diginorm.single.fasta ${sample_name}.orphan.trimmed.pooled.keep.fasta > ${sample_name}.trimmed.pooled.diginorm.orphan.fasta

rm *.keep *.se *.pe *.kh *.abundfilt *.single.fasta

python /mnt/EXT/Schloss-data/seq_stats.py ${sample_name}.trimmed.pooled.diginorm.paired.fasta > ${sample_name}.trimmed.pooled.diginorm.paired.fasta.summary
python /mnt/EXT/Schloss-data/seq_stats.py ${sample_name}.trimmed.pooled.diginorm.orphan.fasta > ${sample_name}.trimmed.pooled.diginorm.orphan.fasta.summary

cat *.summary > preassembly.summaries
rm *.summary

Assembly

Megahit is super easy, blazing fast, and really good at it’s job. I’ll list some stats in my next post when the jobs finish, but it is had-and-shoulders better than other assemblers I’ve tried. Amanda Agelmore in our lab has found similar results. I used both the normalize paired-end and orphaned reads to maximize my coverage of each metagenome.

cp /mnt/EXT/Schloss-data/matt/metagenomes_HiSeq/${sample_name}/fastq/${sample_name}.trimmed.pooled.diginorm.paired.fasta ./
cp /mnt/EXT/Schloss-data/matt/metagenomes_HiSeq/${sample_name}/fastq/${sample_name}.trimmed.pooled.diginorm.orphan.fasta ./

python /home/mljenior/bin/megahit/megahit -m 45e9 -l 251 --12 ${sample_name}.trimmed.pooled.diginorm.paired.fasta -r ${sample_name}.trimmed.pooled.diginorm.orphan.fasta --cpu-only --k-max 127 -o ${sample_name}.megahit

python /share/scratch/bin/removeShortContigs.py ${sample_name}.megahit/final.contigs.fa ${sample_name}.megahit/${sample_name}.final.contigs.251.fa 251

python /mnt/EXT/Schloss-data/seq_stats.py ${sample_name}.megahit/${sample_name}.final.contigs.251.fa > ${sample_name}.megahit/${sample_name}.final.contigs.251.fa.summary

rm *.fasta'

Annotation and read-mapping

I used MGA for gene calling and BLASTed the translated peptide sequences against KEGG which his what I used to construct the metabolic networks I discussed in a previous post. I have two methods for that based on work frrom Elhanan Borenstein and Jens Nielsen which I am excited to use downstream.

/share/scratch/bin/mga /mnt/EXT/Schloss-data/matt/metagenomes_HiSeq/${sample_name}/assembly/${sample_name}.megahit/${sample_name}.final.contigs.251.fa -m > ${sample_name}.final.contigs.251.mga.out

python /share/scratch/bin/getgenesbypos.py ${sample_name}.final.contigs.251.mga.out /mnt/EXT/Schloss-data/matt/metagenomes_HiSeq/${sample_name}/assembly/${sample_name}.megahit/${sample_name}.final.contigs.251.fa ${sample_name}.genes.faa ${sample_name}.genes.fna

python /share/scratch/bin/removeShortContigs.py ${sample_name}.genes.fna ${sample_name}.genes.250.fna 250
python /share/scratch/bin/removeShortContigs.py ${sample_name}.genes.faa ${sample_name}.genes.80.faa 80

python /home/mljenior/scripts/format_fasta.py ${sample_name}.genes.250.fna ${sample_name}.genes.250.format.fna
python /home/mljenior/scripts/format_fasta.py ${sample_name}.genes.80.faa ${sample_name}.genes.80.format.faa

python /mnt/EXT/Schloss-data/seq_stats.py ${sample_name}.genes.250.format.fna > ${sample_name}.genes.250.format.fna.summary
python /mnt/EXT/Schloss-data/seq_stats.py ${sample_name}.genes.80.format.faa > ${sample_name}.genes.80.format.faa.summary

blastp -query ${sample_name}.genes.80.format.faa -db /mnt/EXT/Schloss-data/kegg/kegg/genes/fasta/kegg_prot_blast.db -num_threads 16 -evalue 5e-4 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -outfmt 6 -max_target_seqs 1 -out ${sample_name}.protVprot.out

I used Bowtie to map reads to called genes.

/home/mljenior/bin/bowtie/bowtie-build ${sample_name}.genes.250.format.fna ${sample_name}_gene_database

/home/mljenior/bin/bowtie/bowtie ${sample_name}_gene_database -f /mnt/EXT/Schloss-data/matt/metagenomes_HiSeq/${sample_name}/fastq/${sample_name}.trimmed.pooled.all.fasta -p 4 -S ${sample_name}.reads2genes.sam

samtools view -bS ${sample_name}.reads2genes.sam > ${sample_name}.reads2genes.bam
samtools sort ${sample_name}.reads2genes.bam ${sample_name}.reads2genes.sorted
samtools index ${sample_name}.reads2genes.sorted.bam
samtools idxstats ${sample_name}.reads2genes.sorted.bam > ${sample_name}.mapped2genes.txt 

rm *.ebwt *.bcf *ai *.bam ${sample_name}_gene_database*

Quality control

One step (among others) of quality control I do is counting the number of curated reads that actually map back to the assembled contigs. This metric is a way of telling if what you assembled actually looks like what you sequenced in the first place.

cp /mnt/EXT/Schloss-data/matt/metagenomes_HiSeq/${sample_name}/assembly/${sample_name}.megahit/${sample_name}.final.contigs.251.fa ./
/home/mljenior/bin/bowtie/bowtie-build ${sample_name}.final.contigs.251.fa ${sample_name}_contig_database

/home/mljenior/bin/bowtie/bowtie ${sample_name}_contig_database -f /mnt/EXT/Schloss-data/matt/metagenomes_HiSeq/${sample_name}/fastq/${sample_name}.trimmed.pooled.all.fasta -p 4 -S ${sample_name}.aligned.sam

samtools view -bS ${sample_name}.aligned.sam > ${sample_name}.aligned.bam
samtools sort ${sample_name}.merged.pe.se.aligned.megahit.bam ${sample_name}.merged.pe.se.aligned.megahit.sorted
samtools index ${sample_name}.merged.pe.se.aligned.megahit.sorted.bam
samtools idxstats ${sample_name}.merged.pe.se.aligned.megahit.sorted.bam > ${sample_name}.mapped2contigs.txt 

rm *.ebwt *.bcf *ai *.bam ${sample_name}_contig_database*'

I’ve revamped the pipeline in the last couple of days so when I have the quality metrics tomorrow when it’s done I’ll write another post about that.