Trimming and Normalization

Let’s see if I’m trimming data or information…

Quality Trimming

To improve the speed and quality of assembly it makes sense to reduce the amount of data (reads) input to the assembler. However, the trick is to reduce data without losing any information. The first step is using sickle to trim the 3’-ends of reads beads on quality scores and discard reads that don’t meet a length threshold. You can read more about their algorithm here, on their github page.

Digital Normalization

I discussed this in the previous blog post but just to reiterate, with the number of reads I have an assembly would take forever and recurring sequencing errors can break up good contigs.

Results

Control

Category	Pooled Lanes	Quality Trimming	Digital Normalization
Paired-end	116784668	114103196	22821658
Orphan	NA	1252357	1624925
Total	116784668	115355553	24446583
% of Pool	NA	98.78	20.93

Condition 1

Category	Pooled Lanes	Quality Trimming	Digital Normalization
Paired-end	178456950	174412784	43012608
Orphan	NA	1926609	4014358
Total	178456950	176339393	47026966
% of Pool	NA	98.81	26.35

Condition 2

Category	Pooled Lanes	Quality Trimming	Digital Normalization
Paired-end	118787094	115437644	10536046
Orphan	NA	1584056	1142605
Total	118787094	117021700	11678651
% of Pool	NA	98.51	9.83

Condition 3

Category	Pooled Lanes	Quality Trimming	Digital Normalization
Paired-end	395429292	385020860	2145684
Orphan	NA	4954819	1420062
Total	395429292	389975679	3565746
% of Pool	NA	98.62	0.90

The amount that was taken from the Strep treated metagenome is a little bit alarm, especially when you remember that it is a much more diverse community that say Cefoperazone (So it’s not just a byproduct of low species richness…). This sample has a whole order of magnitude less reads left after normalization compared to any of the others. I’m not sure how well this will assemble now, but I can tell more once the job for mapping reads to contigs finishes. This will tell me how much what I assemble looks like what I sequenced. If that’s no good, I’ll compare it to the assemblies without normalization and see if khmer is being too greedy. The samples were sequenced on 2 lanes simultaneously and then pooled so it’s unlikey that both lanes sequenced poorly, especially when all other quality metrics are high.

Trimming and Normalization

December 06, 2015