After spending too much time on writing a python script to subsample reads from a fasta file, I found some pretty great awk scripts to do the job courtesy of Umer Zeeshan Ijaz. They also happen to finish in a fraction of the time without all the overhead that comes with python. The goal was to normalize the read files to each other for mapping to genes downstream. Without this step first, comparing groups is unfair to those samples that didn’t have quite as many reads. The difference in coverage between samples was almost negligible, however I wanted to be thorough with my approach.

For paired-end reads you need to subsample before interleaving and then do that afterwards. It takes reads randomly and always grabs its counterpart in the read 2 file. I hadn’t figured out how to insert bash variables into the output file name as is so I renamed them immediately after the script finishes. Here is the code:

# Paired-end fasta
paste <(awk '/^>/ {printf("\n%s\n",$0);next; }{printf("%s",$0);} END {printf("\n");}' < ${sample_name}.read1.trim.pool.fasta) <(awk '/^>/{printf("\n%s\n",$0);next; } { printf("%s",$0);} END{printf("\n");}' < ${sample_name}.read2.trim.pool.fasta) | awk 'NR>1{ printf("%s",$0); n++;if(n%2==0) { printf("\n");} else { printf("\t");} }' | awk -v k=${k1} 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++<k?x-1:int(rand()*x);if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | awk -F"\t" '{print $1"\n"$3 > "temp.read1.fasta";print $2"\n"$4 > "temp.read2.fasta"}'
mv temp.read1.fasta ${sample_name}.read1.trim.pool.pick.fasta
mv temp.read2.fasta ${sample_name}.read2.trim.pool.pick.fasta

These files are subsampled identically and are still interleave-ready. Similarly, you can subsample the orphaned reads from sickle with the following awk script:

# Single-end fasta
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END{printf("\n");}' < ${sample_name}.trim.orphan.pool.fasta |  awk 'NR>1{ printf("%s",$0); n++; if(n%2==0) { printf("\n");} else { printf("\t");} }' | awk -v k=${k2} 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++<k?x-1:int(rand()*x);if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | awk -F"\t" '{print $1"\n"$2 > "temp.single.fasta"}'
mv temp.single.fasta ${sample_name}.trim.orphan.pool.pick.fasta

I subsampled all the files to the lowest common denominator among the each group of like read files respectively.

These results are using the same pipeline as of 12-6-15, but omitting the digital normalization steps.

Control

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Input file name: Conventional.final.contigs.251.fa
# File type: Fasta
# Total sequences: 431204
# Total bases: 495.00 Mb
# Sequence N50: 1762
# Sequence L50: 28
# Sequence N90: 452
# Median sequence length: 560
# Interquartile range: 436
# Shortest sequence length: 251
# Longest sequence length: 420252
# Sequences > 1 kb: 88470
# Sequences > 5 kb: 10866
# G-C content: 48.67%
# Ns included: 0


# reads processed: 115355553
# reads with at least one reported alignment: 52825487 (45.79%)
# reads that failed to align: 62530066 (54.21%)

Condition 1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Input file name: Cefoperazone.final.contigs.251.fa
# File type: Fasta
# Total sequences: 2001183
# Total bases: 1369.55 Mb
# Sequence N50: 709
# Sequence L50: 2143
# Sequence N90: 430
# Median sequence length: 583
# Interquartile range: 329
# Shortest sequence length: 251
# Longest sequence length: 306971
# Sequences > 1 kb: 256222
# Sequences > 5 kb: 912
# G-C content: 42.96%
# Ns included: 0

Condition 2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Input file name: Clindamycin.final.contigs.251.fa
# File type: Fasta
# Total sequences: 467061
# Total bases: 252.32 Mb
# Sequence N50: 530
# Sequence L50: 1132
# Sequence N90: 398
# Median sequence length: 483
# Interquartile range: 175
# Shortest sequence length: 251
# Longest sequence length: 139769
# Sequences > 1 kb: 8790
# Sequences > 5 kb: 358
# G-C content: 45.29%
# Ns included: 0


# reads processed: 117021700
# reads with at least one reported alignment: 55021810 (47.02%)
# reads that failed to align: 61999890 (52.98%)

Condition 3

Unfortunately there were some errors during assembly so I’m repeating the Streptomycin assembly

Assemblies of Great Prairie Soil Metagenome Grand Challenge Datasets

As a standard of comparison I used Megahit’s results from the Assemblies of Great Prairie Soil Metagenome Grand Challenge Datasets. The challenge was to assemble these extremely complex and historically difficult to assemble metagenomes from soil samples across the country. They posted a few quality metrics they achieved and they compare pretty favorably with what I was able to get out of my data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Wisconsin, Switchgrass soil metagenome reference core - MEGAHIT v1.0.3
Contig length cutoff (bp)	200
Number of contigs	4,783,921
Total size (bp)	2,589,776,155
Largest contig (bp)	65,654
N50 (bp)	550

Kansas Native Prairie soil metagenome reference core - MEGAHIT v1.0.1
Contig length cutoff (bp)	200
Number of contigs	47,985,585
Total size (bp)	27,878,228,835
Largest contig (bp)	246,919
N50 (bp)	629

Iowa Native Prairie soil metagenome reference core - MEGAHIT v1.0.1
Contig length cutoff (bp)	200
Number of contigs	10,986,357
Total size (bp)	7,175,891,749
Largest contig (bp)	313,620
N50 (bp)	738

On average, the N50s in my assemblies are 36% longer than the soil assemblies. Similarly the average longest contig in my assembly is 27% longer.

I should also note that I’ve been using version 0.3.2 of MEGAHIT and the same assemblies are running now using the current version of 1.0.3. They programmers say they have improved assembly quality with the newer version so I’ll do that comparison in a future post.