I was recently talking with Geof Hannigan, a postdoc in our lab, and he pointed out a pretty critical piece of information I had been overlooking. When sequencing reads become long enough, you may sequence passed the fragment size and into the 3’ adapter. This is especially problematic when you move on to assembly because these introduced sequences will have 100% homology to one another and assemble on the adapter instead of actual genome sequence. Similar problems can arise in a couple different ways as shown in the figure below. These artifacts will leave you with contigs that don’t reflect biology at all.

To avoid this I am now using a program called Cutadapt, where the figure came from, to remove any residual adapter sequences from my reads prior to assembly. The program requires python 2.7 and the command to run it looks like this:

python2.7 /mnt/EXT/Schloss-data/bin/cutadapt-1.9.1/bin/cutadapt --error-rate=0.1 --overlap=10 -a CAAGCAGAAGACGGCATACGAGATTAAGGCGAGTCTCGTGGGCTCGG -A GACGCTGCCGACGAGCGATCTAGTGTAGATCTCGGTGGTCGCCGTATCATT -o read_1.trimmed.fastq -p read_2.trimmed.fastq read_1.fastq read_2.fastq

Each argument supplies the following info:

--error-rate = Rate for how generously you still call a matching sequence, basically 1 in 10 bases can be a mismatch
--overlap = Minimum number of overlapping bases to call a match
-a = 5' sequencing primer full sequence, adapter + index
-A = reverse complement of the 3' sequencing primer 
-o = Output file name for read 1 followinf trimming
-p = Output file name for read 2

The last two file names without options before them are the forward and reverse read files

In order to get the reverse complement of the 3’ adapter, I wrote a small python script to reverse the order of a nucleotide sequence and then switch each base to it’s complement. Below is what it looks like and I’ve hosted on the Github page I made for scripts I talk about on my blog. This new adapter trimming step is critical for both my metagenomic and metatranscriptomic pipelines, basically anything with some amount of random fragment size. Otherwise the primer fragments left in the data negatively impact and assembly or mapping you try to do downstream.

#!/usr/bin/env python

# Computes the reverse complement of a given nucleotide string
# USAGE: rev_comp.py input_sequence

import sys

def reverse_complement(nucleotides):

	seq_str = str(nucleotides).upper()
	rev_seq_str = seq_str[::-1]

	pyr_seq_str = rev_seq_str.replace('A', 'n')
	pyr_seq_str = pyr_seq_str.replace('T', 'A')
	pyr_seq_str = pyr_seq_str.replace('n', 'T')

	final_seq_str = pyr_seq_str.replace('C', 'n')
	final_seq_str = final_seq_str.replace('G', 'C')
	final_seq_str = final_seq_str.replace('n', 'G')

	return final_seq_str


Below are some stats from the assembly of the same 250 bp paired-end dataset from a HiSeq, just with and without the trimming protocol I laid out above.

Without 3’-adapter trimming:

Total sequences: 431204
Total bases: 495.00 Mb
Sequence N50: 1762
Sequence L50: 28
Sequence N90: 452
Median sequence length: 560
Interquartile range: 436
Shortest sequence length: 250
Longest sequence length: 420252
Sequences > 1 kb: 88470
Sequences > 5 kb: 10866

With adapter trimming:

Total sequences: 496917
Total bases: 471.39 Mb
Sequence N50: 1434
Sequence L50: 44
Sequence N90: 369
Median sequence length: 474
Interquartile range: 423
Shortest sequence length: 250
Longest sequence length: 201752
Sequences > 1 kb: 83535
Sequences > 5 kb: 10603

The assembly quality doesn’t seem to have changed dramatically, however the best metric is really how many reads map to the contigs that you just assembled. Prior to this trimming <49% of reads mapped at least once to a contig. Now >70% are mapping! That’s a huge improvement! The new contigs should reflect what I actually sequenced and allow for a better analysis from now on.

I’ve been pretty excited to talk about another angle I’m working for my project, and that is to try and look at the nutrient niche of bacteria in the gut by using metabolic models built from an organism’s genomic information. Inferring aspects of an organism’s ecology and how it may impact other species in its environment based on high-throughput sequence data has been called Reverse Ecology.

I started this project after reading a pretty cool review from Roie Levy and Elhanan Borenstein.
Their lab used an approach where they used a decomposition algorithm on metabolic networks to identify substrates or nutrients that a bacteria needs to obtain from its environment. They went on to use this information to accurately predict community assembly rules in mouth-associated bacterial communities. I used their methods and found that comparing substrate lists between species predicted in vitro competition more accurately than phylogenetic distance. Here’s a little bit of the data:

Each point is a separate bacterial species, competed against C. difficile in rich media. Competitive Index refers to how much overlap there is between metabolic networks. I didn’t follow this up much further to focus on other projects, but we may come back to it in the future.

What I’ve been working on more recently has been integrating transcriptomic data into genome-scale metabolic models. The methods I’m using are loosely based on some work I read out of Jen Nielsen’s lab. Their approach used a bipartite network architecture with enzyme nodes connecting to substrate nodes. Since substrates only directly connect to enzyme nodes, you are then able to map transcript abundance to their respective enzymes and then make inferences about how in-demand the substrates they act on are. Here’s part of one network I’ve generated as an example:

After mapping transcripts to the enzyme (KEGG ortholog) nodes, you can get a read on how important the adjacent substrate nodes are. I’m extending this to infer the nutrient niche of species in the gut. We are working on the analysis and manuscript now so I’ll have a lot more to post soon.

For now, here’s the R code I used to generate the network plot:

# Load igraph package

# Define variables
file_name <- '~/bipartite.graph'
nodes_1 <- '~/compound.lst'
nodes_1_label <- 'Substrate'
nodes_2 <- '~/enzyme.lst'
nodes_2_label <- 'KEGG Ortholog'
figure_file <- '~/bipartite.scc.pdf'

# Read in data
graph.file <- read.table(file_name, header = F, sep = '\t')
node_group_1 <- as.vector(read.table(nodes_1, header = F, sep = '\t')$V1)
node_group_2 <- as.vector(read.table(nodes_2, header = F, sep = '\t')$V1)

# Format directed graph
raw.graph <- graph.data.frame(graph.file, directed = T)

# Remove loops and multiple edges to make visualzation easier
simple.graph <- simplify(raw.graph)

# Decompose graph
all.simple.graph <- decompose.graph(simple.graph)

# Get largest component
largest <- which.max(sapply(all.simple.graph, vcount))
largest.simple.graph <- all.simple.graph[[largest]]

# Format data for plotting
V(largest.simple.graph)$size <- 3 # Node size
V(largest.simple.graph)$color <- ifelse(V(largest.simple.graph)$name %in% node_group_2, "blue", "red") # Color nodes
E(largest.simple.graph)$color <- 'gray15' # Color edges

# Plot the network
par(mar=c(0,0,0,0), font=2)
plot(largest.simple.graph, vertex.label = NA, layout = layout.graphopt,
     edge.arrow.size = 0.5, edge.arrow.width = 0.8, vertex.frame.color = 'black')
legend('bottomleft', legend=c(nodes_1_label, nodes_2_label), 
       pt.bg=c('red', 'blue'), col='black', pch=21, pt.cex=3, cex=1.5, bty = "n")