I’ve been pretty excited to talk about another angle I’m working for my project, and that is to try and
look at the nutrient niche of bacteria in the gut by using metabolic models built from an organism’s genomic
information. Inferring aspects of an organism’s ecology and how it may impact other species in its environment
based on high-throughput sequence data has been called Reverse Ecology.
I started this project after reading a pretty cool review from Roie Levy and Elhanan Borenstein.
Their lab used an approach where they used a decomposition algorithm on metabolic networks to identify substrates or nutrients
that a bacteria needs to obtain from its environment. They went on to use this information to accurately predict community
assembly rules in mouth-associated bacterial communities. I used their methods
and found that comparing substrate lists between species predicted in vitro competition more accurately than phylogenetic
distance. Here’s a little bit of the data:
Each point is a separate bacterial species, competed against C. difficile in rich media. Competitive Index refers to
how much overlap there is between metabolic networks. I didn’t follow this up much further to focus on other projects,
but we may come back to it in the future.
What I’ve been working on more recently has been integrating transcriptomic data into genome-scale metabolic models. The
methods I’m using are loosely based on some work I read out of Jen Nielsen’s lab.
Their approach used a bipartite network architecture with enzyme nodes connecting to substrate nodes. Since substrates only
directly connect to enzyme nodes, you are then able to map transcript abundance to their respective enzymes and then make
inferences about how in-demand the substrates they act on are. Here’s part of one network I’ve generated as an example:
After mapping transcripts to the enzyme (KEGG ortholog) nodes, you can get a read on how important the adjacent substrate nodes are. I’m
extending this to infer the nutrient niche of species in the gut. We are working on the analysis and manuscript now so I’ll
have a lot more to post soon.
For now, here’s the R code I used to generate the network plot:
The columns represent: Target name, length of target sequence, number of mapped reads, number of unmapped partner sequences after counterpart matching. The only
ones we care about right now are the name of the target and how many reads mapped to it. The problem is, reporting the gene name doesn’t really
help anyone because if you report more than 10 in a figure it gets way too dense. To simplify reporting the results, it helps to label each gene with the pathway
it is a part of. This reduces the number of possible categories and makes it easier to label figures like the linear correlations in Franzosa et al., (2014).
An important note is that the reference fastas that you will be mapping to are not in a great format. You’ll lose a lot of important information when bowtie makes the database
as it splits the name on whitespace and only takes the first element. The original file looks like this:
1
2
3
4
5
>cdf:CD630_00010 dnaA; chromosomal replication initiation protein
atggatatagtttctttatgggacaaaaccctacaattaataaaaggtgacttaacttca
gtaagttttaataccttttttaaaaacatcgtaccattaaaaatacatcttaatgattta
atcttactggctcctagtgattttaataaagatatcttagagaatagatatttacattta
...
The first step in the process of connecting the gene IDs to the relevant pathway information for each. The KEGG reference files are large and cumbersome, so the easiest
way to handle them repeatedly is the turn them into manageable python dictionaries and then pickle them so they only need to be actually constructed once. In case you were wondering,
a pickle in python is where you convert an object hierarachy into a byte stream. This is great because then you can save it as a file and open it up later as an already
formed python data structure. A guide can be found here.
The files we need to use are going to be used for the gene code to pathway code and pathway code to pathway category translation. Here’s my code to create pickles of both KEGG reference files:
You might notice that during the construction of the second, larger dictionary that I create a set containing gene IDs and reference it iteratively.
Specifically, sets are unordered collections of unique elements. Since they are not indexed, have no order, and each element
appears only once, they are great for membership checking. If I were to use just the dictionary to check membership using something like…
The run time of a script containing this would take several orders of magnitude more time to complete that doing the membership test using a set instead.
The next step is to run the code that will use the libraries I just made and annotate the bowtie output to be a little more useful. It looks like this:
Through trial and error I learned that you need to account for key errors just in case it doesn’t find something in a dictionary. This shouldn’t happen in this instance
but just to be sure. The final output that will be made into figure looks like this:
1
2
3
4
108 cdf:CD630_00010 dnaA;_chromosomal_replication_initiation_protein K02313 02020 Two-component system Environmental Information Processing Signal transduction
59 cdf:CD630_00020 dnaN;_DNA_polymerase_III_subunit_beta_(EC:2.7.7.7) ko_key_error path_key_error metadata_key_error metadata_key_error metadata_key_error
6 cdf:CD630_00030 RNA-binding_mediating_protein K14761 path_key_error metadata_key_error metadata_key_error metadata_key_error
41 cdf:CD630_00040 recF;_DNA_replication_and_repair_protein_RecF K03629 03440 Homologous recombination Genetic Information Processing Replication and repair
The columns are: normalized read # followed by gene code, gene name, KEGG ortholog, pathway code, then some additional classifications. As you can see there are a couple
of key errors, but this is due to the fact that the annotation of some genes is incomplete in the KEGG reference files.
I’ll be posted some figures with some R code I used to make them pretty soon hopefully.