The columns represent: Target name, length of target sequence, number of mapped reads, number of unmapped partner sequences after counterpart matching. The only
ones we care about right now are the name of the target and how many reads mapped to it. The problem is, reporting the gene name doesn’t really
help anyone because if you report more than 10 in a figure it gets way too dense. To simplify reporting the results, it helps to label each gene with the pathway
it is a part of. This reduces the number of possible categories and makes it easier to label figures like the linear correlations in Franzosa et al., (2014).
An important note is that the reference fastas that you will be mapping to are not in a great format. You’ll lose a lot of important information when bowtie makes the database
as it splits the name on whitespace and only takes the first element. The original file looks like this:
1
2
3
4
5
>cdf:CD630_00010 dnaA; chromosomal replication initiation protein
atggatatagtttctttatgggacaaaaccctacaattaataaaaggtgacttaacttca
gtaagttttaataccttttttaaaaacatcgtaccattaaaaatacatcttaatgattta
atcttactggctcctagtgattttaataaagatatcttagagaatagatatttacattta
...
The first step in the process of connecting the gene IDs to the relevant pathway information for each. The KEGG reference files are large and cumbersome, so the easiest
way to handle them repeatedly is the turn them into manageable python dictionaries and then pickle them so they only need to be actually constructed once. In case you were wondering,
a pickle in python is where you convert an object hierarachy into a byte stream. This is great because then you can save it as a file and open it up later as an already
formed python data structure. A guide can be found here.
The files we need to use are going to be used for the gene code to pathway code and pathway code to pathway category translation. Here’s my code to create pickles of both KEGG reference files:
You might notice that during the construction of the second, larger dictionary that I create a set containing gene IDs and reference it iteratively.
Specifically, sets are unordered collections of unique elements. Since they are not indexed, have no order, and each element
appears only once, they are great for membership checking. If I were to use just the dictionary to check membership using something like…
The run time of a script containing this would take several orders of magnitude more time to complete that doing the membership test using a set instead.
The next step is to run the code that will use the libraries I just made and annotate the bowtie output to be a little more useful. It looks like this:
Through trial and error I learned that you need to account for key errors just in case it doesn’t find something in a dictionary. This shouldn’t happen in this instance
but just to be sure. The final output that will be made into figure looks like this:
1
2
3
4
108 cdf:CD630_00010 dnaA;_chromosomal_replication_initiation_protein K02313 02020 Two-component system Environmental Information Processing Signal transduction
59 cdf:CD630_00020 dnaN;_DNA_polymerase_III_subunit_beta_(EC:2.7.7.7) ko_key_error path_key_error metadata_key_error metadata_key_error metadata_key_error
6 cdf:CD630_00030 RNA-binding_mediating_protein K14761 path_key_error metadata_key_error metadata_key_error metadata_key_error
41 cdf:CD630_00040 recF;_DNA_replication_and_repair_protein_RecF K03629 03440 Homologous recombination Genetic Information Processing Replication and repair
The columns are: normalized read # followed by gene code, gene name, KEGG ortholog, pathway code, then some additional classifications. As you can see there are a couple
of key errors, but this is due to the fact that the annotation of some genes is incomplete in the KEGG reference files.
I’ll be posted some figures with some R code I used to make them pretty soon hopefully.
We were recently asked to contribute a review to Current Opinion in Microbiology on computational approaches that have been used to under...… Continue reading