How to export table based on assigned reads for functional gene classifications?

This option is available in the Taxonomy viewer, where the user can export data to a text file based on either Summed or Assigned reads.

In the functional gene viewers (e.g. InterPro/SEED/COG) even though there is an option to Select all nodes with assigned reads in the viewer (which works correctly), there is no option to export such data. The text file will always be based on the Summed reads and therefore include duplicate counts as the tree contain several levels (i.e. branch points) with assigned reads for e.g. InterPro and SEED. For example, selecting “select all leaves” in the tree viewer for InterPro or SEED will not include all assigned reads, as there are also assigned reads distributed higher up on the tree.

Here’s an example of how it can look like, see uploaded image.

The image shows the results when “Select -> Has Assigned” is chosen from the menu.

From the InterPro viewer you can only export Summed counts (no option for Assigned).
The nodes that are not leaves will therefore be exported as Sums, and include duplicate counts as it will also export these counts for the leaves.

Here’s an example of how this will look like after you export the data:

All rows marked light red are duplicate names.

In this case “IPR005667 Sulphate ABC transporter permease protein 2” (marked yellow) will include the Summed counts for two proteins (3801 counts): IPR011865 and IPR011865 (marked green). And it will also include counts assigned to this node (55 counts).

That means that if you would SUM the IPR005667 Sulphate ABC transporter permease protein 2" in Excel you will get duplicate count data.

You can also see that not all rows are marked light red, so this is not the case for all InterPro classifications. You can see in the image that on the columns I have different levels of the InterPro classification (InterPro1, InterPro2, etc). And duplicate rows also occur on deeper levels such as InterPro2 (i.e. that will then Sum some counts from the InterPro 3 column). Because we cannot export assigned reads from the InterPro viewer in MEGAN (or any other functional classification viewer) it makes it very difficult to actually work and plot this data. This will also mess up any normalization done to the data after exporting it from MEGAN.

I think MEGAN actually exported the assigned counts, I double-checked this now in the MEGAN message window directly after the export (it says it exported assigned counts). I also double-checked this by plotting the data and compared assigned counts for specific intermediary nodes compared to that exported in the text file. So this must have been a mistake and misunderstanding from my side.

Most assignments to functional classes occur at the leaves of a functional tree. That is the reason why I don’t distinguish between “assigned” and “summed” in the functional viewers.

The same leaf label can occur multiple times, for example, the same KO can appear in multiple different pathways in KEGG. When computing the summed numbers up the tree, each read count is only used once, even if the same leaf node carrying that count appears more than once.

Here is an example:

In a KEGG analysis, I see this summed count:

“Energy metabolism” 19085

There are seven children of the node “Energy metabolism” with these summed counts:

“ko00190 Oxidative phosphorylation” 4366
“ko00195 Photosynthesis” 891
“ko00710 Carbon fixation in photosynthetic organisms” 3458
“ko00720 Carbon fixation pathways in prokaryotes” 6348
“ko00680 Methane metabolism” 4298
“ko00910 Nitrogen metabolism” 2345
“ko00920 Sulfur metabolism” 1107

If you add all the counts for the children together, then the result is

4366+891+3458+6348+4298+2345+1107 = 22813

This illustrates that for “Energy metabolism”, each read is counted only once, even if it appears multiple times under different children.

Does that make sense and do you find that this works as intended?

Hi Daniel,

Yes that makes sense. Sorry if my post was confusing, however I was more thinking about how MEGAN deals with reads assigned to Parent nodes in the functional viewers. Especially in InterPro there are lots of reads assigned to Parent nodes. If these are exported as Summed then I think this would be an unintentional double count, as the children nodes are also exported as Summed (i.e. counts that are also exported from the Parent node the Children belongs to)?

In any case, it seems the counts were exported as assigned from the Parent nodes as MEGAN reported the data being exported as assigned in the message window, so I don’t think this is a problem. Maybe this also works differently in the KEGG viewer, I use the MEGAN community version.