Plugin for PIA output into MEGAN or non-redundant/incomplete database MEGAN compensation feature

I was curious about the possibility of creating a plugin for MEGAN that can generate a kind of pseudo-rma6 file from the summary output of PIA ( The rationale in this case is that I’m observing a number of potential false-positive hits in MEGAN (namely plants) that I suspect are driven in part by database incompleteness (as similarly reported recently in the PIA paper: I like the MEGAN LCA approach, but would be interested in being able to visualize both program’s taxon node classifications together in a MEGAN comparison file that can be easily visualized as a kind of ensemble means of support for a node’s classification (insofar theoretically as nodes classified both with PIA and MEGAN as having the best support as being ‘real’). That, or if there is a means of incorporating an additional PIA-like approach for dealing with database incompleteness and/or non-redundant databases directly in MEGAN to aid in mitigating database limitations driving the potential for inaccurate classifications.


1 Like

Sorry, I should clarify, I meant to write a means of compensating for redundant databases. Not ‘non-redundant’ as I wrote above.

Thank you for pointing out the PIA paper.

MEGAN allows import of results in a number of different CSV formats. If none of these are suitable, then please send me a typical output file for PIA and I will write an importer for it.

I took a look at the PIA paper and the ideas and I think that it would be good to incorporate the algorithm, or parts of it, into MEGAN. I will look into this further…

1 Like

Thanks Daniel! I tried using the CSV importer to import the file into MEGAN, but I haven’t been successful thus far. I’ve uploaded an example of the PIA output here.
PIA-example-output.fasta.header_out.intersects.txt (1.4 MB)

You can use the linux `sed’ program to create a new CSV file, which can then be imported using MEGAN’s File->Import->Text File menu item.

For example, this call extracts the taxon id provided as “taxonomic range”:

sed "s/Query: \(.*\), top hit:.*range: [a-zA-Z0-9. ]* (\([0-9]*\)).*/\1,\2,50/ "

whereas this extracts the taxon id provided as “phylogenetic intersection”:

sed "s/Query: \(.*\), top hit:.*phylogenetic intersection: [a-zA-Z0-9. ]* (\([0-9]*\)).*/\1,\2,50/ "

In both cases, the command extracts the read name and the taxon id, and appends a fake bitscore of 50, and writes out in comma-separated format: read,taxon-id,50.

For example, the line

Query: SNL153:253:h5jm7bcx3:1:2111:13277:53133, top hit: cellular organisms (131567), expect: 7.71e-08, identities: 100.000, next hit: Agrobacterium sp. ATCC 31749 (82789), last hit: Agrobacterium sp. ATCC 31749 (82789), taxon count: 2, phylogenetic range: cellular organisms (131567), raw hit count: 38, taxonomic diversity (up to cap if met): 30, taxonomic diversity score: 0.0029, phylogenetic intersection: cellular organisms (131567)

is transformed into


by both the first and second sed command.

This shows you how to setup the import command and what the resulting tree looks like:

Perfect, that works great! Thanks so much for your help.

Just as a quick aside, I had an error with a couple reads where the hit was to:

phylogenetic intersection: Poeae Chloroplast Group 2 (Poeae type) (1652081)

In this case, the double parenthetic statement was creating a misformatted line when using the sed command from above, and then those hits to ‘Poeae Chloroplast Group 2’ weren’t getting counted. In case this is of use to others, I modified that command slightly and it seems to work fine now. I barely understand how sed works to be honest, so I’m happy that this small modification sorts out that problem, haha:

sed "s/Query: \(.*\), top hit:.phylogenetic intersection: [a-zA-Z0-9. ]* (\([0-9]\)).*/\1,\2,50/ "
#original sed command from previous post

sed "s/Query: \(.*\), top hit:.phylogenetic intersection: [a-zA-Z0-9(). ]* (\([0-9]\)).*/\1,\2,50/ "
#added parentheses to fix taxon nodes with double parenthetic statements

But yeah, if there was a means of incorporating elements of a PIA-like algorithm into MEGAN in the future, that would be helpful for dealing with incomplete and redundant databases.