Parsing GI | Reference Number in CSV/Text file


I have a counts table, where the first column contains the GI number and Accession number in the format gi|753197404|ref|WP_041503856.1|

Is it possible to import such a file into MEGAN and map these to SEED/KEGG?


I think that MEGAN (still) supports parsing of ref-seq identifiers during import. It would be easy to support accessions or gi numbers, but I won’t have time to look into that for a while

While gi and ref-seq IDs were given above, what is the proper format for Interpro ID and Kegg ID when importing a CSV with counts?
I have used the following test data and it looks like MEGAN is picking them up but not importing the counts.


The output from the Messages window is below:

Importing summary ofINTERPRO2GO assignments from CSV file
Specified line format: classname,count{,count,count…}
Loading 11,294
Loading interpro2go.tre: 26,787
Number of lines read: 4
Different INT. classes identified: 4
Different Tax. classes identified: 0
done (0 reads)

I have looked into this: none of the IPR numbers that you list occur in the InterPro2GO tree that MEGAN currently uses.

This may be because MEGAN’s viewer is focused on IPR families. For example, IPR005821 corresponds to a Domain (Ion transport domain), not a family.

An example of something that is contained in MEGAN is IPR022370, which corresponds to the Phosphosulpholactate synthase family.

MEGAN silently swallows IPRs that don’t appear in its classification tree, I have added code to warn about this.
I have also added code that allows the user to tell MEGAN how to parse identifiers such as GI numbers or Accession numbers, if desired. This will all be available in 6.7.1 soon.