I have been working in Megan6 to use it for the analysis of protein sequence data classified in Diamond. The file was obtained with the format 100 command. The database of reference protein sequences was a customized set of ~2900 sequences obtained from the Fungene repository, for a functional gene. My goal is to get a weighted LCA taxonomic output in Megan6, however, the default NCBI protein database file does not seem to be compatible with the .daa file from Diamond. Over 90% of hits are declared unassigned. I tried to add the customized Fungene database as synonyms list in the format of a tab-delimited txt file with one column being the protein accession number and the other column the GI number. Unfortunately that did not change the output of unassigned hits. Are there any options to make my customized Fungene database (essentially a fasta file with the protein accession no. in the header of protein sequences) a functional Megan6 mapping file? Or am I making a mistake in the amendment of the synonyms list?
you need to produce a mapping file that maps Fungene accessions to taxon ids for taxonomic binning
Functional binning would require a mapping of Fungene identifiers onto e.g. Intropro families,.
On the other hand, Fungene has long been on my list of things to look into, so if you could give me access to a small example file, then that would give me something to work toward.
Thanks so much, Daniel. Can I send you my current Fungene ref sequence data file and the file that links accession no. to GI? Does GI number count as valid taxon id? I believe Fungene uses the normal protein accession no as identifier. Here the format they provide:
I am still waiting to move forward from here. I can imagine customizing mapping files may be useful for many users so please if you could comment on your experience or clarify previous comments (@Daniel) that would be highly appreciated. Are there any plans to implement Fungene-derived mapping files soon? That would be super cool!
a mapping file can be a text file in which each line
has two entries, a string and a number, separated by a tab.
MEGAN looks for a string contained in the reference sequence header line. If the string is found, the reference sequence is mapped to
the corresponding number.
What does it mean to be contained in the reference sequence header?
This depends on whether to supply the mapping file to MEGAN as an “accession” mapping file or a “synonyms” mapping file.
In the former case, MEGAN looks for something like ref| and then grabs the word that comes after that and tries to match that to your mapping file. In the latter case, MEGAN breaks the header string into works and tries to match one of the words to your mapping file.
You can also write taxonomic identifiers etc directly into the header lines of your reference sequences and then use the “use id parsing” feature to grab the identifiers directly from the references.
So, for your example, you need to write
KZC29526 <tab> 666
in the mapping file to map reference KZC29526 to taxon number 666.
If the string to be matched is the very first string on the header line, then you can supply the mapping file both as an accession mapping file or a synonyms mapping file because in both cases the very first word is considered for matching.
You don’t add taxonomic levels but rather supply the taxon id of one taxon. That places the accession into the NCBI taxonomy.
For example, tax-id 2 corresponds to Bacteria, whereas 976 corresponds to Bacteroidetes.