Meganizing diamond files with customized database

steff1088 · July 22, 2019, 5:47pm

Hi everybody,

I have been working in Megan6 to use it for the analysis of protein sequence data classified in Diamond. The file was obtained with the format 100 command. The database of reference protein sequences was a customized set of ~2900 sequences obtained from the Fungene repository, for a functional gene. My goal is to get a weighted LCA taxonomic output in Megan6, however, the default NCBI protein database file does not seem to be compatible with the .daa file from Diamond. Over 90% of hits are declared unassigned. I tried to add the customized Fungene database as synonyms list in the format of a tab-delimited txt file with one column being the protein accession number and the other column the GI number. Unfortunately that did not change the output of unassigned hits. Are there any options to make my customized Fungene database (essentially a fasta file with the protein accession no. in the header of protein sequences) a functional Megan6 mapping file? Or am I making a mistake in the amendment of the synonyms list?

I would highly appreciate any helpful comments!!

cheers,
steffen

Daniel · July 22, 2019, 6:04pm

Hi Steffen,

A couple of things:

you need to produce a mapping file that maps Fungene accessions to taxon ids for taxonomic binning
Functional binning would require a mapping of Fungene identifiers onto e.g. Intropro families,.

On the other hand, Fungene has long been on my list of things to look into, so if you could give me access to a small example file, then that would give me something to work toward.

Best wishes
daniel

steff1088 · July 22, 2019, 7:32pm

Thanks so much, Daniel. Can I send you my current Fungene ref sequence data file and the file that links accession no. to GI? Does GI number count as valid taxon id? I believe Fungene uses the normal protein accession no as identifier. Here the format they provide:

KZC29526 coded_by=complement(28140…30071),organism=Rhodanobacter sp. FW510-R10,definition=TAT-dependent nitrous-oxide reductase

And then I usually just cut the text off leaving the ID:

KZC29526

Cheers,
steffen

steff1088 · July 23, 2019, 7:41pm

Does anybody else have some experience in making their own customized mapping file?

steff1088 · August 1, 2019, 7:56pm

Hi Everybody,

I am still waiting to move forward from here. I can imagine customizing mapping files may be useful for many users so please if you could comment on your experience or clarify previous comments (@Daniel) that would be highly appreciated. Are there any plans to implement Fungene-derived mapping files soon? That would be super cool!

cheers,
steffen

Daniel · August 6, 2019, 2:36pm

Sorry, Steffen,

a mapping file can be a text file in which each line
has two entries, a string and a number, separated by a tab.

MEGAN looks for a string contained in the reference sequence header line. If the string is found, the reference sequence is mapped to
the corresponding number.

What does it mean to be contained in the reference sequence header?
This depends on whether to supply the mapping file to MEGAN as an “accession” mapping file or a “synonyms” mapping file.
In the former case, MEGAN looks for something like ref| and then grabs the word that comes after that and tries to match that to your mapping file. In the latter case, MEGAN breaks the header string into works and tries to match one of the words to your mapping file.
You can also write taxonomic identifiers etc directly into the header lines of your reference sequences and then use the “use id parsing” feature to grab the identifiers directly from the references.

So, for your example, you need to write

KZC29526 <tab> 666

in the mapping file to map reference KZC29526 to taxon number 666.

If the string to be matched is the very first string on the header line, then you can supply the mapping file both as an accession mapping file or a synonyms mapping file because in both cases the very first word is considered for matching.

jigyasa · October 4, 2019, 2:20am

Hey @Daniel

Thank you for explaining how the custom mapping file can be created.

I have one quick question-
If I am using Uniref90 database, and I want to create a tab-delimited mapping file, how do I add the taxonomy hierarchy?

For example, I am creating a mapping file in the format-
reference_idtax_id
UniRef90_Q197F8345201

How do I add or get all the taxonomic levels (i.e. phyla, class, family, order, genus info) for this tax_id?

Looking forward to your reply.

Daniel · October 16, 2019, 6:16am

You don’t add taxonomic levels but rather supply the taxon id of one taxon. That places the accession into the NCBI taxonomy.
For example, tax-id 2 corresponds to Bacteria, whereas 976 corresponds to Bacteroidetes.