Plain text for prot_acc2tax?

Hello,

I’m quite new with MEGAN6 and I’m not sure if there exists any not-binary reference files. I have a contig that using diamond blastx matches to protein reference:

contig-3000000 ref|YP_002004542.1| 31.4 175 108 5 1454 942 164 330 5.7e-18 94.7

This reference can actually be found in NCBI nr database (https://www.ncbi.nlm.nih.gov/protein/YP_002004542), but MEGAN6 does not assign any taxonomy to it. And we don’t know why. How can I check if this NCBI nr sequence exists in the prot_acc2tax file? We are also wondering problems related to the bitscore 94.7 as MEGAN sets a threshold to 95. Is that correct?

Any idea would be very thankful,

Muntsa

You can download the source file here:
ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz

Hello Daniel,

Thank you so much for the file you provided. We’ve checked and effectively both accessions mapped by our contig are included in the reference file.

contig-3000000 ref|YP_002004542.1| 31.4 175 108 5 1454 942 164 330 5.7e-18 94.7
contig-3000000 ref|YP_008060116.1| 35.6 87 52 1 5162 4902 345 427 2.7e-04 49.3

But, we still do not understand why MEGAN6 does not assign any taxa to this contig, even more when the accession hit with blastx is included in the reference file. Do you have any idea about what’s going on here? Could it be something related to the .tree or .map files MEGAN6 uploads when it opens?

Thank you so much for your help!

Which mapping file are you using? Using the latest mapping file prot_acc2tax-May2017.abin.zip downloaded from the MEGAN 6 website I can parse your two lines and get this:

So, MEGAN appears to operate as intended…

Hello Daniel!
We realized the file we downloaded as reference from your page was corrupted although it did not gave as any error when using it. Hence, the file we were using did not include all ncbi entries. We downloaded it again and no problem :wink: Thank you so much for your help!

Hello Daniel,
I’m trying to get accession.version from taxid identified by MEGAN6 from the output file from diamond blastx. I use the rma2info to process the rma file and get the following file with two columns (column 1 with gene ID and column 2 with tax id):
PJOAIHFC_00082 10239
PJOAIHFC_00085 10239
PJOAIHFC_01157 10239
PJOAIHFC_01161 10239
PJOAIHFC_01288 10239
PJOAIHFC_01852 10239

I get up to 755 unique different taxid, but, when I look for them at the prot.accession2taxid to get the corresponding accession.version ID I only found 640. Does this make sense? I would expect to find all of them.

Thank you!