Reads with excellent hits against a single taxon are being placed in the "Not assigned" node

Hello, I am using MEGAN6 with the latest mapping file (megan-nucl-Oct2019) to process my blastn results.

For some reason a great number of reads (more than 30,000), most of them with excellent hits against the same taxon, are being placed in the “Not assigned” node. To exemplify, these are the blastn results for one of those reads (the homology search was set to show the 5 best hits) indicating they all correspond to the same taxon (Spodoptera frugiperda) with an unequivocal E-value (0.0):

Query= HJK3X6U01BD2KC, 470 bp.

Length=470

Sequences producing significant alignments: Score (Bits) E-Value

NJHR01000880.1 Spodoptera frugiperda isolate Sf9 scaffold880, who… 673 0.0
NJHR01000167.1 Spodoptera frugiperda isolate Sf9 scaffold167, who… 673 0.0
NJHR01001202.1 Spodoptera frugiperda isolate Sf9 scaffold1202, wh… 673 0.0
LS001813.1 Spodoptera frugiperda strain corn strain genome assemb… 667 0.0
LS041536.1 Spodoptera frugiperda strain corn strain genome assemb… 667 0.0

I checked the submission date for these particular accessions to make sure they weren´t from a later date than that of the mapping file: they are from June and December 2017 so that shouldn´t be a problem as the mapping file is from October 2019.

I´d be very grateful if you could help me out with this.

Best!

Christina

Dear Christina,

you can check easily whether an accession is in the mapping file using the sqlite3 command line program. First open the file like this:

sqlite3 megan-nucl-Oct2019.db

Then query an accession like this:

select * from mappings where Accession = ‘NJHR01000167’;

It turns out that the accession NJHR01000167 is not in the mapping database.

However, for LS001813 the result is:

LS001813|7108

which means that the accession is known and it maps to taxon 7108 (Spodoptera frugiperda). I don’t know why this is not being assigned. If you could give me access to a small example file that contains this then I will look into it.

However, because the reference lines contain the taxon names, you should also be able to parse the file without using the mapping db. Select “Parse Taxon Names” on the Taxonomy Tab of the “Import from Blast” dialog and MEGAN will capture the names from text.

Dear Daniel, thank you greatly for your prompt answer!

Your suggestion
“However, because the reference lines contain the taxon names, you should also be able to parse the file without using the mapping db. Select “Parse Taxon Names” on the Taxonomy Tab of the “Import from Blast” dialog and MEGAN will capture the names from text.”
has solved the problem!!! Thank you!!!

Nevertheless, I can still send a small example file. Would that be of the “blastn results” file where some of these “problematic” reads appear (and of the corresponding fasta file)?

Thank you again for your help!!

Nevertheless, I can still send a small example file. Would that be of the “blastn results” file where some of these “problematic” reads appear (and of the corresponding fasta file)?

Yes, please send such an blastn results file.

Dear Daniel, I´m attaching a partial blastn results file and its corresponding fasta file for 10 of the reads with excellent hits that were “Not assigned”.

Please let me know if you need me to include more reads and/or if you need anything alse.

Thanks again for all your help!

All the best and keep safe.

Christina

Sf_MM_nt16SLep_partial.txt (214.9 KB) Sf_MM_partial.fasta (5.2 KB)

Dear Christina,

thanks for the data, perfect. I tried out MEGAN on this data and it would appear that the program works as intended.

All ten reads get assigned to Spodoptera frugiperda, using the mapping file. Are you sure that you used this file: megan-nucl-Oct2019.db.zip (needs to be unzipped).

Inspecting the alignments, the references whose accessions are known to MEGAN (such as LS041536) are assigned a taxon, whereas unknown accessions (such as MKQC01000675) are not:

If you can get it to work, then using accession mapping is probably more reliable than name parsing (because there are some ambiguities that are heuristically resolved).

Daniel

Dear Daniel, thank you for trying that out. I tried it out myself and these reads were also assigned, so I looked into the “Not assigned” list of the original file and found that somehow I made a mistake when I made up the partial file, and chose reads that HAD been assigned… I apologise for my mistake.

I´ve made new files (Sf_MM_10reads_nt16SLep.txt and Sf_MM_10reads.fasta) double-checking that they actually DO contain 10 of those reads that were not assigned by MEGAN :-).
Sf_MM_10reads.fasta (5.3 KB)
Sf_MM_10reads_nt16SLep.txt (326.4 KB)

I also processed them with MEGAN to see what happened and confirmed that they are not being assigned.
imagen
Moreover, when I checked the taxonomy I found that MEGAN only assigns 1 match to each of these reads even though they all have 5 excellent matches against Spodoptera frugiperda:
imagen

I apologise for my blunder, it must be the effect of working at home during the quarantine with my little toddler running around :-)! I hope these files are useful to you.

All the best and keep safe

Christina

Thank you for the updated files. Yes, I see what you mean, the reads don’t get assigned, because the accessions are not in the mapping file.
I will generate a new mapping file and put it online, that should fix the problem.

Sorry for the delay in answering… I never got a notification in my email box (as I had previously) so I didn´t check until recently when I signed up for something else!

Thank you for solving the problem! Best!

still working on the new mapping file, I keep hitting problems… but hopefully it will complete soon…

I´ve had a few problems of the same type (hits were not assigned) with other samples, which I processed with the new mapping file (and latest Megan version); in this case not even parsing the taxon names did the trick. But this could well be because the homology searches were made against a much older database. Nevertheless I seem to have circumvented this issue by using the older accesion mapping files (prot_acc2tax-Jul2019X1.abin; acc2seed-May2015XX.abin; acc2eggnog-Jul2019X.abin; acc2interpro-Jul2019X.abin).

I forgot to mention that Megan takes a long long long time to process the homology searches when I use the aforementioned accession mapping files (exponentially longer than with the all-in-one mapping file).

Perhaps try using an older version of MEGAN?
Although I tried hard not to break the old way of mapping accessions when I implemented support for database-based mapping, it is possible that somewhere I made a change that slows down the old approach…

Older installers are still available on the website.

Alternatively, I will also look into generating a mapping database for the older set of mapping files that you mentioned.

Yes, I did try older versions as well, but they´re also extremely slow. They take ages to parse the input file.

I have kicked off the process that will build a mapping file using the versions of classifications that you mentioned.
However, there is a snag: the July 2019 version of MEGAN does not have the code for computing a mapping database, whereas the current version contains updated versions of the taxonomy and functional classifications… So I’m not sure how useful the result will be… (This will mainly effect the SEED classification, which has undergone a major change in MEGAN, jumping from 2015 to 2020).

I have generated a mapping database megan-map-Jul2019.db that uses the old July 2019 mapping files. Please test and let me know whether this works.

The link is on this page:

https://software-ab.informatik.uni-tuebingen.de/download/megan6/old.html

Thank you very much for taking the time to generate this db, Daniel!
I’m really sorry to say the same thing happens with the Jul2019 db :(… I used v6.18.1 and v6.19.1 and with both all the hits were classified as “not assigned”, same as with the Oct2019 and May2020 db…

But I’ve managed to get it to work when I use the separate accession mapping files (prot_acc2tax-Jul2019X1.abin; acc2seed-May2015XX.abin; acc2eggnog-Jul2019X.abin; acc2interpro-Jul2019X.abin), so I’ll just stick to that.

Here is another possible solution: I have generated a new mapping file:

Community Edition:
megan-map-May2020-X.db.zip
Ultimate Edition:
megan-map-May2020-X-ue.db.zip

This file contains an extended mapping that maps over 850 million accessions to taxa, rather than the 250 million that are available in the standard mapping file. The functional mappings are also extended. It also contains mappings to the GTDB taxonomy, something of added value that is not contained in older mapping files.

So, perhaps give this a try. The file is very big, nearly 20Gb when unzipped.

Wow! That’s great! Of course I’ll give it a try and let you know. Thank you again for everything!

Hello again, sorry for the considerable delay, I have just been able to pick up on this again. I don’t have good news… with this new May2020db all the reads were classified as “not assigned” :frowning: