Taxonomic information retrieval

eraysahin · July 26, 2022, 7:27am

Hello,

I have built Diamond nr database using;

diamond makedb --in nr.gz \
–db nr_diamond
–taxonmap prot.accession2taxid.FULL
–taxonnodes nodes.dmp
–taxonnames names.dmp
–threads 72

and performed annotation using;

diamond blastp -p 80 -c1 -b60
-d nr_diamond.dmnd
-q sample1.spades.genes.faa
-o sample1.faa.diamond.daa
-f 100 -k 5 --salltitles -e 0.000001

However when I try to get taxID’s from the daa file, getting the “Error: Taxonomy features are not supported for the DAA format.” error. I posted the issue on GitHub and got a response indicating that tax info could not be retrieved as the error message indicates.

Then, I have tried to use Megan, but I do not want to use LCA algorithm for my case. Instead, I am trying to get the first annotation out of the five for each, as long as they satisfy the thresholds I supply, but could not set the correct parameters. When I attempted to use a manual scrript, it took so long, and also I do not have the taxon ID info. I need to use “prot.accession2taxid.FULL” file for protein acc and taxon ID matching, but the search for each query takes so long.

How can I get the protein and taxID information (even full lineage information if possible) for the first annotation row for each query on Megan without LCA?

Thank you in advance.

Best regards

Daniel · July 27, 2022, 8:18am

Have you tried using the LCA algorithm with topPercent set to 0.001?
That will only use the alignment with the highest score (or all such, if they have identical scores).

eraysahin · July 27, 2022, 8:53am

Dear Daniel,

Thank you for your reply, I will give a try.

Best regards