Meganizing a DAA file with Naive LCA Parameters (Laymans Explanation Desired)

Earlymorningmanatee · August 8, 2020, 5:46pm

Hi MEGAN Community!

I feel this topic has been covered at least partially in previous posts but my lab group is really having a difficult time interpreting and agreeing on what each of the LCA parameters is doing. I have read the manual and I think I have a general impression of what each setting does but it would be nice to get expert and other user validation as well before we begin our analysis in earnest.

For context
We are using MEGAN6 and have processed our short reads through DIAMOND.
In DIAMOND we have an evalue cutoff of .0001 and use the top command to only allow the top 5% and we are comparing against the NCBI nr database

this is our DIAMOND code: diamond blastx -d ~/Diamond_Tools/BLAST_DB/nr_4_23_2018.dmnd -q $pups -o $pups.daa -f 100 --sallseqid --top 5 -b 8 -e .0001 note it is part of a for loop*

My questions are two-fold, the default settings for the LCA algorithm at least in my version are as follows:
Min-score: 50, Max Expected Value: .01, Percent Identity: 0
Top percent: 10%
Min Support:0
Min Support percent: .05
I’m wondering how these defaults were determined and the impact of changing them in laymans terms if possible? For example if we set the max expect value very small does that mean we will get less deep taxa hits (species) but the ones we do get will be more reliable? Also if we are already selecting for an evalue in diamond (and potentially a bitscore which I believe is an option) are the first 3 commands redundant?

I’m mostly confused on the Top Percent value. Is this a secondary filter similar to DIAMOND in which it would look at all the bit scores for alignments for one query but only consider the alignments that have a DIAMOND bitscore within 10% of the highest score?

Forgive me if these questions are obvious, but I Appreciate your assistance and guidance!

Sincerely,

RJ

Daniel · August 13, 2020, 9:03am

Unfortunately, there is not study that has looked into how to set the parameters for different datasets and questions, with different updates of the NCBI-nr database. These are all moving targets, so any such study would have very limited usefulness.

The Top Percent parameter addresses the fact that most sequences found in the environment are not present in current reference databases, so using a best hit assignment strategy will be misleading. Yes, DIAMOND’s top percent filter was inspired by MEGAN and serves the same purpose.

Earlymorningmanatee · August 17, 2020, 1:59pm

Hi Daniel,

Thank you that was very helpful and nice to have it confirmed! I appreciate you taking the time to answer my question!

Sincerely,

RJ