I want to use DIAMOND+MEGAN for a sequence dataset of 80 metagenomes. I have run a test with one file using diamond blastx + meganizer and opened the file in MEGAN.
Everything worked beautifully, but the produced diamond output file is very large, even after reducing --max-target-seqs to 10 (~25GB). While I can run the analysis on a cluster, I will have to view all the files in MEGAN on my own computer, so downloading opening 80 25GB files doesn’t seem like a viable option.
Would it be ok to reduce --max-target-seqs even more, to 5 or maybe even 1? Or is there an alternative way to process the diamond output for MEGAN?
No, please do not reduce the number of alignments per read. This will result in a large increase in false positive assignments.
There are a number of alternatives.
We are working on new releases of MEGAN that will work with UniRef50, UniRef90 and UniRef100, and with clustered versions of NR
You can use the program megan/tools/compute-comparison to compute a single, small comparison file for all your 80 files. You can download and open that.
Thanks for the tips.
I have played around a bit with different numbers and noticed that I end up with way too many unassigned reads if I reduce the number of alignments too much. It did seem to stabilize at 25-100 alignments per read, so I’m now going with 50 and the compute-comparison tool.
I assume that you mean that the number of functional assignment decreases? Taxonomic assignments shouldn’t be effected. The nice thing about using the smaller databases (as soon as the become available, we are nearly there) is that the assignment rate goes up for functional assignments, whereas the time to compute the alignments goes down…
No, it also decreased for the taxonomic assignment.
For my comparison using 1, 10, 25, 50 and 100 max alignments I got ~22,000,000, ~7,400,000, ~5,000,000, ~4,700,000 and ~4,700,000 reads in the “not assigned” category, so quite a dramatic difference.
Ok - that makes sense- unassigned is not unaligned. So, this indicates that there is a mapping problem - you are using a reference database that is much more recent than the mapping file. I will generate a new mapping file that is up-to-date.
Sorry for the confusion. I am using the nr reference database on our computing cluster, so I am not exactly sure how recent that is. As I have now already run diamond followed by daa-meganizer on some of the files, will it be possible to run meganizer again on the already meganized files with an updated mapping file, or do I have to generate a fresh daa-file for that?