Limit cpu usage daa-meganizer

Hi, is there any way to limit the number of threads used by daa-meganizer (Megan tools)? Is using all the cpus and I don’t see any parameter to limit it.

We would like daa-meganizer to use more cpus! It seems to start with 2 cpus, and then settles down to 1 cpu. Our current problem is that when we run daa-meganizer on daa files larger than about 80Gb, it reaches
Binning reads Writing classification tables 10% 20% 30% 40% 50% 60% 70% 80% 90%
and gets no further, despite apparently active Java processes. Any suggestions gratefully received.

The main reason for this to happen is when daa-meganizer is run with too little memory. Edit the file MEGAN/megan.vmoptions to set the amount of memory that MEGAN (and all MEGAN tools) can use. It should be 16GB for daa-meganizer to run comfortably.

Yes, at present most of the work is done in a single thread because most of the work is to write indices to the end of the daa file.

Thanks Daniel: I am trying again with Megan v6.21.2, and 500GB memory, but same outcome. There are active Java processes, using up to 500GB. But runs never get beyond 90% after 8 days on the HPC. Failed input files have about 530 million reads, but some other files with similar numbers have succeeded with less than 300GB. So the outcome does not depend only on input file size.
I will try again with 750GB memory. But grateful for any suggestions on what is using up memory, or how to process such large data.

I have never tried to meganize a file with 530 million reads (only around 80 million reads).
I will have to revisit my code to see whether there are any obvious bottle-necks that I can remove.
In the current version, I try to avoid creating global tables etc while parsing through a file, but there might well be tables that I thought would never grow so big so as to cause a problem.

You can try using more memory… As a Java program, meganizer will happily use all the memory that you give it, trading off memory for speed…

Thanks Daniel: I reran this daa-meganizer job with 2TB memory, but without success.
Output ends as follows:

Using 'Weighted LCA' assignment (80.0 %) on GTDB
Computing weights
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (8017.0s)
Total matches:     891,587,604 
Total references:  166,487,190 
Total weights:      78,945,077 
Using Best-Hit algorithm for binning: EC
Using Best-Hit algorithm for binning: INTERPRO2GO
Binning reads...
Binning reads Analyzing alignments
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (520811.3s)
Total reads:      313,484,464
With hits:         313,484,464 
Alignments:      2,537,048,697
Assig. Taxonomy:   288,798,327
Assig. SEED:        87,548,584
Assig. EGGNOG:      33,480,738
Assig. KEGG:        51,665,086
Assig. GTDB:       249,611,196
Assig. EC:          30,768,505
Assig. INTERPRO2GO: 76,807,634
Binning reads Applying min-support & disabled filter to Taxonomy...
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (0.4s)
Min-supp. changes:      12,899
Binning reads Applying min-support & disabled filter to GTDB...
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (1668.6s)
Min-supp. changes:       1,375
Binning reads Writing classification tables
10% 20% 30% 40% 50% 60% 70% 80% 90%  

It reached this stage after about 5 days. It then filled up 2TB, after which it cycled between about 800GB and 2TB, but without progressing, before running out of time after 9 days.

We would be very grateful for suggestions on how to proceed. We would be happy to test any development of daa-meganizer. Can we combine smaller meganized daa files somehow?

I just noticed that you are using the “weighted LCA option”. Could you please try meganization using the naive LCA, not the weighted LCA. The weighted LCA uses more memory and takes longer than the naive LCA. That should reduce the 520811.3s (6 days) that classification took.
However, the other issue that is apparent is that the program is having a difficult time writing the classification tables. MEGAN creates the classification tables in memory and then writes them out to the end of the DAA file. With 800 million reads and multiple classifications, this uses a lot of memory.

Basically, you are using 10x as much data as this version of MEGAN is tuned for. I will have to look into rewriting parts of the code to accommodate 800 million reads mentioning 166 million references…

Thanks Daniel: We find that naïve LCA is a bit faster, but writing classification tables is still the problem. Daa-meganizer completes for files with less than about 500 million reads, using up to 300G. But larger files hang at “Writing classification tables … 90%”, even when given 2TB.
We would be very grateful for anything you can do to help us process these (and larger) files.

We are happy to test any development. Can smaller meganized daa files be combined somehow?