How to estimate running time for daa-meganizer for long reads

I use below commands to run the Diamond+Megan pipeline for long reads.

diamond blastx -d nrdb.dmnd -q final.contigs.part_001.fa
-o final.contigs_2024.part_001.daa -F 15 --range-culling -f 100 --top 10
-t ./ --threads 32

/megan/tools/daa-meganizer -i final.contigs_2024.part_001.daa
-mdb megan-map-Feb2022.db --longReads -t 16

The daa file generated by Diamond is ~12.7GB. I use MEGAN 6.25.9 and I set a maximum memory for Megan of 500G. The daa-meganizer did not finish within 24 hours. Here are the last lines in the slum-jobid.out file:

Annotating DAA file using FAST mode (accession database and first accession per line)
Annotating references
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (410.8s)
Writing
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (3.6s)
Binning reads Initializing…
Initializing binning…
Using ‘Interval-Union-LCA’ algorithm (51.0 %) for binning: Taxonomy
Using Multi-Gene Best-Hit algorithm for binning: SEED
Using Multi-Gene Best-Hit algorithm for binning: EGGNOG
Using ‘Interval-Union-LCA’ algorithm (51.0 %) for binning: GTDB
Using Multi-Gene Best-Hit algorithm for binning: EC
Using Multi-Gene Best-Hit algorithm for binning: INTERPRO2GO
Binning reads…
Binning reads Analyzing alignments
10% 20% 30% 40% 50%

I also checked memory usage for the run, and here is the information:
Memory Utilized: 95.55 GB
Memory Efficiency: 19.03% of 502.00 GB

(1) should I increase “-t 16” to “-t 32”?
(2) I run a few times, it seems it got stuck at binning 50%
(3) I have submitted a job to run daa-meganizer for 72 hours, but the run has not started yet after waiting for one week.
(4) In total, I have 4 those daa files. Let me know if you have any suggestions to get them run quickly.

There isn’t a precise estimate for this, but last year, with default settings of 8 threads and 200 GB of assigned RAM, it took approximately 56 minutes to process a 67 GB file. For a 1 TB file with the same parameters, it took around 25 hours. what type of cluster you are running it on?

Hi @Anupam,

I am not sure what information you need for the cluster. The cluster is “a heterogeneous cluster suitable for a variety of workloads”. “A low-latency high-performance fabric connecting all nodes and temporary storage.”

Here is the information I required to run the job:
#!/bin/bash
#SBATCH --time=23:59:00
#SBATCH --account=def-mcristes
#SBATCH --mem=502G
#SBATCH --cpus-per-task=32
#SBATCH --job-name=DiamondMegan_part1_3

My daa file is ~12.7GB. MEGAN (v6.25.9) can use 500G of memory.
I have submitted a job to re-run daa-meganizer for up to 72 hours, but the run has not started yet after waiting for one week. I am also worried that I may have the same problem even if the daa-meganizer run for 72 hours.

In November 2023, I had a problem to run daa-meganizer for the daa file generated by diamond with the “–top 10” option, and had no problem to run daa-meganizer for the daa file generated by diamond without the “–top 10” option (i.e., using default parameter for -k).

I just submitted another job to re-run diamond following your suggestion in this post.

Would the difference between “–top 10” and “-k 25” for diamond be just different number of alignments reported in the daa file?

Thanks a lot for your help!

This seems fine.

I suggest reducing the CPU to the default of 8 and memory to 200GB. Frequently, the problem might stem from the node allocation or configuration. Sometimes, simply changing the node can resolve the issue.

Yes, It limits the number of alignments reported within any range along a long sequence to a maximum of k=25, resembling the k=25 mode used for short reads. Essentially, it ensures that the number of alignments overlapping along a long read does not exceed 25.

Alternatively, if you could send the file, I can investigate further.

I suggest reducing the CPU to the default of 8 and memory to 200GB.

Thanks for the suggestion. Glad to know I do not really need 500G of memory for MEGAN.

Alternatively, if you could send the file, I can investigate further.

I have shared the daa file with you. Let me know if you did not receive an email notification about that.

Hi @XHe

I executed daa-meganizer on your file, and it took approximately 16 hours, 59 minutes with default settings. MEGAN was allocated 200GB of memory (It utilized a maximum resident set size or memory of 89.20 GB.). Currently, the total number of alignments is around 12,123,405. While this duration seems reasonable, we can investigate further to understand why it took so long. However, based on the current analysis, you don’t need to allocate an entire day for computation using daa-meganizer.

Version   MEGAN Community Edition (version 6.25.9, built 16 Jan 2024)
Author(s) Daniel H. Huson
Copyright (C) 2023. This program comes with ABSOLUTELY NO WARRANTY.
Java version: 20.0.2; max memory: 195.3G
Functional classifications to use: EC, EGGNOG, GTDB, INTERPRO2GO, SEED
Loading ncbi.map: 2,396,736
Loading ncbi.tre: 2,396,740
Loading ec.map:     8,200
Loading ec.tre:     8,204
Loading eggnog.map:    30,875
Loading eggnog.tre:    30,986
Loading gtdb.map:   240,103
Loading gtdb.tre:   240,107
Loading interpro2go.map:    14,242
Loading interpro2go.tre:    28,907
Loading seed.map:       961
Loading seed.tre:       962
Meganizing: final.contigs_2024.part_001.daa
Meganizing init
Annotating DAA file using FAST mode (accession database and first accession per line)
Annotating references
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (140.8s)
Writing
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (1.7s)
Binning reads Initializing...
Initializing binning...
Using 'Interval-Union-LCA' algorithm (51.0 %) for binning: Taxonomy
Using Multi-Gene Best-Hit algorithm for binning: SEED
Using Multi-Gene Best-Hit algorithm for binning: EGGNOG
Using 'Interval-Union-LCA' algorithm (51.0 %) for binning: GTDB
Using Multi-Gene Best-Hit algorithm for binning: EC
Using Multi-Gene Best-Hit algorithm for binning: INTERPRO2GO
Binning reads...
Binning reads Analyzing alignments
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (60,999.8s)
Total reads:          281,440
Total weight:     158,719,797
With hits:             237,223 
Alignments:         12,123,405
Assig. Taxonomy:       211,607
Assig. SEED:             6,304
Assig. EGGNOG:           2,591
Assig. GTDB:            12,030
Assig. EC:              25,421
Assig. INTERPRO2GO:     50,645
MinSupport set to: 15871
Binning reads Applying min-support & disabled filter to Taxonomy...
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (0.3s)
Min-supp. changes:       4,944
Binning reads Applying min-support & disabled filter to GTDB...
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (0.4s)
Min-supp. changes:       3,421
Binning reads Writing classification tables
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (0.5s)
Binning reads Syncing
100% (0.0s)
Class. Taxonomy:           925
Class. SEED:               365
Class. EGGNOG:             690
Class. GTDB:               158
Class. EC:               1,849
Class. INTERPRO2GO:      5,842
Total time:  61,150.5s
Peak memory: 64.3 of 195.3G
	Command being timed: "./megan/tools/daa-meganizer -i final.contigs_2024.part_001.daa -mdb megan-map-Feb2022.db --longReads"
	User time (seconds): 87751.33
	System time (seconds): 1879.05
	Percent of CPU this job got: 146%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 16:59:12
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 89203468
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 3671
	Minor (reclaiming a frame) page faults: 795668269
	Voluntary context switches: 10753796
	Involuntary context switches: 409467
	Swaps: 0
	File system inputs: 16384
	File system outputs: 2500368
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

You may have received an email containing a link to download the MEGANized DAA file. Please let me know if you haven’t received it.

Best regards,
Anupam