I constantly run into a problem when I try to run daa2rma with a pair of read files. The process seems to get stuck at some point due to memory constraints. I am trying to do this on a cluster with 28 nodes and 64G RAM.
The log file says:
Version MEGAN Community Edition (version 6.7.18, built 19 Apr 2017)
Copyright © 2017 Daniel H. Huson. This program comes with ABSOLUTELY NO WARRANTY.
Loading ncbi.map: 1,562,782
Loading ncbi.tre: 1,562,785
Opening file: /beegfs/work/bcoro01/diamond/funtaxma/prot_acc2tax-Nov2016.abin
Loading seed.map: 13,662
Loading seed.tre: 21,084
Opening file: /beegfs/work/bcoro01/diamond/funtaxma/acc2seed-May2015XX.abin
In DAA files: /beegfs/work/bcoro01/diamond/10daa/NG-13139_RSb_lib196544_5485_7_cleaned_hg19_unmapped_reads_1.daa, /beegfs/work/bcoro01/diamond/10daa/NG-13139_RSb_lib196544_5485_7_cleaned_hg19_unmapped_reads_2.daa
Output file: /beegfs/work/bcoro01/diamond/20rma/NG-13139_RSb_lib196544_5485_7_cleaned_hg19_unmapped_1plus2.rma6
Parsing file: /beegfs/work/bcoro01/diamond/10daa/NG-13139_RSb_lib196544_5485_7_cleaned_hg19_unmapped_reads_1.daa
10% 20% 30% 40% 50% 60% 70% 80% 90% Parsing file: /beegfs/work/bcoro01/diamond/10daa/NG-13139_RSb_lib196544_5485_7_cleaned_hg19_unmapped_reads_2.daa
10% 20% 30% 40% Exception in thread “main” java.lang.OutOfMemoryError: GC overhead limit exceeded
Do you have an idea how this problem could be solved?
I also ran into this problem running daa2rma in a cluster environment. I was notified by the cluster admins that the jobs were utilizing more than the 1 requested thread. I could not figure out a way to force daa2rma or Java to only use one thread. When I ran the same command on my PC, it worked.
How many reads are you trying to process? MEGAN keeps all reads in memory when processing paired reads and this uses a lot of space…
Is there a solution to this? I have many samples with 40GB daa files (each, so 80GB worth of input) and have tried running them on a cluster with maximum 512 GB RAM per node. I tried allocating increasing amounts of RAM, but now I am at the last try, where I won’t be able to allocate more while using a single node. However, this is just an example, there will be larger files in the dataset, so if I can’t get this one to run or can barely get it with the available RAM I am in trouble.
It seems this is because of the new huge mapping files as well?
Can I run daa2rma in parallel on multiple nodes using a slurm system or is this not possible? I already provide a temp directory with -tsd to get as much out of memory as possible.
The only workaround that I can think of is to split your files into chunks, run each chunk separately through DIAMOND and MEGAN and then merge the chunks into summary files. At present, with a summary file you lose the connection to the original reads and matches, so can’t e.g. export all reads that have been assigned to a particular taxon.
However, if this sounds like a reasonable workaround that you are willing to try, then I will look into implementing a multi-DAA summary file format that represents a collection of DAA files that belong to the same sample.
As a user, you would interact with this as if all your reads were on one file, but they would actually be distributed across multiple files.
I had hoped that the size of metagenome datasets would stop growing and that MEGAN 6 would be efficient enough to handle current datasets.
However, apparently that is not the case and I already have some ideas for MEGAN 7…