I constantly run into a problem when I try to run daa2rma with a pair of read files. The process seems to get stuck at some point due to memory constraints. I am trying to do this on a cluster with 28 nodes and 64G RAM.
I also ran into this problem running daa2rma in a cluster environment. I was notified by the cluster admins that the jobs were utilizing more than the 1 requested thread. I could not figure out a way to force daa2rma or Java to only use one thread. When I ran the same command on my PC, it worked.
Is there a solution to this? I have many samples with 40GB daa files (each, so 80GB worth of input) and have tried running them on a cluster with maximum 512 GB RAM per node. I tried allocating increasing amounts of RAM, but now I am at the last try, where I won’t be able to allocate more while using a single node. However, this is just an example, there will be larger files in the dataset, so if I can’t get this one to run or can barely get it with the available RAM I am in trouble.
It seems this is because of the new huge mapping files as well?
Can I run daa2rma in parallel on multiple nodes using a slurm system or is this not possible? I already provide a temp directory with -tsd to get as much out of memory as possible.
The only workaround that I can think of is to split your files into chunks, run each chunk separately through DIAMOND and MEGAN and then merge the chunks into summary files. At present, with a summary file you lose the connection to the original reads and matches, so can’t e.g. export all reads that have been assigned to a particular taxon.
However, if this sounds like a reasonable workaround that you are willing to try, then I will look into implementing a multi-DAA summary file format that represents a collection of DAA files that belong to the same sample.
As a user, you would interact with this as if all your reads were on one file, but they would actually be distributed across multiple files.
I had hoped that the size of metagenome datasets would stop growing and that MEGAN 6 would be efficient enough to handle current datasets.
However, apparently that is not the case and I already have some ideas for MEGAN 7…