Ancient DNA, MALT, & Memory Issues

cscott · June 18, 2020, 7:22pm

Hi,

I am currently working on an ancientDNA project, trying to determine if there are is any coral/symbiont DNA in data from coral reef cores. However, I am also interested in identifying the reads which are not coral/symbiont reads. In order to do this, I have built a MALT database containing ~1,600 genomes from invertebrates, bacteria, protozoa, and archaea. Ideally, I will pipe the output from this into the software HOPS to characterize the degradation patterns of the reads to see which are truly ancient. However I am having several issues:

MALT keeps running into memory errors when trying to align my reads files. The reads files are large, but I have broken them into 5 smaller subsets (however I am still running into memory errors). I suspect this is due to the size of my database taking up most of the Java memory? I am not sure if this is the case though. If so, is there any way to fix this? If not, do you have a recommendation for another software that might be able to handle this kind of data (I’ve considered minimap2, but it seems that wouldn’t be ideal for short, potentially damaged reads)? I’ve already allotted MALT the maximum memory possible during the program’s installation.
A large portion of my reads are returning ‘no hits’ or ‘not assigned’. Is there a best way to loosen the threshold in order to see even poor alignments? Currently I have my min Support Percent set to 0.

Thank you for any feedback!

Carly

maxibor · June 24, 2020, 11:41am

Hi Carly,

To reduce the size of the MALT database to fit it in memory, you can increase the step size (--step). Though keep in mind that this will decrease your sensitivity.
The cause of this might simply be that your reads are not mapping on any of the genomes of your database. The solution: add more genomes in your database, which brings you back to 1

Daniel · June 30, 2020, 2:42pm

The memory usage is dominated by the size of the database… But as pointed out, for more sensitivity you need more reference sequences. The only hope would be to break your references genomes up into multiple databases and then run MALT multiple times on the same input file, once on each database. Then you need to merge the results.

This is really a hard problem: you have short reads and thus looking for short alignments… But in a huge database of genome sequences the same short sequences might appear many times, overwhelming either the algorithm or the output.

I’m sorry that I don’t have a simple answer…