LCA placement failure with Malt v. 0.5.2 and 0.5.3

Dear developers,

I am experiencing some difficulties with the LCA placements of DNA sequences when going from malt v. 0.4.1 to malt v 0.5.2 or 0.5.3

I ran a trivial test case in which I use a reference dataset composed of two (artifactual) identical DNA sequences belonging to two species within the genus Bos. I use the megan-nucl-Feb2022.db taxonomic file downloaded from the MEGAN6 download page after unzipping to build the malt index:

#create reference dataset

cat > test_ref_dataset.fasta << END
>NC_037328.1 Bos taurus
>NC_025563.1 Bos mutus

#create malt index

malt-build -J-Xmx10G -i test_ref_dataset.fasta -s DNA --mapDB megan-nucl-Feb2022.db -d test_malt_index/

I then try to assign a single read which has 100% identity with both of these reference sequences, using malt with default parameters. I expect the read to be assigned to the Bos genus:

##create read

cat > test_read.fasta << END

##assign using malt
mkdir malt_output
malt-run -J-Xmx10G -d test_malt_index/ -i test_read.fasta -o  malt_output -m BlastN -at SemiGlobal --memoryMode load

The resulting RMA6 file uploaded in MEGAN6 shows the read assigned to Bos taurus instead of Bos. When interchanging the order of the sequences in the test_ref_dataset.fasta, the read is then assigned to Bos mutus instead. This shows that both accession are correctly recognized (and that the read is assigned to the first match he encounters in the reference dataset?). Running the same script using malt v. 0.4.1 instead correctly assigns the read to the Bos genus (whether using v.0.4.1 or v.0.5.3 to build the index). The problem is identical when assigning the read using malt v. 0.5.2

I notice from the screen log this message when running malt v. 0.5.3 or 0.5.2:

Using Best-Hit algorithm for binning: Taxonomy

Instead, using malt v. 0.4.1, I get this:

Using 'Naive LCA' algorithm (80.0 %) for binning: Taxonomy

So I wonder if the Naive LCA algorithm might be turned off by default in these recent malt version, using a ‘Best-Hit’ algorithm instead which would simply assign each read to the closest match in the reference dataset (or the first one encountered in case of several identical matches)? I couldn’t find a way to change this behavior when looking at the different options from the commandline help or the manual. I also tried various combinations of LCA parameters instead of using the default, without success.

I am wondering if I am not missing something obvious, but cannot figure it out. Your help would be greatly appreciated!



After a little departmental audit, it seems more people have reported issues with LCA not properly functioning in the later version.

Using the similar setting on the same dataset with the only change being the software version, one of my colleagues managed to go from 900 reads assigned solely to an organism (in the newer version), to competently dissapearing (in the older version). Although we don’t have the STDout from that run anymore, it’s likely the same issue

I have finally identified the bug and will upload a release that fixes the bug this week.