Dear developers,
I am experiencing some difficulties with the LCA placements of DNA sequences when going from malt v. 0.4.1 to malt v 0.5.2 or 0.5.3
I ran a trivial test case in which I use a reference dataset composed of two (artifactual) identical DNA sequences belonging to two species within the genus Bos. I use the megan-nucl-Feb2022.db taxonomic file downloaded from the MEGAN6 download page after unzipping to build the malt index:
#create reference dataset
cat > test_ref_dataset.fasta << END
>NC_037328.1 Bos taurus
ACACCGCTTCAGCTTTGTACCGGGAATCCTTAGAGTCCTCTGATCATTGTTCGCCTCACCATACAGCACTCAGGCAAGCAATCCTGTGCTGGGGTGAGTTGATGACTCTAGCTACC
>NC_025563.1 Bos mutus
ACACCGCTTCAGCTTTGTACCGGGAATCCTTAGAGTCCTCTGATCATTGTTCGCCTCACCATACAGCACTCAGGCAAGCAATCCTGTGCTGGGGTGAGTTGATGACTCTAGCTACC
END
#create malt index
malt-build -J-Xmx10G -i test_ref_dataset.fasta -s DNA --mapDB megan-nucl-Feb2022.db -d test_malt_index/
I then try to assign a single read which has 100% identity with both of these reference sequences, using malt with default parameters. I expect the read to be assigned to the Bos genus:
##create read
cat > test_read.fasta << END
>read1
TCCTCTGATCATTGTTCGCCTCACCATACAGCACTCAGGCAAGC
END
##assign using malt
mkdir malt_output
malt-run -J-Xmx10G -d test_malt_index/ -i test_read.fasta -o malt_output -m BlastN -at SemiGlobal --memoryMode load
The resulting RMA6 file uploaded in MEGAN6 shows the read assigned to Bos taurus instead of Bos. When interchanging the order of the sequences in the test_ref_dataset.fasta, the read is then assigned to Bos mutus instead. This shows that both accession are correctly recognized (and that the read is assigned to the first match he encounters in the reference dataset?). Running the same script using malt v. 0.4.1 instead correctly assigns the read to the Bos genus (whether using v.0.4.1 or v.0.5.3 to build the index). The problem is identical when assigning the read using malt v. 0.5.2
I notice from the screen log this message when running malt v. 0.5.3 or 0.5.2:
Using Best-Hit algorithm for binning: Taxonomy
Instead, using malt v. 0.4.1, I get this:
Using 'Naive LCA' algorithm (80.0 %) for binning: Taxonomy
So I wonder if the Naive LCA algorithm might be turned off by default in these recent malt version, using a ‘Best-Hit’ algorithm instead which would simply assign each read to the closest match in the reference dataset (or the first one encountered in case of several identical matches)? I couldn’t find a way to change this behavior when looking at the different options from the commandline help or the manual. I also tried various combinations of LCA parameters instead of using the default, without success.
I am wondering if I am not missing something obvious, but cannot figure it out. Your help would be greatly appreciated!
Best,
Arthur