I’m also noticing some similar things sometimes but i don’t think it’s a problem with MALT… for example i’m mapping against mammals and getting a hit in the Malt output that says there is an almost 99% sequence identity to Oryx
>K00233:206:HLH3VBBXY:8:2213:8592:22749 1:N:0:TCGAATAA+TGAGACCA
TCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTGGGACCGGCTACAGCCCCAGGATGTGATGAGCCGACATC
DATA[length=76]
K00233:206:HLH3VBBXY:8:2213:8592:22749 [length=76, matches=1]
Oryx dammah; score=134.0
>NW_024072194.1|tax|59534
Length = 5779
Score = 134 bits (147), Expect = 4e-28
Identities = 75/76 (99%), Gaps = 0/76 (0%)
Strand = Plus / Minus
Query: 1 TCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTGGGACCGGCTACAGCCCCAGGATGTGATGAGCCGACATC 76
||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||
Sbjct: 5177 TCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTGGGACCGGCTTCAGCCCCAGGATGTGATGAGCCGACATC 5102
This would be a weird result for this data… so when i blasted this against Oryx dammah but get the msg
No significant similarity found. For reasons why,click here
So i’m assuming it could be a low complexity region that gets automatically filtered by blast…
I downloaded this genome, and the region is present on something called
Oryx dammah isolate SB20612 unplaced genomic scaffold, SCBI_Odam_1.1 HiC_scaffold_2097, whole genome shotgun sequence
To put it lightly, a suspicious name at best (isolate of a goat like animal?.. wonder what culture media they used to grow these must be a big incubator)
So a quick blast of the RefSeq accn against the entire database then yields this hit…
Pseudomonas mendocina strain MAE1-K chromosome, complete genome Pseudomonas mendocina 10318 39581 100% 0.0 98.96% 5157724 CP023641.1
100% of the query has 98,96% identity to a bacterium.
So in conclusion i think these edge cases don’t highlight problems with MALT, but highlight some issues with the quality of data dumps on RefSeq… and the regions we get hits to are probably just low complexity (i.e. have multiple non-unique locations to map to, therefore don’t map).
I’ve noticed some bacteria chromosomes with >10% N’s for example…
I’m not 100% sure but i think MALT could potentially also be N-N greedy, so if you hardmasked both your reference and your read for example you will just get greedy alignments of N’s which can also increase false positive hits, because bwa does not do positive N-N character matching if I’m correct.
And lastly if you run MALT with semiglobal on then MALT could also be ignoring part of the read (good for ancient dna dmg, but if not looked at careful could also mean that you get hits mapping), so in reality you could have better organisms to match, but they are just not in your database. This also goes for your ID% cutoff… if you put a 90% identity match in Malt, and the closest organism in your database is 90% similar to a read, it will put a taxonomic ID on that read, but in the whole RefSeq there might be a 99% matching organism. There is some balancing between putting it low enough to allow for mutations and DNA damage, but putting it high enough to have confidence in the call.
Cheers