MALT extract reads and bwa aln

valery_shap · April 5, 2021, 6:35pm

Hello,
It is my first using of MALT.

I ran malt-build with all refseq bacteria genomes with mapping file --acc2taxonomy megan-nucl-Jan201.db
malt-run with ancient reads with such flags -id 85 -m BlastN -at SemiGlobal -top 1 -supp 0.01 -mq 100.
open .rma6 file in Megan desktop, click at the species, then File --> Extract Reads
I got .fasta file with reads that assigned to the species.
Then I decided to validate this assignments.
I’ve taken ID of the reference from .blast file (also download after clicking at the specie) and run BWA:
bwa aln ref_index extracted_reads_from_megan.fasta > result.bwa
bwa samse ref_index result.bwa extracted_reads_from_megan.fasta > result.sam
And there are no hits in samtools flagstat.
Then I tested with reference genome of this species from Refseq and also there are no hits.
Should I have more strict parameters with build-run?
When I tried to blastn some of these reads on the site, it showed the hits with [Eukaryotic synthetic construct chromosome 20]. The reads for malt-run was from unmapped of human.
What could be the mistake?

A lot of thanks,
Valery

IamIamI · August 4, 2021, 8:31am

I’m also noticing some similar things sometimes but i don’t think it’s a problem with MALT… for example i’m mapping against mammals and getting a hit in the Malt output that says there is an almost 99% sequence identity to Oryx

>K00233:206:HLH3VBBXY:8:2213:8592:22749 1:N:0:TCGAATAA+TGAGACCA
TCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTGGGACCGGCTACAGCCCCAGGATGTGATGAGCCGACATC

DATA[length=76]
K00233:206:HLH3VBBXY:8:2213:8592:22749 [length=76, matches=1]

Oryx dammah; score=134.0

>NW_024072194.1|tax|59534
	Length = 5779

 Score = 134 bits (147), Expect = 4e-28
 Identities = 75/76 (99%), Gaps = 0/76 (0%)
 Strand = Plus / Minus

Query:        1  TCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTGGGACCGGCTACAGCCCCAGGATGTGATGAGCCGACATC  76
                 ||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||
Sbjct:     5177  TCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTGGGACCGGCTTCAGCCCCAGGATGTGATGAGCCGACATC  5102

This would be a weird result for this data… so when i blasted this against Oryx dammah but get the msg

No significant similarity found. For reasons why,click here

So i’m assuming it could be a low complexity region that gets automatically filtered by blast…
I downloaded this genome, and the region is present on something called

Oryx dammah isolate SB20612 unplaced genomic scaffold, SCBI_Odam_1.1 HiC_scaffold_2097, whole genome shotgun sequence

To put it lightly, a suspicious name at best (isolate of a goat like animal?.. wonder what culture media they used to grow these must be a big incubator)

So a quick blast of the RefSeq accn against the entire database then yields this hit…

 	Pseudomonas mendocina strain MAE1-K chromosome, complete genome 	Pseudomonas mendocina 	10318 	39581 	100% 	0.0 	98.96% 	5157724 	CP023641.1

100% of the query has 98,96% identity to a bacterium.

So in conclusion i think these edge cases don’t highlight problems with MALT, but highlight some issues with the quality of data dumps on RefSeq… and the regions we get hits to are probably just low complexity (i.e. have multiple non-unique locations to map to, therefore don’t map).

I’ve noticed some bacteria chromosomes with >10% N’s for example…
I’m not 100% sure but i think MALT could potentially also be N-N greedy, so if you hardmasked both your reference and your read for example you will just get greedy alignments of N’s which can also increase false positive hits, because bwa does not do positive N-N character matching if I’m correct.

And lastly if you run MALT with semiglobal on then MALT could also be ignoring part of the read (good for ancient dna dmg, but if not looked at careful could also mean that you get hits mapping), so in reality you could have better organisms to match, but they are just not in your database. This also goes for your ID% cutoff… if you put a 90% identity match in Malt, and the closest organism in your database is 90% similar to a read, it will put a taxonomic ID on that read, but in the whole RefSeq there might be a 99% matching organism. There is some balancing between putting it low enough to allow for mutations and DNA damage, but putting it high enough to have confidence in the call.

Cheers