Extracting DNA reads and filtering SAM files

Hi there,

I’ve got three questions. I’ve searched the forum and I could not see a similar problem. I hope it will be OK.

I am using Megan Community Edition (6.12.6 built 15 October 2018) to analyze the rma6 files produced by MALT (version 0.4.1, built 24 May 2018) tool. I am trying to filter the SAM files based on the reference sequence and read names to do some further ancient DNA analysis (like authenticating the deamination patterns).

To do this, I selected a species node (lets say X) and clicked extract reads button and created a file (I used the include summarized reads button checked otherwise I got 0 reads). Let’s say I got 25000 DNA reads extracted.

Then I filtered the SAM files based on reference (Accession name and Taxid of the species X), the reads that I have extracted and merged the SAM files into one BAM file. In the end, I have realized that in some of the SAM files, I did not have the exact number of reads (Lets say the sum of the reads decreased to 12500 for species X). So:

First question: Why I got different number of reads? What could be the problem in here? Do you have any idea?

Second question: I didn’t understand the include summarized reads box to extract DNA reads. When we extract DNA reads that are assigned to a species, we extract the reads only for this node right?

Third question: In the alignment viewer there is an expression: “List of 151 reference sequences for X”. Some of them belongs to species X and some of them belongs to taxonomically related references of species X. So, when we extract DNA sequences, are we extracting all the DNA reads that are related to species X or DNA sequences that are only assigned to species X? This is confused me.

best regards

emrah

Dear emarh,

@ First question
MALT gives you alignments of your reads against the references. Next, MEGAN places the read on the taxonomic tree. The fact a read has an alignment to the reference with a taxon X (what you see in SAM), is not enough for MEGAN to place it in the X node. The place MEGAN chooses depends on other alignments of the read. In the case it has a lot of other good alignments, it will be placed above X by the rules governed by LCA.

Question to you:
Does MEGAN say there are assigned reads but then fails to export it? I would suggest updating MEGAN, and trying again.

@ Second question
Yes, summarized refers to the nodes placed on this node or nodes below it (those more specific ones). Assigned refers to the reads placed only on this node. So in the terminal nodes the number of assigned should be the same as summarized.

@ Third question
If you are inspecting a node higher in the taxonomy the references can come from anything below it. That is a reason the read ended up in the higher node. When you extract reads assigned to the node, it is equivalent to taking reads containing sequence conserved across the taxa lower than that node.

Hope it helps!
Ania

Dear Ania,

Thank you for your answers.

I’ve clarified the issue for my first question. I understand that MALT creates corrupted CIGAR strings for some alignments. So when I do SAM to BAM conversion, the process (samtools view -Sb) truncates without giving an error. Since it is still a valid BAM file, the rest of the pipeline works. Now I am filtering the corrupted entries in the SAM file (I am going to open an issue regarding this problem) and continue my analyses.

For the second question, Megan does not say anything about assigned reads, but I understand the idea now. For higher nodes (e.g., Genus), if I don’t check summarized reads box, Megan extracts the reads that are only assigned to this node. But, it should be the same with the species level right?

Let me rephrase my third question: Let’s say Megan shows us that 100 reads assigned to species X (I am working on a terminal node, so no lower nodes involved). When I inspect the alignments I can see other references than X. Like some of our reads aligned to Y and some of them aligned to Z and I understand it. So the other species that we see are just secondary alignments right? So when we extract the reads are we extracting the reads only assigned to species X?

cheers

emrah

no, it depends how many reads went to genus (aligned to more than one species).

do you get several taxonomy headers? Could you provide a screen view?

yes, the other alignments are secondary. MEGAN uses the top scoring alignments for placement (by default, only those alignments that lie within 10% of the best bitscore for the read.)