Taxonomy from paired-end reads only down to species level?

Ralf · March 24, 2017, 4:58pm

Hi there,

I got two fastq.gz files for each of my paired-end metagenomics data. When I “convert” them separately into .daa-Files (via diamond blastx) I get a taxonomy tree down to the strain level (eg. for S. aureus) for each “single-end” file.
But after importing the corresponding .daa files via “Import from Blast” as paired end-data (and taxonomic and functional categorization) and thus combining them into one .rma6 file, the resulting taxonomy only goes down to species level without the previous strain information. Is that an error or “inherent to the system”?

Thank you,

Ralf

Daniel · March 26, 2017, 12:33am

Dear Ralf,

what the current algorithm does is it the following:
Let F and R be a pair of reads.
The algorithm first applies the LCA to the alignments of F to get taxon t(F)
It then applies the LCA to the alignments of R to get taxon t®
It then assigns F and R to the LCA of t(F) and t®

This is a very conservative approach that explains your observation.

I will look into implementing an optional, less conservative approach (which will be based on
adding the bits scores of alignments to the same reference in the appropriate way…)

Best wishes
Daniel

Ralf · March 27, 2017, 4:18pm

Hi Daniel,

thank you for your quick response. I have to admit that I took the assignments to the strain level in the “single-end” data as a hint that the S. aureus taxon might be split up into different strains. I guess this has to be evaluated thoroughly - maybe with specialized software like PanPhlan or Sigma. I assume that the data from the “paired-end meganization” might be an ideal input to these.

Best wishes

Ralf