Phylogenetic assignments from alignments

aguffanti · October 10, 2016, 2:29pm

Hello dear colleaques, I am using MEGAN Community Edition version 6.5.9

I performed the following ‘positive control’:

Extracted all the DNA sequences and CDS proteins from Genbank bacterial genomes.
Built a query dataset of 200 nt sequences from the genome DNA sequences, sampling at random with replacement a total of 500.000 reads.
Ran diamond blastx versus the protein database; each target protein is identified by its Accession Number such as WP_012957021.1, the protein database is non-redundant and contains a total of 10.397.624 entries. Xml output
Read the output in MEGAN and ranked the tree at species or genus level, selecting the appropriate nodes with select Rank and outputing CVS
The number of species which can be assigned to the query 500.000 200 nt DNA segments is 1746 (cleared from plasmids, chromosome, whole genome annotations). The number of species which I can obtain from the MEGAN tree, ranking at the species level, is 975.
If I work at the genera level: input 792 genera, MEGAN gives back 629 genera
The number of reads assigned to species and genera is sensibly lower than the original input
Why am I losing so many entities ? even if the read do contain intergenic material (the fragments are generated from the whole genome sequence, simulating a real metagenome sequencing), I should obtain a very similar number of species and genera from the MEGAN analysis.

Am I doing something wrong ?

Thanks a lot for your beautiful tool,

Alessandro

Daniel · October 13, 2016, 3:45pm

It is hard to know what is going on without seeing more details.
For example, the min support and min support percent parameters set a threshold for the number of reads that a node by attract before it appears in the output. That might lead to some taxa being missing.
Another possibly is that reads might be conserved across different genera (although it seems unlikely that that will cause whole genera to be missing).
A third issue is whether the mapping file that is used to map reference sequences to taxa is adequate?

Have you determined the fate of some of the reads from missing genera? Where do they end up? Do they get a DIAMOND match? If so, are the matches assigned the correct taxon? Do they appear to high up in the taxonomy?

I could get a student to look at this, if you like

aguffanti · October 13, 2016, 4:27pm

Hello and thanks for the kind reply

For example, the min support and min support percent parameters set a
threshold for the number of reads that a node by attract before it
appears in the output. That might lead to some taxa being missing.

I made also an experiment at the species level.
I used the standards: minsupport pct 0.01, minsupport 1. How do I interpret instead max expected and top percent ? I used diamond blastx; starting from 500.000 reads I obtained 452.463 reads in the xml output, of which 447.242 assigned. Input species: 1746 with a total of 448.204 reads (I may be losing some species name here). MEGAN identified 1211 species with a total of 435.915 reads. Overlapping: 974 species with 377.479 reads, corresponding to 397.455 reads originally from GenBank, so there are also assignments to novel species - I think this depends from the target database, which is non-redundant, so we may have one gene assigned to a specie rather than another one. The correlation between the MEGAN and GenBank reads of this final set is 0.917.

A third issue is whether the mapping file that is used to map reference sequences to >taxa is adequate?

It is a diamond blastx xml otput. 500.000 query genomic 200 nt fragments from GenBank genomes + plasmids, versus all the translated CDS from Genbank genomes + plasmid, non-redundant

Have you determined the fate of some of the reads from missing genera?
Where do they end up? Do they get a DIAMOND match? If so, are the
matches assigned the correct taxon? Do they appear to high up in the
taxonomy?

I will do this.

I could get a student to look at this, if you like

If he is interested, that would be useful. I am trying to identify which is the sequencing depth which I should use for a WGS to obtain a maximal ‘resolution power’ of the procedure.

Regards,

Alessandro