Gene Centric Alignment

etheleon · February 8, 2017, 9:59am

Hi,

In MEGAN6’s manual, the gene-centric function in MEGAN CE is said to be accessible by exporting->Gene-Centric-Assembly.

my question here is what would I have to import?
Will a tabbed blast output be sufficient or will a meganised DAA file be required?

Secondly, in the paper*, the screenshots show alignments to single reference genes.

Am I right to say the number genes “detected” as shown in Table1. are the summed number of contigs which passed the threshold (min bp > 200bp) aligned to each reference?

Huson, D. H., Tappu, R., Bazinet, A. L., Xie, C., Cummings, M. P., Nieselt, K., & Williams, R. (2017). Fast and simple protein-alignment-guided assembly of orthologous gene families from microbiome sequencing reads. Microbiome, 5(1), 11. http://doi.org/10.1186/s40168-017-0233-2

Daniel · February 9, 2017, 12:06pm

A meganized DAA file or an RMA file based on full alignments is required…

(However, it should be possible to extended the algorithm so that blast tab suffices:
In the extended algorithm, MEGAN would have to compute all implied protein overlaps between reads…
I will try to look into this…)

You can use gene-centric assembly in two ways: EITHER by selecting a node in viewer (such as the Kegg viewer) and have all reads assigned to that node assembled, OR you can open the alignment viewer on some node, then select a reference, and then select Layout->By Contigs and then the reads assigned to the selected reference will be assembled using that reference.

The number of genes detected is defined in the paper as the number of reference genes that get covered more than half by the longest contig that maps to it:

To assess how well gene sequences are detected for different organisms, we report the number of organisms for which the longest mapped contig covers at least half of the corresponding reference sequence.

etheleon · February 27, 2017, 10:06am

Hi Daniel,

thanks for the explanation.

So if there were 3 contigs mapping to reference gene A (>50% map), and
4 contigs mapping to reference gene B (>50% map).
The total count will be 7? assuming there’s only matches to ref gene A & B?

what if the same contig maps to multiple reference A and B sequences?

edit: Tried both approaches out using the synthetic dataset provided on the public megan server

de novo assembly
mapping to reference

The first gives me a fasta output with the contig
While the second gives me this layout by-contigs

Are the contigs related between 1. and 2.? how many contigs are there from the screenshot?

Daniel · February 28, 2017, 9:56am

So if there were 3 contigs mapping to reference gene A (>50% map), and
4 contigs mapping to reference gene B (>50% map).
The total count will be 7? assuming there’s only matches to ref gene A & B?

No, in the paper we only looked at presence/absence of a read that maps to the gene

what if the same contig maps to multiple reference A and B sequences?

Then potentially we counted one reference gene as missed

Tried both approaches out using the synthetic dataset provided on the public megan server

de novo assembly

mapping to reference

The first gives me a fasta output with the contig
While the second gives me this layout by-contigs

I am pretty sure that you can also save the contigs produced using 2. using an appropriate save item from the File menu

Are the contigs related between 1. and 2.? how many contigs are there from the screenshot?

Not necessarily, as 1. Is based on all protein-alignment-induced overlaps between reads assigned to a node, while 2. is based only on protein alignments to one specific reference sequence

etheleon · February 28, 2017, 10:23am

Thanks Daniel