Choice of using summarized / assigned counts for cluster analysis

Dear Daniel and developers of MEGAN,

Hi, we are wondering whether it is possible to allow users to select whether they would like to use summarized or assigned counts for cluster analysis (PCO plots etc)?

We are assuming that the current versions of MEGAN (5 & 6) uses summarized counts (?! Actually is there a way to export the read count matrix that was used for cluster analysis?). But would also be convenient if at some stage we can do cluster analysis using assigned counts at a given taxonomic level.

Thank you for your consideration!

with warm regards,
Maria

Dear Maria,

I just took a look at how the distance calculation is implemented. The news is mixed.
Most of the distance calculations do indeed use summarized counts.
However, the JensenShannonDivergence uses summarized counts for leaves and assigned counts for all none leaf nodes.
I think that the latter way of calculating distances is more useful: Usually, people select leaves and use them as the basis of distance calculations, in which case either method of computing distances will produce the same result.

To provide more flexibility, in the next release of MEGAN, the distance calculation will be modified as follows:

  • if a selected node is a leaf, then the summarized value associated with the node is used
  • if a selected node is not a leaf, then the assigned value is used

If you want enforce the use of the summarized values for a given node, then you need to collapse that node so that is a leaf. If you want to enforce the use of the assigned values for a given node, then ensure that it is not a leaf by uncollapsing it. (If it cannot be uncollapsed, then it has the property that assigned=summarized).

Again, this change will not effect the results obtained by most users because by default, MEGAN selects leaves for doing this calculation, in which case the old and the new calculations use summarized values and provide the same results.

I’ll upload the new version later this week.

BTW, if you use the Export CSV option to export taxa to read counts and choose the “Assigned” option, then this feature exports summarized counts for leaves and assigned counts for all other nodes. This matches the new way of calculating distances that will be implemented in the next release.

D

I have just uploaded a new version, 6.4.1
It uses assigned counts for internal nodes and summarized counts for leaves.