Support for per-taxon authenticity metadata (coverage + damage)

I work with ancient environmental DNA and use MEGAN7 CE for taxonomic binning of BLAST/DIAMOND alignments. To assess authenticity, I then run mapDamage/metaDMG on mapped reads to estimate deamination and coverage per taxon.

It would be extremely helpful if MEGAN could ingest a simple TSV with per-taxon, per-sample metrics (e.g. coverage breadth, depth, metaDMG damage_A and Zfit, and a composite “authenticity score”), and then:

  • store these as auxiliary attributes on taxon nodes, and

  • allow colouring of nodes in the Taxonomy viewer by one of these attributes (while keeping node size proportional to read count/coverage).

This would provide a visual “authentication overlay” on the existing MEGAN taxonomy tree, making it much easier to distinguish likely modern contaminants from genuinely ancient taxa.

I’m happy to provide example TSVs and a minimal R/python pipeline that computes these metrics from BAM + metaDMG output.

Dear @TMurchie ,

Thanks, sounds nice.

This will be helpful.

Best regards,
Anupam

Here’s a proposed TSV format and explanation for the metaDMG/coverage aggregation input. Most of this can be pulled from the metaDMG outputs directly, and then a few others like coverage depth/breadth and a proposed ‘authenticity score’ (and others) would need to be computed differently. I don’t have that all sorted yet, but aiming to create these columns would be the goal to then add various ways to display confidence in authenticity on MEGAN’s GUI taxonomy tree.

sample tax_id tax_name tax_rank N_reads N_alignments A significance damage cov_breadth cov_mean_depth auth_score
Site-Depth-Age-01 9901 Bison bison species 523 612 0.18 6.2 0.12 0.42 3.1 0.91
Site-Depth-Age-02 9913 Bos taurus species 37 44 0.02 0.8 0.01 0.05 1.0 0.05
Site-Depth-Age-03 9689 Rangifer tarandus species 110 131 0.12 4.1 0.08 0.35 2.5 0.73

Main columns

  • sample — must match the MEGAN sample/document label (string)
  • tax_id — NCBI taxid (string or int)

Columns matching metaDMG output

  • tax_name (string)
  • tax_rank (string)
    Rank mismatches
    metaDMG can be run at genus/species/family; MEGAN nodes can be viewed at mixed ranks. Best practice: store attributes keyed by tax_id only, and let the viewer show them wherever that tax_id appears.
  • N_reads (int)
  • N_alignments (int)
  • A (float) — background-independent damage amplitude metadmg-dev.github.io
  • significance (float) — metaDMG “sigmas away from 0” (this is effectively Zfit in common usage)
  • damage (float) — estimated damage D

Coverage columns (computed externally; not in metaDMG by default)

  • cov_breadth (float 0–1) — fraction of reference bases covered (aggregate per taxon)
  • cov_mean_depth (float) — length-weighted mean depth across reference bases (aggregate per taxon)
  • Many taxa won’t have meaningful coverage (esp. if reference assembly is fragmentary or taxon is represented by many short contigs). Treat missing as cov_breadth=0.

Other optional columns

  • auth_score (float 0–1) — composite score (user-defined, in progress)
  • mean_L (float) — mean read length (metaDMG) metadmg-dev.github.io
  • rho_Ac (float) — fit diagnostic (metaDMG)
  • any additional numeric columns (MEGAN can ignore unknowns or store them)