Support for per-taxon authenticity metadata (coverage + damage)

TMurchie · November 24, 2025, 10:36pm

I work with ancient environmental DNA and use MEGAN7 CE for taxonomic binning of BLAST/DIAMOND alignments. To assess authenticity, I then run mapDamage/metaDMG on mapped reads to estimate deamination and coverage per taxon.

It would be extremely helpful if MEGAN could ingest a simple TSV with per-taxon, per-sample metrics (e.g. coverage breadth, depth, metaDMG damage_A and Zfit, and a composite “authenticity score”), and then:

store these as auxiliary attributes on taxon nodes, and
allow colouring of nodes in the Taxonomy viewer by one of these attributes (while keeping node size proportional to read count/coverage).

This would provide a visual “authentication overlay” on the existing MEGAN taxonomy tree, making it much easier to distinguish likely modern contaminants from genuinely ancient taxa.

I’m happy to provide example TSVs and a minimal R/python pipeline that computes these metrics from BAM + metaDMG output.

Anupam · December 8, 2025, 4:15am

Dear @TMurchie ,

Thanks, sounds nice.

This will be helpful.

Best regards,
Anupam

TMurchie · December 9, 2025, 9:41pm

Here’s a proposed TSV format and explanation for the metaDMG/coverage aggregation input. Most of this can be pulled from the metaDMG outputs directly, and then a few others like coverage depth/breadth and a proposed ‘authenticity score’ (and others) would need to be computed differently. I don’t have that all sorted yet, but aiming to create these columns would be the goal to then add various ways to display confidence in authenticity on MEGAN’s GUI taxonomy tree.

sample	tax_id	tax_name	tax_rank	N_reads	N_alignments	A	significance	damage	cov_breadth	cov_mean_depth	auth_score
Site-Depth-Age-01	9901	Bison bison	species	523	612	0.18	6.2	0.12	0.42	3.1	0.91
Site-Depth-Age-02	9913	Bos taurus	species	37	44	0.02	0.8	0.01	0.05	1.0	0.05
Site-Depth-Age-03	9689	Rangifer tarandus	species	110	131	0.12	4.1	0.08	0.35	2.5	0.73

Main columns

sample — must match the MEGAN sample/document label (string)
tax_id — NCBI taxid (string or int)

Columns matching metaDMG output

tax_name (string)
tax_rank (string)
Rank mismatches
metaDMG can be run at genus/species/family; MEGAN nodes can be viewed at mixed ranks. Best practice: store attributes keyed by tax_id only, and let the viewer show them wherever that tax_id appears.
N_reads (int)
N_alignments (int)
A (float) — background-independent damage amplitude metadmg-dev.github.io
significance (float) — metaDMG “sigmas away from 0” (this is effectively Zfit in common usage)
damage (float) — estimated damage D

Coverage columns (computed externally; not in metaDMG by default)

cov_breadth (float 0–1) — fraction of reference bases covered (aggregate per taxon)
cov_mean_depth (float) — length-weighted mean depth across reference bases (aggregate per taxon)
Many taxa won’t have meaningful coverage (esp. if reference assembly is fragmentary or taxon is represented by many short contigs). Treat missing as cov_breadth=0.

Other optional columns

auth_score (float 0–1) — composite score (user-defined, in progress)
mean_L (float) — mean read length (metaDMG) metadmg-dev.github.io
rho_Ac (float) — fit diagnostic (metaDMG)
any additional numeric columns (MEGAN can ignore unknowns or store them)