I work with ancient environmental DNA and use MEGAN7 CE for taxonomic binning of BLAST/DIAMOND alignments. To assess authenticity, I then run mapDamage/metaDMG on mapped reads to estimate deamination and coverage per taxon.
It would be extremely helpful if MEGAN could ingest a simple TSV with per-taxon, per-sample metrics (e.g. coverage breadth, depth, metaDMG damage_A and Zfit, and a composite “authenticity score”), and then:
-
store these as auxiliary attributes on taxon nodes, and
-
allow colouring of nodes in the Taxonomy viewer by one of these attributes (while keeping node size proportional to read count/coverage).
This would provide a visual “authentication overlay” on the existing MEGAN taxonomy tree, making it much easier to distinguish likely modern contaminants from genuinely ancient taxa.
I’m happy to provide example TSVs and a minimal R/python pipeline that computes these metrics from BAM + metaDMG output.
Here’s a proposed TSV format and explanation for the metaDMG/coverage aggregation input. Most of this can be pulled from the metaDMG outputs directly, and then a few others like coverage depth/breadth and a proposed ‘authenticity score’ (and others) would need to be computed differently. I don’t have that all sorted yet, but aiming to create these columns would be the goal to then add various ways to display confidence in authenticity on MEGAN’s GUI taxonomy tree.
| sample |
tax_id |
tax_name |
tax_rank |
N_reads |
N_alignments |
A |
significance |
damage |
cov_breadth |
cov_mean_depth |
auth_score |
| Site-Depth-Age-01 |
9901 |
Bison bison |
species |
523 |
612 |
0.18 |
6.2 |
0.12 |
0.42 |
3.1 |
0.91 |
| Site-Depth-Age-02 |
9913 |
Bos taurus |
species |
37 |
44 |
0.02 |
0.8 |
0.01 |
0.05 |
1.0 |
0.05 |
| Site-Depth-Age-03 |
9689 |
Rangifer tarandus |
species |
110 |
131 |
0.12 |
4.1 |
0.08 |
0.35 |
2.5 |
0.73 |
Main columns
sample — must match the MEGAN sample/document label (string)
tax_id — NCBI taxid (string or int)
Columns matching metaDMG output
tax_name (string)
tax_rank (string)
Rank mismatches
metaDMG can be run at genus/species/family; MEGAN nodes can be viewed at mixed ranks. Best practice: store attributes keyed by tax_id only, and let the viewer show them wherever that tax_id appears.
N_reads (int)
N_alignments (int)
A (float) — background-independent damage amplitude metadmg-dev.github.io
significance (float) — metaDMG “sigmas away from 0” (this is effectively Zfit in common usage)
damage (float) — estimated damage D
Coverage columns (computed externally; not in metaDMG by default)
cov_breadth (float 0–1) — fraction of reference bases covered (aggregate per taxon)
cov_mean_depth (float) — length-weighted mean depth across reference bases (aggregate per taxon)
- Many taxa won’t have meaningful coverage (esp. if reference assembly is fragmentary or taxon is represented by many short contigs). Treat missing as
cov_breadth=0.
Other optional columns
auth_score (float 0–1) — composite score (user-defined, in progress)
mean_L (float) — mean read length (metaDMG) metadmg-dev.github.io
rho_Ac (float) — fit diagnostic (metaDMG)
- any additional numeric columns (MEGAN can ignore unknowns or store them)