Question regarding custom ncbi.map / ncbi.tre

IamIamI · March 8, 2021, 1:38pm

I was wondering if there is some documentation in regards to generating current ncbi.map and ncbi.tre files.

I have ran malt-build with a “updated” Acc2Taxonomy file as described here
http://megan.informatik.uni-tuebingen.de/t/updating-mapping-files/97/4

This was done because we included several species that were published around the end of 2020, and the most recent acc2taxonomy file was july 2020 i think. However, when now trying to run Hops it seems it cannot resolve newer entries that aren’t already in our current ncbi.map (made on 30th Jan 2019) and ncbi.tre (made 13th July 2020). This results in our species of interests generating the following error:

“SEVERE: [Species of interest] has no assigned taxID and cannot be processed!”

I found here that we can just manually add them
http://megan.informatik.uni-tuebingen.de/t/adding-custom-entries-to-mapping-database/1611

But as we might be interested in screening other species generated by this Malt run, i was hoping there would be some documentation as to how these two reference files are normally generated. It would also help with reproducibility of my process, as it might be hard to explain that i manually added entries to a reference file.

Thanks in advance

IamIamI · March 11, 2021, 2:02pm

I think i have found it already.

So ETE3 can do this apparently. The way i went about this was using the conda repos
http://etetoolkit.org/download/

Then write a small python script based on the ETE3 documentation.

For this to work, you would only need to change the ROOT_TAXA variable. In this example it’s 2, which is the TaxID of the Bacteria kingdom.

ROOT_TAXA = 2
import os, sys, os.path, glob
from ete3 import NCBITaxa
from ete3 import Tree

# Get Taxa from NCBI and update the DB if needed
ncbi = NCBITaxa()
ncbi.update_taxonomy_database()

# Load decendants of "2" into memory. 
# This will extract everything that branches out from that point. So for another organism this is the 
# only number that needs to be changed for this to work. 
descendants = ncbi.get_descendant_taxa(ROOT_TAXA , collapse_subspecies=False)
names = ncbi.get_taxid_translator(descendants)

# Open a new file and/or empty it
ncbimap_out = open("ncbi.map", "w")
ncbimap_out.truncate(0)

# For each taxid, print the ID + the corresponding refseq name
[ print((str(taxid) + "\t" + names[taxid]), file=ncbimap_out) for taxid in descendants]

# Grab the decendants again, but this time in tree format
descendants = ncbi.get_descendant_taxa(ROOT_TAXA , collapse_subspecies=False, return_tree=True)

# Open new file, and/or empty it
ncbitre_out = open("ncbi.tre", "w")
ncbitre_out.truncate(0)

# Convert NCBI's formatted tree to newick, format=3 is the ETE format 
# http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.html#trees
t = Tree(descendants.write(features=['taxid']), format=3)

# Write it out
t.write(format=3, outfile="ncbi.tre")`

This should generate a current ncbi.tre and ncbi.map file.