Hi Daniel,
Thanks for investing the time to convert the mapping DB files to SQLite files - it makes them very accessible.
I have a question about customizing the mapping db files for a particular use-case:
Is it possible to include a custom mapping entry for an organism that is not currently present in the NCBI taxonomy?
For example, let’s say we have a specific strain of species X (or a new species Y), and we would like that to show up in MEGAN. I can certainly make up a unique “accession” number for the sequences in the reference database used, and also add those to the mapping db. However, what should be used for the taxonomy ID in the mapping db? The taxonomy ID number for species X would not be ideal, because we would like to distinguish this “new” organism from species X. Similarly, if there is no entry for species Y, is there a possible workaround to allow some type of taxonomic labeling?
MEGAN uses two files to specify the taxonomy that it displays;
First, ncbi.tre contains the hierarchy in Newick format. This describes a rooted tree. Each node is labeled by an integer that represents a taxon.
It should be easy to update this file if all you want to do is to add a sister node. Say that X has taxon id 333085 and you would like to add a sister node Y with (fake) taxon 2000333085 to the tree. Searching for 333085 finds the number here:
If I am using sam2rma for conversion, will I need to edit these taxonomy files and supply them to the sam2rma program first? Or can the edited taxonomy files be used after the RMA is first created with the default taxonomy files (e.g., by importing the alternative files via the MEGAN preferences tab as you suggested)?
Currently I am working with large SAM files from minimap2, so the conversion to RMA with sam2rma is highly desired.