Hi @Daniel! In the past we have sporadically created our own taxonomic mapping database when we run on nr releases that are newer that the available mapping file, primarily to make sure we’re using the most up-to-date information available. However, since all mapping databases are combined, is it still possible to do this? We have a collaborator who has a very large data set run against nr from July 2021.
Yes, the combined mapping file is an SQLite database and so you can manipulate the database quite easily using a program such as sqlite3 or from within Java or Python.
Thanks @Daniel. What I meant is, we used to do the following:
$MEGAN_HOME/tools/ncbi/make-acc2ncbi -i prot.accession2taxid.gz -o prot_acc2tax.abin -v > $SLURM_SUBMIT_DIR/prot_acc2tax.log
Is there an MEGAN-based analog for the current databases that simply updates the mappings accordingly, or would this need to be scripted (eg. via Python)? The above worked to generate the original protein taxonomy mapping file from NCBI’s release mappings, though it did require a fair bit of memory.
NOTE: I forgot to add, this is for the UE combined database. I can check whether that is readable, though I think I checked this before and it didn’t work.
Just re-checked and the database is readable. We’re just looking into whether there is a way to update these already in place with MEGAN. Based on the info
table in the database it seems like everything is bulk-loaded from mapping file, so maybe this isn’t possible?
Here is an outline of how to update the taxonomy values in the mapping file:
You need to use the command-line program sqlite3.
First, prepare a file that has two tab-separated columns, “accession <tab> taxon-id”, call it new-mappings.tab. This file contains the latest mappings. (I think it might be good to sort the lines alphabetically, but not sure whether this is necessary.)
Second, open the mapping file in sqlite3, like this
sqlite3 megan-map-Jan2021-ue.db
In sqlite3, set the separator to <tab>:
.mode tab
Create a new (temporary) table for your new mappings:
CREATE TABLE newmappings (Accession PRIMARY KEY , Taxonomy INT) WITHOUT ROWID;
Import your new mappings into the new table:
.import new-mappings.tab newmappings
Here is the key step, it will insert your new taxonomy values from your newmappings table into the mappings table (keeping the old value for an accession, if no new value provided). It will leave the other columns unaltered:
INSERT INTO mappings (Accession, Taxonomy)
SELECT a.Accession, b.Taxonomy FROM mappings AS a INNER JOIN newmappings as b ON a.Accession=b.Accession
ON CONFLICT(Accession) DO UPDATE SET Taxonomy=excluded.Taxonomy;
Drop your newmappings table:
DROP TABLE IF EXISTS newmappings;
Clean up:
vacuum;
Exit sqlite3:
.quit
I am on holiday at present, so I was only able to try this on a small version of the database, but it appears to work fine. I don’t know how long it will take on the full database. Please let me know whether this works for you and how long it takes.
Hi @Daniel I’ll give this a try and will report back hopefully this week. Thanks for the quick (and detailed!) reply!