Taxonomy2function question

Hi @Daniel thanks for implementing taxonomy2function, the outputs are pretty useful (I’m also looking forward to testing AnnoTree annotation in future analysis).

I did have a question regarding the current implementation. It seems very slow; one recent run we have for instance takes about 18-24 hrs per sample (~50-100M reads). It’s using a single core so we could reasonably parallelize this on a cluster, but I wanted to know whether this is primarily due to limitations in the way taxonomic and functional information are stored in the DAA file (I think we’ve discussed this at some point in the past and my recollection was that being an issue). Or maybe this is something that is addressable?

The main reason I ask is that we have multiple assignments included in the annotations now across many samples, and we’ve contemplated generating various read -> pathway and read -> taxonomy mappings (possibly via daa2info, which runs super-fast) to load into a temp database for joint analyses. In many cases we’re dealing with ~50+ samples these days, so dealing with the many-to-many mappings is already tricky.

As always, thx for all of the hard work!

Hi cjfields,

Maybe it is worth checking if your DAA file resides on a network share (e.g., Samba). I experienced very slow extractions from DAA files using read-extractor in such cases, so I moved them to a local drive on the server prior to extraction. The I/O on our Samba share is usually sufficiently high, and I do not know what is causing slow read extraction.
Unfortunately, I cannot reproduce your case since taxonomy2function does not work for me (Error: Could not find or load main class megan.tools.TaxonomyByFunctio).

Best,
Ralf

1 Like

Thanks for the hint that Taxonomy2Function in not working in some installations of MEGAN. I have fixed that and the fix will be available in the next release. As a work around, in the file tools/taxonomy2function, in the last line, replace megan.tools.TaxonomyByFunctio by megan.tools.Taxonomy2Function.

The main reason why extracting this table takes so long is that MEGAN has a table mapping taxa to reads and has other tables mapping functional classes to reads, but no tables mapping reads to taxa or to function. This is for speed and space purposes.

I plan to look into using SQLITE to represent datasets in MEGAN 7, and then it will be easy to extract data in kinds of ways that are currently not easily possible…

@Daniel actually, I found that moving the file to a local store like @ralf_m suggested speed up the process dramatically, now less than 10min for the file generation. We were on a network file share (GPFS) on a cluster. It’s interesting b/c other tools like daa2info run quite quickly, but we have run into issues in the past with similar integrated databases (I think BerkeleyDB but it’s been a while).

Thanks @Daniel, this did the trick