Extract to new document takes extremely long

lenaw · February 27, 2018, 9:05am

Hi there,
I’m working with the latest release of the Megan6 Community Edition (v6_10_10) and want to extract all reads assigned as “Bacteria” to a new .rma6 file (to get rid of some contamination).
I’m running the “Extract to new document” function on a comparison file and my “bacteria” node does contain a lot of sequences (~2.5 million), but the extraction takes extremely long. It is running since 2 days straight and the bar showing the progress has hardly moved…
I already set up the maximum memory usage in the MEGAN.vmoptions.
Is there any other way to accelerate the process? Or do I just have to be patient?
Thanks a lot for your help!
Best,
Lena

Daniel · February 28, 2018, 12:36pm

Hi Lena,

shouldn’t take that long.

Moreover, we are working on implementing a contamination filter.
You can enable the current version of this using
Edit->Preferences->Enable Software Feature to enter

contaminants

This will allow you to specify one or more contaminant taxa. Note that specifying a higher rank taxon such as e.g. Eukaryota will declare that node and all nodes below it as contaminants.
When this is activated, any read that has a significant alignment to any contaminant taxon will be placed on a special Contamination node.

Please give this new feature a spin and send me your feedback.

lenaw · February 28, 2018, 2:31pm

Hey Daniel,

thanks for the fast reply. The feature sounds really promising, but I don’t quite get how to use it.
I enabled it as you suggested and the notification “Executing: setprop enable-contaminants=true;” pops up in the Messages window.
But there seems to be no additional “contaminant” option or anything where I could specify the contaminant taxa in the GUI…do I need to refresh it somehow (or am I just missing it…)?

Best,
Lena

Daniel · March 1, 2018, 7:24am

You will have to reimport or remeganize your data, once the contaminants mode has been activated, you will find a new button in the Taxonomy Tab of the Import Blast and Meganize dialogs “Load Contaminants File” that will allow you to load a file that contains contaminant names or taxon ids, one per line.

cjfields · June 23, 2018, 10:50pm

@Daniel I noticed this as an option in daa-meganizer now. I tried this but didn’t see any change to the annotation in MEGAN6 after meganizing with a simple file (just the taxon name ‘Chordata’ in a file), even after activating the ‘contaminants’ feature. Does this need to be enabled via command-line somehow?

Daniel · June 26, 2018, 3:45pm

There was a bug in MEGAN that caused the contaminant filter to be ignored when using daa-meganizer and similar tools. That has been fixed, please update to release 6.12.0

MTERZIN · September 27, 2022, 12:36am

Hi all, I am having some issues with the ‘Extract to a new doc’ function as well - in my case it is not even possible to use it… I am using Megan v.6.7.18 and I want to export only bacterial, archaeal and viral reads as a new rma6 document - this way I would remove the eukaryotic reads (which is contamination in my case). I would then import the subset rma6 file to get GO and COG terms, for only those reads that were annotated as prokaryotic or viral. Is this the best way to do this, and if so - is there a reason why this function may not work? Please let me know. Thank you, Marko

Daniel · October 23, 2022, 3:35pm

Did you consider using MEGAN’s contaminant filter? When importing a Blast file or meganizing a DAA file, the setup dialog allows you to specify a “contamination file”. This file should contain one or more taxon id’s (one per line) that you consider as contaminants. Note that if you specify a higher-rank taxon, e.g. 2759 for Eurkayotes, then that taxon, and all taxa that lie below that node in the taxonomy (e.g. all eukaryotic taxa) are considered contaminants and any read that has significant alignments to such a taxon are placed on a special “contaminant” node, in all classifications, not just the taxonomic one.

Unfortunately, the current implementation to extract to new document is too slow for today’s size of datasets. We are planning a reimplementation of how data is stored and hopefully we will be able to deal with very large datasets in the future.