Create .megan files from the command line

jakew · February 13, 2020, 2:44am

I process a lot of data all on the command line. At the request of a researcher, I just started using Diamond and Megan and I like that I can take the daa output from Diamond and create rma6 files for Megan. However, there is still much to do in the analysis.

Is there a way to create a megan file from these rma6 files, trim the #SampleID from something like 51.h38au.bowtie2-e2e.unmapped.diamond.nr.daa to just 51, add a column like “Classification” containing “Case” and “Control”, add @Color and @GroupId columns, all from the command line?

Daniel · February 14, 2020, 9:33am

please consider using daa-meganizer rather than daa2rma.
daa-meganizer adds a block to the end of the daa file and the file can then be opened in MEGAN. This is much faster and saves you one file.
You can add meta data to an rma file or meganized-daa file during daa2rma or daa-meganizer using the -mdf option.
You can extract a .megan file from an rma or meganized-daa file using the program rma2info or daa2info, respectively. The .megan files are text-based and you can easily identify and change the line that contains all the metadata.
There isn’t a program currently for adding or modifying the metadata stored in an rma or meganized-daa file. I will look into adding this feature to the programs rma2info and daa2info

jakew · February 14, 2020, 4:13pm

Thank you. I will look into this.

What / Where can I find the details on the metadata file format?

Can I duplicate the functionality of the “File > Compare” creating a single megan file from all samples in the data set from the command line?

Daniel · February 14, 2020, 4:18pm

There is a program called MEGAN/tools/compute-comparison that does that.

The metadata format is the same as used for QIIME.
You have a header line that starts with #Samples and is followed by the names of the attributes.
Then subsequent lines start with the name of a dataset and then contains the values of the named attributes. Everything is tab-separated.

Here is an example:

#SampleID Day Subject antibiotic Treatment Health
Alice00-1mio 0 Alice cirprofloxacin no good
Alice01-1mio 1 Alice cirprofloxacin yes good
Alice03-1mio 3 Alice cirprofloxacin yes good

jakew · February 14, 2020, 5:04pm

Perfect. Thank you. I’ll give it a try. Not really sure how I didn’t see that given such an obvious name.

jakew · February 16, 2020, 6:05pm

So this works great and produces a megan file, without using a metadata file.

/Applications/MEGAN/tools/daa-meganizer 
     --mapDB ~/Downloads/megan-map-Oct2019.db --in *daa
/Applications/MEGAN/tools/compute-comparison --in *daa

However, if I add a metadata file in the following format …

#SampleID	Classification
01	Control
02	Control
03	Control
04	Control
05	Case
06	Case
07	Case
08	Case
...

for my 77 samples, and meganize like so …

/Applications/MEGAN/tools/daa-meganizer 
    --mapDB ~/Downloads/megan-map-Oct2019.db 
    --metaDataFile ../megan_metadata.tsv --in *daa

Compute comparison runs out of memory, after about 25 of the 77 samples. Same thing happens inside Megan after 19 samples on these daa files.

/Applications/MEGAN/tools/compute-comparison --in *daa

…

JVMDUMP039I Processing dump event "systhrow", 
     detail "java/lang/OutOfMemoryError" 
     at 2020/02/16 09:56:33 - please wait.
JVMDUMP039I Processing dump event "systhrow", 
     detail "java/lang/OutOfMemoryError" 
     at 2020/02/16 09:56:33 - please wait.

Why does adding this little bit of metadata cause such a huge memory issue?

Do you have any suggestions how to correct this?

Thank you.