Joining multiple .m8 files or merging multiple .daa files into one .mx file

nerdynella · September 30, 2016, 9:38pm

Hi there!

I would like to analyze my metagenomics data using your tools (DIAMOND and MEGAN).
I have 8 samples containing a total of ~235 million reads, at ~230bp long.
I plan to split each sample into 4 smaller chunks to reduce DIAMOND processing time.

(1) If I set the -o parameter for .m8 output, will I be able to join the 4 .m8 outputs created for each sample into one large .m8 file?

(2) If I set the -a parameter instead, will I be able to join the 4 .daa outputs for each sample into one large .daa file?

(3) Still using the -a parameter, will I be able to merge the 4 .daa outputs during the conversion to blast tabular format? i.e. will I be able to convert 4 .daa files into one blast tabular file?

(4) is there a way to join multiple DIAMOND outputs into one MEGAN input file using MEGAN?

If the answer is yes to any of these questions, could you please share how? Any help will be greatly appreciated.

Many thanks in advance,
Nsa

Daniel · October 7, 2016, 3:30pm

It doesn’t make sense to split the files, this won’t speed up DIAMOND.

(1) yes

(2) no

(3) no

(4) using the Import Blast dialog you can select multiple input files that give rise to one output rma6 file.
However, I do not recommend this. Rather, don’t split the 8 samples into smaller chunks but run as is.
Then use dat-meganizer (or equivalent File menu item of MEGAN) to meganize the 8 daa files. This will be the fastest route

nerdynella · October 7, 2016, 5:33pm

Thank you for getting back Daniel.
I am running my analysis on a cluster and it’s nearly impossible to request all the resources I need to run diamond on my files as they are…I went ahead and split the files so that I can spread out the jobs without requesting any resources and it worked.

Also I have been able to join the .m8 outputs and they work in MEGAN!.
Cheers,
Nsa

Daniel · October 12, 2016, 6:48pm

I will look into writing a DAA merger program…

nerdynella · October 12, 2016, 10:24pm

that will be really useful.
Thank you!

grendon · July 17, 2017, 7:20pm

Daniel,
Has the DAA merger program been added to MEGAN6 yet?

Sebastian_M · March 26, 2018, 2:05pm

HI Daniel,

I created many meganized .daa files and I also wondered if the merge script is on the way?

Thanks a lot for the nice program!

Daniel · April 1, 2018, 5:32am

Dear Sebastian,

I believe that we do have a DAA-Merger program and I will look into providing it with the next MEGAN release

Sebastian_M · April 13, 2018, 12:38pm

Dear Daniel,

that sounds great!

rhall · October 31, 2019, 4:18pm

Was there ever a solution for merging daa files? I’m running into the same issue, I have a large dataset and the only practical solution for the alignments is running on a cluster in small chunks. Thanks.

Daniel · November 25, 2019, 9:28am

We do have a DAA-merger program, I will look into adding it to MEGAN tools

dportik · July 24, 2020, 12:28am

Hello,
I am working with large long-read datasets (each fasta file with 1-2 million reads of 10-15 kb length). I am also considering splitting fasta files for blastx alignment with Diamond. Currently, I am running some benchmarks to investigate if there is a major improvement in time.

Ultimately I would like to use the daa-meganizer tool, which sounds like the best option for import to MEGAN. Is the DAA-merger program currently available in MEGAN or elsewhere? I downloaded the latest release and did not see this program in the tools bin. This would be extremely helpful for my ongoing work.

Thanks,
Dan

Daniel · July 28, 2020, 6:35am

Our alignment program Ella comes with a daa-merger program. However, the daa-merger program is painfully slow at present and requires more work. If you want to try it out, then please download and install Ella, available here: https://software-ab.informatik.uni-tuebingen.de/download/ella/welcome.html

If you wanted to join multiple .m8 files, then something like this should work:

cat *m8 |sort >a.m8

For use with MEGAN, the key thing is that alignments for the same read must appear together, hence the sort.
If you have paired reads in which both members of the pair have exactly the same name, then you should concatenate and sort all first reads together, then all second reads together, and then concatenate both files.

dportik · July 28, 2020, 6:39pm

Hi Daniel,
Thank you for the suggestion. I won’t be working with paired reads so this will be a little more straightforward. In general, I am considering which formats are best for merging multiple outputs from a given alignment program. My workflows will require splitting input fasta files for processing. I’ve posted in detail about this here: Does sam2rma work for converting SAM protein alignments?