Generic pipeline using DIAMOND and MEGAN6

To work with a large number of large metagenomic shotgun samples, proceed as follows.

Let us assume that you have just received a hard-disk containing a collection of fastq.gz files, each representing one sample from your study.

Put these files into a directly called 00fastq.
Create a second directory called 10daa. This will contain DAA files generated by DIAMOND.
Create a third directory called 20rma. This will contain MEGAN RMA files.

For each file reads.fastq.gz in your 00fastq directory, run DIAMOND as follows:

diamond blastx --query 00fastq/reads.fastq.gz --db nr --daa 10daa/reads.daa

When this has completed, for each file reads.aa in your 10daa directory, run daa2rma as follows:

daa2rma -i 10daa/reads.daa -o 20rma/reads.rma -g2t -g2t gi_taxid.bin -g2kegg gi2kegg.bin -fun KEGG

For this to work:

  • you need to install diamond
  • you need to download the nr database from here: ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz (no need to ungzip,
    can be read as is)
  • you need to build a diamond index for nr as follows:
    diamond makedb --in nr.gz --db nr
    This will generate a new file called nr.dmnd
  • you need to install MEGAN6 and you will find the command daa2rma in the megan/tools directory.
  • You need to download the GI-to-NCBI mapping file gi_taxid.bin and the GI-to-KEGG file gi2kegg.bin from the MEGAN download webpage.
  • Other mapping files are required if you want to also compute SEED, COG or other classifications. -

You may also consider using MeganServer to serve the files contained in your 20rma directory.

3 Likes

The entire analysis as described in the previous post can also be executed with a single command using snakemake as workflow manager.

This will build the nr diamond database, use this database for the alignment of all fastq.gz files and translate the resulting daa files to rma files.

The workflow is platform agnostic and can be used locally or using a queueing system such as LSF or SGE.

Hi Daniel,
When converting .daa files in MEGAN using the command as you wrote (daa2rma -i 10daa/reads.daa -o 20rma/reads.rma -g2t -g2t gi_taxid.bin -g2kegg gi2kegg.bin -fun KEGG), how to specify paired end information?

I have parsed my paired sequences separately using diamond blastx, and got output as reads_1.daa and reads_2.daa
thanks,
Suparna

1 Like

the both daa-meganizer and daa2rma currently do not support explicit handling of paired reads. This is only available when using the MEGAN interactive import dialog.
I will update the daa2rma program so that it can explicitly handle paired reads in the near future.

1 Like

Thanks Daniel. That will be great.
S

Dear Daniel,

I would like to export the functional and taxonomic csv table from command line. My output was converted with meganizer after Diamond alignment.

thanks.

Command line use of MEGAN is only supported by the Ultimate Edition.
There, the command to save a taxonomic profile is as follows:

open file=’/Users/bib/data.daa’;
uncollapse nodes=all;
select nodes=leaves;
export what=CSV format=taxonName_to_count separator=comma counts=summarized file=’/Users/bob/data.csv’;
quit;

Hi Daniel,

I have the same question as Suparna, I have Illumina paired read data, so after I run Blastx using Diamond, I cannot use this data unless I use the GUI? And how exactly do I do that? Also, a separate question when I tried to load the Blastx data in XML format to the GUI, the application stopped responding and I couldn’t load that data.

Thanks for your time!

Malik

Inside the MEGAN installation directory you will find a tools directory that provides a number of command line programs for processing alignment files such as produced by DIAMOND, BLAST, Last etc.

Hi Daniel,

Is there any important difference between daa-meganizer and daa2rma? After running both on the same daa file, I couldn’t notice any significant changes except for slight differences in reads assigned/summarized and the rma file being substantially smaller.

many thanks,
Julian

meganizing a DAA file takes much less time that computing an RMA file, and producing an RMA file allows use instruct MEGAN to explicitly use mate pair information (although the current algorithm appears to be quite conservative, so I’m not sure how useful that is at present.) Both types of files, meganized-DAA and RMA, should be equivalent (except that the RMA file might contain less matches if one sets the limit of alignments to keep for a given read to something that is smaller to what the limit was set to for DIAMOND)

Hi Daniel, You mentioned that you might do an update for daa2rma program so that it can handle paired reads…is there any update yet?

Another point is the any multithread option to run jobs in parallel?
Thanks a lot,
Suparna

a) just looking at the help output of daa2rma, support for paired reads appears to be implemented… I don’t remember, did you report a problem with the feature?

b) MEGAN uses some multithreading, but jobs like computing an RMA file from a DAA file is limited mostly by IO, so multithreading is of limited use.

Thanks for your msg…
Sorry if I understood correctly, is this mean that now there is a option for daa2rma using paired reads in comandline too as can be done in GUI? Then how?
Thanks a lot,
Suparna

Here are the first few lines of the help displayed by the program:

 SYNOPSIS
    	DAA2RMA6 [options]
    DESCRIPTION
    	Computes a MEGAN .rma6 file from a DIAMOND .daa file
    OPTIONS
     Input  
    	-i, --in [string(s)]                 Input DAA file. Mandatory option.
    	-mdf, --metaDataFile [string(s)]     Files containing metadata to be included in RMA6 files. 
     Output  
    	-o, --out [string(s)]                Output file(s), one for each input file, or a directory. Mandatory option.
    	-c, --useCompression                 Compress reads and matches in RMA file (smaller files, longer to generate. Default value: true.
     Reads  
    	-mag, --magnitudes                   Reads are annotated with magnitudes. Default value: false.
    	-p, --paired                         Reads are paired. Default value: false.
    	-ps, --pairedSuffixLength [number]   Length of name suffix used to distinguish between name (i.e. first word in header) of read and its mate (use 0 if read and mate have same name). Default value: 0.
    	-pof, --pairedReadsInOneFile         Are paired reads in one file (usually they are in two). Default value: false.

Use -p to tell the program that reads are paired, use -ps to provide the length of the suffix which is used to distinguish paired reads (e.g. use 1 if pairs are named XXX.1 and XXX.2).
Use -pof to tell the program that paired reads appear together in one file (otherwise you must provide two files).

The paired read mode is overly conservative at the moment, pairs of reads are assigned to the LCA of the LCA of each. I will implement a second strategy (based on adding bitscores) in the near future.

Hi Daniel,

I am currently using the trial version of MEGAN6_UE and I cannot get your script to work on my dataset. It complains about the key…

[grendon@compute-5-19 20170417_MEGAN6]$ cat megan6_script.sh

open file='/.../results/20170417_MEGAN6/533AC_R1.paired.rma';
uncollapse nodes=all;
select nodes=leaves;
export what=CSV format=taxonName_to_count separator=comma counts=summarized file='/.../results/20170417_MEGAN6/533AC_R1.paired.csv';
quit;

[grendon@compute-5-19 20170417_MEGAN6]$ MEGAN -g -E < megan6_script.sh

Version   MEGAN Ultimate Edition (version 6.7.18, built 19 Apr 2017)
Copyright (C) 2016 Daniel H. Huson. This program comes with ABSOLUTELY NO WARRAN
TY.
3D PCoA viewer not supported on this platform

Error: Invalid license key

It looks like the license key has not been entered? You can supply it on the command line using the -rc option.
You only need to do that once, MEGAN will “remember” the code.

Hello Daniel,
Just saw that you have implemented the paired files option in daa2rma.
Thanks for this,
Suparna

I am trying to build a diamond index for nr as follows:
diamond makedb --in nr.gz --db nr

I am getting the following error

diamond v0.9.10.111 | by Benjamin Buchfink buchfink@gmail.com
Licensed under the GNU AGPL https://www.gnu.org/licenses/agpl.txt
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 12
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)

Opening the database file… [0.000936s]
Loading sequences… [22.688s]
Masking sequences… [39.8268s]
Writing sequences… [2.54574s]
Loading sequences… [16.6854s]
Masking sequences… [39.8461s]
Writing sequences… [1.76207s]
Loading sequences… [6.51386s]
Error: Inflate error.

Please suggest me a solution.

Thanks

My guess is that your file nr.gz is incomplete. Please try gunzip first so as to get nr.fa and then run diamond with nr.fa
If that doesn’t help, please contact Benjamin Buchfink via the DIAMOND GitHub page.