Questions about Diamond and MEGAN tools

Hi there,

I am very interested in using the software such as Diamond and MEGAN
to process my metagenomics data.

The data is not 16S RNA but all genes.

I wonder if you could help me with the following three questions:

  1. I have paired-end data, is it a good practice to merge them
    together before running the blastx via Diamond or run Diamond for each
    end read separately?

  2. If running diamond on them separately, what is the best way to feed
    the data into daa2rma or daa-meganizer? I notice that there are
    options for pair-end data for those two tools but am not sure if I
    understand the usage of them, in particular, for the parameter “-ps
    (–pairedSuffixLength)”.

For example,

For R1 file:
diamond blastx --db nr --query sample1_R1.fq --threads 24 --outfmt 100
–out sample1.R1.daa

For R2 file:
diamond blastx --db nr --query sample1_R2.fq --threads 24 --outfmt 100
–out sample1.R2.daa

Now we have got two DAA files such as sample1.R1.daa and
sample1.R2.daa. What is the appropriate way to provide them to daa2rma
or daa-meganizer?

I have tired to convert those two DAA files into BLAST tabular format,
and it occurs to me that the resultant tabular format files will not
distinguish the read 1 and read 2 from the same pair and the two end
reads will be given the exactly the same name in those tabular files.
In turn, I speculate that the DAA files will not distinguish the two
reads from the same pair in terms of the read name.

  1. Would you recommend using whether daa2rma or daa-meganizer to
    process the data before loading the data into MEGAN?

Many thanks,

Tom

Hi Tom,

my usual advice is to create DAA files with Diamond, then run them through daa-meganizer (separately) to create meganized DAA files, which then can be opened in MEGAN.

However, that does not take paired reads into account.

The alternative is to run both files through daa2rma together, specifying
--paired and --pairedSuffixLength <n>

It sounds like <n>=0 is appropriate for your data because the names (i.e. first word on header line) of paired reads are identical. (It used to be that the names would differ by some suffix, such as .1 and .2, in which case <n>=2 would be appropriate.)

In summary, daa-meganizer is much faster than daa2rma and has the advantage that it does not produce a new file, but daa2rma does take paired reads into account. My advice would be to try daa2rma, but it if takes to long, just use daa-meganizer.

Hi Daniel,

Thank you very much for the helpful comments.

If I run daa-meganizer on paired reads from the same sample separately, this will lead to two meganized files (R1 and R2) for the same sample. After loading the R1 and R2 files into MEGAN6 and in the compare module we got a table of normalised read counts for each taxonomic classification for R1 and R2 files respectively. Is there a reasonable way to get a sample-level table based on the table for R1 and R2 files?

Many thanks,

In the Samples Viewer you can create a new file that contains the union of the two files: select the two samples and then use the Samples->Total Biome menu item.