Incorrect number of reads shown and missing feature?

To the MEGAN team,

I have 2 questions about MEGAN:

  1. I have tried running a diamond/megan pipeline on a sample of my data. I used head -n400000 to collect a sample of 100,000 reads from the top of my fastq files (once from each file for paired reads). I aligned these to nr with diamond:

diamond blastx --db $database --query /nobackup/b7040535/Macaque_Transcriptome_analysis/Raw_data/current_maca_reads_1_sample.fastq -o $outdir/$current_fastq/$current_fastq"_R1" --tmpdir $TMPDIR -f 100 -e 0.000001

and then used daa2rma to convert these into a megan-friendly format:

daa2rma -t 20 -p -i $indir/SRR6425383_R1.daa $indir/SRR6425383_R2.daa -o $outdir -mdb $megandb

When I import the rma6 file into megan, the total number of reads is shown as 125,752. My question is how is this figure derived? I aligned 100,000 paired reads. Is this considered 200,000 reads by mega, and the difference due to reads not aligned by diamond? For functional/taxonomic quantification, surely paired reads cannot be considered independent counts from one another and therefore a function/taxon should be agreed between a pair to assign a function/taxon for the pair?

  1. I would like to use a feature mentioned in the megan manual: taxon disabling. The manual mentions this is in the options menu and can allow given taxa to be excluded from the analysis, however I can’t see this in the options menu in either the community or ultimate editions of megan6

Any help is much appreciated

Nick

The fact that MEGAN reports 125,752 reads doesn’t make any sense…
What number of reads did DIAMOND report?

Taxon disabling is in Edit->Preferences->Taxon Disabling…

I would recommend that you use daa-meganizer rather than daa2rma.
This processes your daa file so that it can be opened in megan, is faster and has the advantage of not creating an additional file.

Hi Daniel,

Thanks for your prompt feedback.

Diamond reported these number of query reads aligned for the paired reads:

R1: 62169 queries aligned.
R2: 63583 queries aligned.

Interestingly 2x the average of these 2 numbers makes exactly 125,752 - so I presume this indicates each read pair is interpreted separately by megan?

Thanks for help finding taxon disabling. Does this affect the results of the functional analysis? The number of reads in this section didn’t change

using daa-meganizer I was getting the warning:

WARNING: Not an RMA6 file, will ignore paired read information

so it seems both methods will not combine information from read pairs?

I’m not sure I see the advantage of modifying the original file, what if the analysis needs to be repeated with different parameters?

Thanks for the help

Best,

Nick

125752=62169 + 63583, so yes, MEGAN counts each read as a single read, and not pairs of reads…

Taxon disabling does not affect functional analysis.

By default, MEGAN will treat each read of a pair of reads as a separate entity. The paired read mode uses some heuristics to try to improve the taxonomic placement of reads that occur in pairs, such as placing read B on the same taxon as read A, if A has a taxonomic assignment and B doesn’t. But I don’t think that the heuristics work well enough to justify the additional hassle of using paired mode.

You can always re-meganize a DAA file if you want to change the analysis, or can also make multiple copies of your DAA file if you want to compare different analyses of the same file side-by-side.
I am pretty sure that a meganized DAA file will be smaller than the corresponding rma6 file. So, the advantages are size and speed (meganizing a DAA file is much faster than creating an rma6 file…)

Perhaps taxon disabling mapping onto the functional analysis would be a useful feature? E.g. this would allow removal of unwanted e.g. host reads and functional analysis without the need for additional steps.

I have implemented contaminant filtering, which is the correct way to address the problem that you mentioned. Any read that aligns to a contaminant is placed on a special contaminant node and is not assigned to any taxon or any function.

Is is quite experimental, so I will need to check that is in working order and then upload a new release in which the feature is activated…

I’d be happy to do so, if you can then give me feed back on whether it does what you desire…

Happy to give this feature a test. I’m interested in how “contaminant” would be defined. Would this be a user-defined set of taxids, as I think the definition of contaminant will vary widely between experiments.

You specify a list of contaminants. For example, if you considered all archaea and eukaryotes to be contaminants, then you would supply the taxon ids 2157 and 2759 during processing of your input file.
MEGAN will report this:

Using contaminants profile: 2 input, 1,522,418 total

This means that you supplied two high-level taxon ids, which sit above a total of 1.5 million ids.

For short reads, any read that has any alignment to some contaminant taxon is put on the contaminant node (where only those alignments are considered that pass all the filters such as top percent, min bitscore etc).

In long read mode, a simpler criterion is used: any read assigned to a contaminant taxon is deemed a contaminant read.

To activate this feature, you must supply a file containing contaminant ids or names, when supplying names, it should be one per line or in single quotes. You will find these at bottom of the Taxonomy tab of the Meganize and Import Blast dialogs.

You can also supply the contaminant file to the command line tools such as daa-meganizer.

This feature is not completely implemented in older releases of MEGAN, you must use release V6_19_7 or later (which I am currently uploading).

This shows the effect of declaring that all archaea and eukaryotes are contaminants:

Reads identified as contaminants are not binned to functions: