Use SILVA taxonomy files in Megan 6

Hi,

I have been trying to load SILVA version 123 taxonomy files in “Edit-Preferences-Use Alternative Taxonomy” (in Megan 6 Community Edition on Windows PC), but I get a “Failed to open files” error message.
I would like to use Megan 6 to work on 16S otu biom files that were generated in Qiime using SILVA v123 taxonomy assignments. The SILVA tree and map files I am using are from the SILVA 123 Qiime release, but I haven’t found a reason that these files should fail to load.

Do you have suggestions how I could use SILVA taxonomy in Megan 6?

Thank you,

GK

You can use the NCBI taxonomy with Silva, to do so, download and employ the silva-to-ncbi mapping file available on the MEGAN6 download page.

But that does not address your immediate problem, which I will look in to
D

Hello,

I’m trying to use the mapping file for working with the SILVA SSU (SSURef_132_Nr99_tax_silva_to_NCBI_synonyms.map.gz), but I’m not sure how to write the script for daa-meganizer. Is the following script correct?

daa-meganizer -i file.daa -a2t SSURef_132_Nr99_tax_silva_to_NCBI_synonyms.map.gz

Thanks in advance.
Juanjo

The SILVA mapping file is for use with 16S rRNA nucleotide sequences, which must be aligned against 16S references using a DNA-DNA aligner. DIAMOND is for aligning coding sequence against protein references. It makes no sense to mix the two.

(Also, the script is not correct, you need to use -s2t with the Silva mapping file because it is a MEGAN “synonyms” mapping file, not a MEGAN “accessions mapping file”.)

Many thanks Daniel. I’d be very grateful if you could verify the validity of the following workflow for taxonomic assignment:

diamond makedb --in nr.gz -d nr

diamond blastx -d nr.dmnd -q sample1.fasta -o sample1.daa --outfmt 100 -F 15 --range-culling --top 10

daa-meganizer -i sample1.daa -a2t prot_acc2tax-Nov2018X1.abin

After applying this workflow I did obtain taxonomic assignment, but nevertheless it would be nice to confirm the validity of these scripts so that I can continue with the taxonomic analysis.

Many thanks in advance.
Juanjo

you need to specify long read mode when using daa-meganizer:

daa-meganizer --longReads

Hello Daniel,

I’ve tried with “daa-meganizer --longReads” as you suggested, and got 7,083,542 assignments to Bacteria. Considering that the corresponding R1.fastq file contains about 14 mill reads (151 bp), isn’t this number of assignments too high?
When daa-meganizer is run with the option “-alg weighted”, I get 21,850 assignments to Bacteria. Does this number make more sense?

Thanks in advance.

I’m confused, is this SSU data or shot gun data?
It looks like you have short-read shotgun data.
Then please run daa-meganizer without the --longReads option.

Hi Daniel,

This is shot gun data (NovaSeq 6000 2x150), but before using Megan, the reads were assembled into contigs, whose lengths range between 300 and 10,000 bp. For this case, you recommended to use blastx with options “-F 15 --range-culling --top 10” Alignment of long reads and then daa-meganizer with option “–longReads”.
Below you can see the different outputs from blastx with and without long-reads options:

diamond blastx -d nr.dmnd -q sample1.fasta -o sample1.daa --outfmt 100 -F 15 --range-culling --top 10
Reported 26,122,619 pairwise alignments, 26,137,913 HSPs.
226,132 queries aligned.

diamond blastx -d nr.dmnd -q sample1.fasta -o sample1.daa --outfmt 100
Reported 5,216,235 pairwise alignments, 5,237,419 HSPs.
225,442 queries aligned.

However, when daa-meganizer is run with “–longReads” option, the number of assignments is remarkably high (7,083,542 only for bacteria), which makes me suspicious about the validity of all these parameters for my samples. On the other hand, when daa-meganizer is run with “-alg weighted”, the output is 21,850 assignments to Bacteria, which seems to make more sense.
Should I use blastx with options “-F 15 --range-culling --top 10” and then daa-meganizer with option “-alg weighted”?

Thanks

For long reads or contigs, run DIAMOND with

-F 15 --range-culling --top 10

Run daa-meganizer with

–longReads

Note that when using the long read option, MEGAN reports the number of aligned bases rather than the number of reads aligned. So, I am guessing that the number “7,083,542 only for bacteria” refers to the number of aligned bases?

I see. Then, how can I get the number of reads assigned to each taxon? The aim is to study the taxonomic composition and relative abundance of each taxon, so I would need the raw counts of reads assigned to each taxon.

Also, while the number of assignments to Bacteria is 7,083,542, some taxa at lower taxonomic levels show much higher number of assignments; for instance, Gammaproteobacteria: 12,280,207. Does that make sense? I assumed the number of assignments were greater as you go up into higher taxonomic levels.

Hi Daniel,

I think I’ve now understood the actual meaning of the terms “assigned” and “summed”. I guess the term “summed” refers to the total number of reads assigned to a given taxon, including the reads assigned to all descendant nodes from that taxon; while “assigned” refers to the number of reads assigned to a given taxon and that could not be further classified into lower taxonomic levels. Is that correct?

I’d still appreciate an answer to the first question about how to know the number of reads assigned to each taxon from long-reads option (–longReads), instead of number of aligned bases.
I still wonder if running blastx with options “-F 15 --range-culling --top 10” and then daa-meganizer with option “-alg weighted”, would be considered as a valid way to analyse contigs. The results I’ve obtained by applying such an approach seem to make more sense than when running bastx with default parameters, or when running daa-meganizer with long-read option.

Many thanks in advance.

Yes, “assigned” means the number assigned to a specific node, whereas “summed” means assigned to a specific node, or to any of the nodes “below” the specific node.

Use the read assignment method flag:

-ram

with possible values

readCount, readLength, alignedBases or readMagnitude

Running DIAMOND with “-F 15 --range-culling --top 10” is a good idea for contigs. (Frame-shift alignment triggered by -F 15 is unnecessary, but at present the other two options can’t be invoked on their own).

Which LCA performs best depends on the details of the data, of course, and also on what your question is. I would favor the longReads LCA, but the weighted LCA might also be ok.