Is this low meganized functional assignment?

Hi there,

I have recently performed a metatranscriptomic analysis with the SAMSA2 pipeline (which uses DIAMOND/blastx). With MEGAN UE I have run daa-meganizer (all default settings) on about 40 .daa files using the megan-map-Feb2022-ue.db (10.1GB).

Blockquote
#meganizer output for 1 (relatively small) metatranscriptome file

Functional classifications to use: EC, EGGNOG, GTDB, INTERPRO2GO, KEGG, SEED
Loading ncbi.map: 2,396,736
Loading ncbi.tre: 2,396,740
Loading ec.map: 8,200
Loading ec.tre: 8,204
Loading eggnog.map: 30,875
Loading eggnog.tre: 30,986
Loading gtdb.map: 240,103
Loading gtdb.tre: 240,107
Loading interpro2go.map: 14,242
Loading interpro2go.tre: 28,907
Loading kegg.map: 25,480
Loading kegg.tre: 68,226
Loading seed.map: 961
Loading seed.tre: 962
Meganizing: /projects/soil_ecology/dr359/UM/samsa2/fire_interleaved/step_4_output/daa_binary_files/fire_9.merged.ribodepleted.fastq.nr.daa
Meganizing init
Annotating DAA file using FAST mode (accession database and first accession per line)
Annotating references
10% 100% (3.7s)
Writing
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (0.2s)
Binning reads Initializing…
Initializing binning…
Using ‘Naive LCA’ algorithm for binning: Taxonomy
Using Best-Hit algorithm for binning: SEED
Using Best-Hit algorithm for binning: EGGNOG
Using Best-Hit algorithm for binning: KEGG
Using ‘Naive LCA’ algorithm for binning: GTDB
Using Best-Hit algorithm for binning: EC
Using Best-Hit algorithm for binning: INTERPRO2GO
Binning reads…
Binning reads Analyzing alignments
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (15.6s)
Total reads: 2,481,396
With hits: 2,481,396
Alignments: 2,481,396
Assig. Taxonomy: 2,315,108
Assig. SEED: 1,002
Assig. EGGNOG: 769
Assig. KEGG: 742
Assig. GTDB: 155,886
Assig. EC: 272
Assig. INTERPRO2GO: 274,910
MinSupport set to: 248
Binning reads Applying min-support & disabled filter to Taxonomy…
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (0.2s)
Min-supp. changes: 2,500
Binning reads Applying min-support & disabled filter to GTDB…
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (0.8s)
Min-supp. changes: 671
Binning reads Writing classification tables
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (2.0s)
Binning reads Syncing
100% (0.0s)
Class. Taxonomy: 1,238
Class. SEED: 225
Class. EGGNOG: 54
Class. KEGG: 376
Class. GTDB: 147
Class. EC: 187
Class. INTERPRO2GO: 410
Total time: 30s
Peak memory: 5.4 of 32G

Blockquote

This is… really bad looking read assignment for all functional databases to me. Clearly tax assignment for this sample was fine (2.3 million of 2.4 million), so what have I done wrong for function? In the comparison file of the 36 meganized daa files I have 154 million reads, with only 8,013 KEGG assignments. This data is metatranscriptomes from Illumina HiSeq.

Please let me know if this is something wrong with my daa files, my meganizing, or what?? Ideally a response with code for daa-meganizer if that is the issue.

Thanks in advance!

Also, sorry I used the nr.db (310GB) for my DIAMOND blastx annotations.

This makes no sense…
Could you please make the DAA file available to me and I re-meganize it in a debugger to see why functional assignments are not being made…

Thanks for the quick reply!

Yes, I agree it is confusing. One thing to note, I did not use the -lg option for daa-meganizer, perhaps this is the problem… though HiSeq is not long reads??

Here is a link to the original .daa file (~240MB): control_9.merged.ribodepleted.fastq.nr.daa - Google Drive

Hopefully the drive link works?

Thanks again for the help!

Thank you for access to the file.

You have restricted the number of alignments to one per read, the so-called “best hit” strategy. Never use this strategy. Because: for any given read, there will tens or even hundreds of equally good reference sequences. For any given read, you will randomly get one (of many possible) taxonomic assignments. Also, you will get only very few functional assignments, because only some of the references have functions associated with them…

So, run DIAMOND again allowing 25 alignments per read, say, and that should work much better.

Ugh, this is what I suspected when seeing that “-k 1” in my diamond call!

Ok, it is running now and will hopefully solve this problem, I will know in a few days. Thanks so much for looking into it for me, was giving me lots of stress! Cheers

Ok, I’ve finally been able to re-process this. Here is output from daa-meganizer after running diamond with -k 25.

Meganizing init
Annotating DAA file using FAST mode (accession database and first accession per line)
Annotating references
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (44.9s)
Writing
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (1.8s)
Binning reads Initializing…
Initializing binning…
Using ‘Naive LCA’ algorithm for binning: Taxonomy
Using Best-Hit algorithm for binning: SEED
Using Best-Hit algorithm for binning: EGGNOG
Using Best-Hit algorithm for binning: KEGG
Using ‘Naive LCA’ algorithm for binning: GTDB
Using Best-Hit algorithm for binning: EC
Using Best-Hit algorithm for binning: INTERPRO2GO
Binning reads…
Binning reads Analyzing alignments
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (123.2s)
Total reads: 5,712,765
With hits: 5,712,765
Alignments: 54,675,219
Assig. Taxonomy: 5,629,491
Assig. SEED: 12,342
Assig. EGGNOG: 21,668
Assig. KEGG: 26,656
Assig. GTDB: 961,984
Assig. EC: 10,778
Assig. INTERPRO2GO: 1,530,804
MinSupport set to: 571

I think this clearly looks better than previous meganizing, but perhaps still low?? Is running in “FAST” mode and assigning the function through with ‘Best-Hit’ the ideal approach?

Thanks for your help!

That is a low assignment rate. Stick to the fast assignment mode, the other one won’t assign more stuff and will be much slower. If you give me access to the file (or a smaller version of it), then I’d be happy to take a look at it to see what is going one

Well that’s a bummer.

It is significantly more KEGG assignments across all samples than before (~9000 → ~430,000), but perhaps the rate is still low. I have attached a gzipped version of a 1.2GB meganized DAA file after diamond blastx with -k 25 (fire_43), please let me know what you find when you get the chance.

Thanks very much for your help!

Hi again Daniel,

Sorry to bug you, but wanted to know if you were able to check in on this issue which was low assignment mostly through the KEGG database even after correcting the blastx -k call.

Hop you’re well, thanks!

this is not shotgun data. Select one of the taxonomic nodes and then open the alignment viewer. You will see that the reads stack up on a small number of reads, at least for the more specifically aligned nodes:

I looked at number of taxonomic nodes, and in most cases, all the reads align against a single protein, and the protein is labeled as hypothetical (in the shown sample hypothetical protein BS099_19495)

So I think that MEGAN is working correctly…

Ok, thanks Daniel. Yes, this is metatranscriptomic data, so perhaps it makes sense…

We will proceed with these results! :slight_smile: