COG functional assignments

Hi everybody!

I am writing this post because I am a bit confused on how MEGAN assigned COG/ENOGG functional annotations.

I am going to expose my particular case.
I am working with genomes downloaded directly from the ncbi. I have aligned the genes of these genomes using diamond blastx against the nr database from the ncbi. However, with MEGAN I have obtained a considerable high number of genes with no hit (for instance, 1590 of 3800)
Reading the manual of MEGAN again, I have realized that in the section about MEGAN-Functional classification eggNOG, there is a recommendation for COG functional annotation:

“Hence, if a COG- based analysis is desired, then the database that is used in the BLAST alignment must contain RefSeq- ids”

So, the next step was to reduced the nr_database with only sequences with ref_seq ids. However, after using this nr_ref_seq database, MEGAN still didn’t show an improvement in the number of genes with COG functional annotation. In fact, in some genomes the number of genes with no COG assignment has increased. Looking close to those genes, I observed that most of them showed a besthit with really good % identities and e-values, as in the following case.

lcl|FN869568.1_cds_CBV41043.1_160 gi|1041907453|ref|WP_065241784.1| 99.6 276 1 0 1 828 1 276 1.6e-148 533.5
lcl|FN869568.1_cds_CBV41043.1_160 gi|1063487019|ref|WP_069384679.1| 98.6 276 4 0 1 828 1 276 2.3e-147 529.6
lcl|FN869568.1_cds_CBV41043.1_160 gi|1036466143|ref|WP_064702082.1| 94.8 271 14 0 16 828 4 274 1.3e-139 503.8
lcl|FN869568.1_cds_CBV41043.1_160 gi|1036463760|ref|WP_064699725.1| 94.5 271 15 0 16 828 4 274 3.9e-139 502.3
lcl|FN869568.1_cds_CBV41043.1_160 gi|1120907293|ref|WP_073436697.1| 93.1 276 17 1 1 828 1 274 4.3e-138 498.8
lcl|FN869568.1_cds_CBV41043.1_160 gi|498315256|ref|WP_010629412.1| 91.3 276 23 1 1 828 1 275 2.4e-136 493.0
lcl|FN869568.1_cds_CBV41043.1_160 gi|1032606792|ref|WP_064235193.1| 89.8 274 28 0 7 828 2 275 1.7e-134 486.9
lcl|FN869568.1_cds_CBV41043.1_160 gi|496246056|ref|WP_008959441.1| 89.4 274 29 0 7 828 2 275 4.9e-134 485.3
lcl|FN869568.1_cds_CBV41043.1_160 gi|737609914|ref|WP_035580566.1| 89.4 274 29 0 7 828 2 275 6.4e-134 485.0
lcl|FN869568.1_cds_CBV41043.1_160 gi|496589197|ref|WP_009287518.1| 89.1 274 30 0 7 828 2 275 1.9e-133 483.4
lcl|FN869568.1_cds_CBV41043.1_160 gi|667761819|ref|WP_031382883.1| 82.2 276 49 0 1 828 1 276 1.0e-123 451.1
lcl|FN869568.1_cds_CBV41043.1_160 gi|517846634|ref|WP_019016842.1| 81.9 276 50 0 1 828 1 276 2.3e-123 449.9
lcl|FN869568.1_cds_CBV41043.1_160 gi|1016245609|ref|WP_062951245.1| 74.6 276 70 0 1 828 1 276 4.1e-112 412.5

I understand that MEGAN assigns functional annotations using only the besthit of the diamond blastx results, but in case that the besthit does not have any COG associated, the program does not look for the following hits even if they are many, with high identities and small evalue. Is that correct?

Thank you so much

Beatriz

I forgot to ask, if there is any way to modify the behaviour of MEGAN to improve its COGs functional annotation, in the way if the first besthit has not assigned any COG, the program might look for it in the following hits?

Thanks again

Beatriz

Hi Beatriz,

for a given read, MEGAN goes down the list of hits until it finds one that has a COG assignment, but only among the hits that are within 10% of the best bit score seen for the read. I don’t think that going further down the list is a good idea.
However, I will look into ways of improving the mapping file and thus obtaining more assignments.
D

Hi Daniel,

Thank you so much for your answer. The rule of 10% best bit score makes sense for the functional assignment. I though it was only based on the besthit.

However, trying to check the 10% best bit score rule, I can’t figure out why it seems that this rule is not met in my data.

Here I show you my results,

This is the diamond blastx output from a specific gene of the genome Halomonas_elongata

lcl|FN869568.1_cds_CBV43126.1_2243 gi|1041906984|ref|WP_065241315.1| 99.8 546 1 0 1 1638 1 546 0.0e+00 1118.6
lcl|FN869568.1_cds_CBV43126.1_2243 gi|1063485359|ref|WP_069384131.1| 94.0 546 33 0 1 1638 1 546 3.1e-305 1055.0
lcl|FN869568.1_cds_CBV43126.1_2243 gi|1036466037|ref|WP_064701976.1| 92.3 546 42 0 1 1638 1 546 8.7e-300 1036.9
lcl|FN869568.1_cds_CBV43126.1_2243 gi|1036462013|ref|WP_064697978.1| 92.3 546 42 0 1 1638 1 546 1.1e-299 1036.6
lcl|FN869568.1_cds_CBV43126.1_2243 gi|737622602|ref|WP_035592904.1| 78.0 546 120 0 1 1638 1 546 2.5e-254 885.9

This gene has no COG assignment by Megan

And this is the diamond blastx output from a read that comes from this gene of Halomonas elongata, generated using Metasim

r90061.1 gi|1063485359|ref|WP_069384131.1| 100.0 50 0 0 1 150 130 179 3.6e-19 101.3
r90061.1 gi|1041906984|ref|WP_065241315.1| 100.0 50 0 0 1 150 130 179 3.6e-19 101.3
r90061.1 gi|1036462013|ref|WP_064697978.1| 92.0 50 4 0 1 150 130 179 7.5e-17 93.6
r90061.1 gi|1036466037|ref|WP_064701976.1| 92.0 50 4 0 1 150 130 179 7.5e-17 93.6
r90061.1 gi|764676315|ref|WP_044454276.1| 85.7 49 7 0 1 147 126 174 5.4e-15 87.4

And this is the COG asignment with MEGAN: r90061.1 “COG0069 glutamate synthase”

In both cases, the first four hits (the ones that should be taking into account according to the 10% best bit score rule) are identical for the read and for the gene. So, why in the gene there is no assignment, and which hit is using Megan for the functional assignment of the read?

Sorry for insisting on this issue again, but it is very important for me to know how Megan is doing the COG functional assignments.

Thank you so much

Beatriz

Dear Daniel,

First of all, I would like to apologize for writing you again concerning this issue. But, I would like you to help me understand what Megan is doing in the functional COG assignment of reads.
I am sending you a screenshot of the assignment of one read of my mock metagenome using MEGAN. As I can see, Megan is not using none of the hits that fulfil the 10% best bit score rule. Instead, it is using the first one with COG annotation that is placed in the 178th position. Moreover, this hit has no ref-seq id. So I don’t understand anymore the behaviour of the program using the default parameters.

I am looking forward to your reply

Thank you so much again

Beatriz