Difference in "normalised" read counts between MEGAN and manual calculation

bknair · March 11, 2020, 3:33am

Hi Daniel,

I tried to do normalization of “absolute” values using the equation, *|C|/|S|m manually. I found minor differences between MEGAN normalized read counts and my calculation. Please refer to the below table:

In total five samples were used for testing.

sample1 (smallest) <> 2,180,466
sample2 <> 2,661,067
sample3 <> 2,679,666
tested-sample <> 2,710,131
sample5 <> 3,113,893

MEGAN (version 6.17)

[read counts obtained from MEGAN]
absolute # <> normalised #
109466 <> 87918
39620 <> 31821
22673 <> 18210
14636 <> 11755
4348 <> 3493
24998 <> 20077
23862 <> 19165
7418 <> 5958

Calculated normalized value using the equation, |C|/|S|*m

(109466 /2710131) * 2180466 = 88072.08624
(39620/2710131)*2180466 = 31876.7111
(22673/2710131)*2180466 = 18241.814
(14636/2710131)*2180466 = 11775.55638
(4348/2710131)*2180466 = 3498.231697
(24998/2710131)*2180466 = 20112.41858
(23862/2710131)*2180466 = 19198.43716
(7418/2710131)*2180466 = 5968.234299

As you can see there is difference between the manually calculated and MEGAN reported normalized reads counts. Can tell me, if I missed anything or why there is this difference?

Thanks in advance,

Prem

Daniel · March 30, 2020, 1:17pm

There are indeed minor differences due to the way that MEGAN does book-keeping.

bknair · March 31, 2020, 2:28am

Oh I see, thanks. Then do you think the value that I calculated using the formula is correct? and I can use for my analysis?

Thanks

Daniel · March 31, 2020, 6:22am

I took another look at the code and at your numbers. The discrepancies are larger than what I would expect. Basically, I would only expect to see differences due to rounding. I just tried this out on a collection of files to confirm this.

When doing normalization, MEGAN reports the following line in the message window:

Normalizing to: N reads per sample.

What was N for your data? Was it exactly 2,180,466?

Also, did you use the “ignore unassigned option”?

bknair · April 19, 2020, 3:54am

Apologies for the delay in reply.

What was N for your data? Was it exactly 2,180,466? & Also, did you use the “ignore unassigned option”?

I didn’t “ignore unassigned” reads.

Surprisingly it was not 2,180,466.

I copy paste the logs from ‘message window’ here:

Computing comparison:
ignoreUnassigned=false;
Normalizing to: 2,133,096 reads per sample // I am thinking how MEGAN gets this number!

When opting “ignore Unassigned reads”:
ignoreUnassigned=true;
Normalizing to: 916,644 reads per sample

Thanks.

Daniel · May 25, 2020, 7:49am

I think that I want to go back to my original explanation for the discrepancies of counts, e.g. 87918 vs 88072.

MEGAN normalizes by scaling counts down to the smallest input sample size. However, the code uses rounding in a number of key steps. I have rewritten the code so as to avoid rounding and using the next release of MEGAN, 6.18.9, you should see none of the differences that you have pointed out.

bknair · May 25, 2020, 8:28am

Hi Daniel,

Thanks for letting me know. Appreciate your help.

Daniel · May 25, 2020, 10:16am

Release 6.18.11 will not use rounding and numbers should match.

bknair · May 25, 2020, 10:36am

Great, thanks and looking forward.