Difference in "normalised" read counts between MEGAN and manual calculation

Hi Daniel,

I tried to do normalization of “absolute” values using the equation, *|C|/|S|m manually. I found minor differences between MEGAN normalized read counts and my calculation. Please refer to the below table:

In total five samples were used for testing.

  1. sample1 (smallest) <> 2,180,466
  2. sample2 <> 2,661,067
  3. sample3 <> 2,679,666
  4. tested-sample <> 2,710,131
  5. sample5 <> 3,113,893

MEGAN (version 6.17)

[read counts obtained from MEGAN]
absolute # <> normalised #
109466 <> 87918
39620 <> 31821
22673 <> 18210
14636 <> 11755
4348 <> 3493
24998 <> 20077
23862 <> 19165
7418 <> 5958

Calculated normalized value using the equation, |C|/|S|*m

(109466 /2710131) * 2180466 = 88072.08624
(39620/2710131)*2180466 = 31876.7111
(22673/2710131)*2180466 = 18241.814
(14636/2710131)*2180466 = 11775.55638
(4348/2710131)*2180466 = 3498.231697
(24998/2710131)*2180466 = 20112.41858
(23862/2710131)*2180466 = 19198.43716
(7418/2710131)*2180466 = 5968.234299

As you can see there is difference between the manually calculated and MEGAN reported normalized reads counts. Can tell me, if I missed anything or why there is this difference?

Thanks in advance,

Prem

There are indeed minor differences due to the way that MEGAN does book-keeping.

Oh I see, thanks. Then do you think the value that I calculated using the formula is correct? and I can use for my analysis?

Thanks

I took another look at the code and at your numbers. The discrepancies are larger than what I would expect. Basically, I would only expect to see differences due to rounding. I just tried this out on a collection of files to confirm this.

When doing normalization, MEGAN reports the following line in the message window:

Normalizing to: N reads per sample.

What was N for your data? Was it exactly 2,180,466?

Also, did you use the “ignore unassigned option”?

Apologies for the delay in reply.

What was N for your data? Was it exactly 2,180,466? & Also, did you use the “ignore unassigned option”?

I didn’t “ignore unassigned” reads.

Surprisingly it was not 2,180,466.

I copy paste the logs from ‘message window’ here:

Computing comparison:
ignoreUnassigned=false;
Normalizing to: 2,133,096 reads per sample // I am thinking how MEGAN gets this number!

When opting “ignore Unassigned reads”:
ignoreUnassigned=true;
Normalizing to: 916,644 reads per sample

Thanks.

I think that I want to go back to my original explanation for the discrepancies of counts, e.g. 87918 vs 88072.

MEGAN normalizes by scaling counts down to the smallest input sample size. However, the code uses rounding in a number of key steps. I have rewritten the code so as to avoid rounding and using the next release of MEGAN, 6.18.9, you should see none of the differences that you have pointed out.

Hi Daniel,

Thanks for letting me know. Appreciate your help.

Release 6.18.11 will not use rounding and numbers should match.

Great, thanks and looking forward.