Different type of normalization

The current normalization method normalizes to the smallest given count. This kind of method is today quite easy to criticize as a large part of the dataset is thrown away. Furthermore, there are a few well-known papers directly criticizing the use of rarefying/sub-sampling data for sequencing data.

This point is typically brought up by a reviewer and the dataset then needs to be normalized differently and re-plotted to show that the results are sound. Another common method to normalize data, that doesn’t require assembled contigs and estimated gene lengths into the calculation, is counts per million (CPM, i.e. relative proportion x 1 000 000), and is much less criticized among the community as all data is kept. It is for example, used in the edgeR bioconductor R package and also combined with a trimmed mean of M-values normalization.

Anyway, what I wanted to bring up is that MEGAN only offers one type of normalizations, the sub-sampling approach. Because of this, for me MEGAN is many times just used to extract the absolute counts with functional classifications and then normalized and analyzed in other software. This is a bit unnecessary in my opinion as CPM/TMM-values could simply be an alternative normalization method included in the MEGAN software and the supplied command tool compute-comparison.

Here’s a paper from 2018 that I found on the subject:
Comparison of normalization methods for the analysis of metagenomic gene abundance data

These are valid points, thank you.

I am not sure whether I understand what you are suggesting.
Are you saying that rather than normalize to the smallest count, normalize to 1 million (relative proportion x 1 000 000)? If so, then this used to be a feature in MEGAN (you could specify the the number that you wanted to normalize to), but the problem with that is species-richness is completely misrepresented.
Ok, now everything samples as a total count of 1 million, but if a sample only had 10 reads to begin with, then it will only have at most 10 different species… and we won’t know whether 10 species represents a very special community (like an acid mine drainage community) or is due to insufficient initial sample size.

One suggestion that I have heard and makes sense to me is to use square-root scaling when comparing samples. This evens out differences in counts without throwing away data…

CPM is just a relative comparison so the data is not subsampled to 1,000,000 read counts before relative proportion is calculated.

First the read counts for each gene are normalized as relative proportions (i.e. gene read count/sum of sample). And instead of multiplying with 100 which is a typical unit expressed for for relative abundance (i.e. %) the values are instead multiplied with 1,000,000, and this is then called CPM which is a relative unit. Likely 1,000,000 is used because many genes might have very low counts and the data is more easy to work without having to visualize the decimals.

I agree that insufficient sample size is always a problem, even with sub-sampling. Sub-sampling is also a bit problematic when there are large differences between the samples, e.g. sample 1: 3,000,000 reads vs sample 2 45,000,000 reads. We obviously don’t want to throw away 42,000,000 reads in sample 2. With CPM we at least include all information in the samples, even if that is a bit problematic when making comparisons between low and high samples, when looking at sample 2 we at least have the full information.

I think if it would be possible to add different kinds of normalizations in MEGAN, it would be good to use a couple of different methods that are common in the literature. That would add alternatives for the user. And my guess is that one method could be more common for functional gene analysis but not so for taxonomy, and vice versa.

I realize this topic is a bit old now, but I also think it would be really helpful for the end user to include a few different means of normalizing reads in MEGAN that one could easily switch between. I’m not sure how easily the two best methods found by Pereira et al (2018) (trimmed mean of M-values [TMM] and relative log expression [RLE]) could be implemented into MEGAN. But if it is possible, having a few different methods would allow one to easily cross-compare in MEGAN to see if the varied approaches are skewing the results.

Pereira et al. (2018) mention that their relative log expression (RLE) approach was also found to be optimum (based on their measures) in McMurdie et al. (2014) and Dillies et al. (2013).

Pereira, M., Wallroth, M., Jonsson, V. et al. Comparison of normalization methods for the analysis of metagenomic gene abundance data. BMC Genomics 19, 274 (2018). https://doi.org/10.1186/s12864-018-4637-6

McMurdie PJ, Holmes S. Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. PLoS Comput Biol. 2014; 10(4):1003531. https://doi.org/10.1371/journal.pcbi.1003531.

Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloe D, Le Gall C, Schaeffer B, Le Crom S, Guedj M, Jaffrezic F. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinformatics. 2013; 14(6):671–83. https://doi.org/10.1093/bib/bbs046.