We were surprised to see a strange loss of data after normalization, despite using the “Keep at least 1 read” option (as in ignore all unnassigned reads checked). The figure below shows that the Plasmodium reichenowi taxon disappears completely, although it is stable presence with reads of 69, 614 and 522.
Another example: in absolute mode we find 141 viral taxons, which is reduced to 28 after normalization (table attached)
Comparison001-510-CSV-norm-abs-bugtest.xlsx (252.0 KB)
In both cases, a contig-based analysis was performed, abundance values were determined by remapping, and then the CSV import module was used to load the TaxID, reads values into Megan.
Thanks in advance for any help.
I was just about to make the same post, so I’m glad you made it blaize! I’ve found the same thing with my data in multiple projects.
Here are two examples using the same input files (same LCA parameters) with normalized and absolute comparisons (with “keep at least 1 read” checked). I assume this might be happening because the numbers become so small when one fasta has several orders of magnitude fewer reads, and the other files are forced to normalize down too much? But a workaround for this scenario would be much appreciated because sometimes I do want to normalize samples that have vastly different performances, and would like to reduce the problem of the inbuilt normalization mainly just making the good samples look far worse. Could something like the normalized file containing decimal proportions of the total reads be used so that it’s not comparing read counts anymore, but proportions? Just an idea.
The same major taxa are present, but many hits to nodes with small absolute counts are dropped in the normalized variant.