Magnitude of reads artificially alters sample composition when comparing datasets

plaffy · April 19, 2017, 7:21am

We are using MEGAN6 Community version 6.7.17 on Linux OS (ubuntu 14.04 and CentOS 6.6 (tested on both systems)).
The data we are investigating is generated from metavirome assemblies, and in order to adjust the representation of a particular gene in a dataset, we have calculated contig coverage and use this information to assign a magnitude value for each contig. This is then presented within our blastable data as a magnitude value (eg magnitude=55) within the descriptor for each query sequence.

This has worked fine in previous versions (MEGAN5) and continues to work in MEGAN6 but we have encountered an issue with this magnitude adjustment when we try to compare individual datasets, specifically when we are making a normalised comparison.

Once we highlight multiple magnitude adjusted datasets in the compare window and select “use Normalized counts” we end up with a grossly inflated sampling of our datasets.

The example we have provided is of a comparison of three different datasets:
samplename | absolute read count | Magnitude adjusted counts
Xesto89…383…5688
Xesto7…346…5336
Xesto155…315…4019

Yet if we try to compare these datasets we get the following counts
samplename | non normalised comparison | normalised comparison
Xesto89…5668…59676
Xesto7…5336…61987
Xesto155…4019…51279

what we would expect is a comparison where 4019 (or there abouts) magnitude adjusted reads are presented in each dataset. this is not the case.

Our best guess is that when the comparison is being made, MEGAN6 is trying to take 4019 reads from each dataset and then making this “normalised” comparison, however we have less than 400 actual reads per sample, so it is artificially duplicating reads, but there also seems to be an additional magnitude adjustment going on, hence the final counts producing a > 10fold increase in the number of reads in our datasets.

When we compare many samples, this issue becomes particularly problematic.

We have attached a normalized and non-normalized comparison of our datasets using magnitude adjusted samples, and have also attached the three associated RMA6 files for the three datasets. A single example of the associated blast file and fasta file were attached as well for reference.
these were archived in a tar.gz file due to upload limitations

files ending in blast.gz are zipped blast files
files ending in .faa are the unzipped fasta files (with magnitude values in descriptors)
files ending in rma6 are the corresponding megan files

example. normalized.magadjusted.megan is the resulting normalized comparison of these three datasets
example.nonnormalized.magadjusted.megan is the resulting non-normalized comparison of these three datasets

if you need any more info, please let me know, or if we are doing something wrong, also let me know
thanks
Patrick
meganmagnitudeissue.tar.gz (2.4 MB)

Daniel · April 19, 2017, 10:15am

Dear Patrick,

thank you very much for the detailed bug report. While working on MEGAN’s new “long read” analysis mode, a modification to how MEGAN interprets “read weights” broke the comparison computation.
I have tracked down the problem and have built a new release, please test 6.7.18.

Please note that read magnitudes are now displayed only when you select
Options-> Use Read Weights for Assignments (which was only selected on two of the three files that you made available).

Also, please note that if you consider MEGAN’s new long read mode for analysis (still under development, but should soon be completed, is appropriate for contigs), then at present the weight used for a read (or contig) is the product of the magnitude of the read (or 1, if none defined) times the length of the read. I’m not quite sure whether that is what is wanted, so I might also make the multiplication by length an option.

Finally, at present when computing a comparison for files that have weighted reads, the actual number of reads is currently lost, that is, in the comparison file, the number of reads equals the sum of weights, not the some of numbers of reads. I will look into fixing this soon.

D

plaffy · April 20, 2017, 4:57am

Hi Daniel,
thanks for the prompt response.We’ve tried the update on the Ubuntu install and it all now works as it did in the previous versions. If we encounter any issues regarding this I will let you know.
thanks again,
Patrick

plaffy · May 12, 2017, 9:49am

Hi Daniel,
there seems to be another issue with the read magnitude parameters in the latest release of Megan6 (6.7.20)
I was using 6.7.18 earlier today and had to restart the software. I am prompted to install updates every time i restart the software, and when i restarted 6.7.20 the parameter “use read weights for assignments” is now missing from the options menu and i cant seem to locate it now.

Ive checked the manual for direction, but i cant seem to find instruction to implement magnitude adjustment with the new changes
any help and direction you can offer to get this working, that would be great

if you need any new test data to troubleshoot this problem i can oblige, but the data i posted in my initial report is of exactly the same format

thanks
Patrick

Daniel · May 12, 2017, 10:43am

Dear Patrick,

I have been working on reorganizing the code that features that are related to scaling nodes in the viewer by different things. By default, “assigned” and “summarized” values refer to read counts. However, one can now also choose to scale by total length of all reads assigned, total base pairs aligned for all reads assigned, or by read magnitudes.

To support these different modes, I had to move the features for controlling the “readAssignmentMode” into the
"Change LCA Parameters dialog" in the Options menu. In this dialog you can switch between the different modes.
(However, this triggers a complete reclassification of all reads and this may take a while, depending on the size of the dataset).

This change became necessary because I wanted to ensure that there is some explicit control over what is being compared when people compute a comparison of multiple samples.

Also, now when a file has been produced or updated with a new version of MEGAN then the status bar will indicate which read assignment mode was used to calculate the assignment values (and also the axes in charts should reflect this, too)

Sorry for any inconvenienced caused
D