Sam2rma error when using long contig nucleotide alignments

dportik · February 12, 2021, 7:26pm

Hi Daniel,
I recently mapped a few thousand long contigs from an assembly to NCBI nuc using minimap2. I then tried to convert the resulting SAM file to RMA, and encountered the following error:

SAM2RMA6 - Computes a MEGAN RMA (.rma) file from a SAM (.sam) file that was created by DIAMOND or MALT
Options:
Input
–in: 5-merged/tick-contigs-rescreen.merged.sam
–reads: 1-fasta-sort/tick-contigs-rescreen.sorted.fasta
Output
–out: 6-rma/tick-contigs-rescreen.nucleotide.readCount.rma
–useCompression: true
Reads
–paired: false
–pairedSuffixLength: 0
Parameters
–longReads: true
–maxMatchesPerRead: 100
–classify: true
–minScore: 50.0
–maxExpected: 0.01
–topPercent: 10.0
–minSupportPercent: 0.05
–minSupport: 0
–minPercentReadCover: 0.0
–minPercentReferenceCover: 0.0
–lcaAlgorithm: longReads
–lcaCoveragePercent: 100.0
–readAssignmentMode: readCount
Classification support:
–mapDB: /home/dportik/programs/megan/db/megan-nucl-map-Jul2020.db
Deprecated classification support:
–parseTaxonNames: true
–firstWordIsAccession: true
–accessionTags: gb| ref|
Other:
–threads: 24
–verbose: true
Version MEGAN Community Edition (version 6.19.4, built 16 Jul 2020)
Author(s) Daniel H. Huson
Copyright (C) 2020 Daniel H. Huson. This program comes with ABSOLUTELY NO WARRANTY.
Loading ncbi.map: 2,259,889
Loading ncbi.tre: 2,259,893
Current SAM file: 5-merged/tick-contigs-rescreen.merged.sam
Reads file: 1-fasta-sort/tick-contigs-rescreen.sorted.fasta
Output file: 6-rma/tick-contigs-rescreen.nucleotide.readCount.rma
Classifications: Taxonomy
Generating RMA6 file Parsing matches
Annotating RMA6 file using FAST mode (accession database and first accession per line)
Parsing file tick-contigs-rescreen.merged.sam
Parsing file: 5-merged/tick-contigs-rescreen.merged.sam
Input domination filter: MinPercentCoverToStronglyDominate=90.0 and TopPercentScoreToStronglyDominate=90.0
10% 20% 30% 40% 50% 60% Caught:
java.lang.NegativeArraySizeException: -1725067332
at megan/megan.parsers.blast.PostProcessMatches.apply(PostProcessMatches.java:117)
at megan/megan.parsers.blast.SAM2SAMIterator.next(SAM2SAMIterator.java:135)
at megan/megan.rma6.RMA6FromBlastCreator.parseFiles(RMA6FromBlastCreator.java:193)
at megan/megan.tools.SAM2RMA6.createRMA6FileFromSAM(SAM2RMA6.java:339)
at megan/megan.tools.SAM2RMA6.run(SAM2RMA6.java:303)
at megan/megan.tools.SAM2RMA6.main(SAM2RMA6.java:69)

Do you think this is due to using very large contigs? I do not need the alignments and am considering replacing cigar, seq, and qual fields with a *, and am wondering if this would be a quick fix.

Thanks,
Dan

Daniel · February 18, 2021, 2:19pm

This is due to the fact that MEGAN writes a read and all matches associated with it into a byte array. I use a doubling strategy when growing the array, which I will fix so that the program can assign max size array (but anything larger than that will require a redesign…)

Daniel · February 19, 2021, 4:18pm

Replacing cigars etc by “*” should work, the code should notice that info for computing the alignments is missing and will not try to report them (thus using less bytes per query)