Malt-build NegativeArraySizeException: -1734967296

jfy133 · January 14, 2021, 2:30pm

I tried to index a collection of NCBI RefSeq reference genomes with malt-build on v0.5.0, but after the FASTA file was loaded, I recieved the following error:

Version   MALT (version 0.5.0, built 5 Aug 2020)
Author(s) Daniel H. Huson
Copyright (C) 2020 Daniel H. Huson. This program comes with ABSOLUTELY NO WARRANTY.
Classifications to use: Taxonomy
Reference sequence type set to: DNA
Seed shape(s): 111110111011110110111111
Deleting index files: 3
Number input files:            1
Loading FastA files:
100% (6387.4s)
NegativeArraySizeException: -1734967296

The command used:

malt-build -J-Xmx1400g --input plant_2021Jan06.fna.gz --sequenceType DNA --index malt/db/custom/raw/plant/ --threads 40 --step 8 --mapDB megan-nucl-map-Jul2020.db

Does anyone have any suggestion what could have caused this relatively ‘generic’ error message, or how I could investigate further?

Daniel · January 14, 2021, 5:13pm

Looks like you might have broken MALT… a negative array size exception might mean that the size of the database exceeds what MALT can currently handle…
Could you please rerun with option -v (verbose).
This will hopefully generate a stack trace that would be helpful for me to take a look…

jfy133 · January 18, 2021, 5:02am

Hi Daniel,

Ah, that could be. The input FASTA size is 125GB. Find the outut of the build with -v below

Command

malt-build -J-Xmx1400g --input /scratch2/users/jfellow/resources/malt/db/custom/raw/plant/plant_2021Jan06.fna.gz --sequenceType DNA --index /scratch2/users/jfellow/resources/malt/db/custom/raw/plant/ --threads 40 --step 8 --mapDB /scratch2/jfellow/resources/malt/db/megan-map/megan-nucl-map-Jul2020.db -v

Stderr:

gwdu103:6 05:59:45 /scratch2/jfellow/resources/malt/db/custom/raw/plant > cat slurm-6262828.err 
MaltBuild - Builds an index for MALT (MEGAN alignment tool)
Options:
Input:
	--input: /scratch2/users/jfellow/resources/malt/db/custom/raw/plant/plant_2021Jan06.fna.gz
	--sequenceType: DNA
Output:
	--index: /scratch2/users/jfellow/resources/malt/db/custom/raw/plant/
Performance:
	--threads: 40
	--step: 8
Seed:
	--shapes: default
	--maxHitsPerSeed: 1000
Classification support:
	--mapDB: /scratch2/jfellow/resources/malt/db/megan-map/megan-nucl-map-Jul2020.db
Deprecated classification support:
	--parseTaxonNames: true
	--noFun: false
Other:
	--firstWordIsAccession: true
	--accessionTags: gb| ref|
	--firstWordOnly: false
	--random: 666
	--hashScaleFactor: 0.9
	--buildTableInMemory: true
	--extraStrict: false
	--verbose: true
Version   MALT (version 0.5.0, built 5 Aug 2020)
Author(s) Daniel H. Huson
Copyright (C) 2020 Daniel H. Huson. This program comes with ABSOLUTELY NO WARRANTY.
Classifications to use: Taxonomy
Reference sequence type set to: DNA
Seed shape(s): 111110111011110110111111
Deleting index files: 3
Number input files:            1
Loading FastA files:
100% (6392.6s)
Caught:
java.lang.NegativeArraySizeException: -1734967296
	at malt/malt.io.FastAFileIteratorBytes.growBuffer(FastAFileIteratorBytes.java:194)
	at malt/malt.io.FastAFileIteratorBytes.hasNext(FastAFileIteratorBytes.java:131)
	at malt/malt.data.ReferencesDBBuilder.loadFastAFile(ReferencesDBBuilder.java:183)
	at malt/malt.data.ReferencesDBBuilder.loadFastAFiles(ReferencesDBBuilder.java:167)
	at malt/malt.MaltBuild.run(MaltBuild.java:259)
	at malt/malt.MaltBuild.main(MaltBuild.java:71)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at com.install4j.runtime/com.exe4j.runtime.LauncherEngine.launch(LauncherEngine.java:84)
	at com.install4j.runtime/com.install4j.runtime.launcher.UnixLauncher.start(UnixLauncher.java:66)
	at install4j.malt.MaltBuild.main(Unknown Source)

Daniel · January 18, 2021, 7:40am

The error was caused when MALT attempted to read a single FastA sequence that was extremely long.
The error occurred when the FastA parser attempted to double the size of the array that it uses to hold the current sequence…

(There was a programming error here: when the requested length is extremely big, the program is supposed to cap the requested length rather than request a negative length, I have fixed that bug).

I will upload a new new release with the bug fix.

With the bug fix, as long as no individual FastA record exceeds a length of 2,147,483,639, the program should be ok

Do you really have individual FastA records that are that on the order of a gigabase in length? Or is there some other problem causing MALT to try to read your whole input file as one FastA record?
If all individual FastA records that much smaller than 1G, then could send me the first 1000 records, say, of your file, then I can run it to verify that MALT is processing each record separately.

jfy133 · January 19, 2021, 6:08am

Hi Daniel,

Thanks for the explantion, it’s appreciated.

These are a lot of plant assemblies from NCBI RefSeq, so it could be. I will try and identify first if there is a problematic genome that is very large, otherwise send you a subset files as requested (might take a couple of days due to lockdown + child).

Daniel · January 19, 2021, 7:30am

I am currently uploading a new release 0.5.1 that allows individual records to be up to 2,147,483,639 letters long (but do you really have individual sequences that are that long?)

jfy133 · January 19, 2021, 2:18pm

Awesome thank you very much. I will update the version of MALT on bioconda as well and test it for you.

In the meantime I will also investigate whether there is a FASTA entry is longer than that. I also think this unlikely, but I don’t actually know.

By indvidual record, I’m assuming you mean an entry within a single multi-entry FASTA file (e.g. is one chromosome longer than 2,147,483,639 bases)? If you mean if the single input FASTA file has more than that number, it could very likely be (a lot of the genomes are just assemblies so have a lot of unplaced scaffolds, which I may remove in the future).

jfy133 · January 20, 2021, 5:53am

@Daniel the longest contig I found was Garlic

CM024377.1 Allium sativum isolate Ershuizao chromosome 3, whole genome shotgun sequence 2093616421

2,093,616,421 doesn’t reach your specified threshold above, but it’s pretty close. I have submitted a new malt-build run with version 0.5.1 and will let you know the outcome (might take a couple of days as I’m in a low-priority queue due to job size).

For your reference I also attach the file that I used to download the genomes going into the database I was trying to build: sparse_plant_priority_subsampled.tsv (195.1 KB)

jfy133 · January 21, 2021, 5:27am

@Daniel the build with MALT 0.5.1 finished correctly with no issues. Thank you very much!

gwdu103:2 06:25:35 /scratch2/jfellow/resources/malt/db/custom/raw/plant > cat slurm-6294716.err
MaltBuild - Builds an index for MALT (MEGAN alignment tool)
Options:
Input:
	--input: /scratch2/users/jfellow/resources/malt/db/custom/raw/plant/plant_2021Jan06.fna.gz
	--sequenceType: DNA
Output:
	--index: /scratch2/users/jfellow/resources/malt/db/custom/raw/plant/
Performance:
	--threads: 40
	--step: 8
Seed:
	--shapes: default
	--maxHitsPerSeed: 1000
Classification support:
	--mapDB: /scratch2/jfellow/resources/malt/db/megan-map/megan-nucl-map-Jul2020.db
Deprecated classification support:
	--parseTaxonNames: true
	--noFun: false
Other:
	--firstWordIsAccession: true
	--accessionTags: gb| ref|
	--firstWordOnly: false
	--random: 666
	--hashScaleFactor: 0.9
	--buildTableInMemory: true
	--extraStrict: false
	--verbose: true
Version   MALT (version 0.5.1, built 19 Jan 2021)
Author(s) Daniel H. Huson
Copyright (C) 2020 Daniel H. Huson. This program comes with ABSOLUTELY NO WARRANTY.
Classifications to use: Taxonomy
Reference sequence type set to: DNA
Seed shape(s): 111110111011110110111111
Deleting index files: 0
Number input files:            1
Loading FastA files:
10% 100% (7664.2s)
Number of sequences:  50,285,414
Number of letters:443,248,675,232
BUILDING table (0)...
Seeds found: 55,261,513,838
tableSize=    2,147,483,639
hashMask.length=31
maxHitsPerHash set to: 1000
Initializing arrays...
100% (0.0s)
Analysing seeds...
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (569.2s)
Number of low-complexity seeds skipped: 2,508,344,679
Allocating hash table...
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (1230.8s)
Total keys used:     2,147,027,648
Total seeds matched:51,861,051,597
Total seeds dropped:   455,401,947
Opening file: /scratch2/users/jfellow/resources/malt/db/custom/raw/plant/table0.db
Allocating: 394.4 GB
Filling hash table...
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (3101.6s)
Randomizing rows...
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (368.8s)
Writing file: /scratch2/users/jfellow/resources/malt/db/custom/raw/plant/table0.idx
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (106.8s)
Writing file: /scratch2/users/jfellow/resources/malt/db/custom/raw/plant/table0.db
Size: 394.4 GB
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (14566.9s)
Writing file: /scratch2/users/jfellow/resources/malt/db/custom/raw/plant/index0.idx
100% (0.0s)
Loading ncbi.map: 2,259,889
Loading ncbi.tre: 2,259,893
Building mappings...
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (93.0s)
Writing file: /scratch2/users/jfellow/resources/malt/db/custom/raw/plant/taxonomy.idx
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (1.0s)
Writing file: /scratch2/users/jfellow/resources/malt/db/custom/raw/plant/ref.db
Writing file: /scratch2/users/jfellow/resources/malt/db/custom/raw/plant/ref.idx
100% (1175.7s)
Total time:  29,123s
Peak memory: 884 of 1400G

Daniel · January 21, 2021, 11:58am

Good to hear… I also wonder how well MALT will run on a reference dataset of such large size…

jfy133 · January 22, 2021, 9:35am

Pleased to report it ran succesfully on a small (random) test dataset (I’m also surprised to be honest )


Binning reads: Analyzing alignments
Total reads:          311,236
With hits:             311,236
Alignments:         17,228,664
Assig. Taxonomy:           720
Binning reads: Writing classification tables
Numb. Tax. classes:          9
Binning reads: Syncing
Class. Taxonomy:             9
Analysis written to file: /scratch2/jfellow/data/malt/test/ERR4172783_1.rma6
Prepending @SQ lines to SAM file: /scratch2/jfellow/data/malt/test/ERR4172783_1.blastn.sam.gz
100% (0.5s)
Copying from temporary file:
100% (138.3s)
Deleted temporary file: /scratch2/jfellow/data/malt/test/ERR4172783_1.blastn-tmp237668..sam..gz
Num. of queries:   12601261
Aligned queries:     311236
Num. alignments:   17228664
+++++ Aligning file: /scratch2/jfellow/data/fastq/test/ERR4172783_2.fastq.gz
Starting file: /scratch2/jfellow/data/malt/test/ERR4172783_2.rma6
10% 20% 30% 40% 100% (5983.6s)
Alignments written to file: /scratch2/jfellow/data/malt/test/ERR4172783_2.blastn-tmp397107..sam..gz
Finishing file: /scratch2/jfellow/data/malt/test/ERR4172783_2.rma6
Binning reads: Initializing...
Initializing binning...
Using Best-Hit algorithm for binning: Taxonomy
Binning reads...
Binning reads: Analyzing alignments
Total reads:          289,245
With hits:             289,245
Alignments:          5,246,647
Assig. Taxonomy:           761
Binning reads: Writing classification tables
Numb. Tax. classes:         10
Binning reads: Syncing
Class. Taxonomy:            10
Analysis written to file: /scratch2/jfellow/data/malt/test/ERR4172783_2.rma6
Prepending @SQ lines to SAM file: /scratch2/jfellow/data/malt/test/ERR4172783_2.blastn.sam.gz
100% (0.6s)
Copying from temporary file:
100% (43.3s)
Deleted temporary file: /scratch2/jfellow/data/malt/test/ERR4172783_2.blastn-tmp397107..sam..gz
Num. of queries:   12601261
Aligned queries:     289245
Num. alignments:    5246647
Number of input files:          2
Total num. of queries:   25202522
Total aligned queries:     600481
Total num. alignments:   22475311
Total time:  24,920s
Peak memory: 1371.2 of 1400G

Daniel · January 28, 2021, 3:21pm

Fantastic!! I am totally surprised, too! I really like this line:

Peak memory: 1371.2 of 1400G

jfy133 · January 29, 2021, 12:13pm

Haha! It made me smile too ;).

One further question, I may need to run on smaller cluster nodes in the future, and I saw there are different memoryModes:

--memoryMode Load all indices into memory, load indices page by page when needed or use memory
mapping (load, page or map).

I thought maybe page would be ideal, as I guess it chunks up the database and align reads against each chunk (presumably meaning there is a smaller running memory footprint).

I’ve tried this on the same database as above, but the run appears to have hung on aligning (as in last log message is Starting file and then nothing else. Edit: Or rather is running extremely slowly. I see there is an RMA6 file but it is only 47k and there is a 10 byte SAM file… (in comparison to 100MB files with --memoryMode load.

Do you have any more information behind each mode, and why maybe the malt-run build may have hung?