I tried to index a collection of NCBI RefSeq reference genomes with malt-build on v0.5.0, but after the FASTA file was loaded, I recieved the following error:
Version MALT (version 0.5.0, built 5 Aug 2020)
Author(s) Daniel H. Huson
Copyright (C) 2020 Daniel H. Huson. This program comes with ABSOLUTELY NO WARRANTY.
Classifications to use: Taxonomy
Reference sequence type set to: DNA
Seed shape(s): 111110111011110110111111
Deleting index files: 3
Number input files: 1
Loading FastA files:
100% (6387.4s)
NegativeArraySizeException: -1734967296
Looks like you might have broken MALT… a negative array size exception might mean that the size of the database exceeds what MALT can currently handle…
Could you please rerun with option -v (verbose).
This will hopefully generate a stack trace that would be helpful for me to take a look…
gwdu103:6 05:59:45 /scratch2/jfellow/resources/malt/db/custom/raw/plant > cat slurm-6262828.err
MaltBuild - Builds an index for MALT (MEGAN alignment tool)
Options:
Input:
--input: /scratch2/users/jfellow/resources/malt/db/custom/raw/plant/plant_2021Jan06.fna.gz
--sequenceType: DNA
Output:
--index: /scratch2/users/jfellow/resources/malt/db/custom/raw/plant/
Performance:
--threads: 40
--step: 8
Seed:
--shapes: default
--maxHitsPerSeed: 1000
Classification support:
--mapDB: /scratch2/jfellow/resources/malt/db/megan-map/megan-nucl-map-Jul2020.db
Deprecated classification support:
--parseTaxonNames: true
--noFun: false
Other:
--firstWordIsAccession: true
--accessionTags: gb| ref|
--firstWordOnly: false
--random: 666
--hashScaleFactor: 0.9
--buildTableInMemory: true
--extraStrict: false
--verbose: true
Version MALT (version 0.5.0, built 5 Aug 2020)
Author(s) Daniel H. Huson
Copyright (C) 2020 Daniel H. Huson. This program comes with ABSOLUTELY NO WARRANTY.
Classifications to use: Taxonomy
Reference sequence type set to: DNA
Seed shape(s): 111110111011110110111111
Deleting index files: 3
Number input files: 1
Loading FastA files:
100% (6392.6s)
Caught:
java.lang.NegativeArraySizeException: -1734967296
at malt/malt.io.FastAFileIteratorBytes.growBuffer(FastAFileIteratorBytes.java:194)
at malt/malt.io.FastAFileIteratorBytes.hasNext(FastAFileIteratorBytes.java:131)
at malt/malt.data.ReferencesDBBuilder.loadFastAFile(ReferencesDBBuilder.java:183)
at malt/malt.data.ReferencesDBBuilder.loadFastAFiles(ReferencesDBBuilder.java:167)
at malt/malt.MaltBuild.run(MaltBuild.java:259)
at malt/malt.MaltBuild.main(MaltBuild.java:71)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at com.install4j.runtime/com.exe4j.runtime.LauncherEngine.launch(LauncherEngine.java:84)
at com.install4j.runtime/com.install4j.runtime.launcher.UnixLauncher.start(UnixLauncher.java:66)
at install4j.malt.MaltBuild.main(Unknown Source)
The error was caused when MALT attempted to read a single FastA sequence that was extremely long.
The error occurred when the FastA parser attempted to double the size of the array that it uses to hold the current sequence…
(There was a programming error here: when the requested length is extremely big, the program is supposed to cap the requested length rather than request a negative length, I have fixed that bug).
I will upload a new new release with the bug fix.
With the bug fix, as long as no individual FastA record exceeds a length of 2,147,483,639, the program should be ok
Do you really have individual FastA records that are that on the order of a gigabase in length? Or is there some other problem causing MALT to try to read your whole input file as one FastA record?
If all individual FastA records that much smaller than 1G, then could send me the first 1000 records, say, of your file, then I can run it to verify that MALT is processing each record separately.
These are a lot of plant assemblies from NCBI RefSeq, so it could be. I will try and identify first if there is a problematic genome that is very large, otherwise send you a subset files as requested (might take a couple of days due to lockdown + child).
I am currently uploading a new release 0.5.1 that allows individual records to be up to 2,147,483,639 letters long (but do you really have individual sequences that are that long?)
Awesome thank you very much. I will update the version of MALT on bioconda as well and test it for you.
In the meantime I will also investigate whether there is a FASTA entry is longer than that. I also think this unlikely, but I don’t actually know.
By indvidual record, I’m assuming you mean an entry within a single multi-entry FASTA file (e.g. is one chromosome longer than 2,147,483,639 bases)? If you mean if the single input FASTA file has more than that number, it could very likely be (a lot of the genomes are just assemblies so have a lot of unplaced scaffolds, which I may remove in the future).
2,093,616,421 doesn’t reach your specified threshold above, but it’s pretty close. I have submitted a new malt-build run with version 0.5.1 and will let you know the outcome (might take a couple of days as I’m in a low-priority queue due to job size).
For your reference I also attach the file that I used to download the genomes going into the database I was trying to build: sparse_plant_priority_subsampled.tsv (195.1 KB)
One further question, I may need to run on smaller cluster nodes in the future, and I saw there are different memoryModes:
--memoryMode Load all indices into memory, load indices page by page when needed or use memory
mapping (load, page or map).
I thought maybe page would be ideal, as I guess it chunks up the database and align reads against each chunk (presumably meaning there is a smaller running memory footprint).
I’ve tried this on the same database as above, but the run appears to have hung on aligning (as in last log message is Starting file and then nothing else. Edit: Or rather is running extremely slowly. I see there is an RMA6 file but it is only 47k and there is a 10 byte SAM file… (in comparison to 100MB files with --memoryMode load.
Do you have any more information behind each mode, and why maybe the malt-run build may have hung?