Building the database in MALT: how to control that all the genomes are left?

Vasilina · January 11, 2023, 8:15am

Hi all,

I have an issue with malt-build regarding the resulting complexity of the database. As an input for the database I include CDs from different genomes, so, for example all my plant genomes smash in the end into wheat genome. I tested it with randomly sampled Fasta from another plant species (from db input) and there were no matches found. Do I control it somehow with --maxHitsPerSeed or any other parameter? Thanks!

Here is the log file of the run, I assume it looks fine:

MaltBuild - Builds an index for MALT (MEGAN alignment tool)
Options:
Input:
	--input: .........
	--sequenceType: DNA
Output:
	--index: ...
Performance:
	--threads: 10
	--step: 1
Seed:
	--shapes: default
	--maxHitsPerSeed: 1000
Classification support:
	--mapDB: .../megan-nucl-Feb2022.db
Deprecated classification support:
	--parseTaxonNames: true
	--noFun: false
Other:
	--firstWordIsAccession: true
	--accessionTags: gb| ref|
	--firstWordOnly: false
	--random: 666
	--hashScaleFactor: 0.9
	--buildTableInMemory: true
	--extraStrict: false
	--verbose: true
Version   MALT (version 0.5.3, built 4 Aug 2021)
Author(s) Daniel H. Huson
Copyright (C) 2021 Daniel H. Huson. This program comes with ABSOLUTELY NO WARRANTY.
Classifications to use: Taxonomy
Reference sequence type set to: DNA
Seed shape(s): 111110111011110110111111
Deleting index files: 0
Number input files:          517
Loading FastA files:
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (1466.6s)
Number of sequences: 186,748,193
Number of letters:101,888,004,433
BUILDING table (0)...
Seeds found: 101,887,763,290
tableSize=    2,147,483,639
hashMask.length=31
maxHitsPerHash set to: 1000
Initializing arrays...
100% (0.0s)
Analysing seeds...
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (3808.2s)
Number of low-complexity seeds skipped: 755,530,672
Allocating hash table...
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (464.8s)
Total keys used:     2,145,121,347
Total seeds matched:91,421,691,322
Total seeds dropped: 2,364,654,292
Opening file: table0.db
Allocating: 689.1 GB
Filling hash table...
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (11780.6s)
Randomizing rows...
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (605.8s)
Writing file: table0.idx
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (97.1s)
Writing file: table0.db
Size: 689.1 GB
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (6264.4s)
Writing file: index0.idx
100% (0.1s)
Loading ncbi.map: 2,302,807
Loading ncbi.tre: 2,302,811
Building mappings...
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (474948.8s)
Writing file: taxonomy.idx
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (2.8s)
Writing file: ref.db
Writing file: ref.idx
100% (285.9s)
Total time:  500,362s
Peak memory: 900.5 of 1900G