New to MALT: index building files?

Kalani · May 6, 2021, 3:36pm

Hi everyone!

I’m new to MALT (using it for HOPS) and I’m not sure what input files I need to build an indexed database. I have one question about the reference file zip type and one question about the mapping files.

Reference file zip type
I’m on an HPC that already has the nt and nr databases on it for using blast+. Can I use this as my reference directory? The MALT manual says

The files must be in FastA format and may be gzipped (in which case they must end on .gz.)

but these files end in .tar.gz.MD5 not .tar.gz. Will that be a problem?

I want to map reference sequences to taxonomic identifiers, what is the easiest way to do this for the nt database? -a2taxonomy seemed like the easiest way to me, but I don’t really understand what these mapping files are. I’ve found some, but how do I know if they match up with my version of the nt database? I’ve found megan-nucl-Jan2021.db.zip at
https://software-ab.informatik.uni-tuebingen.de/download/megan6/welcome.html
as well as nucl_gb.accession2taxid.gz at
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/
However all the examples of malt-build I’ve seen have a .bin extension for the mapping file, so I’m not sure if these would work. I also noticed that the MALT manual says

The options -tre and -map are used to access the NCBI taxonomy, which is needed to perform a taxonomic analysis of the reads as they are aligned.

in reference to the -tre $MALT/data/ncbi.tre.gz -map$MALT/data/ncbi.tre.gz options in the example, what are these files and where can I get them?

Sorry for having such vague questions, I realize I may be way off target with a lot of this stuff. Thank you in advance if you are able to get me on the right track!

IamIamI · May 17, 2021, 7:33am

Heya Kalani,

I’m also a somewhat new user to Malt but have been through some of these steps

Yes you can use any type of multi Fasta file. the extention however seems to need to be .gz… this is just how the pipeline is setup… cant expect it to unpack every file format. But you can just untar and then gzip it back up if needed.
Also a warning, the full non-redudant refseq database is massive… and this will take a lot of memory on your HPC to crunch through. So make sure you actually need the full nt, and not just a specific branch of it. For example if you are screening for the animal kingdom, it might be easier and faster to remove every funghi, bacteria, etc. This will speed it up and decrease ram usage
The supplied mapping file works, the .bin appears to just be there to compress a otherwise large file. You can also convert that current ncbi accession2taxid file using the description Daniel gave in the comments of the following post: Updating mapping files
If you supplied your reference multi-fasta, and the a2t file (either the jan2021 one or your own) and run Malt-build… it will then build the -tre and -map file itself and call it taxonomy.tre and taxonomy.map and store it in the index folder. You dont need to specify these files when running malt, just point to the index folder that is created by malt-build. If for some reason you do need to build your own map and tre file, i explained how i did that in this post: Question regarding custom ncbi.map / ncbi.tre
But you wont need this for Malt, but maybe for other software you want to use downstream.

So recap, yes fasta is fine! The jan2021 if good to go, but making your own is as easy as writing a one-liner. And the tre and map files are auto generated by malt-build.

As for matching versions… off course if you get an organism from ncbi that is deposited in feb 2021, then the jan 2021 index isn’t gonna see it. But since you want to use a set that is already on your HPC, its likely that it’s already a year old anyway and then the jan2021 file should work perfectly
It just matches the assecion numbers of your multi-fasta to species levels taxonomy nodes from ncbi… so if you have subspecies they will be ignored and just group in the species level… and if you have something new or something that got removed (for example sometimes you get new organisms with “candidate xyz” that are later renamed… those might also mismatch)

Cheers,
Lesley