Does sam2rma work for converting SAM protein alignments?

Hi Daniel,
Thank you for the quick response. I will take a look at the SAM protein output from MALT for comparison.

I understand the optimal way to run DIAMOND for MEGAN is to produce .daa format and using daa-meganizer. However, there are HPC bottlenecks if DIAMOND is run this way. I am working with long-read datasets (PacBio HiFi) and trying to create efficient workflows for projects containing many samples. I have done some benchmarking and here is what I found:

Inputs:

Fasta: 2,529,830 HiFi reads; 21GB
Reduced nr database: bacteria, fungi, virus; 44,878,615,829 letters

Diamond run with 32 threads, block-size 8. This is what I would consider typical for a resource request on HPC.

diamond blastx -p 32 --db bac-fung-vir.dmnd --query Sample1.fasta --top 10 --block-size 8 --index-chunks 1 --outfmt 100 --out Sample1.daa

Total time = 196304s (54.5 hrs)
Reported 120613887 pairwise alignments, 164351780 HSPs.
2529625 queries aligned.

Diamond run with double the resources: 64 threads, block-size 16. This is probably more than the typical HPC resources available for many researchers.

diamond blastx -p 64 --db bac-fung-vir.dmnd --query Sample1.fasta --top 10 --block-size 16 --index-chunks 1 --outfmt 100 --out Sample1.daa

Total time = 176854s (49.1 hrs)
Reported 120520986 pairwise alignments, 164105732 HSPs.
2529625 queries aligned.

In this case, doubling the resources only reduced the 54 hr run time by 5 hrs.

I also tested splitting the input fasta file into 5 and 10 pieces. You are correct in that it does not actually speed up DIAMOND (regarding total time to run all files), but running them in parallel does save a considerable amount of real-world time.

When the fasta is split into 5 pieces, each file contains ~500,000 records and is 4GB. Using the same commands as above, the results are:

Total time = 36840.1s (10.2 hrs)
Reported 23674215 pairwise alignments, 30310046 HSPs.
506454 queries aligned.

When the fasta is split into 10 pieces, each file contains ~250,000 records and is 2GB. Using the same commands as above, the results are:

Total time = 21259.6s (5.9 hrs)
Reported 11736020 pairwise alignments, 14953493 HSPs.
257103 queries aligned.

Again, this strategy doesn’t reduce the total time to process all files if they are run in serial, but if they are run in parallel there are major reductions in processing time. I will be involved with projects containing many samples, and so splitting files and using a workflow manager (e.g., snakemake) will allow much more efficient use of our HPC system. Using SAM format allows the output files to be combined very easily, and the reads will be grouped properly. Combining multiple .daa files seems possible with Ella, but perhaps very slow and counteracting the time saved by splitting files.

Would you be able to clarify which information is being used in the SAM for sam2rma?

Are the CIGAR strings and aligned sequence segments being used directly? For example, it would be easy enough to translate the protein CIGAR into a nucleotide CIGAR and simply omit the matched read sequence (using * instead). However, if sam2rma also uses the matched sequence segment, then this would not be as straightforward.

I understand that your recommendation is to produce .daa output and used daa-meganizer. My main concern is that the reference databases, sequencing data, and scale of projects will continue to grow in size. It would be ideal to have alternative pathways into MEGAN, as it is an exceptionally useful program. SAM format is straightforward to work with, has many supporting tools, and seems like a viable option for producing the required rma format. I am actively developing these workflows, and I welcome your feedback and suggestions.

Thanks,
Dan