Does sam2rma work for converting SAM protein alignments?

dportik · July 28, 2020, 12:36am

Hello,
I have used diamond blastx to produce an alignment file in SAM format. My understanding from viewing options in sam2rma is that it:

Computes a MEGAN RMA (.rma) file from a SAM (.sam) file that was created by DIAMOND or MALT.

Given that diamond only performs nucleotide-protein alignments or protein-protein alignments, I assumed this tool could support SAM protein format.

When I try to convert the resulting SAM file from diamond, I am receiving an error indicating illegal characters in the CIGAR string. This is clearly because this SAM file is storing protein alignments, not nucleotide alignments. There are additional characters in the CIGAR, \ and /, which are indicative of nucleotide frameshifts, whereas the remaining CIGAR characters and values represent amino acid positions. I sort of assumed that sam2rma would properly handle these non-standard CIGAR characters.

My main question is: Does sam2rma only work for nucleotide alignments (blastn, minimap2, etc), or does it work for translated alignments too (diamond blastx)?

Given I am unsure of the intended behavior for sam2rma, I am including the commands used, and a truncated version of the input file I was testing chunk1-100.sam (50.7 KB) . This is a dataset consisting of PacBio HiFi reads.

diamond:

diamond blastx -d refseq_bac-fung-vir.dmnd -q reads_chunk1.fasta -o chunk1.sam -f 101 -F 15 --range-culling --top 10 -b 16 -c 1 -p 32

sam2rma:

sam2rma -i chunk1.sam -r reads01.fasta -o Test -lg -alg longReads -t 12 -mdb megan-map-Jul2020.db -v

output:

SAM2RMA6 - Computes a MEGAN RMA (.rma) file from a SAM (.sam) file that was created by DIAMOND or MALT
Options:
Input
–in: chunk1.sam
–reads: reads01.fasta
Output
–out: Test
–useCompression: true
Reads
–paired: false
–pairedSuffixLength: 0
Parameters
–longReads: true
–maxMatchesPerRead: 100
–classify: true
–minScore: 50.0
–maxExpected: 0.01
–topPercent: 10.0
–minSupportPercent: 0.05
–minSupport: 0
–minPercentReadCover: 0.0
–minPercentReferenceCover: 0.0
–lcaAlgorithm: longReads
–lcaCoveragePercent: 100.0
–readAssignmentMode: alignedBases
Classification support:
–mapDB: /home/dportik/programs/megan/db/megan-map-Jul2020.db
Deprecated classification support:
–parseTaxonNames: true
–firstWordIsAccession: true
–accessionTags: gb| ref|
Other:
–threads: 12
–verbose: true
Version MEGAN Community Edition (version 6.19.4, built 16 Jul 2020)
Author(s) Daniel H. Huson
Copyright (C) 2020 Daniel H. Huson. This program comes with ABSOLUTELY NO WARRANTY.
Functional classifications to use: EC, EGGNOG, GTDB, INTERPRO2GO, SEED
Loading ncbi.map: 2,259,889
Loading ncbi.tre: 2,259,893
Loading ec.map: 8,081
Loading ec.tre: 8,085
Loading eggnog.map: 30,875
Loading eggnog.tre: 30,986
Loading gtdb.map: 182,187
Loading gtdb.tre: 182,191
Loading interpro2go.map: 12,738
Loading interpro2go.tre: 28,689
Loading seed.map: 978
Loading seed.tre: 979
Current SAM file: chunk1.sam
Reads file: reads01.fasta
Output file: Test
Classifications: Taxonomy, SEED, EGGNOG, GTDB, EC, INTERPRO2GO
Generating RMA6 file Parsing matches
Annotating RMA6 file using FAST mode (accession database and first accession per line)
Parsing file chunk1.sam
Parsing file: chunk1.sam
Input domination filter: MinPercentCoverToStronglyDominate=90.0 and TopPercentScoreToStronglyDominate=90.0
Error parsing file near line: 78: Unrecognized CigarOperator: 92
Error parsing file near line: 78: Unrecognized CigarOperator: 92

You can see the offending characters in the CIGAR string of line 78:

291M1\348M

My testing indicated that CIGAR strings containing frameshift characters are pervasive, and form a majority of the records in the SAM file.

Any clarification on this would be greatly appreciated!

Thanks,
Dan

Daniel · July 28, 2020, 6:40am

At present, sam2rma only supports

nucleotide alignments
protein alignments produced by MALT

The recommended way to combine DIAMOND + MEGAN is to produce .daa files using DIAMOND and then to meganize the .daa the files.
If you require a different DIAMOND output format for other purposes, then please note that you can use DIAMOND’s view command to create other formats from the .daa file.

I will look into supporting DIAMOND’s SAM output in the future, but this probably will not be available in the foreseeable future.

dportik · July 28, 2020, 5:51pm

Hi Daniel,
Thank you for the quick response. I will take a look at the SAM protein output from MALT for comparison.

I understand the optimal way to run DIAMOND for MEGAN is to produce .daa format and using daa-meganizer. However, there are HPC bottlenecks if DIAMOND is run this way. I am working with long-read datasets (PacBio HiFi) and trying to create efficient workflows for projects containing many samples. I have done some benchmarking and here is what I found:

Inputs:

Fasta: 2,529,830 HiFi reads; 21GB
Reduced nr database: bacteria, fungi, virus; 44,878,615,829 letters

Diamond run with 32 threads, block-size 8. This is what I would consider typical for a resource request on HPC.

diamond blastx -p 32 --db bac-fung-vir.dmnd --query Sample1.fasta --top 10 --block-size 8 --index-chunks 1 --outfmt 100 --out Sample1.daa

Total time = 196304s (54.5 hrs)
Reported 120613887 pairwise alignments, 164351780 HSPs.
2529625 queries aligned.

Diamond run with double the resources: 64 threads, block-size 16. This is probably more than the typical HPC resources available for many researchers.

diamond blastx -p 64 --db bac-fung-vir.dmnd --query Sample1.fasta --top 10 --block-size 16 --index-chunks 1 --outfmt 100 --out Sample1.daa

Total time = 176854s (49.1 hrs)
Reported 120520986 pairwise alignments, 164105732 HSPs.
2529625 queries aligned.

In this case, doubling the resources only reduced the 54 hr run time by 5 hrs.

I also tested splitting the input fasta file into 5 and 10 pieces. You are correct in that it does not actually speed up DIAMOND (regarding total time to run all files), but running them in parallel does save a considerable amount of real-world time.

When the fasta is split into 5 pieces, each file contains ~500,000 records and is 4GB. Using the same commands as above, the results are:

Total time = 36840.1s (10.2 hrs)
Reported 23674215 pairwise alignments, 30310046 HSPs.
506454 queries aligned.

When the fasta is split into 10 pieces, each file contains ~250,000 records and is 2GB. Using the same commands as above, the results are:

Total time = 21259.6s (5.9 hrs)
Reported 11736020 pairwise alignments, 14953493 HSPs.
257103 queries aligned.

Again, this strategy doesn’t reduce the total time to process all files if they are run in serial, but if they are run in parallel there are major reductions in processing time. I will be involved with projects containing many samples, and so splitting files and using a workflow manager (e.g., snakemake) will allow much more efficient use of our HPC system. Using SAM format allows the output files to be combined very easily, and the reads will be grouped properly. Combining multiple .daa files seems possible with Ella, but perhaps very slow and counteracting the time saved by splitting files.

Would you be able to clarify which information is being used in the SAM for sam2rma?

Are the CIGAR strings and aligned sequence segments being used directly? For example, it would be easy enough to translate the protein CIGAR into a nucleotide CIGAR and simply omit the matched read sequence (using * instead). However, if sam2rma also uses the matched sequence segment, then this would not be as straightforward.

I understand that your recommendation is to produce .daa output and used daa-meganizer. My main concern is that the reference databases, sequencing data, and scale of projects will continue to grow in size. It would be ideal to have alternative pathways into MEGAN, as it is an exceptionally useful program. SAM format is straightforward to work with, has many supporting tools, and seems like a viable option for producing the required rma format. I am actively developing these workflows, and I welcome your feedback and suggestions.

Thanks,
Dan

Daniel · July 29, 2020, 6:55am

Hi Dan,

you have made a very good case for using SAM format. It does indeed allow one to split the alignment task into many subtasks whose results can then concatenated and merged.

Workaround: Providing a * as cigar string does work. You will lose the ability to view the alignments in MEGAN, of course.

MEGAN uses SAM cigar strings internally in the RMA6 format to represent alignments, including ones with frame-shifts…
So, there appears to be an incompatibility with the way DIAMOND encodes frame-shifts in SAM format.

I am working on this.

dportik · July 29, 2020, 6:35pm

Hi Daniel,
Thank you again for the quick response. I have looked at the protein-alignment outputs from MALT and can see that the CIGAR strings do not contain any frameshift characters ( / and \ ) that DIAMOND uses.

There could be several options for cleaning the DIAMOND SAM before running sam2rma. The frameshift characters could be eliminated - these are typically in the CIGAR as 1/ or 1\ (although I suppose 2/ and 2\ are also possible). I am not sure how much this would impact the interpreted alignment. In general, when a frameshift character is present there seems to be a mismatch in AA numbers reported in the CIGAR and the number of AA positions in the sequence segment shown. Another possibility is to use the frameshift characters to revise the CIGAR string, though I am not sure how to interpret these single bases in the context of protein format (where other “positions” represent 3 base chunks). Or, as you’ve mentioned, the CIGAR can be eliminated, preventing some MEGAN features from being used.

Any of the above would require an intermediate step between DIAMOND and sam2rma, so I could also see the value of making the DIAMOND SAM directly compatible with sam2rma.

I do have one quick followup question:

Does MEGAN use the sequence segment information from the SAM files (e.g., the 10th column, when starting the column count from 1)?
This column contains a long string of amino acids or nucleotide characters representing the matched sequence from the CIGAR string. I was thinking that if this information is not used, that I will probably pre-clean SAM files to eliminate these strings (using a * instead). This would substantially reduce the size of the merged SAM file that I would convert to RMA format.

Thanks again for considering my request to make SAM protein format more accessible. Please let me know if there is anything I can do on my end to help out.

Thanks,
Dan

Daniel · July 30, 2020, 6:07am

Yes, you can replace both the cigar and the 10-th column by a star. MEGAN will still process the SAM file (but you won’t be able to inspect the alignments).

I am currently working on parsing the cigar string, 10-th column and MD: tag so as to correctly report the alignment. This work has hit a snag, as it looks like the 10-th column of the SAM output of DIAMOND may be incorrect and so this first has to be sorted out.

Daniel · July 30, 2020, 9:21am

SAM format is really a DNA alignment format for short read alignments and DIAMOND and MEGAN misuse it to represent DNA to protein alignments. However, the current implementations in DIAMOND and MEGAN fail when frame-shifts are present. Also, use of SAM is inefficient for reporting alignments of long queries. So, fixing DIAMOND and MEGAN so that SAM format can be used to convey frame-shift aware alignments will not happen soon. However, you can still use the approach of simply replacing the cigar and 10-th column by a *. You can also remove the MD: tag.

The next release of MEGAN (6.19.7) will contain code that silently accepts any SAM alignment that contains a frame shift. The alignment will still be used for calculations, but will not be viewable in the inspector window.

dportik · July 30, 2020, 5:22pm

Hi Daniel,
Thank you for the detailed responses, this helps to clarify several things.

I realize that protein alignment is not the intended use for SAM format. I will be performing nucelotide-nucleotide and nucleotide-protein alignments for my work, so having the same output format was convenient for constructing workflows.

I understand that removing the CIGAR, 10th column, and MD tags will allow the DIAMOND frameshift-aware protein SAM to be read by MEGAN, but will prevent alignment features and some calculations.

I suppose another option is simply to not use frameshift-aware protein alignment in DIAMOND. The HiFi reads are very accurate and the alignments should still be very high quality. HiFi reads do contain indels at low frequency, so it may result in some shorter or split alignments. I can try this and compare the results. However, the frameshift-aware protein alignments will be necessary for other long read types with higher error rates (e.g., ONT), so the workaround will still be useful.

I have one last follow up question:
Are the CIGAR, 10th column, and MD tags all required in SAM to allow alignment viewing and correct calculations by MEGAN?
For example, let’s say I do not use frameshift-aware alignment. Then, in re-formatting a SAM I only include CIGAR, use * for the 10th column, and exclude all optional tags in the SAM. Will MEGAN still function properly? I just want to clarify if all of these fields must be present to enable desirable downstream features, or if some can be left out. Omitting some fields will reduce file sizes, but I would not do this if it disables key features in MEGAN. I will likely be using sam2rma rather than direct import into MEGAN, if that helps narrow down the answer.

Thanks again for your helpful suggestions!

Daniel · July 31, 2020, 8:32am

All fields have to be present for MEGAN to recreate the alignment.
Using SAM for long reads is not ideal.

The best way to solve this would be to use DAA format and then to merge the DAA files. it would be fantastic if DIAMOND provided a merge command that allowed one to merge multiple DAA files into a single one. I will suggest this to Benjamin Buchfink.

dportik · July 31, 2020, 5:15pm

Hi Daniel,
I agree, the DAA merger would be ideal.

I should also mention that it is currently not possible to use the --range-culling feature of DIAMOND blastx without using the -F (frameshift-aware) option. This is unfortunately what produces the illegal CIGAR strings. However, --range-culling is an important aspect for long reads. One workaround I have found is just to set an absurdly high penalty to frameshifts. For example, using -F 2000 (or higher) instead of -F 15. This more or less prevents frameshifts but allows multiple gene hits for a read.

I appreciate the discussion we’ve had about these topics. Thanks for all of your hard work on MEGAN and its support tools!

Daniel · August 4, 2020, 7:04am

Yes, it would be good if the two DIAMOND options were decoupled. For such DIAMOND feature requests, please visit http://www.diamondsearch.org, which is maintained by Benjamin Buchfink, who is working on the DIAMOND code.
(Benjamin said that he will look into implementing a merge command so that DIAMOND can be used to merge multiple DAA files.)