Hello,
I have used diamond blastx to produce an alignment file in SAM format. My understanding from viewing options in sam2rma
is that it:
Computes a MEGAN RMA (.rma) file from a SAM (.sam) file that was created by DIAMOND or MALT.
Given that diamond only performs nucleotide-protein alignments or protein-protein alignments, I assumed this tool could support SAM protein format.
When I try to convert the resulting SAM file from diamond, I am receiving an error indicating illegal characters in the CIGAR string. This is clearly because this SAM file is storing protein alignments, not nucleotide alignments. There are additional characters in the CIGAR, \ and /, which are indicative of nucleotide frameshifts, whereas the remaining CIGAR characters and values represent amino acid positions. I sort of assumed that sam2rma would properly handle these non-standard CIGAR characters.
My main question is: Does sam2rma
only work for nucleotide alignments (blastn, minimap2, etc), or does it work for translated alignments too (diamond blastx)?
Given I am unsure of the intended behavior for sam2rma
, I am including the commands used, and a truncated version of the input file I was testing chunk1-100.sam (50.7 KB) . This is a dataset consisting of PacBio HiFi reads.
diamond:
diamond blastx -d refseq_bac-fung-vir.dmnd -q reads_chunk1.fasta -o chunk1.sam -f 101 -F 15 --range-culling --top 10 -b 16 -c 1 -p 32
sam2rma:
sam2rma -i chunk1.sam -r reads01.fasta -o Test -lg -alg longReads -t 12 -mdb megan-map-Jul2020.db -v
output:
SAM2RMA6 - Computes a MEGAN RMA (.rma) file from a SAM (.sam) file that was created by DIAMOND or MALT
Options:
Input
–in: chunk1.sam
–reads: reads01.fasta
Output
–out: Test
–useCompression: true
Reads
–paired: false
–pairedSuffixLength: 0
Parameters
–longReads: true
–maxMatchesPerRead: 100
–classify: true
–minScore: 50.0
–maxExpected: 0.01
–topPercent: 10.0
–minSupportPercent: 0.05
–minSupport: 0
–minPercentReadCover: 0.0
–minPercentReferenceCover: 0.0
–lcaAlgorithm: longReads
–lcaCoveragePercent: 100.0
–readAssignmentMode: alignedBases
Classification support:
–mapDB: /home/dportik/programs/megan/db/megan-map-Jul2020.db
Deprecated classification support:
–parseTaxonNames: true
–firstWordIsAccession: true
–accessionTags: gb| ref|
Other:
–threads: 12
–verbose: true
Version MEGAN Community Edition (version 6.19.4, built 16 Jul 2020)
Author(s) Daniel H. Huson
Copyright (C) 2020 Daniel H. Huson. This program comes with ABSOLUTELY NO WARRANTY.
Functional classifications to use: EC, EGGNOG, GTDB, INTERPRO2GO, SEED
Loading ncbi.map: 2,259,889
Loading ncbi.tre: 2,259,893
Loading ec.map: 8,081
Loading ec.tre: 8,085
Loading eggnog.map: 30,875
Loading eggnog.tre: 30,986
Loading gtdb.map: 182,187
Loading gtdb.tre: 182,191
Loading interpro2go.map: 12,738
Loading interpro2go.tre: 28,689
Loading seed.map: 978
Loading seed.tre: 979
Current SAM file: chunk1.sam
Reads file: reads01.fasta
Output file: Test
Classifications: Taxonomy, SEED, EGGNOG, GTDB, EC, INTERPRO2GO
Generating RMA6 file Parsing matches
Annotating RMA6 file using FAST mode (accession database and first accession per line)
Parsing file chunk1.sam
Parsing file: chunk1.sam
Input domination filter: MinPercentCoverToStronglyDominate=90.0 and TopPercentScoreToStronglyDominate=90.0
Error parsing file near line: 78: Unrecognized CigarOperator: 92
Error parsing file near line: 78: Unrecognized CigarOperator: 92
You can see the offending characters in the CIGAR string of line 78:
291M1\348M
My testing indicated that CIGAR strings containing frameshift characters are pervasive, and form a majority of the records in the SAM file.
Any clarification on this would be greatly appreciated!
Thanks,
Dan