Sam2RMA fails with String or BLOB exceeds size limit

bioinfodonk · January 17, 2025, 12:07am

Hello,

I’ve written a wrapper script that runs Megan 6 on my long read sam files. All other files complete fine, bit this one fails with the error pasted below.

The same was obtained by mapping the FASTQ to a diamond database of NCBI nr roughly following this workflow for Nanopore long reads…

Any debugging tips appreciated. Thanks!

Current SAM file: bams/aligned/NCBI_nr/barcode03.sam
Reads file:   fastqs/decontaminated/barcode03.fastq
Output file:  megan_diamond/rmas/barcode03.rma
Classifications: Taxonomy, SEED, EGGNOG, GTDB, EC, INTERPRO2GO
Generating RMA6 file Parsing matches
Annotating RMA6 file using FAST mode (accession database and first accession per line)
Parsing file barcode03.sam
Parsing file: bams/aligned/NCBI_nr/barcode03.sam
Input domination filter: MinPercentCoverToStronglyDominate=90.0 and TopPercentScoreToStronglyDominate=90.0
10% 20% 30% Caught:
org.sqlite.SQLiteException: [SQLITE_TOOBIG] String or BLOB exceeds size limit (statement too long)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.core.DB.newSQLException(DB.java:1179)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.core.DB.newSQLException(DB.java:1190)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.core.DB.throwex(DB.java:1150)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.core.NativeDB.prepare_utf8(Native Method)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.core.NativeDB.prepare(NativeDB.java:126)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.core.DB.prepare(DB.java:264)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.jdbc3.JDBC3Statement.lambda$executeQuery$1(JDBC3Statement.java:81)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.jdbc3.JDBC3Statement.withConnectionTimeout(JDBC3Statement.java:454)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.jdbc3.JDBC3Statement.executeQuery(JDBC3Statement.java:79)
        at megan/megan.accessiondb.AccessAccessionMappingDatabase.getValues(AccessAccessionMappingDatabase.java:220)
        at megan/megan.rma6.RMA6FromBlastCreator.parseFiles(RMA6FromBlastCreator.java:257)
        at megan/megan.tools.SAM2RMA6.createRMA6FileFromSAM(SAM2RMA6.java:340)
        at megan/megan.tools.SAM2RMA6.run(SAM2RMA6.java:307)
        at megan/megan.tools.SAM2RMA6.main(SAM2RMA6.java:69)

Anupam · January 19, 2025, 10:59am

Hi @bioinfodonk,

Would it be possible to share this file? I also recommend using MEGAN7. You can upload the file to a drive and share the link with us.

Best regards,
Anupam

bioinfodonk · January 20, 2025, 2:03am

Hi @Anupam,

The files are very large (~20Gb for the FASTQ and SAM each), but I’m working on it. Unfortunately using MEGAN7 isn’t currently possible in my pipeline.

Appreciate any tips you might have.
Thanks

bioinfodonk · January 20, 2025, 7:16pm

Hello @Anupam,

Here’s the link: Dropbox

Anupam · January 21, 2025, 12:07am

Thanks @bioinfodonk, will update you soon.

Best
Anupam

bioinfodonk · February 14, 2025, 11:51pm

Hello @Anumpan, any chance you’ve found a solution? Thank you.

Anupam · February 17, 2025, 9:05pm

Hi @bioinfodonk,

Sorry for the delay! I’ve been caught up with some work, but I’ll look into it soon and get back to you. Thanks for your patience!

best
Anupam

Anupam · February 24, 2025, 12:01am

Hi @bioinfodonk,

I ran the provided SAM file on my server using MEGAN6 Ultimate Edition and was able to reproduce the error on my end. However, I noticed that the FASTQ file you provided appears to be a BAM file. If you could send the raw FASTQ file, I’d be happy to check it again for you.

Regarding the issue, I believe it stems from DIAMOND reporting a large number of alignments per read—a common occurrence with long-read alignments due to the vast number of entries in the NCBI-nr database. Here are a couple of suggestions to handle this:

Use a DAA file instead of a SAM file:
- DAA files are optimized for MEGAN and generally more efficient to process.
Limit the number of reported alignments per read:
- Instead of using --top 5 in your DIAMOND command, try:
```
-F 5000 --range-culling -k 25
```
- This will report the top 25 alignments per read rather than the top 5%, helping MEGAN process the file without overwhelming the system.

The error occurs because MEGAN queries the SQLite database, and with a large number of alignments, more accession lookups are required, which can strain the system. Limiting the reported alignments can help avoid this issue.

I also noticed that you’re using a high frameshift penalty. Is there a specific reason for this setting?

If you prefer to continue working with the SAM format, we can explore other solutions, but I believe using the -k 25 option should resolve the issue (with SAM format too).

Let me know your preference, and I’ll be happy to assist further!

Best regards,
Anupam

bioinfodonk · February 24, 2025, 5:31am

Hello,

Thank you for looking into this! I do prefer to use SAM if possible. The -F 5000 was actually an incorrectly copied paramater (I was following a PacBio tutorial whereas I am have nanopore). I was actually using -F 15.

I will try with -k 25 instead of -top 5 and report back!

bioinfodonk · April 8, 2025, 4:47pm

Thank you that worked!