Sam2RMA fails with String or BLOB exceeds size limit

Hello,

I’ve written a wrapper script that runs Megan 6 on my long read sam files. All other files complete fine, bit this one fails with the error pasted below.

The same was obtained by mapping the FASTQ to a diamond database of NCBI nr roughly following this workflow for Nanopore long reads…

Any debugging tips appreciated. Thanks!

Current SAM file: bams/aligned/NCBI_nr/barcode03.sam
Reads file:   fastqs/decontaminated/barcode03.fastq
Output file:  megan_diamond/rmas/barcode03.rma
Classifications: Taxonomy, SEED, EGGNOG, GTDB, EC, INTERPRO2GO
Generating RMA6 file Parsing matches
Annotating RMA6 file using FAST mode (accession database and first accession per line)
Parsing file barcode03.sam
Parsing file: bams/aligned/NCBI_nr/barcode03.sam
Input domination filter: MinPercentCoverToStronglyDominate=90.0 and TopPercentScoreToStronglyDominate=90.0
10% 20% 30% Caught:
org.sqlite.SQLiteException: [SQLITE_TOOBIG] String or BLOB exceeds size limit (statement too long)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.core.DB.newSQLException(DB.java:1179)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.core.DB.newSQLException(DB.java:1190)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.core.DB.throwex(DB.java:1150)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.core.NativeDB.prepare_utf8(Native Method)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.core.NativeDB.prepare(NativeDB.java:126)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.core.DB.prepare(DB.java:264)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.jdbc3.JDBC3Statement.lambda$executeQuery$1(JDBC3Statement.java:81)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.jdbc3.JDBC3Statement.withConnectionTimeout(JDBC3Statement.java:454)
        at org.xerial.sqlitejdbc@3.42.0.0/org.sqlite.jdbc3.JDBC3Statement.executeQuery(JDBC3Statement.java:79)
        at megan/megan.accessiondb.AccessAccessionMappingDatabase.getValues(AccessAccessionMappingDatabase.java:220)
        at megan/megan.rma6.RMA6FromBlastCreator.parseFiles(RMA6FromBlastCreator.java:257)
        at megan/megan.tools.SAM2RMA6.createRMA6FileFromSAM(SAM2RMA6.java:340)
        at megan/megan.tools.SAM2RMA6.run(SAM2RMA6.java:307)
        at megan/megan.tools.SAM2RMA6.main(SAM2RMA6.java:69)

Hi @bioinfodonk,

Would it be possible to share this file? I also recommend using MEGAN7. You can upload the file to a drive and share the link with us.

Best regards,
Anupam

Hi @Anupam,

The files are very large (~20Gb for the FASTQ and SAM each), but I’m working on it. Unfortunately using MEGAN7 isn’t currently possible in my pipeline.

Appreciate any tips you might have.
Thanks

Hello @Anupam,

Here’s the link: Dropbox

Thanks @bioinfodonk, will update you soon.

Best
Anupam

Hello @Anumpan, any chance you’ve found a solution? Thank you.

Hi @bioinfodonk,

Sorry for the delay! I’ve been caught up with some work, but I’ll look into it soon and get back to you. Thanks for your patience!

best
Anupam

Hi @bioinfodonk,

I ran the provided SAM file on my server using MEGAN6 Ultimate Edition and was able to reproduce the error on my end. However, I noticed that the FASTQ file you provided appears to be a BAM file. If you could send the raw FASTQ file, I’d be happy to check it again for you.

Regarding the issue, I believe it stems from DIAMOND reporting a large number of alignments per read—a common occurrence with long-read alignments due to the vast number of entries in the NCBI-nr database. Here are a couple of suggestions to handle this:

  1. Use a DAA file instead of a SAM file:

    • DAA files are optimized for MEGAN and generally more efficient to process.
  2. Limit the number of reported alignments per read:

    • Instead of using --top 5 in your DIAMOND command, try:
      -F 5000 --range-culling -k 25
      
    • This will report the top 25 alignments per read rather than the top 5%, helping MEGAN process the file without overwhelming the system.

The error occurs because MEGAN queries the SQLite database, and with a large number of alignments, more accession lookups are required, which can strain the system. Limiting the reported alignments can help avoid this issue.

I also noticed that you’re using a high frameshift penalty. Is there a specific reason for this setting?

If you prefer to continue working with the SAM format, we can explore other solutions, but I believe using the -k 25 option should resolve the issue (with SAM format too).

Let me know your preference, and I’ll be happy to assist further!

Best regards,
Anupam

Hello,

Thank you for looking into this! I do prefer to use SAM if possible. The -F 5000 was actually an incorrectly copied paramater (I was following a PacBio tutorial whereas I am have nanopore). I was actually using -F 15.

I will try with -k 25 instead of -top 5 and report back!