Clustering nanopore reads before blasting and doing 16S analysis with MEGAN6


Our lab has recently started to work on 16S analysis using ONT nanopore sequencing, and I am trying to set up an analysis pipeline for 16S amplicon analysis. So, I am sorry if the questions sounds quite simple but I am trying to figure things out as I go.

So, we did amplified the 16S region of microbial samples in some environmental samples, sequenced them using ONT MinION, and we are now trying to see what sort of bacteria are there in the sample and their composition.

After preprocessing the reads with adapter trimming, length and quality trimming, we locally blasted the reads against NCBI’s 16S Ribosomal RNA database, and the output file is in .blastn. We then import the BLAST file into MEGAN6 and continue with our analysis there.

My question stems from the fact that blasting the reads (on average about 150,000 reads) takes up a very long time, and I wanted to ask if anyone has any suggestions to reduce the time it takes to blast the samples (clustering or dereplication?) before importing to MEGAN. It took us about 7 hours and the file size was about 250GB to get a single BLAST file.

I tried dereplicating with VSEARCH, but because of its strictness, I never have any reads dereplicated. Or perhaps there is a way to reduce computational load through BLAST? I thought of limiting the number of matches per reads, but I have read of the --max_target_seq, now I am trying to figure out a different solution.

We have about 130 samples with each sample containing on average 150,000 reads, and our server does not have the computational power nor the space to store all the BLAST results.

Again, very sorry if this seems like a bit of a dumb question, but thank you very much in advance.

This is a difficult question. If clustering using VSEARCH (or a similar tool) doesn’t work, then perhaps hope of a tool designed explicitly for long reads?

A colleague suggests looking at this method:

Perhaps it can be adapted to Nanopore reads?