Alignment of long reads by MALT or DIAMOND

Hello,

I want to use MEGAN on NexSeq 2x150 reads that I have assembled (R1 and R2 files trimmed and assembled together), thus obtaining longer reads. I’ve read that Diamond is not recommended for long reads. Actually, I’ve already employed Diamond on my datasets and only a low proportion of reads are assigned. My question is whether MALT could be used on long reads. I’ve read that LAST is also recommended for long reads.
Looking forward to a reply. Thanks in advance.

DIAMOND will do fine on those reads. If you are not getting any alignments against NR, then it is not a problem of the aligner.

Long reads start at around 500bp, say.
And for those, DIAMOND and new options for dealing with them.

Hello Daniel,

I’ve been checking the contigs length and they range from 300 to 2,000 bp, sometimes reaching up to 10,000 bp. Isn’t that considered long reads? In that case, should I use LAST or MALT?

I’d also like to ask you another thing (it may be pretty basic). My R1.fastq file contains 3,412,750 sequences (152 bp). After diamond blastx (against NCBI-nr) the number of queries aligned is 1,653,102. Doesn’t this represent a too low percentage of the total reads? This is the output from diamond blastx:

Reported 27513386 pairwise alignments, 27513529 HSPs.
1653102 queries aligned.

Use DIAMOND with the following options:

-F 15 --range-culling --top 10

-F 15 will trigger frame-shift aware alignments (rather than straight translated alignments), this if you suspect that there may be frame-shift causing errors in your long sequences. This is probably less the case for contigs, more the case for long reads (Nanopore or PacBio).

–rangeCulling and --top 10 will change the way DIAMOND decides which alignments to report: range culling turned on will cause DIAMOND to report alignments along the whole length of the read, locally reporting all alignments with 10% of the best local bit score (–top 10). If this is not turned on, then DIAMOND will report the best scoring alignments globally for the read, which will tend to concentrate on one region of the query (the most conserved gene).

There is no publication on this yet, but in our hands running DIAMOND in this mode is slightly more sensitive than running LAST.

This depends on the environment that you sampled. For a human gut sample, this would be low, but for a soil sample, that seems right.

Dear Daniel,

As you suggested, I’ve run diamond blastx with the options -F 15 --range-culling --top 10. I’m currently analysing samples from marine bacterioplankton sequenced by NovaSeq 6000 (2x150).

When I run blastx with the standard settings and then daa-meganizer against InterPro mapping file, I get 149,712 assignments out of 225,442 contigs (and 5,237,419 alignments). However, when I run blastx with the options -F 15 --range-culling --top 10 and then daa-meganizer, I get significantly higher number of alignments (26,137,913) but 0 assignments.

Below you can see the commands and outputs in both cases:

diamond blastx -d nr.dmnd -q sample1.fasta -o sample1.daa --outfmt 100

daa-meganizer -i sample1.daa -a2interpro2go acc2interpro-June2018X.abin

Total reads: 225,442
_With hits: 225,442 _
Alignments: 5,237,419
Assig. Taxonomy: 0
Assig. INTERPRO2GO: 149,712

diamond blastx -d nr.dmnd -q sample1.fasta -o sample1.daa --outfmt 100 -F 15 --range-culling --top 10

daa-meganizer -i sample1.daa -a2interpro2go acc2interpro-June2018X.abin

Total reads: 226,132
_With hits: 226,132 _
Alignments: 26,137,913
Assig. Taxonomy: 0
Assig. INTERPRO2GO: 0

Dear Juanjo,

This looks like a different issue, but just in case, when using the long-read mode you should also specificy -lg (for longReads) flag for meganizer.

Your command should look like:

daa-meganizer -i sample1.daa -a2interpro2go acc2interpro-June2018X.abin -lg

The short-read functional assignment algorithm will assume a short-read can span only one gene and therefore assign it to only one function. The long-read algorithm; however, can assign multiple functions to one read/contig as a single long-read/contig may span multiple genes.

Best,
Caner

Dear Caner,

Many thanks for the reply. It actually worked! However, the standard daa-meganizer settings generated more assignments to INTERPRO2GO (100,164) than the option -lg (83,571), even though the contigs employed in this analysis are quite long, ranging between 200 bp and 111,190 bp (average 839 bp, N50 1078 bp). Does it make sense to you?

Dear Juanjo,

This rather seems like a bug in the numbers reported, not the actual assignments (or in the reads that are put under “Not assigned” - they seem to be duplicates @Daniel).

I tested with a small file (50 reads). The commandline reports that in the short-read mode 38 reads get an assignment and in the long-read mode 36. Which confirms your observation. If you, however, open the file and look at the actual assignments, the short-read mode really has 38 reads assigned to a function, whereas the long-read mode actually has 45 reads assigned to a function with a total number of 150 assignments (one long-read may span multiple gene, on average ~3 in this case).
If you look at “Not assigned” node in the long-read mode, you still see 14 reads there (50-14=36, I guess that’s what the commandline reports). But 9 of these 14 reads actually have an assignment somewhere in the InterPro2GO classification. I don’t know why they also appear there.

Hello Caner,

Interesting. Many thanks for the test. So, how should this bug be reported? I’ve seen you’ve added a link to Daniel. Is this enough? I guess there is nothing I can do to continue my analyses before this bug is fixed, isn’t it?

Daniel should have received an email but he’ll be back at work next week.

The answer to your question depends on what you need for your analysis. But in general, I think this bug shouldn’t affect you in any way. The reads still get assigned correctly (at least in the small test case I produced). Some of them also just appear in the “Not assigned” bin. If you don’t care about the “Not assigned” bin, you should be fine. If you want to find which ones are really “Not assigned”, what I did was I selected all nodes, extracted read name to InterPro2GO name in tab format. Then excluded “Not assigned” and did the same again. Then I printed the first column of both files and diff’d them in bash.

awk ‘{print $1}’ mtuberculosis_simulated_50_reads.diamond.nr.lg-ex-all.txt | sort | uniq > all-reads.txt
awk ‘{print $1}’ mtuberculosis_simulated_50_reads.diamond.nr.lg-ex-assigned.txt | sort | uniq > assigned-reads.txt
diff all-reads.txt assigned-reads.txt

The thing is that I’m also interested in a quantitative functional analysis to study the relative abundance of certain genes in the bacterial communities and compare between treatments. Therefore, I need to get a reliable estimation of the number of the different genes in the different samples.
I’ll try as you suggested while I wait for this bug to be addressed.

Many thanks for your help. I’ll be looking forward to hearing from MEGAN community.
Kind regards,

Juanjo

I have just found and fixed the bug, and will upload a new release today