I want to use MEGAN on NexSeq 2x150 reads that I have assembled (R1 and R2 files trimmed and assembled together), thus obtaining longer reads. I’ve read that Diamond is not recommended for long reads. Actually, I’ve already employed Diamond on my datasets and only a low proportion of reads are assigned. My question is whether MALT could be used on long reads. I’ve read that LAST is also recommended for long reads.
Looking forward to a reply. Thanks in advance.
I’ve been checking the contigs length and they range from 300 to 2,000 bp, sometimes reaching up to 10,000 bp. Isn’t that considered long reads? In that case, should I use LAST or MALT?
I’d also like to ask you another thing (it may be pretty basic). My R1.fastq file contains 3,412,750 sequences (152 bp). After diamond blastx (against NCBI-nr) the number of queries aligned is 1,653,102. Doesn’t this represent a too low percentage of the total reads? This is the output from diamond blastx:
-F 15 will trigger frame-shift aware alignments (rather than straight translated alignments), this if you suspect that there may be frame-shift causing errors in your long sequences. This is probably less the case for contigs, more the case for long reads (Nanopore or PacBio).
–rangeCulling and --top 10 will change the way DIAMOND decides which alignments to report: range culling turned on will cause DIAMOND to report alignments along the whole length of the read, locally reporting all alignments with 10% of the best local bit score (–top 10). If this is not turned on, then DIAMOND will report the best scoring alignments globally for the read, which will tend to concentrate on one region of the query (the most conserved gene).
There is no publication on this yet, but in our hands running DIAMOND in this mode is slightly more sensitive than running LAST.
As you suggested, I’ve run diamond blastx with the options -F 15 --range-culling --top 10. I’m currently analysing samples from marine bacterioplankton sequenced by NovaSeq 6000 (2x150).
When I run blastx with the standard settings and then daa-meganizer against InterPro mapping file, I get 149,712 assignments out of 225,442 contigs (and 5,237,419 alignments). However, when I run blastx with the options -F 15 --range-culling --top 10 and then daa-meganizer, I get significantly higher number of alignments (26,137,913) but 0 assignments.
Below you can see the commands and outputs in both cases:
The short-read functional assignment algorithm will assume a short-read can span only one gene and therefore assign it to only one function. The long-read algorithm; however, can assign multiple functions to one read/contig as a single long-read/contig may span multiple genes.
Many thanks for the reply. It actually worked! However, the standard daa-meganizer settings generated more assignments to INTERPRO2GO (100,164) than the option -lg (83,571), even though the contigs employed in this analysis are quite long, ranging between 200 bp and 111,190 bp (average 839 bp, N50 1078 bp). Does it make sense to you?
This rather seems like a bug in the numbers reported, not the actual assignments (or in the reads that are put under “Not assigned” - they seem to be duplicates @Daniel).
I tested with a small file (50 reads). The commandline reports that in the short-read mode 38 reads get an assignment and in the long-read mode 36. Which confirms your observation. If you, however, open the file and look at the actual assignments, the short-read mode really has 38 reads assigned to a function, whereas the long-read mode actually has 45 reads assigned to a function with a total number of 150 assignments (one long-read may span multiple gene, on average ~3 in this case).
If you look at “Not assigned” node in the long-read mode, you still see 14 reads there (50-14=36, I guess that’s what the commandline reports). But 9 of these 14 reads actually have an assignment somewhere in the InterPro2GO classification. I don’t know why they also appear there.
Interesting. Many thanks for the test. So, how should this bug be reported? I’ve seen you’ve added a link to Daniel. Is this enough? I guess there is nothing I can do to continue my analyses before this bug is fixed, isn’t it?
Daniel should have received an email but he’ll be back at work next week.
The answer to your question depends on what you need for your analysis. But in general, I think this bug shouldn’t affect you in any way. The reads still get assigned correctly (at least in the small test case I produced). Some of them also just appear in the “Not assigned” bin. If you don’t care about the “Not assigned” bin, you should be fine. If you want to find which ones are really “Not assigned”, what I did was I selected all nodes, extracted read name to InterPro2GO name in tab format. Then excluded “Not assigned” and did the same again. Then I printed the first column of both files and diff’d them in bash.
The thing is that I’m also interested in a quantitative functional analysis to study the relative abundance of certain genes in the bacterial communities and compare between treatments. Therefore, I need to get a reliable estimation of the number of the different genes in the different samples.
I’ll try as you suggested while I wait for this bug to be addressed.
Many thanks for your help. I’ll be looking forward to hearing from MEGAN community.