Binning for nanopore long-reads assembly before running DIAMOND + MEGAN?

h.v.tran · April 3, 2025, 5:38am

Hi all,
My name is Lenny, a third year phD student. I am using DIAMOND + MEGAN pipelines to analyse metagenomic nanopore long-reads sequencing from faeces samples (48 samples). After assembling contigs in each samples (barcode), I ran checkM2(new version of checkM design for assembly from metagenomic) to test the completeness of the assembles. Most of my assembly got > 98% of completeness, but the contamination is also > 100% (which could be common in metagenomic samples). My question is do I need to bin the assemblies with metabat2 before submit them to DIAMOND + MEGAN pipelines? Should there be any difference between binning and un-binning assemblies?

Thank you for your help.

Anupam · April 7, 2025, 8:47am

@h.v.tran

MEGAN can help you bin assembled contigs directly. You can align and meganize them using the standard DIAMOND + MEGAN pipeline, but make sure to use long-read mode in both DIAMOND and MEGAN, assuming your assemblies are from long-read sequencing.

Once aligned, you can use the read-extractor to perform binning.

If you’re using MetaBAT2, then your next task might be to assign taxonomy to the resulting bins. You can do this with MEGAN’s approach or alternatively with GTDB-Tk, depending on your goals.

Also, MEGAN includes a contamination filter, which might be worth exploring to see if it suits your case.

Hope this helps!
Anupam

h.v.tran · April 8, 2025, 12:38am

Hi Anupam,

I have been asked that without binning beforehand, was the DIAMOND +MEGAN pipeline is reliable for assigning taxonomies and functional pathway. I am not sure about the answer that I got from the protocol and paper. Therefore, I post me question here. Thank you for your reply. It clarified a lot.

For more information, I did some testing with binning the assembles using metaBAT2, then run DIAMOND to generate DAA files from those bins. However, it took significantly longer time, compared to the original assemblies. My friends once also tried to run DIAMOND with raw long-reads (without assembling them) and it also took much longer time to finish the run with the same computational resources (we run all the steps in HPC).

I also tested the contamination of assigned reads in each node after running DIAMOND + MEGAN with the assemblies. For some reason, I could not successfully run read-extractor in command line. Therefore, I extracted the reads from each node in GUI version, then performed checkM2 with those assigned reads. Results showed there were almost no contamination (< 1%). However, the completeness was also reduced considerably (ranging from 10-80%)? I guess it was due to the smaller total reads in the assigned contigs compared to the whole assembly (which, in my case, had multiple contigs).

Again, thank you for your prompt reply. It means a lot.
Best regard,
Lenny

Anupam · April 11, 2025, 7:46pm

Hi Lenny (@h.v.tran),

Running DIAMOND in long-read mode against the full nr database can be quite time-consuming. As an alternative, you might try using the nr90 database available on the MEGAN7 download page—we’re currently preparing a manuscript using that setup with MEGAN7.

Could you please share the command you used for the read-extractor, or the exact error message you encountered when running it from the command line?

Also, I wasn’t completely sure if you were asking a specific question in your previous comment or simply describing the approach you used. Apologies if I misunderstood—happy to help once I have a bit more context.

Best regards,
Anupam