Pooled contigs and raw reads

Hi Megan community,

I have pooled-contigs assembled through Megahit and the raw-reads per sample. No taxonomy or functions have been annotated. Can I continue my analysis in Megan? Could you give me an idea of how to do that?

This is my current workflow:
I was told to map my reads to the contig file using bowtie2. Then convert the output from .sam to .bam and .bam.bai. Now for the pooled-contigs, I was told to use DIAMOND to add function and taxonomy, obtain .daa file and meganize. Where I get lost is how to import the reads.bam and reads.bam.bai and contigs.megan - is this even possible?

Thank you.

Run the contains through the DIAMOND+MEGAN pipeline to assign taxonomy and function to your contigs. You can’t use your bam files in this process.

Hi,

Looking at previous posts and the Megan paper I want to make sure I understood something correctly. I have paired end reads. Prior to DIAMOND alignment, I need to first create contigs with a program like metaSPAdes? I cannot run the DIAMOND alignment on a raw fastq.gz file I received from the sequencer?

Hi @jspychalla,

It depends on you. You can directly align your raw FASTQ files (QC is optional and depends on your dataset type) against the NCBI-nr database and then process the alignments using MEGAN. MEGAN will generate a count table, which you can explore within MEGAN itself or use this table to do statistical analyses

In the approach mentioned above, the authors assembled contigs from FASTQ files and generated an abundance table by mapping reads to the contigs, you will get something like below:

Contig Sample1 Sample2 Sample3
contig1 0.5 0.1 0.4
contig2 0.5 0.5 0.7
contig3 0.4 0.8 0.9

but contigs don’t have taxonomic and functional information :

  1. Use the contig sequences from contig.fasta.
  2. Align the contigs against the NCBI-nr database.
  3. Process the alignment results using MEGAN for taxonomic and functional binning.

Using MEGAN’s daa2info tool, you can extract taxonomic and functional information for your contigs and merge it with your abundance table. This will allow you to explore the data in MEGAN or perform statistical analyses on it.

Let me know if you need further clarification or assistance!

Best regards,
Anupam

Hi Megan Community,

Following previous discussions, I would like to ask a few additional questions that have been unclear to me for some time regarding read-based vs contig-based analysis using DIAMOND + MEGAN6.

I am currently analyzing PacBio long-read metagenomic data. In initial tests, we observed that read-based classification against the NCBI-nr database can take 2–3 days per sample, so we switched to a contig-based workflow to reduce memory & computation time. Our goal is to characterize both microbiome composition and eDNA signatures from environmental samples.

However, I have not been able to find a clear consensus in the literature regarding whether read-based or contig-based analysis provides more accurate or reliable taxonomic and functional profiles. I am concerned that read-based approaches may increase the number of low-confidence or spurious taxonomic assignments, whereas contig-based approaches may introduce biases related to assembly artifacts, including potential chimeric contigs and uneven representation of taxa.

Moreover, I was wondering how to interpret my contig-based results (output = contig counts per taxon / number of aligned bases)? I used DIAMOND+MEGAN6 as described for long-reads in the following publication together with daa2info -c2c Taxonomy: https://currentprotocols.onlinelibrary.wiley.com/doi/10.1002/cpz1.59

I would greatly appreciate any clarification or advice. Thank you very much for your help and time in advance.

Kind regards,

DanE

1 Like

Dear Dane,

Since your data is already long-read, you can go either read-based or assembly-based. I usually recommend assembly-based, as it can help you recover complete chromosomes — we found this in one of our group’s papers. Recent long-read technology has also improved a lot, with nice quality scores for long reads.

For interpreting the assembly-based output, I would go with the total number of bases aligned rather than contig counts per taxon, since this accounts for contig length and gives a more quantitative view of taxon abundance, especially if you want to do further statistical analysis.

Also, please make sure to use the proper long-read flags in both DIAMOND and MEGAN, as the default short-read settings can give misleading results on PacBio data. One more thing to consider — I am not sure whether you need to do some polishing after assembling reads into contigs, but it might be worth looking into, depending on your assembler and data quality.

Let me know if you have further questions.

Best regards,
Anupam


Dear Anupam,

Thank you for your quick response and helpful comments.

I am happy to hear that our assembly-based approach was a good choice for our dataset. To provide a bit more context, our environmental samples have an average sequencing yield of 0.7-0.8 Gb per sample. We assembled the reads using metaFlye and subsequently polished the contigs with Racon. We then proceeded with DIAMOND and MEGAN6 using the recommended long-read settings (as described in the protocol mentioned in my previous post).

The only point that is still unclear to me is how to obtain the “total number of aligned bases per taxon”. After using daa-meganizer, we used the following two scripts:

daa2info -i ${sample} -o ${sample_ID}_tax.txt -c2c Taxonomy -n true -p true

daa2info -i ${sample} -o ${sample_ID}_taxids.txt -c2c Taxonomy -l -m -r

Based on the outputs we obtained, I assume that these results correspond to “contig counts per taxon”, rather than “total number of aligned bases per taxon”. Consequently, I was wondering if I need to do back-mapping of reads to contigs or other approaches for extracting “total number of bases aligned”. Are there other options/flags for extracting this information using daa2info or similar MEGAN6 tools?

Apart from this, would you be able to share the link to the paper of your group you mentioned.

Thanks again for your help in advance.

Kind regards,

DanE

Would it be possible for you to send your meganized DAA file? Your command should already give the total number of aligned bases per taxon, so it would help me check what’s going on.

Hi Megan Community,

Since my data needs to remain confidential (including meganized DAA files), I continued the discussion with Anupam in private. To summarize the outcome: the counts reported by daa2info correspond to aligned bases, provided the long-read workflow was followed correctly. In my case, this confirmed that the results were as expected.

I hope this is helpful for others, and thanks again to Anupam for the great help.

Kind regards,

DanE

1 Like