Taking contig coverage into account

CPPC · September 15, 2017, 1:53am

I would like to use DIAMOND and MEGAN to annotate assembled contigs derived from Illumina metagenome 2 x 250 bp reads. Since I want to compare organism % abundances, I need to consider coverage in the annotations. Is there a way to do this using DIAMOND and MEGAN?

Ania · September 15, 2017, 4:40pm

Dear CPPC,

Diamond is not a good tool to perform any long-sequence alignment. Rather than that, you can use LAST to do the frame-shift-aware alignment of your DNA scaffolds or contigs against the NCBI nr database. Than, you can use the blast2rma tool to make rma files and open them in MEGAN. The MEGAN long reads viewer supports number of aligned bp as a measurement of abundance so it seems to be exactly what you need,

Ania

Daniel · September 16, 2017, 9:42am

You can add a statement to the header line of a contig to indicate its “magnitude”, like this:

>contig666 magnitude=123
ACGTACT…

This would tell MEGAN that contig number 666 has magnitude 123.
Then, when importing your data you should set the “Read Assignment Mode” to “ReadMagnitude”.
MEGAN will weight each contig by its magnitude in all analyses.

As Ania pointed out, you might look into using LAST as an alternative to DIAMOND, as LAST may perform better on long reads.

CPPC · September 18, 2017, 11:55pm

Thanks Daniel and Ania for your suggestions. I will look into using LAST instead of DIAMOND. As for including the read magnitude in the contig header, do you know of a program that will output files with this format? I have several samples totaling millions of contigs, and to do this manually would be infeasible.

Daniel · September 20, 2017, 1:43pm

what format do you currently have that info in?

CPPC · September 21, 2017, 4:24am

I am using 2 different assemblers - IDBA-UD and SPAdes. SPAdes has the the coverage information in the fasta output of assembly. IDBA-UD does not, so I’ll use Bowtie2 to map the reads back to the assembly, so that should be in SAM format.

Daniel · September 21, 2017, 8:23am

Could you please show me a fastA header line generated by SPAdes that contains coverage information and I will look into modifying MEGAN’s parser so as to pickup that information.
I won’t be able to help you with parsing a SAM to obtain the coverage. I suggest that you write a script that extracts that info and then inserts that info into the fastA file.
I will also look into supporting an additional “read mapping file” that allows a user to map each read “name” (i.e. the first word in its fastA header line) to an “extended header line” that contains e.g. a magnitude=xxx statement. However, adding support for this is quite a bit or work and I it will not have much priority until it more users.