I am trying to see if it would be possible to take a large metagenome assembly derived from samples sequenced on a whole NovaSeq S4 lane, that is:
annotated with diamond using the --long-reads option (which is the advantage of this method, as large assemblies can be difficult to annotate with other common tools)
meganized
and then the annotations are extracted as a GFF.
However, I cannot find any tool to export annotations in GFF format. Is there one? This kind of large metagenome assembly would produce such huge .daa file that it would be unpractical to download this file and work with the MEGAN GUI software. I think it would be better if e.g. the daa2info tool is updated with this feature.
At present, only the interactive version of MEGAN allows export of GFF files, however, I have written a new tool, daa2gff3, and will upload a new release containing it.
I am not sure how well it will work, this probably needs some additional engineering, but if it looks promising to you, then I will look into fixing any issues that arise.
Another option is to setup megan-server. This is a Java program that runs on your server and serves your meganized DAA files to megan over the web. It uses authentication and so access is restricted by user name and password.
If you try this and run into any issues, then please let me know, I am happy to help
Thanks for the help with this. Great that there will be a new tool! I have two questions below:
Is it necessary to run a gene prediction tool (such as prodigal) before running diamond with --long-reads command?
I’ve tried to export a GFF using a small .daa file (subset of a metagenome assembly). However, one issue right away was that the long text in column 9 was not compatible with featureCounts to estimate number of mapped reads per feature. I solved this by removing all characters behind the first “;” sign in column 9. This left only the Id= tag with the NCBI accession number. I am not sure why there is an issue using the default GFF exported by MEGAN, but I was able to solve it this way.
I was then able to run featureCounts. Finally, at the end I just had to merge the annotations in column 9 of the original GFF with the NCBI accession numbers and counts in the output table from featureCounts.
Now, here there was a problem. The GFF generated from the diamond file does not contain unique Id= identifiers for each CDS. Instead these NCBI accession numbers occur multiple times on several rows in the GFF file. It seems that featureCounts either ignore multiple instances or summed them somehow (not sure). Software like PROKKA contain an Id that is unique per CDS, typically something like ID=PROKKA_01, ID=PROKKA_02 and so on, that increases by one with each new row (CDS) in the GFF file. Would it be possible to do something similar with the GFF from the diamond annotations? Considering column 9 contains Id= and Acc= that both are the NCBI accession number, couldn’t Id= instead be ID=MEGAN_01, ID= MEGAN_02 and so on?
This would make it much easier to be able to merge the mapped counts results from featureCounts with the annotations in the GFF file based on the diamond annotations.
EDIT January 24: clarified the issue with the GFF file.