Gene Centric Assembly

I meganized a DAA file and then I did the Gene-Centric Assembly.Then I again aligned the contigs from to the reference file to check how much they match with the reference file taken for making the DAA file but i found that the contigs from the Gene assembly file are not matching 100 percent to the reference database Genes. Some are matching even 30 percent. So how I can get gene assembly if contigs are not matching 100 percent to the reference database.


the reads might align against a particular reference even though they are not identical to underlying reference sequence. Gene centric assembly aims at producing variants of gene sequences from reads aligned to a certain protein. Therefore, it would depend on the quality of the reads alignment - you can inspect it using an Alignment viewer (right click on a node > Show Alignment)


Thank you very much for solving my query.

It would be good to see a concrete example of what you mean.

I sequenced sample from a river body and then I blast it with the CARD(Comprehensive Antibiotic Resistance Database) database using diamond and therefore I meganized that diamond (DAA) output to get the gene centric assembly. Afterwards I used again those reads in the assembly file to align with the CARD database to check whether the reads I got in assembly matched perfectly with the genes in CARD or not, but the result showed that many of the reads in assembly file even matched only 20-30 percent but still I got those in assembly. So I was just wondering why the reads in the assembly file were not identical to the genes in the CARD database ?

see Ania’s reply above

I wanted a favor once again. This time I am using MEGAN 6.11 on Linux platform (as gene centric assembly failed on Windows giving error “Execute failed:java.IO.ioeXCEPTION”).So, I want to do gene centric assembly of the diamond output (DAA) file aligned to CARD database. I don’t have a single idea from where to start as I am not able to understand from the manual clearly to perform only gene centric assembly on Linux. Can you please help me in this regard by giving suggestion either for Windows error message or performing gene centric assembly on linux.
thanks in advance.

MEGAN on linux operates the same as MEGAN on windows.
In addition, there is a command line program contained in the megan/tools directory called gc-assembly.
Run the program without any command line options to get a help message.
For example, if you installed megan in ~/megan (here ~ represents your home directory, as usual) then type


to see the help message.

Thanks a lot, at least now I am able to use megan in linux. After meganizing the DAA file, I ran the gc assembler using command “./gc-assembler -i try.daa -o out -id ALL -fun none” but I am getting these continuous warnings “(Problem parsing SAM format, alignment may be incorrect: buffer.length=48, gappedQuery.length=49)
Warning: illegal char in diff string: ‘*’ (code: 42)” . Can you please help me in this regard. It will solve a big problem.
Thank You

Did you meganize the file using a very recent version of MEGAN? Older versions have problems. If you did meganize with a very recent version, then please make the file available to me and I will take a look at it

Yeah, I downloaded the recent version today and tried making contigs but still, the error messages are persisting. Although I am getting contigs if I am providing a diamond file having only a small section of hits (say 35,000 raw reads) but if I am providing the full diamond output then I am not able to retrieve any contigs. I have uploaded a sample file which I am using for assembly. Please have a look at it.try.daa (2.3 MB)

Thank you for providing the file. Two things:

  1. I have fixed the

Warning: illegal char in diff string: ‘*’ (code: 42)”


  1. For “gene-centric” alignment to make sense, the reads must be classified by genes. Your file only contains a taxonomic classification, but not eggNOG, SEED etc

You are interested in CARD and we do have an experimental CARD module that can be used with MEGAN. If you would like, then I can make this available to you for testing, but it still requires some work.

Thank you for the help up to this extent. Actually this time I am using refseq plasmid database to perform gene-centric assembly and as told by you this time I meganized DAA file using SEED and I got various contig files but I got so many fasta files so is there any way to get combined contigs fasta file as every contig fasta file has numbering starting from 1.
Thank You once again.