Can blastx be a problem? Best strategy for classifying assembled contigs from metagenomes with MEGAN

jov14 · June 3, 2016, 7:25pm

Sorry for the long question, but this has been bugging me for a while now:

I want to use MEGAN to analyze contigs of pre-assembled (and often already binned) metagenome datasets (and possibly also remove the most likely residual contaminants of each bin before reassembly), instead of unassembled reads.

I would prefer to base the classifications on protein-alignments against the nr-database (using either diamond blastx of the contig sequences or diamond blastp of the respective prodigal predictions), because I believe this to be more sensitive for unkown species with few related references in the current databases (protein sequences of distantly related organisms are mostly more similar to each other than the corresponding gene sequences).

HOWEVER; if I use blastx, the BLASTS will result in mutliple HSPs per Contig (hopefully at least one per ORF present on that contig).
Since MEGAN will use only the best X% of the blast hits for classification, will MEGAN use ALL HSPs of each contig for it’s classifications, or will these be predominantly based only on the HSP with the highest scoring best hit (in most cases simply the largest gene product encoded on that contig)?

I ask this, because an HSP with an high scoring best hit may still have quite unsignificant fifth- and tenth-best hits. In those cases it would be preferable to use the best scoring hits of the next HSPs instead of focusing on all hits of just the HSP with the best scoring hit.

Would it, perhaps, make more sense to classify the LCA of each protein sequence predicted on each contig separately (MEGAN analyses of blastp results of prodigal predictions)? In that case I would have to subsequently analyze the LCA classifications of all proteins for each contig and determine the “LCA of LCAs” for that contig (based on all of it’s proteins)

If you would agree to this second option: Do you think i could simplify that process by renaming the query of each blastp-hit to the name of the contig that protein was identified from? And then resorting the blast output to equally represent the best hits of each protein sequence for each contig with descending BLAST-score? (–>thereby “tricking” MEGAN to use only the best hits of every single protein encoded on the contig for it’s classification)?

Thanks for any thoughts and suggestions

Rewati27 · June 7, 2016, 6:04pm

Dear jov14,

If you want to carry out the taxonomic classification of long contigs, nucleotide alignments against a genome DB (BLASTN) would be a good idea.

However, your approach of BLASTX may also lead to interesting findings. In this case your idea of considering LCA of LCA’s is a good one.

Also you may want to try both BLASTN and BLASTX on a subset of the contigs to check what output you get.

Daniel · June 15, 2016, 9:45pm

We are currently looking at the perform of MEGAN’s LCA on Nanopore data… And it is very obvious that the fact that MEGAN doesn’t take multiple alignments at different locations into account is indeed a problem, as you suggest.
I have put this high up on my list of things-to-do…
D

jov14 · July 14, 2016, 1:15pm

Thanks, and sorry for the late reply.
So do you think it would be a helpful workaround to sort blast-tables by query-name and score before importing them to MEGAN6?

Otherwise, It seems to me that the blast results are sorted by query-name, location and score meaning that for each alignment-location all hits are listed from best to worst score. So, if the LCA parameters are set to use the first x% of those hits, all hits (including relatively low scoring ones) from the first location are used before the best hits of the second are even considered.

I hope that presorting the results would prevent low scoring (secondary) hits from the first (best-hit) location from being preferred over high scoring hits from a second location on the same query fragment. By sorting the results by query-name and score (and excluding location) would at least mix the best hits of different locations that have similarly high scores (and would give less priority to low scoring secondary hits).

@Rewati27
my experience is, that by using nucleotide alignments I have a much lower chance of recognizing phyla which are currently underrepresented in the sequencing databases (e.g. Planctomycetes). So my opinion, at least, is that for samples where such organisms are to be expected, or if the community is relatively complex, BLASTX and BLASTP are the methods of choice. However in those cases I generally do not trust the classifications up to the species level but only up to the familiy level.

Daniel · July 15, 2016, 12:07pm

MEGAN does not use the first x% of hits, but rather it uses those hits whose bit-score is within x% of the best bit score seen for the given read. So, if the the best alignment scores 100 bits and the second scores 85 bits, then it will only use the first alignment if the topPercent parameter is set to 10 (10 %).

jov14 · July 15, 2016, 1:17pm

OK thanks, thats good to know. I was afraid MEGAN might be relying on the order of the BLAST-output to determine the top x% hits. Good to know that it doesn’t. I guess that makes my “workaround” of presorting the results unnecessary as MEGAN already takes care of that.
Thanks