Blast2lca tool - meaning of the options, algorithm and output

Hello,

I would like to make sure that I understand the way the tool blast2lca well.

Here are my questions:

  • it uses naive LCA algorithm, right?

  • the parameters:
    –minScore, maxExpected, minPercentIdentity: do I understand right, that these parameters define which classifications to consider: for example, for default values, only the hits with the score >50, evalue <0.01 and pident>0% would be considered.

–topPercent - does it also concern the number of hits to be considered (i.e. if we have 10 hits, it will assign it to the best one automatically) or it is something else?

-in the output, what are the numbers following the taxon levels? I suppose it is some kind of confidence, but I’m not sure and I would like to know what it means exactly.

Thanks a lot for creating all these cool tools and thanks in advance for the answers,

A

I have a similar question, especially with regards to the output numbers. I first thought that the numbers represent the Score (but the numbers also don’t seem to change if I set --minScore to a higher value). Is this documented somewhere?

I try to reformulate my question to make it a bit more clearer.

$ ./blast2lca -i input.table -f BlastTab -m BlastP -o output.tsv \
  -a2t data/prot_acc2tax-Jun2018X1.abin \
  --minScore 100

However, if I investigate the output

$ cat output.tsv | cut -f 3,4 -d ";" | sort | uniq -c
     26 
    650 d__Bacteria; 100
      1 d__Bacteria; 55
      1 d__Bacteria; 57
      1 d__Bacteria; 75
      2 d__Bacteria; 83
      2 d__Bacteria; 85
      1 d__Bacteria; 88
      2 d__Bacteria; 89
      1 d__Bacteria; 90
      1 d__Bacteria; 91
      1 d__Bacteria; 92
     10 d__Bacteria; 95
      1 d__Bacteria; 96
      2 d__Bacteria; 98
      4 d__Eukaryota; 100
      1 d__Eukaryota; 74

I expected that the values behind the taxonomic ranks are the score values, but it seems like they are not filtered out when I set the min score with -ms to 100.

Can anyone explain to me what these values represent? Or how I can get access to the scores (via command line)?

Thank you very much for your help

I have a new idea about the values in the table:

I came along this paper: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1670-4, or more precisely figure 1, see below:

Are the numbers referring to the bootstrap values as in the tree at the bottom of the figue?

I would really appreciate any comments on that. Thank you very much for your help and for creating those tools, they are really helpful.

For a given read R and a given taxon T,
the score is the percentage of all alignments for R
that are to a taxon S that is either T, or an ancestor of T.

For example, look at this output:

read666; ;d__Bacteria; 100;p__Proteobacteria; 100;c__Gammaproteobacteria; 100;o__Enterobacterales; 100;f__Enterobacteriaceae; 100;g__Escherichia; 50;s__Escherichia coli; 50;

While 100% of the alignments for read666 are to the family of Enterobacteriaceae, only 50 percent are to the genus of Escherichia.

To clarify, here is a summary of the alignments for read666:

  • Escherichia coli CFT073; score=71.0
  • Escherichia coli; score=71.0
  • Shigella flexneri 2a; score=71.0
  • Shigella sonnei Ss046; score=71.0
  • Shigella boydii Sb227; score=71.0
  • Escherichia coli; score=69.0

Note that all six listed organisms belong to Enterobacteriaceae, but only half belong to Escherichia coli