Export taxonomic counts in Kraken or MPA report format?

dportik · March 9, 2021, 7:16pm

Hello,
I am wondering if it would be possible to add a feature that would allow taxonomic annotations/counts to be exported in the Kraken report (‘kreport’) format, or MetaPhlAn3 mpa format? The full description of the Kraken format can be found here, and the mpa format is described here (abundance output).

The main advantage of the kraken format is that it provides the cumulative read counts at each taxonomic rank. For example, if there are 6 species/strains of Bacteroides, the genus Bacteroides would be the sum of these individual species counts. All of the genera in a family would contribute to a sum count for that family, etc. However, you can also see the number of reads assigned directly to these ranks as well. These makes it very easy to look at different ranks quickly. The same is more or less true with the mpa format, though it only calculates abundances and does not provide read counts.

I am attaching an example of the kraken and mpa formats here, if that will help in deciding the feasibility of this feature: STD-h500-k20.kraken.report.txt (19.0 KB); STD-profiled_metagenome.txt (29.9 KB)

This should work well for read counts, but I think could be extended to base pairs as well. It would be ideal to have either of these format options available from the command line. Currently, I have been using rma2info to get read counts: rma2info -i {input} -o {output} -r2c Taxonomy -n --bacteriaOnly. If this could include the kraken report and/or mpa format, that would be exceptionally useful!

Thanks,
Dan

dportik · March 25, 2021, 11:43pm

In case anyone is also interested in converting to these file formats, I wrote a script to perform this task.

The script can be used to convert an NCBI class2count (c2c) file into kraken report (kreport) and metaphlan (mpa) formats. To obtain this c2c file, you can run MEGAN’s rma2info program on a MEGAN read-count RMA file:

rma2info -i input.RMA -o NCBI.c2c.txt -c2c Taxonomy -n -r

The script is called Convert_MEGAN-NCBI-c2c_to_kreport-mpa.py and can be found in the pb-metagenomics-scripts folder of the PacBio metagenomics repo. Details for using the script are provided in the documentation here. It requires python 3.7 and the python packages ete3 and pandas. I hope that it proves useful, especially in allowing comparisons to other methods.

Please also note that complete pipelines for aligning HiFi reads, performing annotation with MEGAN, and producing output files (including kreport and mpa files) are also available on the PacBio metagenomics github: Taxonomic-Functional-Profiling-Protein, Taxonomic-Profiling-Nucleotide, and MEGAN-RMA-summary.