Taxonomy2function suggestions

Hi Daniel,

Thank you for this very useful feature. I’ve a few suggestions to further improve this new tool.

  1. Include the whole taxonomic name in column one (i.e. Phyla/Class/Order/Family/Genus/Species). This will be a big help to be able to identify phyla, families etc of the Taxonomy “name” in column one.

  2. Same as above, but also for column 2. Such as KEGG Categories/Pathways/KOs.

  3. Make it possible to input multiple number of samples per analysis. This is a bit more complicated, but column 3 plus any additional columns would here indicate read counts for each sample. That means that samples that had a similar taxonomy-function hit will appear for each sample. Let’s say we have 50 samples in column 3 and onwards. Column 1 and 2 will still be taxonomy and function, and these two columns are based on merged exact similar labels for all samples. However, one needs to remember that unclassified taxonomy (such as labels with “Bacteria”-“KEGG K02021”) will still be merged even when “Bacteria” could derive from many different kinds of bacterial species. But at least this information will be more accurate for classifications such as genera and species where this kind of information is mostly interesting. I don’t know if something like this would be possible, but it could be something worth thinking about.

Thanks for the suggestions.

All three should be possible. I will implement them (soon).

I have implemented all your suggestions and am now uploading a new release, V6_21_16. Please take a look to see whether the options work as intended.

When parsing multiple files, I currently do this sequentially. I will look into doing this in parallel in the near future, when I have time.

I tested the new version using two of my environmental metagenome samples. The samples are two meganized .daa files each ca 70 GB in size.

Column 1 and 2 can now be set to show full paths, as well as specifying a delimiter. This works.

However, the analysis of multiple samples is not very usable due to the sequential analysis. It took me ~two days to analyze one of my samples (having 50 GB RAM available). So running a set of 50 samples is not optimal.

I think one taxonomy2function job can be run per sample, and the user can then setup a For Loop to run all samples in parallel on a high performance computing cluster. Then there could be a final command (still using the taxonomy2function tool) to combine all the output files. Here it could also be possible to specify a read count normalization method.