Megan LR with nucleotide database

Hello,

In your paper you are using MEGAN-LR (with LAST aligner) by mapping to NR protein database. This is working fine for me. However, I would like to try it out with a nucleotide database that I have taken from kraken tool, so that I can compare MEGAN to it.

On my first try, I didn’t get any reads aligned, however, by reading through the forums I found out that the headers in my database are not formatted as they should be for MEGAN. For example, one of the headers is:

kraken:taxid|272844|NC_000868.1 Pyrococcus abyssi GE5, complete genome

I have two questions.

  1. Will MEGAN-LR work with nucleotide database?
  2. Will it be enough to remove “kraken:taxid|” from each header for MEGAN to classify reads correctly?

Best regards,
Krešimir Križanović

could you please send me a small (but complete) kraken output file and I will add an importer for the such files to MEGAN.

Hello,

I’m not trying to import kraken output, just use the same sequence database for aligning reads so that I can compare results from Kraken and MEGAN. Kraken simply downloads sequences from RefSeq database. The files are in FASTA format and have headers in the format that I wrote in the last post. That header seems to contain taxid but it also has some other text before that. I will reformat the headers and see if MEGAN works, but do you see any other problem with running MEGAN-LR with a nucleotide database?

Yes, MEGAN is happy to process nucleotide alignments.

In your case, if you remove

kraken:

and leave

taxid|xxxxx

then MEGAN can perform taxonomic assignment if you select “Extended Mode” and “Id Parsing” on the Taxonomy tab of the Import Blast dialog.

However, the following simpler solution should also work:
When selecting Id Parsing, you can specify the tags used by MEGAN to identify Ids. For taxonomy, the default tags are tax| taxonomy|.
If you specify kraken:taxid| as a tag, then you will not have to edit the lines in the input files.

Thanks a lot,

Since you have been so helpful I have two more questions:

  1. Can I do that from command line? I’m guessing that I can, but how can I set the parameters?

The file with commands that I currently use to run from command line is:

open file=’/mnt/5TB/megan/megan-results/BacArDB/Mock_100000-bac_2bacAr.daa’;
update;
select nodes=all;
export what=CSV format=taxonId_to_percent separator=comma counts=assigned file=’/mnt/5TB/megan/megan-results/Mock_100000-bac_2bacAr-megan.csv’;
quit;

Before this script I mapped reads with LAST to produce maf file and converted that to daa using maf2daa script. Then I also ran daa-meganizer with the option -lg for long reads.

  1. Can MEGAN output relative abundancies of organisms in the sample or just the read counts?

I really appreciate your help. To give you a little more information, we are currently testing several metagenomic classifiers on long read data and MEGAN is one of them.

Sincerely,
Krešimir Križanović

Dear

  1. This is how you turn on idParsing and set the tag prefix to kraken:taxid| :

set idParsing=true cName=Taxonomy prefix=‘kraken:taxid|’;

The run “import from blast” from the command-line in MEGAN UE, the easiest way to figure out the commands is to do this once from the GUI and then copy the command from the message window, as MEGAN echos all interactive commands there.

  1. At present, MEGAN can output read counts, total bases or total aligned bases. If you want relative read counts for multiple samples, then you need to open those samples together in a single document using File->Compare and select relative counts when setting up the comparison.
    Note that relative read counts is not the same as relative abundances, as the former doesn’t take genome sizes into account…

Thanks again, this is very useful.

Best regards,
Krešimir Križanović

Hello again.

I’ve been quite busy and didn’t get to test this until yesterday. Unfortunately, I still didn’t manage to get MEGAN to classify anything. I have trouble testing this with the MEGAN GUI because all my data is on a GUI-less server.

I want to repeat all my steps to make clear what exactly I’m doing. I hope you can check this out and maybe see what I’m doing wrong.

STEP 1. Map reads to a reference database using Last aligner. This produces quite large .maf file. The reference database is a .fna file with headers like this:
‘>kraken:taxid|272844|NC_000868.1 Pyrococcus abyssi GE5, complete genome’

STEP 2. Convert .maf file to .daa file using maf2daa tool. This produces much smaller .daa file that can be loaded into MEGAN.

STEP 3. Prepare results for analysis with MEGAN using daa-meganizer tool. Set option for long reads (-lg).

/daa-meganizer -lg -i Mock_100000-bacteria-l1000-q10_2BacAr.daa -mdb /mnt/5TB/megan/megan-db/megan-map-Oct2019-ue.db

This is where I might be doing something wrong. The output of daa-meganizer tool says that it only managed to classify into two classes (as I can understand it). Last part of daa-meganizer output:

Class. Taxonomy: 2
Class. SEED: 2
Class. EGGNOG: 2
Class. KEGG: 2
Class. INTERPRO2GO: 2

I’m not sure if I should be running daa-meganizer at all when mapping to nucleotide database.

STEP 4. Run MEGAN from command line with the following script.

open file=‘/mnt/data/megan-test/test_dataset_NR.daa’;
set idParsing=true cName=Taxonomy prefix=‘kraken:taxid|’;
update;
select nodes=all;
export what=CSV format=taxonId_to_percent separator=comma counts=assigned
file=‘/home/ubuntu/data/megan-results/silico-30human-70bac-toNR-megan.csv’;
quit;

This results in a really small .csv file:

1,61.025478
-1,0.000179
-2,61.025299

MEGAN is failing to determine the taxonomic identities of the reference sequences.

This makes no sense:

-mdb /mnt/5TB/megan/megan-db/megan-map-Oct2019-ue.db

because you are not using the NCBI-nr database.

As previously mentioned, you will have to use a different mechanism for identifying taxa, namely make use of the fact that your reference sequences already contain a taxon id in them:

kraken:taxid|272844|NC_000868.1 Pyrococcus abyssi GE5, complete genome

In this example, the taxon id is 272844.

As mentioned, MEGAN allows parsing of such ids to be turned on.
Unfortunately, daa-meganizer was missing the necessary option for turning on taxon-id parsing. I have fixed this. I will upload a new release of MEGAN 6.18.11 in which daa-meganizer has a new option:

-t4t, --tags4taxonomy [string] Tags for taxonomy id parsing (must set to activate id parsing).

Here you must specify:

-t4t taxid|

So, your daa-meganizer call should then look like this (using quotes to protect the pipe character ‘|’):

/daa-meganizer -lg -i Mock_100000-bacteria-l1000-q10_2BacAr.daa -t4t “taxid|”