I am trying to extract reads from specific nodes with the command line but am having issues. If I want to extract the reads from a specific node, that’s no issue but I want to extract the reads assigned to eg Taxon A, Taxon B, Taxon C etc.
I could do it manually but that would take too long, hence why I want to write a command that will extract the node-specific reads for all nodes at a specific level.
Do you have any pointers/tips?
Thank you!
When you say command line, I assume that you mean MEGAN UE?
If so, then this script might be a useful starting point. It is a script that opens a file, collapses to rank of Phylum, selects all Phylum nodes and then exports their counts:
the script above, is there any way I can modify it to write each taxon to a new output file?
Ideally, I would want to have a file with the extracted reads for every taxon.
Hello @Daniel
Thanks for the ultra quick response
I have tried this, but I don’t understand at all how it works…
Sorry for the silly question, but how can I extract a .fasta file with only bacterial reads?
daa2info -i contigs1.daa -o onlybacteria.fasta -bo
All the best,
L
Sorry, actually you should use the read-extractor tool.
Here is its reported usage:
SYNOPSIS
ReadExtractorTool [options]
DESCRIPTION
Extracts reads from a DAA or RMA file by classification
OPTIONS
Input and Output
-i, --input [string(s)] Input DAA and/or RMA file(s). Mandatory option.
-o, --output [string(s)] Output file(s). Use %t for class name and %i for class id. (Directory, stdout, .gz ok). Default value(s): ‘stdout’.
Options
-fsc, --frameShiftCorrect Extract frame-shift corrected reads. Default value: false.
-c, --classification [string] The classification to use. Legal values: EC, EGGNOG, GTDB, INTERPRO2GO, KEGG, SEED, Taxonomy
-n, --classNames [string(s)] Names (or ids) of classes to extract reads from (default: extract all classes).
-b, --allBelow Report all reads assigned to or below a named class. Default value: false.
-a, --all Extract all reads (not by class). Default value: false.
Other:
-IE, --ignoreExceptions Ignore exceptions and continue processing. Default value: false.
-gz, --gzipOutputFiles If output directory is given, gzip files written to directory. Default value: true.
-v, --verbose Echo commandline options and be verbose. Default value: false.
-h, --help Show program usage and quit.
Here is an example of how to run it to extract all bacterial reads to a file:
The -b option is important: this “all below” option ensures that not only reads assigned to Bacteria are saved, but also assigned below the Bacteria node are saved, too.
The %t in the output file name is replaced by the taxon name, in this case Bacteria.
This produces a gzipped file called Bacteria.fasta.gz that contains all reads assigned to Bacteria or below (more specific nodes).
I am also trying to extract reads but based on specific KEGG KOs, is this possible? So far I have been unable to use the -n to specify KEGG KOs, KEGG pathway names etc.
The problem is that my .daa files are too large (ca 60 GB each) and I have too many of them to 1) move them to my local computer storage, and 2 ) work with each one of them manually using the UI. I need to use one of the command tool supplied with MEGAN, preferbly the read-extractor tool could be updated. I am already a big fan of the meganizer, compute comparison, and taxonomy2function tools
I tested this and it worked. Yes, I have a paid license for MEGAN UE so this isn’t a problem. The issue was that I had written the full KEGG KOs (e.g. K06206) rather than just 6206.