Extract rank-based reads with command line

Hi everyone,

I am trying to extract reads from specific nodes with the command line but am having issues. If I want to extract the reads from a specific node, that’s no issue but I want to extract the reads assigned to eg Taxon A, Taxon B, Taxon C etc.
I could do it manually but that would take too long, hence why I want to write a command that will extract the node-specific reads for all nodes at a specific level.
Do you have any pointers/tips?
Thank you!

When you say command line, I assume that you mean MEGAN UE?

If so, then this script might be a useful starting point. It is a script that opens a file, collapses to rank of Phylum, selects all Phylum nodes and then exports their counts:

open file=‘/Users/huson/test.rma6’;
collapse rank=Phylum;
select rank=Phylum;
export what=CSV format=taxonName_to_count separator=comma counts=assigned file=‘/Users/huson/test.txt’;
quit;

This is what appears in the output file:

“Proteobacteria”,933.0
“Firmicutes”,1.0

1 Like

I did, yes :sweat_smile:

the script above, is there any way I can modify it to write each taxon to a new output file?
Ideally, I would want to have a file with the extracted reads for every taxon.

Thanks again for the help!

Yes,

Use this command:

extract what=reads outDir= outFile= [data={{EC|EGGNOG|GTDB|INTERPRO2GO|KEGG|SEED|Taxonomy}]
[ids=<SELECTED|numbers…>] [names=<names…>] [allBelow={false|true}];

Any occurrence of %t or %i in the provided file-name template will be replaced by the node name, or id, respectively.

For example, I selected all genus level nodes in the taxonomy viewer and then ran this command:

extract what=reads outdir=‘/Users/huson/tmp’ outfile=‘reads-%t.fasta’ data=Taxonomy ids=selected allBelow=true;

This produced the following files:

/Users/huson/tmp/reads-Acidaminococcus.fasta /Users/huson/tmp/reads-Collinsella.fasta /Users/huson/tmp/reads-Paraprevotella.fasta …

Dear @Daniel,
is it not possible to extract reads from the command line in Megan6 Community Edition (v6.22.2)?

I want extract all bacterial reads, and from the desktop version is taking to much time.

All the best,
L

Please use the daa2info or rma2info programs to extract reads from a meganized DAA or RMA file

Hello @Daniel
Thanks for the ultra quick response :slight_smile:
I have tried this, but I don’t understand at all how it works…
Sorry for the silly question, but how can I extract a .fasta file with only bacterial reads?
daa2info -i contigs1.daa -o onlybacteria.fasta -bo
All the best,
L

Sorry, actually you should use the read-extractor tool.
Here is its reported usage:

SYNOPSIS
ReadExtractorTool [options]
DESCRIPTION
Extracts reads from a DAA or RMA file by classification
OPTIONS
Input and Output
-i, --input [string(s)] Input DAA and/or RMA file(s). Mandatory option.
-o, --output [string(s)] Output file(s). Use %t for class name and %i for class id. (Directory, stdout, .gz ok). Default value(s): ‘stdout’.
Options
-fsc, --frameShiftCorrect Extract frame-shift corrected reads. Default value: false.
-c, --classification [string] The classification to use. Legal values: EC, EGGNOG, GTDB, INTERPRO2GO, KEGG, SEED, Taxonomy
-n, --classNames [string(s)] Names (or ids) of classes to extract reads from (default: extract all classes).
-b, --allBelow Report all reads assigned to or below a named class. Default value: false.
-a, --all Extract all reads (not by class). Default value: false.
Other:
-IE, --ignoreExceptions Ignore exceptions and continue processing. Default value: false.
-gz, --gzipOutputFiles If output directory is given, gzip files written to directory. Default value: true.
-v, --verbose Echo commandline options and be verbose. Default value: false.
-h, --help Show program usage and quit.

Here is an example of how to run it to extract all bacterial reads to a file:

read-extractor -b -i /Users/huson/data/asari/1mio/Alice00-1mio.daa -o /Users/huson/data/asari/1mio/%t.fasta.gz -c Taxonomy -n Bacteria

The -b option is important: this “all below” option ensures that not only reads assigned to Bacteria are saved, but also assigned below the Bacteria node are saved, too.

The %t in the output file name is replaced by the taxon name, in this case Bacteria.

This produces a gzipped file called Bacteria.fasta.gz that contains all reads assigned to Bacteria or below (more specific nodes).

1 Like

Thank you Daniel :slight_smile:
L

I am also trying to extract reads but based on specific KEGG KOs, is this possible? So far I have been unable to use the -n to specify KEGG KOs, KEGG pathway names etc.

Yes, select the nodes you are interested in and then select the File->Extract Reads… menu item:

This will extract reads to a given file (or files, you can use %t and %i for place-holders for the name
and integer id of KEGG classes).

The problem is that my .daa files are too large (ca 60 GB each) and I have too many of them to 1) move them to my local computer storage, and 2 ) work with each one of them manually using the UI. I need to use one of the command tool supplied with MEGAN, preferbly the read-extractor tool could be updated. I am already a big fan of the meganizer, compute comparison, and taxonomy2function tools :+1:

I what way were you “unable to use the -n to specify KEGG KOs, KEGG pathway names etc”?

This works for me (however, you need to use MEGAN UE for KEGG):

tools/read-extractor -i Alice00.daa -c KEGG -n 6206

extracts all reads assigned to ‘K06206 sugar fermentation stimulation protein A’.

Hi again,

I tested this and it worked. Yes, I have a paid license for MEGAN UE so this isn’t a problem. The issue was that I had written the full KEGG KOs (e.g. K06206) rather than just 6206.

Thank you very much for the help.

How does it look like if I want to download EC 3.2.1.4 or some other enzymes. Empty folders are created. I don’t know the proper syntax.

#SBATCH --job-name=read_extractor
#SBATCH --account=“biotechnology”
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --time=7-00:00:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

Path to the read-extractor tool

READ_EXTRACTOR_PATH=“/home/stud/tgangar/megan/tools/read-extractor”

Input .daa file

INPUT_DAA=“…/Mayur_main_files/Mayur_Compost_R1_pair.daa”

Base output directory

BASE_OUTPUT_DIR=“./extracted_reads”

Function to extract reads for a given EC class

extract_reads() {
local ec_class=“$1”
local output_dir=“$BASE_OUTPUT_DIR/$ec_class”
mkdir -p “$output_dir”
xvfb-run --auto-servernum --server-num=1 “$READ_EXTRACTOR_PATH”
-i “$INPUT_DAA”
-o “$output_dir/%t.txt”
-c EC
-n “$ec_class”
}

Extract reads for EC class 3.2.1.4

extract_reads “3.2.1.4”

If I use -b option it is extracting all from EC 1 which I don’t want. I tried using full name like

/home/stud/tgangar/megan/tools/read-extractor -i …/Mayur_main_files/Mayur_Compost_R1_pair.daa -o /scratch/tgangar/Mayur_main_files/extracted_reads/%t.fasta.gz -c EC -n 3.2.1.4 Cellulase

I’m getting empty folders is it that I have to write -EC -n 3/3.2/3.2.1/3.2.1.4

Try putting it in double quotes “3.2.1.4 Cellulase”