Strange "coincidences"

BioSina · April 13, 2022, 12:48pm

I am working with a sample that has a lot of (human) host DNA. I removed the host reads by mapping to HG38 and am mapping the remaining reads against NCBI NR (downloaded 3rd April 2022). I run DAA2RMA on the alignments to NR with the current (Feb2022) mapping file for MEGAN.
From the over 16 00000 read pairs, only around 4000 show up as mapped reads, but no taxonomy can be assigned to any of them.
I did check random IDs from the mapped, unassigned reads and it looks like all of those proteins have been added to NR between the 11th (or maybe earlier, this is the earliest I have seen) February 2022 and end of March 2022, so in the short time between creation of the megan-map and download and creation of NR fasta for the DIAMOND database.

I am not sure what is going on here. The sample is clearly not optimal itself, but this looks like there might be a bug in the current mapping file or MEGAN version as well? I have most definitely mapped against the full NR (the DIAMOND database file is ~230 GB big).

Daniel · April 14, 2022, 7:50am

Could you please give me access to some of the alignments for reads that have no assignment. To do so, open them in the inspector window, uncollapse the reads and their alignments, and then copy and paste them into a text file and send to me. (Ideally, of course, reads that you suspect should have a taxonomic placement).
I will look into this.

BioSina · April 14, 2022, 8:45am

I am currently looking deeper into the problem.
It looks like a mixture of missing alignments from DIAMOND (A sample with 1668301 reads has only 1656 alignments against NR in the DAA file for excample.) and these alignments all being against sequences that have been added to NR after the creation of the MEGAN-map file, for example this one:

>WP_239926560.1
	Length = 286

 Score = 56 bits (136), Expect = 6e-05
 Identities = 30/30 (100%), Positives = 30/30 (100%), Gaps = 0/30 (0%)
 Frame = +2

Query:       62  MTATGAGALIDDDFAGAGESVSADGAETFA  151
                 MTATGAGALIDDDFAGAGESVSADGAETFA
Sbjct:        1  MTATGAGALIDDDFAGAGESVSADGAETFA  30

Or this one:
>WP_238399680.1
Length = 95

 Score = 78 bits (193), Expect = 1e-11
 Identities = 36/36 (100%), Positives = 36/36 (100%), Gaps = 0/36 (0%)
 Frame = -1

Query:      151  LVVSYLFYNFASVSDCIHCKSYFMITLNLHIFGDNR  44
                 LVVSYLFYNFASVSDCIHCKSYFMITLNLHIFGDNR
Sbjct:       60  LVVSYLFYNFASVSDCIHCKSYFMITLNLHIFGDNR  95

The NCBI IDs match proteins added to NR at the 22nd and 13th February 2022 respectively.

I thought I might have the wrong NR database to map against, but the uncompressed FASTA file is 230 GB (and matches the checksum), so I am fairly confident it is not just including “newly added sequences”.

I aligned to NR using a command like this, NOT using the LCA version of DIAMOND:
diamond blastx -p 40 -d /path/to/nr.dmd -a path/to/output/sampleID.daa -q path/to/input/sampleID.fastq

I did that for forward and reverse reads and then used daa2rma: with the mapping files found currently on the MEGAN website (Feb2022)
daa2rma -c -i path/to/input/sampleID_1.daa path/to/input/sampleID_2.daa -p true -tsd temp -t 60 -a2t megan-nucl.db -mdb megan-map.db -me 0.0 -supp 0.0

The few reads that do have alignments align only to proteins which have been added to NR after the creation of the mapping file. It looks like the problem actually starts with DIAMOND already, but I don’t know why DIAMOND would ignore most of the database when I did not give it the megan-map file at all.

The database for DIAMOND was created like this:
diamond makedb -p 30 -d nr.dmnd --in nr.fasta

nr.fasta is the uncompressed file retrieved by
wget https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz

I took the sequence of the read from my second example (which matched only to WP_238399680.1) and ran it through the browser blastx on NR and it does ONLY match to WP_238399680.1 there as well. However that still leaves me with the question, why are only reads that match only very novel protein sequences (in NR) aligned by DIAMOND to the whole NR database and many, many other reads do not get any alignments.

To test, I took the sequence of the very last read in the input fastq.
AAATAAGACTGCTGAGCAAGCGAAATGAGACTGCTGGACAAGCAGAACGAGATTACTGAGTAAGCGCAGTGAGACGTTTAGTCAGGTTCAAAGATATAGAATATGAGGTGAAAGGAGATAGGTTCTAAGGTATAAGAAGACTGCTTCTAGA
and did also blastX that quickly, in the hopes it would have more alignments to “older” NR proteins.
Through using diamond view on the matching DAA file I can see it has no DIAMOND alignments. Unfortunately that really seems to have no blastx alignments to NR and no blastn alignments to NT (which was my guess what it could be, a remnant of human sequence that was not filtered out of the sample before).

I am stumped right now. QC of the 1668301 reads left after filtering out human for this sample looks fine. Good quality, no duplications, no overrepresented sequences, no significant adapter content.
Still, most of them do not seem to match anything in NR (and not even NT?) and the ones who do have only been very recently added to the databases.

Either there is something very odd going on with that sample, or DIAMOND or even the whole NR database. I am at the end of my ideas.

The sample itself is a normal human fecal sample, processed with the standard kit for WGS metagenomics as recommended by EMBL.

Edit: I did reduce the blastn requirements for my test sequence and if you use actual blastn on nt it eventually maps some uncultered bacterium. That seems to hint at our samples including a lot of unknown/newly or yet unsequenced bacteria. But I wouldn’t know why. The donors should not have an extraordinarily unusual gut microbiome, definitely not all of them.