I am currently looking deeper into the problem.
It looks like a mixture of missing alignments from DIAMOND (A sample with 1668301 reads has only 1656 alignments against NR in the DAA file for excample.) and these alignments all being against sequences that have been added to NR after the creation of the MEGAN-map file, for example this one:
>WP_239926560.1
Length = 286
Score = 56 bits (136), Expect = 6e-05
Identities = 30/30 (100%), Positives = 30/30 (100%), Gaps = 0/30 (0%)
Frame = +2
Query: 62 MTATGAGALIDDDFAGAGESVSADGAETFA 151
MTATGAGALIDDDFAGAGESVSADGAETFA
Sbjct: 1 MTATGAGALIDDDFAGAGESVSADGAETFA 30
Or this one:
>WP_238399680.1
Length = 95
Score = 78 bits (193), Expect = 1e-11
Identities = 36/36 (100%), Positives = 36/36 (100%), Gaps = 0/36 (0%)
Frame = -1
Query: 151 LVVSYLFYNFASVSDCIHCKSYFMITLNLHIFGDNR 44
LVVSYLFYNFASVSDCIHCKSYFMITLNLHIFGDNR
Sbjct: 60 LVVSYLFYNFASVSDCIHCKSYFMITLNLHIFGDNR 95
The NCBI IDs match proteins added to NR at the 22nd and 13th February 2022 respectively.
I thought I might have the wrong NR database to map against, but the uncompressed FASTA file is 230 GB (and matches the checksum), so I am fairly confident it is not just including “newly added sequences”.
I aligned to NR using a command like this, NOT using the LCA version of DIAMOND:
diamond blastx -p 40 -d /path/to/nr.dmd -a path/to/output/sampleID.daa -q path/to/input/sampleID.fastq
I did that for forward and reverse reads and then used daa2rma: with the mapping files found currently on the MEGAN website (Feb2022)
daa2rma -c -i path/to/input/sampleID_1.daa path/to/input/sampleID_2.daa -p true -tsd temp -t 60 -a2t megan-nucl.db -mdb megan-map.db -me 0.0 -supp 0.0
The few reads that do have alignments align only to proteins which have been added to NR after the creation of the mapping file. It looks like the problem actually starts with DIAMOND already, but I don’t know why DIAMOND would ignore most of the database when I did not give it the megan-map file at all.
The database for DIAMOND was created like this:
diamond makedb -p 30 -d nr.dmnd --in nr.fasta
nr.fasta is the uncompressed file retrieved by
wget https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
I took the sequence of the read from my second example (which matched only to WP_238399680.1) and ran it through the browser blastx on NR and it does ONLY match to WP_238399680.1 there as well. However that still leaves me with the question, why are only reads that match only very novel protein sequences (in NR) aligned by DIAMOND to the whole NR database and many, many other reads do not get any alignments.
To test, I took the sequence of the very last read in the input fastq.
AAATAAGACTGCTGAGCAAGCGAAATGAGACTGCTGGACAAGCAGAACGAGATTACTGAGTAAGCGCAGTGAGACGTTTAGTCAGGTTCAAAGATATAGAATATGAGGTGAAAGGAGATAGGTTCTAAGGTATAAGAAGACTGCTTCTAGA
and did also blastX that quickly, in the hopes it would have more alignments to “older” NR proteins.
Through using diamond view on the matching DAA file I can see it has no DIAMOND alignments. Unfortunately that really seems to have no blastx alignments to NR and no blastn alignments to NT (which was my guess what it could be, a remnant of human sequence that was not filtered out of the sample before).
I am stumped right now. QC of the 1668301 reads left after filtering out human for this sample looks fine. Good quality, no duplications, no overrepresented sequences, no significant adapter content.
Still, most of them do not seem to match anything in NR (and not even NT?) and the ones who do have only been very recently added to the databases.
Either there is something very odd going on with that sample, or DIAMOND or even the whole NR database. I am at the end of my ideas.
The sample itself is a normal human fecal sample, processed with the standard kit for WGS metagenomics as recommended by EMBL.
Edit: I did reduce the blastn requirements for my test sequence and if you use actual blastn on nt it eventually maps some uncultered bacterium. That seems to hint at our samples including a lot of unknown/newly or yet unsequenced bacteria. But I wouldn’t know why. The donors should not have an extraordinarily unusual gut microbiome, definitely not all of them.