Duplicate row names in maltExtract output

Hi, I have used malt and maltExtract to map some samples against the full ncbi fungi database as part of the HOPS pipeline. The problem I have is that I’m getting two instances of duplicate rownames in my maltExtract outputs which means I can’t run the HOPS post-processing step. The first is with Candida, and the second Cryptococcus. Since I am most interested in the latter, I was thinking I could just remove all lines corresponding to Candida in the output files. However, I do want to pull out the mapping stats to Cryptococcus. One of the duplicated lines doesn’t have any reads mapped to it, so I assume it should be fine to remove from RunSummary.txt, but I don’t know what to remove from the damageMismatch.txt file as shown below.

Perhaps there is a way to fix it in either my original database or the rma6 files before maltExtract is used? I did try removing certain fna files from my database before running malt-build already but I couldn’t get the problem to resolve.

Any help appreciated!

candida duplication in RunSummary.txt:

(hops2) [he11@login-a default]$ grep "Candida" RunSummary.txt
Candida 7       1       5       2       9       4       4
Candida 702     715     345     4388    526     964     875
Candida_albicans        76      39      23      7       56      44      45
Candida_albicans_A155   0       0       0       0       0       0       0
Candida_albicans_A67    0       0       0       0       0       0       1
Candida_albicans_CHN1   1       0       1       0       0       1       0
Candida_albicans_Ca529L 2       2       0       0       0       0       1
Candida_albicans_Ca6    1       0       0       0       1       1       1
Candida_albicans_P34048 0       1       0       0       0       0       0
Candida_albicans_P37037 1       0       0       0       0       0       0
Candida_albicans_P37039 1       0       0       0       0       0       0
Candida_albicans_P57072 0       0       0       0       0       1       0
Candida_albicans_P60002 1       0       0       0       0       0       0
Candida_albicans_P75010 1       0       0       0       0       0       0
Candida_albicans_P75016 0       0       0       0       0       1       0
Candida_albicans_P94015 0       0       0       0       0       0       1
Candida_albicans_SC5314 1       0       0       0       2       0       0
Candida_albicans_WO-1   0       0       0       0       1       0       0
Candida_corydali        1       2       2       0       1       0       2
Candida_dubliniensis    39      20      8       2       20      21      8
Candida_dubliniensis_CD36       4       1       2       0       3       2       1
Candida_maltosa_Xu316   12      3       4       0       5       9       5
Candida_metapsilosis    9       6       2       3       5       5       11
Candida_orthopsilosis   1       9       1       0       0       0       2
Candida_orthopsilosis_AY2       4       4       0       0       3       0       1
Candida_orthopsilosis_Co_90-125 5       2       1       0       4       3       4
Candida_orthopsilosis_MCO456    49      13      6       7       27      16      5
Candida_oxycetoniae     21      2       9       2       12      11      12
Candida_parapsilosis    56      25      20      85      41      25      72
Candida_parapsilosis_GA1        7       2       0       1       5       1       0
Candida_sanyaensis      21      9       4       1       12      14      5
Candida_sojae   9       0       6       4       4       5       1
Candida_sp._JCM_15000   0       4       1       0       1       1       1
Candida_sp._LDI48194    2       1       1       0       0       0       1
Candida_theae   8       1       3       0       3       4       4
Candida_tropicalis      38      35      14      9       32      58      16
Candida_tropicalis_MYA-3404     2       0       1       0       0       0       0
Candida_viswanathii     0       2       1       0       4       1       3

Cryptococcus duplication in RunSummary.txt:

(hops2) [he11@login-a default]$ grep "Cryptococcus" RunSummary.txt
Cryptococcus    0       0       0       0       0       0       0
Cryptococcus    1393    4022    204     3239    892     3523    1144
Cryptococcus_amylolentus        0       2       0       0       0       2       3
Cryptococcus_amylolentus_CBS_6039       2       1       1       0       0       3       1
Cryptococcus_amylolentus_CBS_6273       2       0       0       0       2       2       1
Cryptococcus_depauperatus       0       0       0       0       0       2       1
Cryptococcus_depauperatus_CBS_7841      0       0       2       0       0       0       2
Cryptococcus_depauperatus_CBS_7855      0       1       1       1       0       0       1
Cryptococcus_gattii_CA1280      2       0       0       0       0       0       0
Cryptococcus_gattii_CA1873      0       0       0       0       0       2       0
Cryptococcus_gattii_EJB2        0       0       0       0       0       0       1
Cryptococcus_gattii_NT-10       2       1       1       0       0       2       2
Cryptococcus_gattii_Ru294       0       0       0       0       0       2       1
Cryptococcus_gattii_VGI 2       2       0       0       4       7       1
Cryptococcus_gattii_VGII        4       0       1       0       0       1       1
Cryptococcus_gattii_WM276       1       2       0       0       1       0       0
Cryptococcus_gattii_species_complex     0       0       0       0       1       1       0
Cryptococcus_neoformans 13      19      7       0       6       24      9
Cryptococcus_neoformans_AD_hybrid       0       1       0       0       0       2       1
Cryptococcus_neoformans_species_complex 0       1       2       0       1       0       1
Cryptococcus_neoformans_var._grubii     32      27      12      6       23      32      14
Cryptococcus_neoformans_var._grubii_125.91      1       0       0       0       0       0       0
Cryptococcus_neoformans_var._grubii_A5-35-17    0       0       0       0       1       0       0
Cryptococcus_neoformans_var._grubii_Br795       12      4       2       1       5       4       2
Cryptococcus_neoformans_var._grubii_Bt1 0       1       0       0       0       0       0
Cryptococcus_neoformans_var._grubii_Bt15        0       0       0       0       0       0       0
Cryptococcus_neoformans_var._grubii_Bt63        0       1       0       0       0       0       0
Cryptococcus_neoformans_var._grubii_C23 0       0       0       0       0       1       0
Cryptococcus_neoformans_var._grubii_CHC193      1       2       1       0       1       2       0
Cryptococcus_neoformans_var._grubii_D17-1       3       2       1       0       3       2       2
Cryptococcus_neoformans_var._grubii_MW-RSA1955  2       0       1       0       0       0       0
Cryptococcus_neoformans_var._grubii_MW-RSA36    0       0       0       0       0       0       0
Cryptococcus_neoformans_var._grubii_MW-RSA852   0       0       1       0       0       0       0
Cryptococcus_neoformans_var._grubii_Tu401-1     0       1       0       0       0       0       1
Cryptococcus_neoformans_var._grubii_c45 1       1       0       0       0       1       0
Cryptococcus_neoformans_var._neoformans 0       0       0       0       0       0       2
Cryptococcus_neoformans_var._neoformans_B-3501A 0       1       0       0       1       0       0
Cryptococcus_neoformans_var._neoformans_JEC21   0       0       0       0       2       0       1
Cryptococcus_neoformans_var._neoformans_XL280   1       0       0       0       0       0       0
Cryptococcus_sp._05/00  68      189     19      0       68      206     12
Cryptococcus_sp._JCM_24511      12      34      5       1       13      32      4
Cryptococcus_wingfieldii        1       2       1       0       2       9       0
Cryptococcus_wingfieldii_CBS_7118       2       0       1       0       1       2       1

And Cryptococcus in readDist/_alignment:

(hops2) [he11@login-a default]$ grep "Cryptococcus" readDist/S1.rma6_alignmentDist.txt
Cryptococcaceae Cryptococcus    1       2       2       2       2430807
Cryptococcus    Cryptococcus    0.972   270     285     319     2099666
Cryptococcus    NA      0       0       0       0       0
Cryptococcus_amylolentus        Cryptococcus_amylolentus_CBS_6273       1       1       1       1       1425963
Cryptococcus_amylolentus_CBS_6039       Cryptococcus_floricola  1       1       1       1       92997
Cryptococcus_amylolentus_CBS_6273       Cryptococcus_amylolentus_CBS_6273       1       1       1       1       1093067
Cryptococcus_depauperatus       Cryptococcus_depauperatus_CBS_7855      1       2       2       2       939986
Cryptococcus_gattii_CA1873      Purpureocillium_takamizusanense 1       1       1       1       1096568
Cryptococcus_gattii_NT-10       Cryptococcus_gattii_NT-10       1       2       2       2       4847
Cryptococcus_gattii_Ru294       Cryptococcus_gattii_Ru294       1       1       1       1       1133708
Cryptococcus_gattii_VGI Cryptococcus_gattii_EJB2        1       1       1       3       346420
Cryptococcus_gattii_VGII        Cryptococcus_gattii_VGII        0       0       2       2       1116177
Cryptococcus_gattii_WM276       Cryptococcus_gattii_WM276       0       0       2       2       1325755
Cryptococcus_gattii_species_complex     Cryptococcus_gattii_Ru294       1       1       1       1       536216
Cryptococcus_neoformans Cryptococcus_neoformans_var._grubii_H99 0.604   6       10      10      1621675
Cryptococcus_neoformans_AD_hybrid       Cryptococcus_neoformans_AD_hybrid       1       1       1       1       84756
Cryptococcus_neoformans_species_complex NA      0       0       0       0       0
Cryptococcus_neoformans_var._grubii     Cryptococcus_neoformans_var._grubii_A1-35-8     0.487   7       16      17      174521
Cryptococcus_neoformans_var._grubii_Br795       NA      0.035   0       4       4       1344473
Cryptococcus_neoformans_var._grubii_Bt15        Cryptococcus_neoformans_var._grubii_Bt15        0.043   0       2       2       3096
Cryptococcus_neoformans_var._grubii_C23 Cryptococcus_neoformans_var._grubii_C23 1       1       1       1       2917
Cryptococcus_neoformans_var._grubii_CHC193      Cryptococcus_neoformans_var._grubii_CHC193      1       1       1       1       3675
Cryptococcus_neoformans_var._grubii_D17-1       Cryptococcus_neoformans_var._grubii_D17-1       0.036   0       4       4       2878
Cryptococcus_neoformans_var._grubii_c45 Cryptococcus_neoformans_var._grubii_c45 0.083   0       2       2       153309
Cryptococcus_sp._05/00  Cryptococcus_sp._05/00  0.371   3       17      18      6454
Cryptococcus_sp._JCM_24511      Cryptococcus_sp._JCM_24511      1       6       6       8       2178591
Cryptococcus_wingfieldii        Cryptococcus_wingfieldii        1       2       2       2       1480951
Cryptococcus_wingfieldii_CBS_7118       Cryptococcus_wingfieldii_CBS_7118       1       1       1       1       606306

I’ve looked into this, but can’t figure out how the same taxon would occur twice in the RMA file.

Hi Daniel,
Thanks for looking at it. It’s pretty frustrating, particularly as it is the Cryptococcus reads I’m most interested in.
Is there any way to find these in the rma file and rename them?
Or can I edit the RunSummary file (I’m assuming this is produced first?) produced by maltExtract and then continue the rest of the process using the edited file?

Or perhaps I have to collapse species nodes and just look at genus level?

I’m tempted just to delete the row of Cryptococcus with zeroes from the files, but not sure if this could mess anything else up somewhere…

Please give me access to a file that exhibits the problem and I will try to debug this

Thanks, Daniel. What files would you need? The rma?

precisely, I should be able to figure this out from an RMA file

Hi Daniel,
Hopefully I sent my file share to the correct email. If so, I was wondering if you have had chance to look into this yet?
Thanks so much,
Hannah

Unfortunately, I wasn’t able to login to box because a box account appears to be required (which I don’t have) :frowning_face:

Hello, apologies I know it’s been a while but I’m still struggling with this. I can get through to some final outputs on the HOPS pipeline by deleting one of the rows with duplicate names in all the following files:
editDistance/_editDistance.txt
damageMismatch/
_damageMismatch.txt
filterInformation/_assignedReads.txt
filterInformation/
_filterTable.txt
readDist/_additionalNodeEntries.txt
readDist/
_readLengthStat.txt
readDist/*_alignmentDist.txt
RunSummary.txt

but I’m not sure how much this could affect my final results, particularly as I am interested in pulling out the Cryptococcus taxon if possible.

I shared the rma file over Drive, did it work this time?

Thanks, Hannah