Hi, everyone! I have performed metaproteomics analysis and obtained the identification like the following:
The first colume is uniprot_swissprot accession number. I wonder if I can import the table into MEGAN 6 for taxonomy and other functional analyses with the help of prot.accession2taxid?
What kind of file format for such peptide importing?
After proteomic database searching, the identified peptide has been assigned to one or more accession numbers. This means the link between the peptide and protein/gene has been established. Therefore, we can load such result to MEGAN directly and the blastp process can be avoided.
However, it seems that MEGAN accepts only BLAST input. Is there a method to parse directly the proteomic identification table to MEGAN?
You can import data in CSV format.
For this, you prepare a file that either contains:
accession-string <tab> peptide-count
peptide-count <tab> accession-string <tab> 50
In the first case, you list accessions and how many peptides have been assigned to each accession.
In the second case, you list a peptide name (it could be the peptide sequence itself) and then the accession that the peptide
maps to. This needs to be followed by a score, such as 50, say.
Then, when importing you have to select “Taxonomy” and then click the “Parse accession ids”. The latter will drop you into a dialog that will allow you to specify a “mapping” file.
This could be one downloaded from the MEGAN webpage, or one that you setup yourself.
Here is a fake example:
(using commas rather than tabs…)
If you set the taxonomy accession mapping file to
prot_acc2tax-Nov2018X1.abin then MEGAN will happily import these two fake peptides and will assign one to
Bacteroides and the other to Bacteroides uniformis.
If a peptide has multiple assignments then list it multiple times consecutively with the different accessions.
Let me know if you have problems getting this to work.
Thank you very much for your kindly replay! The method you proposed works fine. The line in my file is 47115 and the distinct peptide count is 10824. The mapping result includes 4076 in cellular organism (234 bacteria, 4 archaea, 3822 eukaryota) and 5 in viruses. This means more than half of the peptides fail in mapping. It’s so strange. I use prot_acc2tax-Nov2018X1.abin for accession mapping. In the proteomic database searching, I used Uniprot_Swissprot v1903 for identification. Could you please give some advices?
lnrep1-1_pep_acc.csv (1.1 MB)
I took a look at your data, I see this:
Reads in: 10,824
Reads out: 2,330
Class names: 47,115
This means that MEGAN is failing to map 36 of 47 thousand accessions. This is because the mapping file that we provide only uses the first accession of each NR entry. You are not using NR so that is why many of your accessions are not being used.
One way to fix this is to create your own mapping file,
each line should consist of an accession and the corresponding NCBI taxon Id, like this:
B7M0C7 <tab> 585034
You should be able to get all of these entries out of the file
prot.accession2taxid.gz from ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid.
You can then give the resulting file to MEGAN as an accession file. (But don’t use the whole of prot.accession2taxid.gz, as that is too big for MEGAN to handle directly.)
Dear Daniel, your suggestion is so great! I generate a accession mapping file as you advise and I get a complete mapping of all of the peptides. I really appreciate your help.
Good to hear that that works.