|
|
||||||||
1 Department of Biological Science and Technology, and
2 Institute of Bioinformatics, National Chiao Tung University, HsinChu 30050, Taiwan
3 Department of Computer Science, National Taiwan University, Taipei 10617, Taiwan
Reprint request to: Jenn-Kang Hwang, Department of Biological Science and Technology, National Chiao Tung University, HsinChu 30050, Taiwan; e-mail: jkhwang{at}cc.nctu.edu.tw; fax: 886-3-572-9288; or Chih-Jen Lin, Department of Computer Science, National Taiwan University, Taipei 10617, Taiwan; e-mail: cjlin{at}csie.ntu.edu.tw; fax: 886-2-2362-8167.
(RECEIVED October 7, 2003; FINAL REVISION January 30, 2004; ACCEPTED February 7, 2004)
| Abstract |
|---|
|
|
|---|
Keywords: subcellular localization; support vector machine; Gram-negative bacteria; machine-learning method; proteome; genome; n-peptide compositions
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.03479604.
| Introduction |
|---|
|
|
|---|
Gram-negative bacteria have five major subcellular localization sites that include the cytoplasm, the inner membrane, the outer membrane, the periplasm, and the extracellular space. PSORT I (Nakai and Kanehisa 1991) has been the most widely used predictive tool for Gram-negative bacteria. However, it does not predict extracellular sequences, and its predictive performance reaches only 61% in overall prediction accuracy for a standard data set (Gardy et al. 2003). Recently Gardy et al. (2003), combining different algorithms and input information, developed a multimodular method PSORT-B. This approach comprises six modules examining the query sequence specifically for different characteristics such as amino acid composition, similarity to proteins of known localization, presence of a signal peptide, transmembrane
-helices, and motifs corresponding to specific localizations. This program then constructs a Bayesian network to generate a final probability value for each localization site. This approach yields an overall prediction accuracy of 75% for all location sites, significantly improving on the previous results of PSORT I by 14%. However, despite the great improvement, PSORT-B gives modest prediction for some subcellular locations. For example, it gives a poor predictive accuracy of 58% for periplasmic sequences and of 69% for cytoplasmic sequences. In this work, we present an approach using a single module, the SVM classifier, based on the multiple feature vectors (Yu et al. 2003), to predict the subcellular localization for Gram-negative bacteria.
| Materials and methods |
|---|
|
|
|---|
An important issue of optimizing SVMs is the selection of parameters. For SVM training, a few parameters such as the penalty parameter and the kernel parameter of the RBF function must be determined in advance. Choosing optimal parameters for SVM is an important step in SVM design. We use the cross-validation on different parameters for the model selection (Duan et al. 2003). In this work, all SVM calculations are performed by using LIBSVM (Chang and Lin 2001), a general library for support vector classification and regression.
Sequence coding schemes
We have shown in the previous work (Yu et al. 2003) that protein descriptors based on the generalized n-peptide compositions are effective in predicting protein three-dimensional folds. If n = 1, then the n-peptide composition reduces to the amino acid composition, and if n = 2, the n-peptide composition gives dipeptide composition. When n gets larger, the n-peptide compositions will cover more global sequence information, but at the same time, such a coding scheme becomes not only impractical from a computational viewpoint but also undoable from a learning viewpoint. However, the size problem can be overcome if we regroup the amino acids into smaller groups of classes according to their physicochemical properties or structural properties. In this work, we use the notation An to denote the n-peptide composition of amino acids, Fn to denote the reduced amino acid composition in which 20 amino acids are classified into four groups (charged, polar, aromatic and nonpolar), and Xk to denote the partitioned amino acid composition in which the sequence is partitioned into k regions of equal length. Similar sequence coding schemes such as the n-gram hashing function has also been successfully applied to the protein classification (Wu et al. 1992, 1996).
SVM raining and testing
For multiclass SVM classification, we use the one-against-one method (Yu et al. 2003). For five classes of subcellular locations, we can construct 5(5 - 1)/2 = 10 SVM classifiers for a given type of input vector. Each classifier is trained with proteins from two different subcellular locations. For each penalty parameter and kernel parameter, cross-validation combining with the one-against-one method is used for estimating the performance of the model. Therefore, for each model, 10 decision functions share the same parameter. Each protein in the test set will always get a vote from each binary classifier. In this work we use four sequence coding schemes (A1, A2, X4, and F3X5); therefore, we have constructed 10 x 4 = 40 SVM classifiers. We combine votes from these classifiers and use the jury votes to determine the final assignment. In the case of identical votes, we will give more weight to the votes from A1. The general architecture of our predictive system is shown in Figure 1
. Note that the program SubLoc, which is based on amino acid compositions, can be seen as a special case of our predictive system. For convenience, we will refer to our Subcellular Localization Predictive System as CELLO.
|
![]() | (1) |
![]() | (2) |
where ci is the number correctly predicted in the ith sub-cellular location, ni its number of sequences, J the number of locations, w = ni/N and N the total number of sequences. We also use Matthews correlation coefficient (MCC; Matthews 1975) as a measure of the predictive performance for each location:
![]() | (3) |
where ni is the number of correctly predicted sequences not of location i, ui is the number of underpredicted sequences, and oi is the number of overpredicted sequences. The value of MCCi is one for a perfect prediction and zero for a completely random assignment. Following the method of Gardy et al. (2003), for the sequences with dual locations, if one of their locations is predicted, we will consider them as correctly predicted. Such consideration will lead to a slight overestimation of the prediction accuracy (~1% of protein sequences of the data set are multiple localization). At present, CELLO does not predict multiple subcellular sites for protein sequences.
Data sets
We use the same data set of Gardy et al. (2003), extracted from SWISS-PROT release 40.29 (Bairoch and Apweiler 2000). This data set consists of 1443 protein sequences: 1302 proteins localized in a single subcellular site, which are 248 cytoplasmic, 268 inner membrane, 244 periplasmic, 352 outer membrane, and 190 extracellular. This data set also includes a further 141 proteins resident at multiple localization sites: 14 cytoplasmic/inner membrane, 50 inner membrane/periplasmic, and 77 outer membrane/extracellular.
| Results and Discussion |
|---|
|
|
|---|
|
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28: 4548.
Cai, Y.D. Liu, X.J., Xu, X.B., and Chou, K.C. 2002. Support vector machines for prediction of protein subcellular location by incorporating quasi-sequence-order effect. J. Cell. Biochem. 84: 343348.[CrossRef][Medline]
Cedano, J., Aloy, P., Perez-Pons, J.A., and Querol, E. 1997. Relation between amino acid composition and cellular location of proteins. J. Mol. Biol. 266: 594600.[CrossRef][Medline]
Chang, C.-C. and Lin, C.-J. 2001. LIBSVM: A library for support vector machines. Software. http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Chou, K.C. 2001. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 43: 246255.[CrossRef][Medline]
Chou, K.C. and Cai, Y.D. 2002. Using functional domain composition and support vector machines for prediction of protein subcellular location. J. Biol. Chem. 277: 4576545769.
Chou, K.C. and Elrod, D.W. 1999. Protein subcellular location prediction. Protein Eng. 12: 107118.
Duan, K., Keerthi, S.S., and Poo, A.N. 2003. Evaluation of simple performance measures for tuning SVM hyperparameters. Neurocomputing 51: 4159.[CrossRef]
Emanuelsson, O., Nielsen, H., and von Heijne, G. 1999. ChloroP: A neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci. 8: 978984.[Abstract]
Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne, G. 2000. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 300: 10051016.[CrossRef][Medline]
Gardy, J.L., Spencer, C., Wang, K., Ester, M., Tusnady, G.E., Simon, I., Hua, S., deFays, K., Lambert, C., Nakai, K., et al. 2003. PSORT-B: Improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res. 31: 36133617.
Hua, S. and Sun, Z. 2001. Support vector machine approach for protein sub-cellular localization prediction. Bioinformatics 17: 721728.
Jensen, L.J., Gupta, R., Blom, N., Devos, D., Tamames, J., Kesmir, C., Nielsen, H., Staerfeldt, H.H., Rapacki, K., Workman, C., et al. 2002. Prediction of human protein function from post-translational modifications and localization features. J. Mol. Biol. 319: 12571265.[CrossRef][Medline]
Matthews, B.W. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta. 405: 442451.[Medline]
Nakai, K. 2000. Protein sorting signals and prediction of subcellular localization. Adv. Protein Chem. 2000: 277344.
Nakai, K. and Kanehisa, M. 1991. Expert system for predicting protein localization sites in Gram-negative bacteria. Proteins 11: 95110.[CrossRef][Medline]
. 1992. A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14: 897911.[CrossRef][Medline]
Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G. 1997. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 10: 16.
Reinhardt, A. and Hubbard, T. 1998. Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res. 26: 22302236.
Tusnady, G.E. and Simon, I. 1998. Principles governing amino acid composition of integral membrane proteins: Application to topology prediction. J. Mol. Biol. 283: 489506.[CrossRef][Medline]
. 2001. The HMMTOP transmembrane topology prediction server. Bioinformatics 17: 849850.
Vapnik, V. 1995. The nature of statistical learning theory. Springer, New York.
Wu, C., Whitson, G., McLarty, J., Ermongkonchai, A., and Chang, T.C. 1992. Protein classification artificial neural system. Protein Sci. 1: 667677.[Abstract]
Wu, C.H., Zhao, S., Chen, H.L., Lo, C.J., and McLarty, J. 1996. Motif identification neural design for rapid and sensitive protein family search. Comput. Appl. Biosci. 12: 109118.
Yu, C.-S., Wang, J.-Y., Yang, J.-M., Lyu, P.C., Lin, C.-J., and Hwang, J.-K. 2003. Fine-grained protein fold assignment by support vector machines using generalized npeptide coding schemes and jury voting from multiple-parameter sets. Proteins 50: 531536.[CrossRef][Medline]
Yuan, Z. 1999. Prediction of protein subcellular locations using Markov chain models. FEBS Lett. 451: 2326.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
I. Alves-Pereira, J. Canales, A. Cabezas, P. Martin Cordero, M. J. Costas, and J. C. Cameselle CDP-Alcohol Hydrolase, a Very Efficient Activity of the 5'-Nucleotidase/UDP-Sugar Hydrolase Encoded by the ushA Gene of Yersinia intermedia and Escherichia coli J. Bacteriol., September 15, 2008; 190(18): 6153 - 6161. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Nakhamchik, C. Wilde, and D. A. Rowe-Magnus Cyclic-di-GMP Regulates Extracellular Polysaccharide Production, Biofilm Formation, and Rugose Colony Development by Vibrio vulnificus Appl. Envir. Microbiol., July 1, 2008; 74(13): 4199 - 4209. [Abstract] [Full Text] [PDF] |
||||
![]() |
C.-T. Lee, C. Amaro, K.-M. Wu, E. Valiente, Y.-F. Chang, S.-F. Tsai, C.-H. Chang, and L.-I Hor A Common Virulence Plasmid in Biotype 2 Vibrio vulnificus and Its Dissemination Aided by a Conjugal Plasmid J. Bacteriol., March 1, 2008; 190(5): 1638 - 1648. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. A. Clock, P. J. Planet, B. A. Perez, and D. H. Figurski Outer Membrane Components of the Tad (Tight Adherence) Secreton of Aggregatibacter actinomycetemcomitans J. Bacteriol., February 1, 2008; 190(3): 980 - 990. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. L. Dahl, F. K. Tengra, D. Dutton, J. Yan, T. M. Andacht, L. Coyne, V. Windell, and A. G. Garza Identification of Major Sporulation Proteins of Myxococcus xanthus Using a Proteomic Approach J. Bacteriol., April 15, 2007; 189(8): 3187 - 3197. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Marczak, A. Mazur, J. E. Krol, W. I. Gruszecki, and A. Skorupska Lipoprotein PssN of Rhizobium leguminosarum bv. trifolii: Subcellular Localization and Possible Involvement in Exopolysaccharide Export. J. Bacteriol., October 1, 2006; 188(19): 6943 - 6952. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. M. Bulach, R. L. Zuerner, P. Wilson, T. Seemann, A. McGrath, P. A. Cullen, J. Davis, M. Johnson, E. Kuczek, D. P. Alt, et al. Genome reduction in Leptospira borgpetersenii reflects limited transmission potential PNAS, September 26, 2006; 103(39): 14560 - 14565. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Guo and Y. Lin TSSub: eukaryotic protein subcellular localization by extracting features from profiles Bioinformatics, July 15, 2006; 22(14): 1784 - 1785. [Abstract] [Full Text] [PDF] |
||||
![]() |
M.-J. Han and S. Y. Lee The Escherichia coli Proteome: Past, Present, and Future Prospects Microbiol. Mol. Biol. Rev., June 1, 2006; 70(2): 362 - 439. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Xie, A. Li, M. Wang, Z. Fan, and H. Feng LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST Nucleic Acids Res., July 1, 2005; 33(suppl_2): W105 - W110. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Bhasin, A. Garg, and G. P. S. Raghava PSLpred: prediction of subcellular localization of bacterial proteins Bioinformatics, May 15, 2005; 21(10): 2522 - 2524. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. L. Gardy, M. R. Laird, F. Chen, S. Rey, C. J. Walsh, M. Ester, and F. S. L. Brinkman PSORTb v.2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis Bioinformatics, March 1, 2005; 21(5): 617 - 623. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Rey, M. Acab, J. L. Gardy, M. R. Laird, K. deFays, C. Lambert, and F. S. L. Brinkman PSORTdb: a protein subcellular localization database for bacteria Nucleic Acids Res., January 1, 2005; 33(suppl_1): D164 - D168. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |