|
|
||||||||
1 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0111, Japan
2 Centre de Géostatistique, Ecole des Mines de Paris, 77300 Fontainebleau, France
3 Division of Bioinformatics, Medical Institute of Bioregulation, Kyushu University, Fukuoka 812-8582, Japan
Reprint requests to: Setsuro Matsuda, Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan; e-mail: smatsuda{at}kuicr.kyoto-u.ac.jp; fax: +81-774-38-3022.
(RECEIVED May 20, 2005; FINAL REVISION August 22, 2005; ACCEPTED August 22, 2005)
| Abstract |
|---|
|
|
|---|
Keywords: subcellular location; signal sequence; amino acid composition; distance frequency; support vector machine; predictive accuracy
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.051597405.
| Introduction |
|---|
|
|
|---|
Many efforts have been made to develop prediction methods to date. PSORT (Nakai and Kanehisa 1992; Horton and Nakai 1997) is historically the first method for predicting subcellular locations. It uses various sequence-derived features such as the presence of sequence motifs and amino acid compositions. Most existing methods can be roughly classified into two groups according to their input data. One is the method based on the N-terminal sequence of a protein and the other on its amino acid composition. TargetP (Emanuelsson et al. 2000) requires the N-terminal sequence as an input into two layers of artificial neural networks (ANNs), and can also predict the peptidase-cleaved site of a protein. The first layer comprises the earlier binary predictors, SignalP (Nielsen et al. 1997) and ChloroP (Emanuelsson et al. 1999). Reczko and Hatzigeorgiou (2004) used a bidirectional recurrent neural network with the first 90 residues in the N-terminal sequence. Yuan (1999) applied the Markov chain model to the prediction, but the entire sequence was used as the input data.
ProtLock (Cedano et al. 1997) requires the amino acid composition and is based on the least Mahalanobis distance algorithm. Chou and Elrod (1998, 1999) also used the amino acid composition but the covariant discriminant algorithm was employed in their method. NNPSL (Reinhardt and Hubbard 1998) is an ANN-based method using the amino acid composition. After the successful report in Reinhardt and Hubbard (1998), application of machine learning techniques became popular in this field. For SubLoc (Hua and Sun 2001), a support vector machine (SVM) was implemented instead of the ANN. It is expected that incorporating an amino acid order as well as the amino acid composition makes it possible to improve prediction performance. Chou (2001) proposed the pseudoamino acid composition to take the effect of the amino acid order into account. Furthermore, Cai and Chou (2004) have recently developed an accurate method integrating the pseudo-amino acid composition, the functional domain composition (Chou and Cai 2002, 2004), and the information of gene ontology (Chou and Cai 2003). Park and Kanehisa (2003) developed an SVM-based method that incorporates compositions of dipeptides and gapped amino acid pairs in addition to the conventional amino acid composition. The concepts of the pseudoamino acid and gapped amino acid pair compositions were merged in the residue-couple model proposed by Guo et al. (2005).
Incorporating the information of homology search can also improve the prediction performance (Bhasin and Raghava 2004; Kim et al. 2004; Bhasin et al. 2005). However, one should pay much attention to the sequence similarity between training and test data in evaluating prediction methods based on homology search. If a query sequence in the test data has a high similarity with a sequence in the training data, then its subcellular location can be easily predicted without using a complicated predictor. In other words, the data set used for training and testing must be sufficiently redundancy-reduced.
Although Reinhardt and Hubbard (1998) pointed out that prediction methods based on the amino acid composition are robust to the gene annotation error in the 5'-region, using the amino acid composition only leads to information loss of signal sequences. To overcome this problem, the concepts such as the pseudoamino acid composition have been introduced. In this work, we propose a novel representation of protein sequences to further improve the accuracy of prediction methods. Our method, which employs the SVM with RBF kernel, is based on local compositions of amino acids and twin amino acids, and local frequencies of distance between successive amino acids. As benchmark data, we adopt the data sets provided by Reinhardt and Hubbard (1998) and Emanuelsson et al. (2000) because they have been widely used in earlier studies. For convenience, we call the former "NNPSL data sets" and the latter "TargetP data sets."
Each amino acid is represented by its one-letter code hereafter. In this work, basic amino acids encompass R, K, and H. Hydrophobic amino acids are I, V, L, F, M, A, G, W, and P. The remainder, D, N, E, Q, Y, S, T, and C are called "other amino acids."
| Results |
|---|
|
|
|---|
|
|
|
|
In Tables 3
and 4
, our overall accuracies are the highest if we consider the jackknife accuracies of Chou and Cai (2004). According to Chou and Zhang (1995), the jackknife test is more rigorous and objective than cross-validation test, because the number of possible data divisions is too large to be handled in the latter test. However, we adopted the cross-validation test to save CPU time and compare our method with as many recent methods as possible. For plant proteins, our sensitivities for chloroplast, nuclear and cytosolic (other) proteins are lower than those of Emanuelsson et al. (2000), but the sensitivity for mitochondrial proteins was improved by 0.104. For non-plant proteins, our sensitivity for nuclear and cytosolic proteins is higher than any other methods. It is noteworthy that the MCCs of our method are over 0.82 for all locations.
To compare the predictive accuracies in the same conditions, we implemented the method proposed by Kim et al. (2004). Therefore, the values of sensitivity, specificity, MCC, and overall accuracy are different from those in Kim et al. (2004). They also employed the SVM with RBF kernel and characterized protein sequences by the Needleman-Wunsch scores (Needleman and Wunsch 1970) against all the sequences in training data. ALIGN0 (Myers and Miller 1988) in the FASTA 2.0 package (Pearson and Lipman 1988; Pearson 1990) was used for calculating the scores. The gap penalty is 3 and the scoring matrix is BLOSUM50. Each sequence was truncated after the N-terminal 90 residues for the calculation. The values of regularization parameter C and parameter
of RBF kernel are the same as those in Kim et al. (2004; see Table 8
, below).
|
| Discussion |
|---|
|
|
|---|
Usefulness of the distance frequency
The distance frequency was developed in consideration of nuclear export signal (NES) and chloroplast transit peptide. In Figure 1
, we visualized distance frequencies of three hydrophobic amino acids (L, I, and V) for 75 protein sequences containing the NES (called "with NES"). These sequences were downloaded from NES-base 1.0 (la Cour et al. 2003) and their NESs were experimentally verified. We also depicted the distance frequencies for the 75 sequences with their NES removed (called "without NES"). These frequencies are slightly smaller than those of "with NES" at H = 2, 3, 4, where H represents the distance between successive amino acids. This decline implies that the distance frequency modestly reflects the existence of NES.
|
6. The difference of distance frequencies related to the two locations is significantly large compared with Figure 2B
|
Internal signal sequences, which are positioned in the middle part, are unclear compared with ones in the N-terminal and C-terminal parts. But some biological experiments indicate the importance of signal sequences in the middle part. Miyakawa and Imamura (2003) found out that two fibroblast growth factors FGF-9 and FGF-16 require both the N-terminal region and central hydrophobic region as a secretory signal. This hydrophobic region belongs to the middle part here. Furthermore, this bipartite signal sequence is not cleaved off by proteases during the transport process. We collected three sequences: human FGF-9, human FGF-16, and rat FGF-16 from databases available on the Internet and then predicted their subcellular locations by SignalP 3.0 (Bendtsen et al. 2004). This is the latest version of SignalP and employs both the ANN and hidden Markov model. As a result, these sequences were predicted as nonsecretory proteins. In contrast our method, which can consider the features in the middle part, correctly predicted all the sequences as secretory proteins.
From the aforementioned fact, it is concluded that separating a protein sequence into the N-terminal, middle, and C-terminal parts is helpful to capture signal sequences. In addition, our method has an advantage of small CPU time requirement to construct the feature vector compared with the method proposed by Kim et al. (2004).
Feature weights
Here we describe how to estimate the importance of each feature and discuss the relation between these features and subcellular locations. As opposed to linear SVM, the RBF SVM does not assign a weight to each feature. In order to estimate the importance of each feature, we followed the following procedures: (1) Prepare a feature vector whose components are all 0, (2) assign 1 to a feature whose importance is to be estimated, (3) feed this vector into the trained SVMs and obtain their outputs, and (4) repeat the procedures 13 for all features. The outputs are regarded as the weights of the RBF SVM, quantifying the contributions of the features.
Since our prediction method adopted the one-versus-rest method, we have one specific SVM for each subcellular location. Figure 3
shows the feature weights of the SVMs specifically for (A) SP and (B) "other" on the TargetP plant data set. Feature number j of the X-axis corresponds to the j-th component of a feature vector (see Equation 1). For easy understanding, we discuss the possible meaning of the features with the most positive weights. In Figure 3A
, we can see that the weights of hydrophobic amino acids in the N-terminal 20 residues are large. Interestingly, the weights of cysteine in the N-terminal part are relatively large. It is noteworthy that the distance frequency of other amino acids in the middle part (h1(M)) has a large weight. In Figure 3B
, it is clarified that aspartic and glutamic acids in the N-terminal 40 residues are important. We can also see that the weights of lysine in the N-terminal 20 residues and the middle part are large.
|
The above results indicate that the SVMs in our method were successfully trained, because their feature weights are consistent with features of signal sequences described later. Moreover, it is concluded that the first 20 residues in the N terminus are particularly important to predict the subcellular location.
| Materials and methods |
|---|
|
|
|---|
|
|
Important features of signal sequences
In general, proteins destined for chloroplast, mitochondria, and secretory pathway have signal sequences in their N termini. On the other hand, proteins destined for nucleus and cytosol have one or more signal sequences in the middle part of their sequence. Furthermore, chloroplast proteins transported into thylakoid have an internal signal sequence after the chloroplast transit peptide (cTP) (Keegstra and Cline 1999; Robinson et al. 2001).
The length of cTP is believed to be at most 100 residues. That of mitochondrial targeting peptide (mTP) ranges from 10 to 80 residues (Neupert 1997; Omura 1998). cTPs are rich in hydroxylated amino acids (S and T) and have basic amino acids with several residue gaps intervening (Bruce 2000). mTPs especiallyfor mitochondrial matrix and intermembrane space can form amphipathic
-helix with basic amino acids (Omura 1998). Signal peptides (SPs) for secretion are abundant in hydrophobic amino acids (von Heijne 1990). Secretory proteins that have the KDEL or KKXX motif in their C terminus return from Golgi apparatus to endoplasmic reticulum (Cosson and Letourneur 1997).
The nuclear localization signal (NLS) and nuclear export signal (NES) are rich in basic and hydrophobic amino acids (particularly L, I, and V), respectively. The basic amino acids in NLS can form one or more clusters. NESs have the hydrophobic amino acids with approximately constant gaps between each hydrophobic amino acid. Some examples of signal sequences are summarized in Table 7
.
|
Feature vector
First of all, we defined the N-terminal, middle, and C-terminal parts depending on sequence length L. Most of the sequences used here conform to the definition in Figure 4A
. The N-terminal part is further divided into four regions with length dN. Because we assumed that proteins are directed by the approximate amount of specific amino acids to make the signal sequence flexible and the cluster of such amino acids can be distributed in various regions even in the N terminus. dN is set to 20 and 24 for eukaryotic and prokaryotic proteins, respectively. It was also assumed that the middle part has at least 20 residues equal to the number of distinct amino acids. The length of the C-terminal part dC is set to nine and eight on the NNPSL and TargetP data sets, respectively.
|
4dN + dC, we assumed that the lengths of the N-terminal and middle parts are the same. That is, these lengths are defined by (L dC)/2 and the N-terminal part is not divided at all (Fig. 4CThe feature vector to represent protein sequence i is expressed as follows:
![]() | (1) |
where the capital letters, N, M, C, and E indicate the N-terminal, middle, C-terminal, and entire parts. The entire part means the whole length of a sequence. The numerals in the parentheses (14) correspond to the regions in the N-terminal part in Figure 4, A and B
. x1(p), ..., x20 (p) indicate the composition of 20 amino acids in part p (p = 1, 2, 3, 4, M, C, E). y1(M), ..., y20(M) are the composition of 20 twin amino acids (e.g., RR, KK) in the middle part. In the case that a sequence is too short to be divided on its N-terminal part (see Fig. 4C
), the amino acid composition of the whole N-terminal residues is equally assigned to the four regions, i.e., xj(1) = xj(2) = xj(3) = xj(4) (j = 1, ..., 20).
f1(q), ..., f6 (q) represent the distance frequencies of basic amino acids in part q (q = N, M). To calculate distance frequencies, we defined six distance classes (H = 1, 1 < H
6, 6 < H
11, 11 < H
16, 16 < H
21, H > 21). Similarly, g1(M), ..., g6 (M) are the distance frequencies of hydrophobic amino acids and h1(M), ..., h6(M) are those of other amino acids in the middle part. Altogether, this feature vector has 184 components and each component is normalized between 0 and 1 by its possible maximum.
Distance frequency
In this work, we introduced a new feature, called "distance frequency" to encode a protein sequence. This is the frequency of the distance between two successive amino acids. For example, consider the following protein sequence:
![]() |
where underlined letters denote basic amino acids. The distances between successive basic amino acids, Hb, take the values 3, 2, 3, 2, and 3 starting from the left. Note that Hb is calculated in a left-to-right fashion. As a result, the distance frequencies for Hb = 2 and Hb = 3 are 2 and 3, respectively.
SVM training
In order to implement SVM, we used the free software, SVMlight developed by Joachims (1999). As the kernel, the radial basis function (RBF) was selected because this function outperformed linear and polynomial kernels in terms of overall predictive accuracy (data not shown). The RBF kernel is defined by the following equation:
![]() | (2) |
where vi and vj are feature vectors representing protein sequences. The parameter
in Equation 2 and regularization parameter C are adjusted in training to produce reliable performance. As
becomes smaller, the decision boundary for discriminating positive and negative examples becomes smoother. C controls the trade-off between training error and margin. We determined the two parameters as shown in Table 8
by trial and error. Other options for SVMlight are set to their default.
For multiclass classification, the one-versus-rest method (Schölkopf and Smola 2002; Nguyen and Rajapakse 2003) was adopted. That is, the l-th SVM is trained on sequences belonging to the l-th location with the positive label "+1" and on sequences belonging to the remaining locations with the negative label "1." We also tested the one-versus-one method, but the overall accuracy was lower than the one-versus-rest method (data not shown).
Measures for evaluation of the prediction performance
To evaluate the prediction performance of our method, sensitivity, specificity, Matthews (1975) correlation coefficient (MCC) for each subcellular location, and overall accuracy were calculated. The definitions of these measures are as follows:
![]() | (3) |
![]() | (4) |
![]() | (5) |
![]() | (6) |
where n is the total number of protein sequences and k is the number of subcellular locations. tp(l) is the number of correctly predicted sequences belonging to location l (true positive). tn(l) is the number of correctly predicted sequences that do not belong to location l (true negative). fp(l) is the number of overpredicted sequences in location l (false positive). fn(l) is the number of underpredicted sequences in location l (false negative).
| Acknowledgments |
|---|
| References |
|---|
|
|
|---|
Bendtsen, J.D., Nielsen, H., von Heijne, G., and Brunak, S. 2004. Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol. 340: 783795.[CrossRef][Medline]
Bhasin, M. and Raghava, G.P.S. 2004. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res. 32: W414W419.
Bhasin, M., Garg, A., and Raghava, G.P.S. 2005. PSLpred: Prediction of subcellular localization of bacterial proteins. Bioinformatics 21: 25222524.
Bruce, B.D. 2000. Chloroplast transit peptides: Structure, function and evolution. Trends Biochem. Sci. 10: 440447.
Cai, Y.-D. and Chou, K.-C. 2004. Predicting subcellular localization of proteins in a hybridization space. Bioinformatics 20: 11511156.
Cedano, J., Aloy, P., Pérez-Pons, J.A., and Querol, E. 1997. Relation between amino acid composition and cellular location of proteins. J. Mol. Biol. 266: 594600.[CrossRef][Medline]
Chou, K.-C. 2001. Prediction of protein cellular attributes using pseudoamino acid composition. Proteins 43: 246255.[CrossRef][Medline]
Chou, K.-C. and Cai, Y.-D. 2002. Using functional domain composition and support vector machines for prediction of protein subcellular location. J. Biol. Chem. 277: 4576545769.
. 2003. A new hybrid approach to predict subcellular localization of proteins by incorporating gene ontology. Biochem. Biophys. Res. Commun. 311: 743747.[CrossRef][Medline]
. 2004. Predicting subcellular localization of proteins by hybridizing functional domain composition and pseudo-amino acid composition. J. Cell. Biochem. 91: 11971203.[CrossRef][Medline]
Chou, K.-C. and Elrod, D.W. 1998. Using discriminant function for prediction of subcellular location of prokaryotic proteins. Biochem. Biophys. Res. Commun. 252: 6368.[CrossRef][Medline]
. 1999. Protein subcellular location prediction. Protein Eng. 12: 107118.
Chou, K.-C. and Zhang, C.T. 1995. Review: Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol. 30: 275349.[Medline]
Cosson, P. and Letourneur, F. 1997. Coatomer (COPI)-coated vesicles: Role in intracellular transport and protein sorting. Curr. Opin. Cell Biol. 9: 484487.[CrossRef][Medline]
Emanuelsson, O., Nielsen, H., and von Heijne, G. 1999. ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci. 8: 978984.[Abstract]
Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne, G. 2000. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 300: 10051016.[CrossRef][Medline]
Guo, J., Lin, Y., and Sun, Z. 2005. A novel method for protein subcellular localization: Combining residue-couple model and SVM. Proceedings of the 3rd Asia-Pacific Bioinformatics Conference, pp. 117129. Imperial College Press, Singapore.
Horton, P. and Nakai, K. 1997. Better prediction of protein cellular localization sites with the k nearest neighbors classifier. Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology, pp. 147152. AAAI Press, Menlo Park, CA.
Hua, S. and Sun, Z. 2001. Support vector machine approach for protein subcellular localization prediction. Bioinformatics 17: 721728.
Joachims, T. 1999. Making large-scale SVM learning practical. In Advances in kernel methodsSupport vector learning (eds. B. Schölkopf et al.), pp. 4156. MIT Press, Cambridge, MA.
Keegstra, K. and Cline, K. 1999. Protein import and routing systems of chloroplasts. Plant Cell 11: 557570.
Kim, J.K., Raghava, G.P.S., Kim, K.S., Bang, S.Y., and Choi, S. 2004. Prediction of subcellular localization of proteins using pairwise sequence alignment and support vector machine. Proceedings of the 3rd Annual Conference of the Korean Society for Bioinformatics, pp. 158166. Seoul, Korea.
la Cour, T., Gupta, R., Rapacki, K., Skriver, K., Poulsen, F.M., and Brunak, S. 2003. NESbase version 1.0: A database of nuclear export signals. Nucleic Acids Res. 31: 393396.
Matthews, B.W. 1975. Comparison of predicted and observed secondary structure of T4 phage lysozyme. Biochem. Biophys. Acta 405: 442 451.[Medline]
Miyakawa, K. and Imamura, T. 2003. Secretion of FGF-16 requires an uncleaved bipartite signal sequence. J. Biol. Chem. 278: 35718 35724.
Myers, E.W. and Miller, W. 1988. Optimal alignments in linear space. Comput. Appl. Biosci. 4: 1117.
Nakai, K. and Kanehisa, M. 1992. A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14: 897911.[CrossRef][Medline]
Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443453.[CrossRef][Medline]
Neupert, W. 1997. Protein import into mitochondria. Annu. Rev. Biochem. 66: 863917.[CrossRef][Medline]
Nguyen, M.N. and Rajapakse, J.C. 2003. Multi-class support vector machines for protein secondary structure prediction. Genome Inform. Ser. Workshop Genome Inform. 14: 218227.[Medline]
Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijine, G. 1997. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 10: 16.
Omura, T. 1998. Mitochondria-targeting sequence, a multi-role sorting sequence recognized at all steps of protein import into mitochondria. J. Biochem. 123: 10101016.
Park, K.-J. and Kanehisa, M. 2003. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 19: 16561663.
Pearson, W.R. 1990. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183: 6398.[Medline]
Pearson, W.R. and Lipman, D.J. 1988. Improved tools for biological sequence analysis. Proc. Natl. Acad. Sci. 85: 24442448.
Reczko, M. and Hatzigeorgiou, A. 2004. Prediction of the subcellular localization of eukaryotic proteins using sequence signals and composition. Proteomics 4: 15911596.[CrossRef][Medline]
Reinhardt, A. and Hubbard, T. 1998. Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res. 26: 22302236.
Robinson, C., Thompson, S.J., and Woolhead, C. 2001. Multiple pathways used for the targeting of thylakoid proteins in chloroplasts. Traffic 2: 245251.[CrossRef][Medline]
Schölkopf, B. and Smola, A.J. 2002. Learning with kernelsSupport vector machines, regularization, optimization, and beyond. MIT Press, Cambridge, MA.
Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195197.[CrossRef][Medline]
von Heijne, G. 1990. The signal peptide. J. Membr. Biol. 115: 195201.[CrossRef][Medline]
Yuan, Z. 1999. Prediction of protein subcellular locations using Markov chain models. FEBS Lett. 451: 2326.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
H.-B. Shen and K.-C. Chou Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM Protein Eng. Des. Sel., November 10, 2007; (2007) gzm057v1. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |