Protein Science
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Protein Science (2005), 14:2804-2813. Published by Cold Spring Harbor Laboratory Press. Copyright © 2005 The Protein Society
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Matsuda, S.
Right arrow Articles by Akutsu, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Matsuda, S.
Right arrow Articles by Akutsu, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

A novel representation of protein sequences for prediction of subcellular location using support vector machines

Setsuro Matsuda1, Jean-Philippe Vert2, Hiroto Saigo1, Nobuhisa Ueda1, Hiroyuki Toh3 and Tatsuya Akutsu1

1 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0111, Japan
2 Centre de Géostatistique, Ecole des Mines de Paris, 77300 Fontainebleau, France
3 Division of Bioinformatics, Medical Institute of Bioregulation, Kyushu University, Fukuoka 812-8582, Japan

Reprint requests to: Setsuro Matsuda, Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan; e-mail: smatsuda{at}kuicr.kyoto-u.ac.jp; fax: +81-774-38-3022.

(RECEIVED May 20, 2005; FINAL REVISION August 22, 2005; ACCEPTED August 22, 2005)


    Abstract
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
As the number of complete genomes rapidly increases, accurate methods to automatically predict the subcellular location of proteins are increasingly useful to help their functional annotation. In order to improve the predictive accuracy of the many prediction methods developed to date, a novel representation of protein sequences is proposed. This representation involves local compositions of amino acids and twin amino acids, and local frequencies of distance between successive (basic, hydrophobic, and other) amino acids. For calculating the local features, each sequence is split into three parts: N-terminal, middle, and C-terminal. The N-terminal part is further divided into four regions to consider ambiguity in the length and position of signal sequences. We tested this representation with support vector machines on two data sets extracted from the SWISS-PROT database. Through fivefold cross-validation tests, overall accuracies of more than 87% and 91% were obtained for eukaryotic and prokaryotic proteins, respectively. It is concluded that considering the respective features in the N-terminal, middle, and C-terminal parts is helpful to predict the subcellular location.

Keywords: subcellular location; signal sequence; amino acid composition; distance frequency; support vector machine; predictive accuracy

Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.051597405.


    Introduction
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
Predicting the subcellular location of proteins is important to infer their biological function. As the number of complete genomes rapidly increases, accurate methods that automatically predict the subcellular location become more necessary. In particular, in the case where no homologous protein is found in protein databases, such methods are important tools to help annotate the function of unknown proteins.

Many efforts have been made to develop prediction methods to date. PSORT (Nakai and Kanehisa 1992; Horton and Nakai 1997) is historically the first method for predicting subcellular locations. It uses various sequence-derived features such as the presence of sequence motifs and amino acid compositions. Most existing methods can be roughly classified into two groups according to their input data. One is the method based on the N-terminal sequence of a protein and the other on its amino acid composition. TargetP (Emanuelsson et al. 2000) requires the N-terminal sequence as an input into two layers of artificial neural networks (ANNs), and can also predict the peptidase-cleaved site of a protein. The first layer comprises the earlier binary predictors, SignalP (Nielsen et al. 1997) and ChloroP (Emanuelsson et al. 1999). Reczko and Hatzigeorgiou (2004) used a bidirectional recurrent neural network with the first 90 residues in the N-terminal sequence. Yuan (1999) applied the Markov chain model to the prediction, but the entire sequence was used as the input data.

ProtLock (Cedano et al. 1997) requires the amino acid composition and is based on the least Mahalanobis distance algorithm. Chou and Elrod (1998, 1999) also used the amino acid composition but the covariant discriminant algorithm was employed in their method. NNPSL (Reinhardt and Hubbard 1998) is an ANN-based method using the amino acid composition. After the successful report in Reinhardt and Hubbard (1998), application of machine learning techniques became popular in this field. For SubLoc (Hua and Sun 2001), a support vector machine (SVM) was implemented instead of the ANN. It is expected that incorporating an amino acid order as well as the amino acid composition makes it possible to improve prediction performance. Chou (2001) proposed the pseudo–amino acid composition to take the effect of the amino acid order into account. Furthermore, Cai and Chou (2004) have recently developed an accurate method integrating the pseudo-amino acid composition, the functional domain composition (Chou and Cai 2002, 2004), and the information of gene ontology (Chou and Cai 2003). Park and Kanehisa (2003) developed an SVM-based method that incorporates compositions of dipeptides and gapped amino acid pairs in addition to the conventional amino acid composition. The concepts of the pseudo–amino acid and gapped amino acid pair compositions were merged in the residue-couple model proposed by Guo et al. (2005).

Incorporating the information of homology search can also improve the prediction performance (Bhasin and Raghava 2004; Kim et al. 2004; Bhasin et al. 2005). However, one should pay much attention to the sequence similarity between training and test data in evaluating prediction methods based on homology search. If a query sequence in the test data has a high similarity with a sequence in the training data, then its subcellular location can be easily predicted without using a complicated predictor. In other words, the data set used for training and testing must be sufficiently redundancy-reduced.

Although Reinhardt and Hubbard (1998) pointed out that prediction methods based on the amino acid composition are robust to the gene annotation error in the 5'-region, using the amino acid composition only leads to information loss of signal sequences. To overcome this problem, the concepts such as the pseudo–amino acid composition have been introduced. In this work, we propose a novel representation of protein sequences to further improve the accuracy of prediction methods. Our method, which employs the SVM with RBF kernel, is based on local compositions of amino acids and twin amino acids, and local frequencies of distance between successive amino acids. As benchmark data, we adopt the data sets provided by Reinhardt and Hubbard (1998) and Emanuelsson et al. (2000) because they have been widely used in earlier studies. For convenience, we call the former "NNPSL data sets" and the latter "TargetP data sets."

Each amino acid is represented by its one-letter code hereafter. In this work, basic amino acids encompass R, K, and H. Hydrophobic amino acids are I, V, L, F, M, A, G, W, and P. The remainder, D, N, E, Q, Y, S, T, and C are called "other amino acids."


    Results
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
Tables 1Go and 2Go show the comparison of predictive accuracies with existing methods on the NNPSL data sets. Likewise, Tables 3Go and 4Go show that on the TargetP data sets. The overall accuracies of Chou and Cai (2003, 2004) are remarkably high for all data sets. The sensitivity, specificity, and MCC of their methods are not shown because these values are not given in their papers. As explained in the previous section, their methods require the information of gene ontology and functional domain retrieved from the InterPro database (Apweiler et al. 2001). On the contrary, all other methods use sequence information alone. Therefore, we cannot compare their methods with the other methods directly. It should be pointed out, however, that our protein representation may be incorporated into their methods.


View this table:
[in this window]
[in a new window]
 
Table 1. Comparison of predictive accuracies for eukaryotic proteins in the NNPSL data set
 

View this table:
[in this window]
[in a new window]
 
Table 2. Comparison of predictive accuracies for prokaryotic proteins in the NNPSL data set
 

View this table:
[in this window]
[in a new window]
 
Table 3. Comparison of predictive accuracies for plant proteins in the TargetP data set
 

View this table:
[in this window]
[in a new window]
 
Table 4. Comparison of predictive accuracies for non-plant proteins in the TargetP data set
 
In Table 1Go, our sensitivity for mitochondrial proteins is 0.13–0.25 higher than the other four methods. Since the average MCC of our method (=0.82) is the highest, it is clear that the performance of our method is well-balanced. In Table 2Go, the overall accuracy of our method is close to that of Guo et al. (2005). Although the performance of machine learning techniques such as ANN and SVM is affected by the number of training data, it seems that the discrimination between cytoplasmic and the remaining (Extra and Peri) proteins is relatively easy.

In Tables 3Go and 4Go, our overall accuracies are the highest if we consider the jackknife accuracies of Chou and Cai (2004). According to Chou and Zhang (1995), the jackknife test is more rigorous and objective than cross-validation test, because the number of possible data divisions is too large to be handled in the latter test. However, we adopted the cross-validation test to save CPU time and compare our method with as many recent methods as possible. For plant proteins, our sensitivities for chloroplast, nuclear and cytosolic (other) proteins are lower than those of Emanuelsson et al. (2000), but the sensitivity for mitochondrial proteins was improved by 0.104. For non-plant proteins, our sensitivity for nuclear and cytosolic proteins is higher than any other methods. It is noteworthy that the MCCs of our method are over 0.82 for all locations.

To compare the predictive accuracies in the same conditions, we implemented the method proposed by Kim et al. (2004). Therefore, the values of sensitivity, specificity, MCC, and overall accuracy are different from those in Kim et al. (2004). They also employed the SVM with RBF kernel and characterized protein sequences by the Needleman-Wunsch scores (Needleman and Wunsch 1970) against all the sequences in training data. ALIGN0 (Myers and Miller 1988) in the FASTA 2.0 package (Pearson and Lipman 1988; Pearson 1990) was used for calculating the scores. The gap penalty is –3 and the scoring matrix is BLOSUM50. Each sequence was truncated after the N-terminal 90 residues for the calculation. The values of regularization parameter C and parameter {gamma} of RBF kernel are the same as those in Kim et al. (2004; see Table 8Go, below).


View this table:
[in this window]
[in a new window]
 
Table 8. Regularization parameter C and parameter {gamma} of RBF kernel used in the SVM training
 

    Discussion
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
We proposed a new representation for protein sequences using distance frequencies of basic, hydrophobic, and other amino acids and separated a protein sequence into several regions. In this section, we discuss how the distance frequency is useful and whether the separation of a sequence is meaningful. Furthermore, we estimate and analyze the weights of features used in the representation.

Usefulness of the distance frequency
The distance frequency was developed in consideration of nuclear export signal (NES) and chloroplast transit peptide. In Figure 1Go, we visualized distance frequencies of three hydrophobic amino acids (L, I, and V) for 75 protein sequences containing the NES (called "with NES"). These sequences were downloaded from NES-base 1.0 (la Cour et al. 2003) and their NESs were experimentally verified. We also depicted the distance frequencies for the 75 sequences with their NES removed (called "without NES"). These frequencies are slightly smaller than those of "with NES" at H = 2, 3, 4, where H represents the distance between successive amino acids. This decline implies that the distance frequency modestly reflects the existence of NES.



View larger version (14K):
[in this window]
[in a new window]
 
Figure 1. Distance frequencies of three hydrophobic amino acids (L, I, and V) for 75 protein sequences containing the NES (with NES). The dotted line shows the distance frequencies for the 75 sequences with their NES removed (without NES). Each value of the frequency was divided by sequence length and averaged over the 75 sequences.

 
Distance frequencies for plant proteins in the TargetP data set are shown in Figure 2Go. Figure 2, A and BGo, represents distance frequencies of basic amino acids in the N-terminal and middle parts, respectively. Figure 2, C and DGo, represents those of hydrophobic and other amino acids in the middle part. In Figure 2AGo, the distance frequency for mTP is the largest and that for SP is the smallest when 1 < H ≤ 6. The difference of distance frequencies related to the two locations is significantly large compared with Figure 2BGo. This indicates that the distance frequency is useful for discrimination of mitochondrial and secretory proteins. In Figure 2, C and DGo, the difference of the frequencies between four locations seems to be small. However, the performance of our method was improved by incorporating the frequencies of hydrophobic and other amino acids.



View larger version (22K):
[in this window]
[in a new window]
 
Figure 2. Distance frequencies of basic amino acids in the N-terminal part (A), basic amino acids in the middle part (B), hydrophobic amino acids in the middle part (C), and other amino acids in the middle part (D), for the TargetP plant proteins. Each value of the frequency was divided by sequence length and averaged over all sequences belonging to each subcellular location. The X-axis is common to the four panels. cTP, mTP, SP, and "other" indicate proteins destined for chloroplast, mitochondria, secretory pathway, and other locations (nucleus and cytosol), respectively.

 
Implication of separating a protein sequence
Each sequence was separated into three parts: N-terminal, middle, and C-terminal. The N-terminal part was further divided into four regions in calculating local amino acid compositions. Each region has 20 residues, except for prokaryotic proteins (24 residues). We tested different region lengths in the range 19–25. Interestingly, the highest overall accuracy was always obtained by using 20 residues. As a transmembrane domain consists of ~20 hydrophobic residues, the 20-residue length may have a biological meaning. For the C-terminal part, we changed the number of residues from 6 to 10. We observed that nine and eight residues were suitable on the NNPSL and TargetP data sets, respectively. The overall accuracy drastically varied depending on the number of residues in the C-terminal part. This indicates that considering the amino acid composition in the C terminus is important to predict the subcellular location. Although peroxisomal proteins, which can have the SKL motif in their C terminus, are not handled in this work, our method would be able to capture the peroxisomal targeting signal. We also examined two cases where the N-terminal part is divided into three regions with 30 residues and not divided at all. However, the overall accuracies were lower than the case mentioned above (four regions with 20 residues). As for the SVM parameters, we found that the effect of tuning them is not so critical compared with the adjustment of sequence lengths of the N-terminal and C-terminal parts.

Internal signal sequences, which are positioned in the middle part, are unclear compared with ones in the N-terminal and C-terminal parts. But some biological experiments indicate the importance of signal sequences in the middle part. Miyakawa and Imamura (2003) found out that two fibroblast growth factors FGF-9 and FGF-16 require both the N-terminal region and central hydrophobic region as a secretory signal. This hydrophobic region belongs to the middle part here. Furthermore, this bipartite signal sequence is not cleaved off by proteases during the transport process. We collected three sequences: human FGF-9, human FGF-16, and rat FGF-16 from databases available on the Internet and then predicted their subcellular locations by SignalP 3.0 (Bendtsen et al. 2004). This is the latest version of SignalP and employs both the ANN and hidden Markov model. As a result, these sequences were predicted as nonsecretory proteins. In contrast our method, which can consider the features in the middle part, correctly predicted all the sequences as secretory proteins.

From the aforementioned fact, it is concluded that separating a protein sequence into the N-terminal, middle, and C-terminal parts is helpful to capture signal sequences. In addition, our method has an advantage of small CPU time requirement to construct the feature vector compared with the method proposed by Kim et al. (2004).

Feature weights
Here we describe how to estimate the importance of each feature and discuss the relation between these features and subcellular locations. As opposed to linear SVM, the RBF SVM does not assign a weight to each feature. In order to estimate the importance of each feature, we followed the following procedures: (1) Prepare a feature vector whose components are all 0, (2) assign 1 to a feature whose importance is to be estimated, (3) feed this vector into the trained SVMs and obtain their outputs, and (4) repeat the procedures 1–3 for all features. The outputs are regarded as the weights of the RBF SVM, quantifying the contributions of the features.

Since our prediction method adopted the one-versus-rest method, we have one specific SVM for each subcellular location. Figure 3Go shows the feature weights of the SVMs specifically for (A) SP and (B) "other" on the TargetP plant data set. Feature number j of the X-axis corresponds to the j-th component of a feature vector (see Equation 1). For easy understanding, we discuss the possible meaning of the features with the most positive weights. In Figure 3AGo, we can see that the weights of hydrophobic amino acids in the N-terminal 20 residues are large. Interestingly, the weights of cysteine in the N-terminal part are relatively large. It is noteworthy that the distance frequency of other amino acids in the middle part (h1(M)) has a large weight. In Figure 3BGo, it is clarified that aspartic and glutamic acids in the N-terminal 40 residues are important. We can also see that the weights of lysine in the N-terminal 20 residues and the middle part are large.



View larger version (27K):
[in this window]
[in a new window]
 
Figure 3. Feature weights of the SVMs specifically for SP (A) and "other" (B) on the TargetP plant data set. Feature number j of the X-axis corresponds to the j-th component of a feature vector. The capital letters represent amino acids and the superscripts indicate a region in a protein sequence. Refer to the definitions of the regions in Figure 4Go. h1(M) represents the distance frequency of other amino acids in the middle part.

 
With respect to the SVM for cTP, it was confirmed that serine and threonine in the N-terminal 20 residues are strongly weighed. As to that for mTP, the weight of arginine in the N-terminal 20 residues was solely positive.

The above results indicate that the SVMs in our method were successfully trained, because their feature weights are consistent with features of signal sequences described later. Moreover, it is concluded that the first 20 residues in the N terminus are particularly important to predict the subcellular location.


    Materials and methods
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
Data sets
In this work, the data sets provided by Reinhardt and Hubbard (1998) and Emanuelsson et al. (2000) were used. Both data sets (the NNPSL and TargetP data sets) were collected from the SWISS-PROT database and any sequences containing ambiguous residues such as X and B were excluded out of them. The NNPSL data sets (Table 5Go) consist of eukaryotic and prokaryotic proteins but do not include plant proteins. Within each subcellular location, none of the sequences has more than 90% identity to any other sequences. This criterion to reduce the redundancy is not strict, because we found that the subcellular locations except for mitochondria and periplasm can be predicted with a high accuracy (>82%) by simple homology search using the Smith-Waterman algorithm (Smith and Waterman 1981).


View this table:
[in this window]
[in a new window]
 
Table 5. Number of sequences in each subcellular location on the NNPSL data sets
 
The TargetP data sets are comprised of two sets: plant and non-plant proteins (Table 6Go). However, the mitochondrial proteins contain sequences from both plant and non-plant proteins, because the number of mitochondrial proteins extracted from SWISS-PROT was too small to be used. The redundancy reduction for plant proteins in cTP, mTP, and SP and for non-plant proteins in SP was done on their pre-sequence plus the first residue of mature protein. Plant and non-plant proteins in "other" were redundancy-reduced on their N-terminal 68 residues. The remainder, non-plant proteins in mTP were redundancy-reduced on the mitochondrial targeting peptide plus three residues. To check the effect of the redundancy reduction, we predicted the subcellular locations based on the Smith-Waterman score. That is, the location of a sequence in the training data with a highest score is assigned to the corresponding query sequence in the test data. As a result, we obtained the overall accuracies of 75.7% and 84.0% for plant and non-plant proteins, respectively. This indicates that the redundancy on the TargetP data sets is relatively small.


View this table:
[in this window]
[in a new window]
 
Table 6. Number of sequences in each subcellular location on the TargetP data sets
 
In order to perform a fivefold cross-validation test, each data set was partitioned into five subsets that have approximately equal sizes. Before partitioning, we shuffled the sequences within each set by using at least 1000 random numbers. One subset is regarded as test data and the remaining four subsets as training data. This procedure is repeated five times so that each subset is used as test data once.

Important features of signal sequences
In general, proteins destined for chloroplast, mitochondria, and secretory pathway have signal sequences in their N termini. On the other hand, proteins destined for nucleus and cytosol have one or more signal sequences in the middle part of their sequence. Furthermore, chloroplast proteins transported into thylakoid have an internal signal sequence after the chloroplast transit peptide (cTP) (Keegstra and Cline 1999; Robinson et al. 2001).

The length of cTP is believed to be at most 100 residues. That of mitochondrial targeting peptide (mTP) ranges from 10 to 80 residues (Neupert 1997; Omura 1998). cTPs are rich in hydroxylated amino acids (S and T) and have basic amino acids with several residue gaps intervening (Bruce 2000). mTPs especiallyfor mitochondrial matrix and intermembrane space can form amphipathic {alpha}-helix with basic amino acids (Omura 1998). Signal peptides (SPs) for secretion are abundant in hydrophobic amino acids (von Heijne 1990). Secretory proteins that have the KDEL or KKXX motif in their C terminus return from Golgi apparatus to endoplasmic reticulum (Cosson and Letourneur 1997).

The nuclear localization signal (NLS) and nuclear export signal (NES) are rich in basic and hydrophobic amino acids (particularly L, I, and V), respectively. The basic amino acids in NLS can form one or more clusters. NESs have the hydrophobic amino acids with approximately constant gaps between each hydrophobic amino acid. Some examples of signal sequences are summarized in Table 7Go.


View this table:
[in this window]
[in a new window]
 
Table 7. Signal sequences and their target locations
 
Although sequence motifs such as the above were clarified by biological experiments, consensus sequences as localization signals are still obscure. This indicates that prediction of the subcellular location should not depend much on motif finding. As stated in the introduction of this paper, prediction methods based on the amino acid composition only take into account the whole length of a sequence and the methods based on the N-terminal sequence ignore the existence of signal sequences in the middle and C-terminal parts. Therefore, it would be effective that the three parts: N-terminal, middle, and C-terminal are separately treated to characterize protein sequences.

Feature vector
First of all, we defined the N-terminal, middle, and C-terminal parts depending on sequence length L. Most of the sequences used here conform to the definition in Figure 4AGo. The N-terminal part is further divided into four regions with length dN. Because we assumed that proteins are directed by the approximate amount of specific amino acids to make the signal sequence flexible and the cluster of such amino acids can be distributed in various regions even in the N terminus. dN is set to 20 and 24 for eukaryotic and prokaryotic proteins, respectively. It was also assumed that the middle part has at least 20 residues equal to the number of distinct amino acids. The length of the C-terminal part dC is set to nine and eight on the NNPSL and TargetP data sets, respectively.



View larger version (16K):
[in this window]
[in a new window]
 
Figure 4. Definitions of the N-terminal, middle, and C-terminal parts depending on sequence length L. dN representsthelengthofaregioninthe N-terminal part (in gray). dC is the lengthofthe C-terminal part (in black).

 
For short sequences, we prepared two more definitions. If L is >4dN + dC and <4dN + 20 + dC, the middle part is regarded as 20 residues from the start of the C-terminal part toward the N-terminal part (Fig. 4BGo). In the case that L is≤ 4dN + dC, we assumed that the lengths of the N-terminal and middle parts are the same. That is, these lengths are defined by (L dC)/2 and the N-terminal part is not divided at all (Fig. 4CGo). Actually, the sequences that satisfy L < 4dN + 20 + dC are only 3.7%–6.3% of the data sets.

The feature vector to represent protein sequence i is expressed as follows:


(1)

where the capital letters, N, M, C, and E indicate the N-terminal, middle, C-terminal, and entire parts. The entire part means the whole length of a sequence. The numerals in the parentheses (1–4) correspond to the regions in the N-terminal part in Figure 4, A and BGo. x1(p), ..., x20 (p) indicate the composition of 20 amino acids in part p (p = 1, 2, 3, 4, M, C, E). y1(M), ..., y20(M) are the composition of 20 twin amino acids (e.g., RR, KK) in the middle part. In the case that a sequence is too short to be divided on its N-terminal part (see Fig. 4CGo), the amino acid composition of the whole N-terminal residues is equally assigned to the four regions, i.e., xj(1) = xj(2) = xj(3) = xj(4) (j = 1, ..., 20).

f1(q), ..., f6 (q) represent the distance frequencies of basic amino acids in part q (q = N, M). To calculate distance frequencies, we defined six distance classes (H = 1, 1 < H ≤6, 6 < H ≤11, 11 < H ≤16, 16 < H ≤21, H > 21). Similarly, g1(M), ..., g6 (M) are the distance frequencies of hydrophobic amino acids and h1(M), ..., h6(M) are those of other amino acids in the middle part. Altogether, this feature vector has 184 components and each component is normalized between 0 and 1 by its possible maximum.

Distance frequency
In this work, we introduced a new feature, called "distance frequency" to encode a protein sequence. This is the frequency of the distance between two successive amino acids. For example, consider the following protein sequence:


where underlined letters denote basic amino acids. The distances between successive basic amino acids, Hb, take the values 3, 2, 3, 2, and 3 starting from the left. Note that Hb is calculated in a left-to-right fashion. As a result, the distance frequencies for Hb = 2 and Hb = 3 are 2 and 3, respectively.

SVM training
In order to implement SVM, we used the free software, SVMlight developed by Joachims (1999). As the kernel, the radial basis function (RBF) was selected because this function outperformed linear and polynomial kernels in terms of overall predictive accuracy (data not shown). The RBF kernel is defined by the following equation:


(2)

where vi and vj are feature vectors representing protein sequences. The parameter {gamma} in Equation 2 and regularization parameter C are adjusted in training to produce reliable performance. As {gamma} becomes smaller, the decision boundary for discriminating positive and negative examples becomes smoother. C controls the trade-off between training error and margin. We determined the two parameters as shown in Table 8Go by trial and error. Other options for SVMlight are set to their default.

For multiclass classification, the one-versus-rest method (Schölkopf and Smola 2002; Nguyen and Rajapakse 2003) was adopted. That is, the l-th SVM is trained on sequences belonging to the l-th location with the positive label "+1" and on sequences belonging to the remaining locations with the negative label "–1." We also tested the one-versus-one method, but the overall accuracy was lower than the one-versus-rest method (data not shown).

Measures for evaluation of the prediction performance
To evaluate the prediction performance of our method, sensitivity, specificity, Matthews’ (1975) correlation coefficient (MCC) for each subcellular location, and overall accuracy were calculated. The definitions of these measures are as follows:


(3)


(4)


(5)


(6)

where n is the total number of protein sequences and k is the number of subcellular locations. tp(l) is the number of correctly predicted sequences belonging to location l (true positive). tn(l) is the number of correctly predicted sequences that do not belong to location l (true negative). fp(l) is the number of overpredicted sequences in location l (false positive). fn(l) is the number of underpredicted sequences in location l (false negative).


    Acknowledgments
 
We thank Dr. Bill Pearson for the help about the usage of the FASTA package and Dr. Morihiro Hayashida of Kyoto University for valuable comments. This work was supported in part by Grant-in-Aid for Scientific Research on Priority Areas (C) "Genome Information Science" and the Education and Research Organization for Genome Information Science, both from the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan.


    References
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D.R., et al. 2001. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29: 37–40.[Abstract/Free Full Text]

Bendtsen, J.D., Nielsen, H., von Heijne, G., and Brunak, S. 2004. Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol. 340: 783–795.[CrossRef][Medline]

Bhasin, M. and Raghava, G.P.S. 2004. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res. 32: W414–W419.[Abstract/Free Full Text]

Bhasin, M., Garg, A., and Raghava, G.P.S. 2005. PSLpred: Prediction of subcellular localization of bacterial proteins. Bioinformatics 21: 2522–2524.[Abstract/Free Full Text]

Bruce, B.D. 2000. Chloroplast transit peptides: Structure, function and evolution. Trends Biochem. Sci. 10: 440–447.

Cai, Y.-D. and Chou, K.-C. 2004. Predicting subcellular localization of proteins in a hybridization space. Bioinformatics 20: 1151–1156.[Abstract/Free Full Text]

Cedano, J., Aloy, P., Pérez-Pons, J.A., and Querol, E. 1997. Relation between amino acid composition and cellular location of proteins. J. Mol. Biol. 266: 594–600.[CrossRef][Medline]

Chou, K.-C. 2001. Prediction of protein cellular attributes using pseudoamino acid composition. Proteins 43: 246–255.[CrossRef][Medline]

Chou, K.-C. and Cai, Y.-D. 2002. Using functional domain composition and support vector machines for prediction of protein subcellular location. J. Biol. Chem. 277: 45765–45769.[Abstract/Free Full Text]

———. 2003. A new hybrid approach to predict subcellular localization of proteins by incorporating gene ontology. Biochem. Biophys. Res. Commun. 311: 743–747.[CrossRef][Medline]

———. 2004. Predicting subcellular localization of proteins by hybridizing functional domain composition and pseudo-amino acid composition. J. Cell. Biochem. 91: 1197–1203.[CrossRef][Medline]

Chou, K.-C. and Elrod, D.W. 1998. Using discriminant function for prediction of subcellular location of prokaryotic proteins. Biochem. Biophys. Res. Commun. 252: 63–68.[CrossRef][Medline]

———. 1999. Protein subcellular location prediction. Protein Eng. 12: 107–118.[Abstract/Free Full Text]

Chou, K.-C. and Zhang, C.T. 1995. Review: Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol. 30: 275–349.[Medline]

Cosson, P. and Letourneur, F. 1997. Coatomer (COPI)-coated vesicles: Role in intracellular transport and protein sorting. Curr. Opin. Cell Biol. 9: 484–487.[CrossRef][Medline]

Emanuelsson, O., Nielsen, H., and von Heijne, G. 1999. ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci. 8: 978–984.[Abstract]

Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne, G. 2000. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 300: 1005–1016.[CrossRef][Medline]

Guo, J., Lin, Y., and Sun, Z. 2005. A novel method for protein subcellular localization: Combining residue-couple model and SVM. Proceedings of the 3rd Asia-Pacific Bioinformatics Conference, pp. 117–129. Imperial College Press, Singapore.

Horton, P. and Nakai, K. 1997. Better prediction of protein cellular localization sites with the k nearest neighbors classifier. Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology, pp. 147–152. AAAI Press, Menlo Park, CA.

Hua, S. and Sun, Z. 2001. Support vector machine approach for protein subcellular localization prediction. Bioinformatics 17: 721–728.[Abstract/Free Full Text]

Joachims, T. 1999. Making large-scale SVM learning practical. In Advances in kernel methods—Support vector learning (eds. B. Schölkopf et al.), pp. 41–56. MIT Press, Cambridge, MA.

Keegstra, K. and Cline, K. 1999. Protein import and routing systems of chloroplasts. Plant Cell 11: 557–570.[Free Full Text]

Kim, J.K., Raghava, G.P.S., Kim, K.S., Bang, S.Y., and Choi, S. 2004. Prediction of subcellular localization of proteins using pairwise sequence alignment and support vector machine. Proceedings of the 3rd Annual Conference of the Korean Society for Bioinformatics, pp. 158–166. Seoul, Korea.

la Cour, T., Gupta, R., Rapacki, K., Skriver, K., Poulsen, F.M., and Brunak, S. 2003. NESbase version 1.0: A database of nuclear export signals. Nucleic Acids Res. 31: 393–396.[Abstract/Free Full Text]

Matthews, B.W. 1975. Comparison of predicted and observed secondary structure of T4 phage lysozyme. Biochem. Biophys. Acta 405: 442– 451.[Medline]

Miyakawa, K. and Imamura, T. 2003. Secretion of FGF-16 requires an uncleaved bipartite signal sequence. J. Biol. Chem. 278: 35718– 35724.[Abstract/Free Full Text]

Myers, E.W. and Miller, W. 1988. Optimal alignments in linear space. Comput. Appl. Biosci. 4: 11–17.[Abstract/Free Full Text]

Nakai, K. and Kanehisa, M. 1992. A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14: 897–911.[CrossRef][Medline]

Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443–453.[CrossRef][Medline]

Neupert, W. 1997. Protein import into mitochondria. Annu. Rev. Biochem. 66: 863–917.[CrossRef][Medline]

Nguyen, M.N. and Rajapakse, J.C. 2003. Multi-class support vector machines for protein secondary structure prediction. Genome Inform. Ser. Workshop Genome Inform. 14: 218–227.[Medline]

Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijine, G. 1997. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 10: 1–6.[Abstract/Free Full Text]

Omura, T. 1998. Mitochondria-targeting sequence, a multi-role sorting sequence recognized at all steps of protein import into mitochondria. J. Biochem. 123: 1010–1016.[Abstract/Free Full Text]

Park, K.-J. and Kanehisa, M. 2003. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 19: 1656–1663.[Abstract/Free Full Text]

Pearson, W.R. 1990. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183: 63–98.[Medline]

Pearson, W.R. and Lipman, D.J. 1988. Improved tools for biological sequence analysis. Proc. Natl. Acad. Sci. 85: 2444–2448.[Abstract/Free Full Text]

Reczko, M. and Hatzigeorgiou, A. 2004. Prediction of the subcellular localization of eukaryotic proteins using sequence signals and composition. Proteomics 4: 1591–1596.[CrossRef][Medline]

Reinhardt, A. and Hubbard, T. 1998. Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res. 26: 2230–2236.[Abstract/Free Full Text]

Robinson, C., Thompson, S.J., and Woolhead, C. 2001. Multiple pathways used for the targeting of thylakoid proteins in chloroplasts. Traffic 2: 245–251.[CrossRef][Medline]

Schölkopf, B. and Smola, A.J. 2002. Learning with kernels—Support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge, MA.

Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195–197.[CrossRef][Medline]

von Heijne, G. 1990. The signal peptide. J. Membr. Biol. 115: 195–201.[CrossRef][Medline]

Yuan, Z. 1999. Prediction of protein subcellular locations using Markov chain models. FEBS Lett. 451: 23–26.[CrossRef][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
Protein Eng Des SelHome page
H.-B. Shen and K.-C. Chou
Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM
Protein Eng. Des. Sel., November 10, 2007; (2007) gzm057v1.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Matsuda, S.
Right arrow Articles by Akutsu, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Matsuda, S.
Right arrow Articles by Akutsu, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS