|
|
||||||||
Institute of Microbial Technology, Sector 39A, Chandigarh, India
Reprint requests to: G.P.S. Raghava, Scientist, Bioinformatics Centre, Institute of Microbial Technology, Sector 39A, Chandigarh, India; e-mail: raghava{at}imtech.res.in; fax: 91-172-690557.
(RECEIVED August 14, 2002; FINAL REVISION November 22, 2002; ACCEPTED November 22, 2002)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.0228903.
Supplemental material: See www.proteinscience.org.
| Abstract |
|---|
|
|
|---|
Keywords: ß-Turns; prediction; neural networks; multiple alignment; secondary structure; Web server
| Introduction |
|---|
|
|
|---|
helices, ß-sheets, loops, and tight turns. Helices and sheets are referred to as regular structures, whereas loops and tight turns belong to the category of irregular secondary structures. Tight turns are irregular structures with nonrepeating backbone torsion angles and often have at least one hydrogen bond (Chou 2000). Depending on the number of residues forming the turn, tight turns are classified as
-turns,
-turns, ß-turns,
-turns, and
-turns. ß-Turn is a four-residue reversal in a protein chain that is not in an
-helix, and the distance between C
(i) and C
(i + 1) is < 7 Å (Richardson 1981; Rose et al. 1985). About one-fourth of all protein residues are in ß-turns. They are responsible for the compact globular shape of proteins because of the ability to reverse the protein chain direction within a span of several residues. Also, ß-turn formation is an important stage in protein folding (Takano et al. 2000). Moreover, the occurrence of ß-turns on solvent-exposed surfaces makes them suitable candidates for molecular recognition processes and interactions between peptide substrates and receptors (Rose et al. 1985).
Therefore, it is useful to develop an accurate method for identifying the location of ß-turn within a protein sequence. It not only would be a small step toward the overall prediction of three-dimensional structure of a protein from its amino acid sequence but also would be helpful in fold-recognition studies and identification of structural motifs such as a ß-hairpin.
| ß-Turn prediction methods |
|---|
|
|
|---|
| Inferences from previous study |
|---|
|
|
|---|
To begin, the network is first trained with single sequences, and then a second filtering network is used to process the output from the first turn/nonturn network and is trained by using secondary structure information from PSIPRED (Jones 1999). Together, these two networks have resulted in substantial improvement in prediction accuracy compared with that of previous methods, by training on a larger data set and secondary structure information from PSIPRED. The method shows Matthews correlation coefficient (MCC) value of 0.41 compared with the MCC of 0.35 for BTPRED.
| A new approach to ß-turn prediction |
|---|
|
|
|---|
It is interesting to note that a significant improvement in prediction accuracy compared with the single sequence has been achieved by training the net on PSI-BLASTgenerated position-specific scoring matrices. The method shows MCC value of 0.37 compared with 0.31 for single sequence. Moreover, prediction from a second filtering network trained on predictions from the first network and predicted secondary structure information from PSIPRED yields a MCC value of 0.43 and a marked improvement in other performance statistics.
| Results |
|---|
|
|
|---|
Prediction of ß-turns
In this work, we have used unbalanced sets containing the natural ratio of ß-turn residues and nonß-turn residues as found in proteins. The results averaged over seven tests are presented. The value of the learning parameter has been set to 0.0001. Training have been performed for 5000 epochs for both networks, after which the learning has been terminated when the error reached a stable value; differences between errors in subsequent steps become sufficiently small. Prediction performance measures have been averaged over seven sets and are expressed as the mean ± SD.
Prediction with single sequences
The net is trained with single sequences encoded as binary bits and contains no secondary structure information, with a window size of nine residues. When applying a sevenfold cross-validation test on a data set containing single sequences, we found that the network reached an overall accuracy of 71.6 ± 0.7%. The prediction results are presented in Table 1
. The net has achieved an MCC of 0.31 ± 0.01, which is comparable to that of BTPRED with a single network; however, the percentage accuracy is lower than that of BTPRED. A comparison of MCC and Qpred/Qobs values of BTPRED (Shepherd et al. 1999) and other statistical methods tested on the same data set (Kaur and Raghava 2002) has also been made (see Electronic Supplemental Material). The probability of correct prediction is better than with statistical methods but lower than that of BTPRED. However, the coverage of turns is maximum among all the methods. Moreover, with single sequences, as input to the network, the performance is better than that of statistical methods.
|
Although our data set is nonhomologous, it contains some of the protein chains used to train the PHD and PSIPRED. As a consequence, we have cross-validated the results by removing those proteins from our data set that were used to develop PHD and PSIPRED. The difference in prediction results is very small or almost negligible, as evident from the values given in Table 1
.
Prediction with multiple alignment
To further enhance the prediction performance, the multiple sequence alignment is implemented for prediction. The first-level network 9(21)-10-1 is trained on PSI-BLASTgenerated position-specific matrices. The comparative results of network with single sequence and with multiple alignment are shown in Table 2
.
|
Prediction with multiple alignment and secondary structure information
Accuracy is further improved by using a second filtering network and secondary structure information. Output from the first network (trained on PSI-BLAST scoring matrices) and secondary structure predicted by PSIPRED is applied to the second network, which is trained for an additional 5000 cycles. Use of PSIPRED-predicted secondary structure and multiple alignment information improves the MCC to 0.43 ± 0.01 and prediction accuracy to 75.5 ± 1.7%, the best available at present (Table 2
). The final network yields Qpred value of 49.8 ± 2.0% and Qobs value of 72.3 ± 1.6% and is marginally better than the results of second-level network with single sequence. Therefore, the use of multiple alignment information in the form of PSI-BLAST position-specific matrices as input to the first network and filtering by second network has further improved the level of prediction performance.
The prediction results with multiple alignment information have also been validated by removing those proteins from our data set that were used to develop PSIPRED. The results, given in Table 2
, show negligible differences in performance measures except in Qobs value.
Receiver operating characteristic results
Performance of different networks has also been evaluated by calculating the area under the receiver operating characteristic (ROC) curve. Figure 1
shows the ROC curves for four different networks. The four curves have been compared by computing the area under the curves. The corresponding areas under the curves are as follows: single sequence, 0.67; multiple alignment, 0.72; single sequence with secondary structure, 0.76; and multiple alignment with secondary structure, 0.77. These reflect the better discrimination of network system, which consists of first network that is trained on multiple alignment profiles, and a second network that is trained on secondary structure in comparison to other three network systems.
|
| Discussion |
|---|
|
|
|---|
Here, we have used the same approach for ß-turn prediction, and it differs significantly from the earlier methods. We have developed a method based on neural networks by using multiple alignment information in the form of PSI-BLASTgenerated scoring matrices to improve the ß-turn prediction accuracy. From this study, it is clear that a combination of neural network and evolutionary information contained in multiple sequence alignment has improved the performance of ß-turn prediction method. There are three possible explanations for the improvements obtained: (1) use of large and recent data set for learning; (2) use of PSI-BLAST profiles, which finds more distantly related homologs than pair-wise search methods against a nonredundant database; and (3) use of a second filter network, which includes predictions from the first network and secondary structure information from a highly successful method PSIPRED.
To begin, the net is first trained with single sequences encoded as binary bits. The results of prediction when net is trained on single sequences are better than statistical methods and comparable to BTPRED results with a single network. It has MCC of 0.31. A second-level network has been used to refine the results produced by the first network. In second network, at each position in the window, the turn/nonturn outputs from the first network and predicted secondary structure states are used in place of sequence information as input to the network. The architecture of the second-level network is the same as for the first-level network. The performance is further improved by using a second filtering network and secondary structure information. The accuracy is improved by 3%, and MCC is raised from 0.31 to 0.41. A significant improvement in Qpred and Qobs values has also been achieved. The effect of two secondary structure prediction methods, PHD and PSIPRED, on ß-turn prediction accuracy has also been assessed, and it has been found that ß-turn prediction by incorporating PSIPRED-predicted secondary structure is more accurate for the same cross-validated set than is that for PHD. The higher prediction accuracy of PSIPRED compared with PHD is the reason of better ß-turn prediction results with PSIPRED in comparison to PHD.
A new approach that uses PSI-BLAST to generate multiple sequence alignment profiles has been implemented for ß-turn prediction. The first-level net is trained on the PSI-BLASTgenerated position-specific matrices, which are produced as part of PSIPRED prediction method. MCC is dramatically increased, from 0.31 of single sequence to 0.37, which is even better than that with BTPRED. Improvement in other performance measures can also be observed. So, ß-turn prediction accuracy is improved by taking into account the information brought about by multiple alignment matrices. The overall results in comparison to single sequence shows an additional gain in performance, and the method reaches the final accuracy of 73.5%. The reason for such a better performance is that at alignment level, PSI-BLAST produces profiles by searching homologs against a large nonredundant database. It is a sensitive scoring system, which involves the probabilities with which amino acids occur at various positions. As expected, the prediction accuracy is further improved by using a second filtering network trained on predictions from first network, along with predicted secondary structure information. The MCC is raised from 0.37 to 0.43, the maximum value for ß-turn prediction achieved so far.
The improvements in ß-turn prediction performance so obtained are significant, especially in the context of overall increase in prediction accuracy of secondary structure prediction, and will be helpful to the researchers working in the field of fold recognition. The method depends on the accuracy of secondary structure prediction method. The suggested approach has a larger potential for further improvement of prediction accuracy, especially in view of the further extension or growth of the sequence database of proteins and a further improvement in protein secondary structure prediction.
| Materials and methods |
|---|
|
|
|---|
2.0-Å resolution. Each chain contains at minimum one ß-turn. The PROMOTIF program has been used to assign ß-turns in proteins (Hutchinson and Thornton 1996). The extracted ß-turn residues have been assigned different secondary structure states by DSSP (Kabsch and Sander, 1983). It has been found that the maximum number of ß-turn residues have T state followed by S state in their nomenclature (see Electronic Supplemental Material).
Sevenfold cross-validation
A prediction method is often developed by cross-validation or jack-knife method (Chou and Zhang 1995). Because of the size of the data set, the jack-knife method (individual testing of each protein in the data set) was not feasible, so a more limited cross-validation technique has been used, in which the data set is randomly divided into seven subsets, each containing equal number of proteins. Each set is an unbalanced set that retains the naturally occurring proportion of ß-turns (
25%) and nonturns.
The data set has been divided into training set, validation set, and testing set. The training set is consisted of five of these subsets. The network is validated for minimum error on validation set to avoid over-training, and the network is tested on the excluded set of proteins, the testing set. This has been done seven times to test for each subset. The final prediction results have been averaged over seven testing sets.
Neural network architecture
In the present study, two feed-forward back-propagation networks with a single hidden layer are used. Both the networks have input window that is nine residues wide, and have 10 units in a single hidden layer. The target output consists of a single binary number and is one or zero (true or false). The window is shifted residue by residue through the protein chain, thus yielding N patterns for a chain with N residues. This is in accordance with the previous work (Shepherd et al. 1999), which showed that a window size of nine gave optimal prediction results. The architecture of the network system used in present work is shown in Figure 2
.
|
First level: sequence-to-structure net
The input to the first network is either single sequence or multiple alignment profiles. Patterns are presented as window of nine residues, in which a prediction is made for the central residue. With single sequence input, binary encoding scheme has been used. In this scheme, each amino acid at each window position is encoded by a group of 21 inputs, 20 units code for each possible amino acid type at that position and one is used when the moving window overlaps the amino- or carboxy-terminal end of the protein. In each group of 21 inputs, the input corresponding to the amino acid type at that window position is set to one, and all other inputs are set to zero.
With multiple alignment profile input, the position-specific scoring matrix generated by PSI-BLAST has been used as input to the neural network. The matrix has 21 x M real-number elements, where M is the length of the target sequence. Each element represents the likelihood of that particular residue substitution at that position. Thus, 21 real numbers rather than binary bits encode each residue.
Second level: Structure-to-structure net
An important feature of the predictions generated by the first network is that they are uncorrelated; that is, the network made prediction for each residue in isolation without reference to neighboring prediction. The correlation can be taken into account by using a second level, a structure-to-structure network. Qian and Sejwonski (1988) achieved 1% improvement in secondary structure prediction accuracy by using a second filtering network.
The input to second filtering network is predictions obtained from the first net and the predicted secondary structure. Four units encode each residue, in which one unit codes for turn/nonturn prediction from first network, and it is either set to one or zero. The remaining three units code for three secondary structure states (helix, strand, and coil; Fig. 2
).
Secondary structure information is also encoded by the actual probabilities of three states provided in the output of the PSIPRED prediction. The probabilities are just the strengths of the prediction for each of the three target states (helix, strand, coil) and are represented by a real number in the range zero to one. The actual score of turn/nonturn predictions obtained from first network is also used as input to the network in the place of binary bits.
Multiple alignment or position-specific scoring matrices
PSIPRED uses PSI-BLAST to detect distant homologs of a query sequence and generate position-specific scoring matrix as part of the prediction process, and here, we have used these intermediate PSI-BLASTgenerated position-specific scoring matrices as a direct input to the first-level network. The matrix has 21 x M elements, where M is the length of the target sequence, and each element represents the frequency of occurrence of each of the 20 amino acids at one position in the alignment (Altschul et al. 1997).
Secondary structure prediction and assignment
The second filtering network is trained with output obtained from first network and predicted secondary structure information. In order to prove that ß-turn prediction accuracy depends on the accuracy of secondary structure prediction, two methods have been used for predicting secondary structure: PHD (Rost 1996) and PSIPRED (Jones 1999). The protein secondary structure assignment by DSSP is used to establish an upper bound of predictive performance. DSSP provides eight states assignment of secondary structure (Kabsch and Sander 1983). The eight states of DSSP have been decomposed into three states (G, H, and I are taken as helices; B and E as strand; and the rest as coil).
Filtering the prediction
Because the prediction is performed for each residue separately, the final prediction includes a number of unusually short ß-turns of one or two residues. Although the second-level structure-to-structure network corrects the tendency of the first-level sequence-to-structure network to predict too short ß-turns, the final predictions still contain single residue ß-turns. To exclude such unrealistic turns, we have applied simple filtering rule, the "state-flipping" rule, as described in the work of Shepherd et al. (1999). A set of four rules have been used in the following order: flip isolated nonturn predictions to turn (i.e., t-t
ttt), flip isolated turn predictions to nonturn (i.e., -t-
), flip isolated pairs of turn predictions to nonturn (i.e., -tt-
), and flip the adjacent nonturn predictions to turn (i.e., -ttt-
tttt- or -tttt).
Performance measures
Performance measures used are categorized as the following.
Threshold-dependent measures
Four parameters have been used in present work to measure the performance of prediction method as described by Shepherd et al. (1999). Following is the brief description of these parameters: (1) Qtotal (or prediction accuracy) is the percentage of correctly classified residues, (2) MCC accounts for both over and under-predictions, (3) Qpred is the percentage of correct prediction of ß-turns (or probability of correct prediction), and (4) Qobs is the percentage of observed ß-turns that are correctly predicted (or percent coverage). The parameters can be calculated by following equations:
![]() |
![]() |
![]() |
![]() |
Threshold-independent measures
One problem with the threshold-dependent measures is that they measure the performance on a given threshold. They fail to use all the information provided by a method. The ROC is a threshold-independent measure that was developed as a signal processing technique. For a prediction method, ROC plot is obtained by plotting all sensitivity values (true-positive fraction) on the y-axis against their equivalent (1-specificity) values (false-positive fraction) for all available thresholds on the x-axis. The area under the ROC curve is taken as an important index because it provides a single measure of overall accuracy that is not dependent on a particular threshold (Deleo 1993). It measures discrimination, the ability of a method to correctly classify ß-turn and nonturn residues. Sensitivity (Sn) and specificity (Sp) are defined as
![]() |
| Electronic supplemental material |
|---|
|
|
|---|
Availability
The program is implemented on the Web server BetaTPred2, available at http://www.imtech.res.in/raghava/betatpred2/ by using CGI/Perl script. The SNNS-generated network is converted into C program and is used as an interface.
Users can enter primary amino acid sequence in fasta or free format. The residues can be predicted as ß-turn or nonß-turn residues. Prediction can also be e-mailed back to them after a short period of time, depending on the server load.
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Chou, K.C. 1997. Prediction of ß-turns. J. Pept. Res. 49: 120144.[Medline]
. 2000. Prediction of tight turns and their types in proteins. Anal. Biochem. 286: 116.[CrossRef][Medline]
Chou, K.C. and Blinn, J.R. 1997. Classification and prediction of ß-turn types. J. Protein Chem. 16: 575595.[CrossRef][Medline]
Chou, K.C. and Zhang, C.T. 1995. Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol. 30: 275349.[Medline]
Chou, P.Y. and Fasman, G.D. 1974. Conformational parameters for amino acids in helical, ß-sheet and random coil regions calculated from proteins. Biochemistry 13: 211222.[CrossRef][Medline]
. 1979. Prediction of ß-turns. Biophys. J. 26: 367384.
Deleo, J.M. 1993. Proceedings of the second International Symposium on Uncertainty Modelling and Analysis, pp. 318325. IEEE, Computer Society Press, College Park, MD.
Garnier, J., Osguthorpe, D.J., and Robson, B. 1987. Analysis and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120: 97120.
Guruprasad, K. and Rajkumar, S. 2000. ß- and
-turns in proteins revisited: A new set of amino aciddependent positional preferences and potential. J. Biosci. 25: 143156.[Medline]
Hutchinson, E.G. and Thornton, J.M. 1996. PROMOTIF: A program to identify and analyze structural motifs in proteins. Protein Sci. 5: 212220.[Abstract]
Jones, D.T. 1999. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292: 195202.[CrossRef][Medline]
Kabsch, W. and Sander, C. 1983. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22: 25772637.[CrossRef][Medline]
Kaur, H. and Raghava, G.P.S. 2002. An evaluation of ß-turn prediction methods. Bioinformatics 18: 15081514.
Przybylski, D. and Rost, B. 2002. Alignments grow, secondary structure prediction improves. Proteins 46: 197205.[CrossRef][Medline]
Qian, N. and Sejnowski, T.J. 1988. Predicting the secondary structure of globular proteins using neural network models. J. Mol. Biol. 202: 865884.[CrossRef][Medline]
Richardson, J.S. 1981. The anatomy and taxonomy of protein structure. Adv. Protein Chem. 34: 167339.[Medline]
Rose, G.D., Gierasch, L.M., and Smith, J.A. 1985. Turns in peptides and proteins. Adv. Protein Chem. 37: 100109.
Rost, B. 1996. PHD: Predicting one-dimensional protein structure by profile based neural networks. Meth. Enzymol. 266: 525539.[CrossRef][Medline]
Rost, B. and Sander, C. 1993. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232: 584599.[CrossRef][Medline]
Rumelhart, D.E., Hinton, G.E., and Williams, R.J. 1986. Learning representations by back-propagation errors. Nature 323: 533536.[CrossRef]
Shepherd, A.J., Gorse, D., and Thornton, J.M. 1999. Prediction of the location and type of ß-turns in proteins using neural networks. Protein Sci. 8: 10451055.[Abstract]
Takano, K., Yamagata, Y., and Yutani, K. 2000. Role of amino acid residues at turns in the conformational stability and folding of human lysozyme. Biochemistry 39: 86558665.[CrossRef][Medline]
Wilmot, C.M. and Thornton, J.M. 1988. Analysis and prediction of the different types of ß-turns in proteins. J. Mol. Biol. 203: 221232.[CrossRef][Medline]
Zell, A. and Mamier, G. 1997. Stuttgart neural network simulator, version 4.2. University of Stuttgart, Stuttgart, Germany.
Zhang, C.T. and Chou, K.C. 1997. Prediction of ß-turns in proteins by 14 & 23 correlation model. Biopolymers 41: 673702.[CrossRef]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
V. S. Subramanian, J. S. Marchant, and H. M. Said Apical membrane targeting and trafficking of the human proton-coupled transporter in polarized epithelia Am J Physiol Cell Physiol, January 1, 2008; 294(1): C233 - C240. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Poursheikhali Asgary, S. Jahandideh, P. Abdolmaleki, and A. Kazemnejad Analysis and identification of -turn types using multinomial logistic regression and artificial neural network Bioinformatics, December 1, 2007; 23(23): 3125 - 3130. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Kumar, R. Verma, and G. P. S. Raghava Prediction of Mitochondrial Proteins Using Support Vector Machine and Hidden Markov Model J. Biol. Chem., March 3, 2006; 281(9): 5357 - 5363. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Saxena and P. K. Chakraborti Identification of Regions Involved in Enzymatic Stability of Peptide Deformylase of Mycobacterium tuberculosis J. Bacteriol., December 1, 2005; 187(23): 8216 - 8220. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Kumar, M. Bhasin, N. K. Natt, and G. P. S. Raghava BhairPred: prediction of {beta}-hairpins in a protein from multiple alignment information using ANN and SVM techniques Nucleic Acids Res., July 1, 2005; 33(suppl_2): W154 - W159. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Zhang, S. Yoon, and W. J. Welsh Improved method for predicting {beta}-turn using support vector machine Bioinformatics, May 15, 2005; 21(10): 2370 - 2374. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. S. Subramanian, J. S. Marchant, M. J. Boulware, and H. M. Said A C-terminal Region Dictates the Apical Plasma Membrane Targeting of the Human Sodium-dependent Vitamin C Transporter-1 in Polarized Epithelia J. Biol. Chem., June 25, 2004; 279(26): 27719 - 27728. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Kozjak, N. Wiedemann, D. Milenkovic, C. Lohaus, H. E. Meyer, B. Guiard, C. Meisinger, and N. Pfanner An Essential Role of Sam50 in the Protein Sorting and Assembly Machinery of the Mitochondrial Outer Membrane J. Biol. Chem., December 5, 2003; 278(49): 48520 - 48523. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Kaur and G.P.S. Raghava A neural-network based method for prediction of {gamma}-turns in proteins from multiple sequence alignment Protein Sci., May 1, 2003; 12(5): 923 - 929. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||