|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, Iowa 50011-3020, USA
2 L.H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, Iowa 50011-3020, USA
(RECEIVED January 30, 2006; FINAL REVISION May 11, 2006; ACCEPTED July 31, 2006)
| Abstract |
|---|
|
|
|---|
Keywords: secondary structure prediction; GOR; Fragment Database Mining; structural fragments; multiple sequence alignments; PSIPRED
| Introduction |
|---|
|
|
|---|
38,000 (counting all, even highly homologous structures), a number that is significantly below the 233,000 or so protein sequences available in UniProtKB/Swiss-Prot database, and the 3,050,000 translations of EMBL nucleotide sequences collected in UniProtKB/TrEMBL database. Additionally, due to completion of many large-scale genome-sequencing projects, the number of known sequences grows continuously at an incredible rate. Protein tertiary structure prediction from sequence is one of the most important problems in molecular biology. Recent significant advances in protein tertiary structure prediction using computational methods, measured by the Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiments, may help reduce this large gap. These structure prediction methods can be broadly grouped into three categories: homology/comparative modeling, fold recognition/threading, and de novo (ab initio) modeling. Among these methods, homology modeling requires the highest sequence similarity to known structures from the PDB, while de novo modeling relies to a lesser extent on information from the sequences and from the structures of proteins in the PDB. Many tertiary structure prediction methods incorporate secondary structure prediction for improvement of the accuracy of their modeling, or to significantly reduce the sampling of conformational space that is required (Kolinski 2004; Solis and Rackovsky 2004; Kolinski and Bujnicki 2005).
Currently no secondary structure prediction techniques yield >80% accuracy in cross-validated predictions, measured by Q3 prediction accuracy (Rost 2001). For example, the most successful techniques based on neural networks, such as PHD (Rost 1996) and PSIPRED (Jones 1999), reported accuracies
76%. This limitation in prediction accuracy of secondary structures is subsequently transferred into many tertiary structure prediction methods, limiting their performance whenever such secondary structure predictions are used as the input to the structure prediction algorithm.
Secondary structure prediction is an active research area. Recently, support vector machines (Hua and Sun 2001; Nguyen and Rajapakse 2005), sequence-based two-level (Huang et al. 2005), and dihedral angle-based (Wood and Hirst 2005) neural network algorithms were successfully used with accuracies <80%. Neural networks were also applied for the cases where secondary structures are combined not only into three categories (helix, sheet, and coil), but also into seven categories in a more detailed representation (Lin et al. 2005).
Despite the variety of these prediction methods, the barrier of cross-validated 80% accuracy is still present and has not yet been overcome. Is there a structural explanation for this limit? In a recent, interesting work, Kihara (2005) pointed out the importance of long-range interactions on the formation of secondary structure. He argued that as long as secondary structure predictions are based on a sliding sequence window, the long-range effects, not only for
-sheets but even for helices, will be treated to a limited extent. A comparison of accuracies as a function of residue contact order (sequence separation between contacts) supports this argument, at least for some helical and coil fragments, and provides interesting implications for protein folding (Tsai and Nussinov 2005). However, the accuracies for some other helices with high-contact order were also low, suggesting that there might be other effects not taken into account in the present secondary structure prediction algorithms. Note that the incorporation of multiple sequence alignments into predictions implicitly introduces long-range effects since sequence conservation is guided by structural constraints; GOR V and now the present novel hybrid method both benefit from this inclusion.
In order to improve the accuracy of secondary structure predictions, we propose a new hybrid method, Consensus Data Mining (CDM), which combines our two previous successful secondary structure prediction methods: the recently developed Fragment Database Mining (FDM) (Cheng et al. 2005) and the latest version of the well known GOR algorithm, GOR V (Kloczkowski et al. 2002; Sen et al. 2005). The basic premise of CDM is that the combination of these two complementary methods can enhance the performance of secondary structure prediction by harnessing the distinct advantages that both methods offer. FDM exploits the availability of sequentially similar fragments in the PDB, which leads to the highly accurate (much better than GOR V) prediction of structure for such fragments, but such fragments are not available for many cases. On the other hand, GOR V predicts the secondary structure of less similar fragments fairly accurately, where usually the FDM method cannot find suitable structures.
| Results |
|---|
|
|
|---|
|
The performance of GOR V can also be improved with multiple sequence alignments: The GOR V method tested with the full jack-knife methodology yields an accuracy of 73.5%, when multiple sequence alignments (MSA) are included; otherwise, its accuracy is 67.5%.
One of the significant advantages of FDM is its applicability to various evolutionary problems because the algorithm does not rely exclusively on the sequences with the highest sequence similarity, but assigns weights to BLAST-aligned sequences that apparently capture divergent evolutionary relationships. As a result, CDM, which incorporates FDM, can be successfully used, even when there is a significant range of sequence similarities among the BLAST identified sequences.
Although the availability of sequences with high similarity in the PDB essentially depends upon the target sequence, the question remains as to what the optimum value of the sequence identity threshold should be. To identify this optimal threshold, we applied the CDM algorithm to our data set with a wide range of identity thresholds ranging from 30% to 95%; some results are shown in Figure 2, where a distinct dependence of CDM on the upper sequence identity limit can be seen. We observe a 10% drop in the prediction accuracy when the upper sequence identity limit drops from 100% to 99%. Our results show that the 50% sequence identity threshold gives the best performance of the CDM method for the upper sequence identity limit (Fig. 2). This optimum value increases to 55% when multiple sequence alignments are incorporated in GOR V (data not shown). Note that the upper sequence identity limit also affects GOR V results because, for some positions within the sequence, only fragments with high sequence identity are available. When the upper sequence identity limit is decreased, those regions that were previously predicted by FDM are now predicted by GOR V.
|
|
|
-sheet predictions: If their lengths are too short (e.g., helices shorter than five residues or sheets shorter than three residues), these predictions are converted to coil.
|
90%, CDM confers a higher accuracy than individual GOR V, with or without MSA. Additionally, on average CDM is always better for the entire sequence than individual FDM regardless of the upper sequence identity limit.
For the cases of 100% and 99% sequence identities, only a small portion of sequences are predicted by GOR V (1% and 12%, respectively). At these upper sequence identity limits, GOR V without MSA performs better than GOR V with MSA for this small number of cases. Only when the upper sequence identity limit falls to
90% does GOR V with MSA then perform better. Although it is generally assumed that adding multiple sequence alignments to predictions increases the accuracy, the data in Table 1 clearly demonstrate that MSA is not effective for low sequence identities, rather the inclusion of MSA increases noise in the data.
Another interesting feature shown in Table 1 is the coverage by the FDM method, i.e., the fraction of FDM predictions in the consensus CDM method. When the upper sequence identity limit drops from 100% to 90%, the FDM coverage plummets from 99% to 65%, illustrating the lack of aligned sequences with high identity. Compare this value with the 12% coverage lost when the upper sequence identity limit drops further from 90% to 60%.
Another measure of prediction accuracy besides Q3 is the Matthews correlation coefficient. The correlation coefficients for three secondary structure elements,
-helices (H),
-sheets (E), and coil (C), are shown in Figure 5. The eight-letter DSSP alphabet has been reduced to the three-letter code as described in the Materials and Methods section. The plots in Figure 5 were obtained at the sequence identity threshold of 50% for a varying upper sequence identity. Similar to the majority of secondary structure algorithms, the correlation coefficients are highest for
-helices (H), followed by those for
-sheets (E), and lastly for coils (C). The correlation coefficients obtained by CDM show a consistent and smoother monotonic decrease with a decrease in the upper sequence identity limit.
|
80%, i.e., it significantly exceeds the original accuracy of prediction of GOR V (73.5%). We should note, however, that the result of 80% is an overprediction of PSIPRED because the result is non-cross-validated. The PSIPRED database that is used for the Neural Network training contains some sequences similar to sequences in the CB513 data set. The most accurate comparison between GOR V and PSIPRED should be made by applying the CB513 database (used in GOR V) for the training of PSIPRED and full jack-knife for the prediction for the CB513 data set. Such an approach would likely decrease the accuracy of prediction of PSIPRED to
76%, i.e., to the currently cross-validated accuracy of prediction of the PSIPRED method. The advantage of the original CDM method with GOR V is that we have developed the codes of both GOR V and Fragments Data Mining programs that comprise the CDM algorithm, and we are able to run and fully control the CDM performance. Because of this, the computations using the original CDM method based on GOR V are much faster than similar computations using the CDM variant with PSIPRED, and it is easier for us to implement cross-validation. We have also developed a variant of the Consensus Data Mining method that uses PSIPRED instead of GOR V for the prediction when fragments with sufficiently high identities cannot be found in the database. These results are shown in Figure 6. We repeated the calculations for a set of sequence identity cutoff values ranging from 30% to 95%. We obtained the best performance with a 70% sequence identity cutoff for combining FDM with PSIPRED. The performance of the CDM method with PSIPRED exceeds the original CDM method based on GOR V, but, as we have discussed previously, we were not able to cross-validate these results.
|
| Discussion |
|---|
|
|
|---|
| Materials and methods |
|---|
|
|
|---|
DSSP alphabet reduction
The Database of Secondary Structure in Proteins (DSSP) developed in 1983 by Kabsch and Sander (1983) is a widely used method for the assignment of the secondary structure based mostly on identification of hydrogen bonds in the crystallographic data. (There are, however, other alternative assignment methods, such as STRIDE [Frishman and Argos 1997] or, most recently, KAKSI [Martin et al. 2005]). DSPP classifies secondary structure elements into eight classes: H (
-helix), E (extended
-strand), G (310 helix), I (
-helix), B (bridge, a single residue
-strand), T (
-turn), S (bend), and C (coil). We follow a standard method of reduction of this eight-letter alphabet to the regular three-letter secondary structure code in the following manner: Helix (H) in the three-letter code includes three DSSP states, H, G, and I;
-strand (E) contains E and B; and coil (C) consists of T, S, and C.
GOR V (Kloczkowski et al. 2002; Sen et al. 2005)
The GOR method, proposed originally by Garnier, Osguthorpe, and Robson in 1978 (Garnier et al. 1978), is one of the first, important methods for prediction of secondary structure from sequence. The GOR method involves information theory and Bayesian statistics (Garnier et al. 1978; Gibrat et al. 1987; Garnier and Robson 1989; Garnier et al. 1996; Kloczkowski et al. 2002). The information entropy I can be written as a function of secondary structure S for a given amino acid R:
|
|
However, this function depends only on single-residue statistics. The predictions can be greatly improved by incorporating the information of flanking residues when a sliding window is used. For GOR V, a variable size window proved to produce the best results. Then, using relative informational content gives:
|
|
In this equation, Ri represents the ith residue in the sliding window, S is a secondary structure of the jth residue (Sj), and n S are all the conformations different than S. In the case when secondary structures are abstracted into three classes, S can be helix, sheet, or coil. n S represents the other two secondary structures. Total information content can be expressed as
|
|
If we keep information on single and pairs of residues, algebraic manipulation finally leads to
|
|
for the residue site in the middle of the sliding window. In this equation, d is the number of flanking residues, and 2d + 1 is the size of the sliding window.
Over decades, the GOR method has been constantly improved by including larger databases and more detailed statistics, where these changes were gradually integrated into the first four versions of GOR. With these improvements, the Q3 accuracy reached 64% in GOR IV. However, studies by other groups showed that the accuracy of prediction for secondary structure prediction methods could be significantly increased by including evolutionary information through multiple sequence alignments (MSAs) (Zvelebil et al. 1987; Levin and Garnier 1988; Rost 1996; for a recent review, see Simossis and Heringa 2004). In the most recent GOR V (Kloczkowski et al. 2002), evolutionary information in the form of MSAs is included using PSI-BLAST (Altschul et al. 1997) (GOR V Server is available at http://gor.bb.iastate.edu; Sen et al. 2005). MSA is generated using PSI-BLAST with the nr database, allowing up to five iterations. MSA increases the information content and therefore allows an improved discrimination of secondary structures. In the last stage, heuristic rules related to the predicted secondary structure distribution are used to improve predictions. With the help of evolutionary information, the full jack-knifed prediction accuracy of GOR V using the Cuff and Barton data set attains Q3 = 73.5%, an almost 10% increase from the previous GOR IV performance. The segment overlap (SOV) (Zemla et al. 1999), an alternative to the Q3 measure of prediction accuracy, is also high at 70.8%. These results substantiate the reliability of GOR V algorithm in our consensus method: Although the algorithm does not provide as much accuracy as the prediction methods based on neural networks (i.e., PHD [Rost 1996]; PSIPRED [Jones 1999]), it can definitely be used as a complement to Fragment Database Mining, which performs poorly with fragments of low sequence similarity. In this work, we use GOR V without MSA (with Q3 accuracy 67.5%) and with MSA (accuracy of 73.5%) to test the performance of hybrid methods.
Fragment Database Mining (FDM)
FDM (Cheng et al. 2005) searches for sequences in the PDB similar to the target sequences and aligns the sequence hits for the secondary structure prediction. For a given target sequence the BLAST similarity alignment search is performed first. Matching segments from BLAST alignments are assigned weights according to their sequence similarity to fragments of the target sequence, followed by normalization. Several different parameters are taken into account in the weight assignment: various substitution matrices, a range of similarity/identity thresholds, degree of solvent exposure, and protein classification and sizes. The secondary structure for each residue in the target sequence is predicted based on the highest normalized score.
Local sequence alignments are obtained with BLAST on the Cuff and Barton (1999, 2000) data set CB513 using BLOSUM-45 (the best performance) and several other substitution matrices. We weighted fragments with a scoring based on their identity scores id and their powers idx, where x is a positive number. The value x = 3 was found to provide the optimum performance. For each position in the sequence, the secondary structure is predicted based on the secondary structures of the matching fragments at that position.
Consensus Data Mining (CDM)
CDM is a three-step algorithm based on the simple idea of combining two complementary secondary structure prediction methods, each with distinct strengths. In the first step, FDM calculations are performed for a given target sequence, and for each residue in the sequence, the normalized similarity score is computed. For some sites, the sequence identity could be as high as 100%. In the second step, the GOR V algorithm is applied to obtain a second set of secondary structure predictions. In the last step, a sequence identity threshold is defined to decide whether the FDM or the GOR V result will be used in the consensus prediction for a given residue. In the CDM method, the FDM predictions are used for the residues with an identity score above the sequence identity threshold, and the GOR V predictions are used for the residues with an identity score below the sequence identity threshold.
Prediction performance metrics
We used Q3 for the secondary structure prediction accuracy. In the accuracy matrix [Aij] of the size 3 x 3, i and j correspond to the three states H, E, C. The ijth element, Aij, of the accuracy matrix is defined as the number of residues predicted to be in state j, which are actually in state i. The diagonal entries of [Aij] are numbers of correctly predicted residues for each state, and Q3 is defined as:
|
|
Here N is the number of residues in the query sequence. Matthews correlation coefficient is another measure of prediction accuracy defined for each secondary structure element separately. For example, the Matthews correlation coefficient for the helix (H) is:
|
|
where TP, TN, FN, and FP with subscripts
are the numbers of true positives, true negatives, false negatives, and false positives for helices, respectively.
| Footnotes |
|---|
Article published online ahead of print. Article and publication date are at http://www.proteinscience.org/cgi/doi/10.1110/ps.062125306.
| Acknowledgments |
|---|
|
|
|---|
| References |
|---|
|
|
|---|
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235242.
Camacho, C.J. and Vajda, S. 2002. Proteinprotein association kinetics and protein docking. Curr. Opin. Struct. Biol. 12: 3640.[CrossRef][Medline]
Chaudhuri, A. and Chant, J. 2005. Protein-interaction mapping in search of effective drug targets. Bioessays 27: 958969.[CrossRef][Medline]
Cheng, H., Sen, T.Z., Kloczkowski, A., Margaritis, D., and Jernigan, R.L. 2005. Prediction of protein secondary structure by mining structural fragment database. Polymer 46: 43144321.
Cuff, J.A. and Barton, G.J. 1999. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins 34: 508519.[CrossRef][Medline]
Cuff, J.A. and Barton, G.J. 2000. Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins 40: 502511.[CrossRef][Medline]
Frishman, D. and Argos, P. 1997. Seventy-five percent accuracy in protein secondary structure prediction. Proteins 27: 329335.[CrossRef][Medline]
Garnier, J. and Robson, B. 1989. The GOR method for predicting secondary structures in proteins. In Prediction of protein structure and the principles of protein conformation (ed. G.D. Fasman), pp. 417465. Plenum Press, New York.
Garnier, J., Osguthorpe, D.J., and Robson, B. 1978. Analysis of the accuracy and implications of simple methods for predicting the decondary structure of globular proteins. J. Mol. Biol. 120: 97120.[CrossRef][Medline]
Garnier, J., Gilbrat, J.F., and Robson, B. 1996. GOR method for predicting protein secondary structure from amino acid sequence. Methods Enzymol. 266: 540553.[Medline]
Gibrat, J.F., Garnier, J., and Robson, B. 1987. Further developments of protein secondary structure prediction using information theory: New parameters and consideration of residue pairs. J. Mol. Biol. 198: 425443.[CrossRef][Medline]
Halperin, I., Ma, B., Wolfson, H., and Nussinov, R. 2002. Principles of docking: An overview of search algorithms and a guide to scoring functions. Proteins 47: 409443.[CrossRef][Medline]
Hua, S. and Sun, Z. 2001. A novel method of protein secondary structure prediction with high segment overlap measure: Support vector machine approach. J. Mol. Biol. 308: 397407.[CrossRef][Medline]
Huang, X., Huang, D.S., Zhang, G.Z., Zhu, Y.P., and Li, Y.X. 2005. Prediction of protein secondary structure using improved two-level neural network architecture. Protein Pept. Lett. 12: 805811.[CrossRef][Medline]
Jones, T.D. 1999. Protein secondary structure prediction based on position specific matrices. J. Mol. Biol. 292: 195202.[CrossRef][Medline]
Kabsch, W. and Sander, C. 1983. A dictionary of secondary structure. Biopolymers 22: 25772637.[CrossRef][Medline]
Keskin, O., Ma, B., Rogale, K., Gunasekaran, K., and Nussinov, R. 2005. Proteinprotein interactions: Organization, cooperativity and mapping in a bottom-up Systems Biology approach. Phys. Biol. 2: S24S35.[CrossRef][Medline]
Kihara, D. 2005. The effect of long-range interactions on the secondary structure formation of proteins. Protein Sci. 14: 19551963.
Kloczkowski, A., Ting, K.L., Jernigan, R.L., and Garnier, J. 2002. Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence. Proteins 49: 154166.[CrossRef][Medline]
Kolinski, A. 2004. Protein modeling and structure prediction with a reduced representation. Acta Biochim. Pol. 51: 349371.[Medline]
Kolinski, A. and Bujnicki, J.M. 2005. Generalized protein structure prediction based on combination of fold-recognition with de novo folding and evaluation of models. Proteins 61: 8490.
Levin, J.M. and Garnier, J. 1988. Improvements in a secondary structure prediction method based on a search for local sequence homologies and its use as a model building tool. Biochim. Biophys. Acta 955: 283295.[CrossRef][Medline]
Lin, K., Simossis, V.A., Taylor, W.R., and Heringa, J. 2005. A simple and fast secondary structure prediction method using hidden neural networks. Bioinformatics 21: 152159.
Martin, J., Letellier, G., Marin, A., Taly, J.F., de Brevern, A., and Gibrat, J.F. 2005. Protein secondary structure assignment revisited: A detailed analysis of different assignment methods. BMC Struct. Biol. 5: 17.[CrossRef][Medline]
Mendes, J., Guerois, R., and Serrano, L. 2002. Energy estimation in protein design. Curr. Opin. Struct. Biol. 12: 441446.[CrossRef][Medline]
Nguyen, M.N. and Rajapakse, J.C. 2005. Two-stage multi-class support vector machines to protein secondary structure prediction. Pac. Symp. Biocomput. 10: 346357.
Park, S., Yang, X., and Saven, J.G. 2004. Advances in computational protein design. Curr. Opin. Struct. Biol. 14: 487494.[CrossRef][Medline]
Rost, B. 1996. PHD: Predicting one-dimensional protein structure by profile-based neural networks. Methods Enzymol. 266: 525539.[CrossRef][Medline]
Rost, B. 2001. Review: Protein secondary structure prediction continues to rise. J. Struct. Biol. 134: 204218.[Medline]
Sen, T.Z., Kloczkowski, A., Jernigan, R.L., Yan, C., Honavar, V., Ho, K.M., Wang, C.Z., Ihm, Y., Cao, H., and Gu, X., et al. 2004. Predicting binding sites of hydrolase-inhibitor complexes by combining several methods. BMC Bioinformatics 5: 205.[CrossRef][Medline]
Sen, T.Z., Jernigan, R.L., Garnier, J., and Kloczkowski, A. 2005. GOR V server for protein secondary structure prediction. Bioinformatics 21: 27872788.
Simossis, V.A. and Heringa, J. 2004. Integrating protein secondary structure prediction and multiple sequence alignment. Curr. Protein Pept. Sci. 5: 249266.[CrossRef][Medline]
Smith, G.R. and Sternberg, M.J.E. 2002. Prediction of proteinprotein interactions by docking methods. Curr. Opin. Struct. Biol. 12: 2835.[CrossRef][Medline]
Solis, A.D. and Rackovsky, S. 2004. On the use of secondary structure in protein structure prediction: A bioinformatic analysis. Polymer 45: 525546.
Szilagyi, A., Grimm, V., Arakaki, A.K., and Skolnick, J. 2005. Prediction of physical proteinprotein interactions. Phys. Biol. 2: S1S16.[CrossRef][Medline]
Tovchigrechko, A., Wells, C.A., and Vakser, I.A. 2002. Docking of protein models. Protein Sci. 11: 18881896.
Tsai, C.J. and Nussinov, R. 2005. The implications of higher (or lower) success in secondary structure prediction of chain fragments. Protein Sci. 14: 19431944.
Vizcarra, C.L. and Mayo, S.L. 2005. Electrostatics in computational protein design. Curr. Opin. Chem. Biol. 9: 622626.[Medline]
Wood, M.J. and Hirst, J.D. 2005. Protein secondary structure prediction with dihedral angles. Proteins 59: 476481.[CrossRef][Medline]
Zemla, A., Venclovas, C., Moult, J., and Fidelis, K. 1999. Processing and analysis of CASP3 protein structure predictions. Proteins (Suppl. 3): 37: 2229.[CrossRef]
Zvelebil, M.J., Barton, G.J., Taylor, W.R., and Sternberg, M.J.E. 1987. Prediction of protein secondary structure and active-sites using the alignment of homologous sequences. J. Mol. Biol. 195: 957961.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
H. Cheng, T. Z. Sen, R. L. Jernigan, and A. Kloczkowski Consensus Data Mining (CDM) Protein Secondary Structure Prediction Server: Combining GOR V and Fragment Database Mining (FDM) Bioinformatics, October 1, 2007; 23(19): 2628 - 2630. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |