|
|
||||||||
1 Department of Biological Science and Technology and Institute of Bioinformatics, National Chiao Tung University, Hsinchu, 30050, Taiwan
2 Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan
3 Institute of Biomedical Sciences, Academia Sinica, Taipei 115, Taiwan
Reprint requests to: Jinn-Moon Yang, Department of Biological Science and Technology and Institute of Bioinformatics, National Chiao Tung University, Hsinchu, 30050, Taiwan; e-mail: moon{at}cc.nctu.edu.tw; fax: 886-3-5729288 or Ming-Jing Hwang, Institute of Biomedical Sciences, Academia Sinica, Taipei 115, Taiwan; e-mail: mjhwang{at}ibms.sinica.edu.tw; fax: 886-2-27887641.
(RECEIVED December 11, 2001; FINAL REVISION May 8, 2002; ACCEPTED May 8, 2002)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.4940102.
| Abstract |
|---|
|
|
|---|
1, 66% for
1 + 2, and 1.36 Å for the root mean square deviation of side-chain positions. We found that if our scoring function was perfect, the prediction accuracy was also essentially perfect. However, perfect prediction could not be achieved if only a discrete search mechanism was applied. These results suggest that GEM is robust and can be used to examine the factors limiting the accuracy of protein side-chain prediction methods. Furthermore, it can be used to systematically evaluate and thus improve scoring functions. Keywords: Evolutionary algorithm; Gaussian mutation; protein-structure prediction; side-chain conformation; rotamer library
| Introduction |
|---|
|
|
|---|
A good scoring function should be able to distinguish between correct and incorrect conformations. Various scoring functions have been developed to predict side-chain conformations, including simple molecular force fields which, for the sake of fast computation, usually employ a Lennard-Jones 12-6 form (Lee and Subbiah 1991; Koehl and Delarue 1994; Hwang and Liao 1995) or 6-9 form (Holm and Sander 1992; Väsquez 1995) to remove close-range interactions. More sophisticated and longer range functions (Tuffery et al. 1991), as well as statistically derived contact potentials (Samudrala and Moult 1998), have also been studied.
Until recently (Xiang and Honig 2001), the combinatorial nature of side-chain placement was generally considered the major obstacle in protein side-chain prediction. Various approaches have been developed to circumvent the combinatorial problem and can be roughly divided into three categories: knowledge-based statistical methods, tree-based elimination methods, and stochastic search methods.
Knowledge-based statistical methods include the homology modeling methods (Holm and Sander 1992; Laughton 1994; Bower et al. 1997), in which side-chain conformations are predicted on the basis of localized similarity between target structures and database templates. Other examples are approaches that use a rotamer library, in which the statistical distributions of side-chain orientations derived from known structures are tabulated. In general, there are two kinds of rotamer library, one in which the rotamers are dependent on the local main-chain environment (Dunbrack and Karplus 1993) and one in which they are not (Ponder and Richards 1987; Tuffery et al. 1991; De Maeyer et al. 1997; Xiang and Honig 2001). The information in these two different kinds of rotamer libraries can also be implicitly captured in neural networks (Hwang and Liao 1995). The dead-end elimination algorithm (Desmet et al. 1992; Looger and Hellinga 2001) and A* algorithm (Leach 1994; Leach and Lemon 1998) are tree-based elimination approaches to reduce search spaces. Whereas homology rotamer library and tree-based approaches are deterministic stochastic methods, such as simulated annealing (Lee and Subbiah 1991; Laughton 1994; Hwang and Liao 1995), genetic algorithms (Tuffery et al. 1991), and mean field theory (Koehl and Delarue 1994; Mendes et al. 1999), use biased sampling to reach approximated solutions.
Despite the diversity of the strategies, algorithms, and energy functions used in these methods, they all seem to produce comparable results, with, for example, an accuracy of
7080% in the
1 angles of all residues (Levitt et al. 1997). However, it is not yet fully understood which factors (e.g., search algorithms, energy functions, or experimental errors) are mainly responsible for the 20%30% of errors (Levitt et al. 1997). In addition, many methods use a discrete search to identify an optimal combination of side-chain rotamer states before finding the optimal side-chain conformations by energy minimization (Fig. 1
), and it is not clear to what extent the nature of discrete search has limited the theoretical accuracy of a search method in protein side-chain prediction.
|
The main difference in methodology between the present work and our previous studies is the addition of a global discrete-search to a continuous-search mechanism. To the best of our knowledge, the present work on protein side-chain prediction is also the first to integrate global discrete- and global continuous-search mechanisms (Fig. 1
). This is distinct from previous studies in which, although both discrete- and continuous-search mechanisms have been used (e.g., Dunbrack and Karplus 1993; Väsquez 1995; Bower et al. 1997; Xiang and Honig 2001), they were used separately, with the result that the continuous search was only a local search.
| Results and Discussion |
|---|
|
|
|---|
1 and
1 + 2 angles for all residues and the root mean square deviation (RMSD) error in side-chain heavy atoms. The 38 structures selected constitute a minimal set encompassing most of the common structures tested in different prediction methods, allowing comparison with these other methods (Holm and Sander 1992; Dunbrack and Karplus 1993; Laughton 1994; Hwang and Liao 1995; Samudrala and Moult 1998; Looger and Hellinga 2001). The results of the comparison are shown in Tables 2 and 3
|
|
|
1 angles that were within 30° of those determined from the crystal structure in 80% of cases and side-chain atomic positions with a mean RMSD of 1.36 Å. For core residues with <20% solvent exposure, the prediction accuracy increased to 93% for
1 and the mean RMSD was reduced to 0.92 Å. Thus, even though no energy minimization was applied to the final structure and the bond lengths and bond angles used were those of standard templates and not of the individual residues in the individual X-ray structures (see Materials and Methods), our results for core residues, on a limited test set, were approaching those from a recent study in which a very detailed rotamer library (7560 rotamers) and experimental bonds and angles for every residue were used (Xiang and Honig 2001).
In terms of residue types, it appears easier to predict the orientation of small nonpolar or aromatic side chains, as evidenced by a significantly better accuracy in
1 angles (90%) being achieved for these residues (Fig. 2a,b
). This means that prediction errors are contributed mainly by polar and charged amino acids, for which interaction with the solvent plays a significant role in determining their side-chain orientation. The largest error was the 58%
1 accuracy for serine, which was probably attributable to its small size and its conformation therefore being dictated by hydrogen bonding (Koehl and Delarue 1994). These results indicate that, in general, it is harder to predict the conformation of those amino acids in which the side-chain conformation is more susceptible not only to steric hindrance, but also to environmental factors, such as hydrogen bonds, salt bridges, and solvent interactions. This observation is consistent with the findings of previous studies (Table 3
; Dunbrack and Karplus 1993; Koehl and Delarue 1994), as well as with the fact that the conformation of residues in the protein core can be more accurately predicted (Levitt et al. 1997).
|
1 angles in terms of amino acid type is quite consistent between these methods (Table 3
1 accuracy being 81%, and
1+2 accuracy being 64%. These results suggest that the accuracy of GEM is comparable with those of previous prediction methods. However, the GEM approach, as discussed below, can be used to analyze elements of methodology, such as search scheme and energy function, and should therefore help in moving toward error-free prediction of protein side-chain conformations.
Evaluation of the energy function used
The main objective of this study was to evaluate whether the evolutionary algorithm that we recently developed and applied to flexible ligand docking (Yang and Kao 2000a) was also applicable to the prediction of side-chain conformations of proteins. To simplify the task, we adopted a typical van der Waalstype energy function used in previous studies (see Materials and Methods). However, to enable the global continuous search mechanism of GEM to energetically discriminate between different torsions, we found it necessary to add a torsion term. In two previous studies in which a very large set of rotamer states were sampled, a torsion function was also used (Lee and Subbiah 1991; Xiang and Honig 2001). The torsion term used in the present work was that employed in the HIV protease-inhibitor docking study of Gehlhaar et al. (1995).
The fact that although we did not attempt to refine any parameters of the energy function used, we still achieved a comparable prediction accuracy attested to the viability of the use of GEM for protein side-chain prediction. However, with uncertainty in the scoring (fitness) function, the robustness of GEM was difficult to assess. To address this question, we made use of the high adaptability of GEM and simply replaced the force-field energy function with a perfect-scoring function (i.e., one that would produce zero RMSD in atom positions). As shown in Table 1
and Figure 2, c and d
, using the RMSD scoring function (equation 1
; Materials and Methods), GEM could indeed approach perfect prediction. Not only were almost all
1 angles predicted to within 30° (Fig. 2c
; Table 1
), but the absolute values of the angles were very accurately reproduced, with >90% having an error of <5° compared to a value of only 30% when the force-field energy function (equation 2
) was used (Fig. 3a
). The residual errors can be attributed to the use of not completely flexible side-chain templates (Materials and Methods), which have rigid bond lengths and bond angles. This attribution is supported by the work of Xiang and Honig (2001), which showed that the use of a standardized geometry could lead to considerable errors. This was most evident in the RMSD results for lysine and arginine, and, to a lesser extent, those for glutamate and glutamine (Fig. 2d
), that is, those amino acids with side chains that are charged or polar and are relatively long and flexible. Interestingly, in the case of lysine and arginine, although almost all the
1 angles were predicted without error, the RMSD value was as large as that obtained using the force-field energy function (Fig. 2
), reinforcing the fact that either RMSD or
1 alone cannot provide a complete assessment of side-chain prediction accuracy. It is also worthy of note that GEM converges much faster with the perfect-fitness function (Fig. 3b
).
|
1 and 1.72 Å for RMSD for the 38 test proteins (Table 4
1 prediction and 0.36 Å in average RMSD of side-chain positions (Fig.2a,b
|
1 angle deviating by >30° from the three energetically favored values (+/-60° and 180°). As shown in Figure 4
1 accuracy between nonstrained and strained side-chain conformations. Such a large difference could often be easily overlooked using statistical averages because strained residues represent a small minority of all amino acid side-chain conformations (e.g., of the 4313 residues examined here, only 177 [5%] were strained).
|
1 from 78.8% to 85.6% and RMSD from 1.26 Å to 0.87 Å; Table 4
In summary, we have demonstrated the robustness and adaptability of GEM for exploring the conformational space of protein side chains and efficiently finding the combinatorial solution under the constraint of the fitness function used. The key novelty of the present work is the seamless ability of GEM to blend global discrete search and global continuous search, the former being required for efficiency and the latter for accuracy, and to allow them to work cooperatively. This was achieved through the incorporation of a number of genetic operators, each having unique search mechanisms (Table 5
). For example, whereas the rotamer mutation operator performs a discrete search on the global rotamer space of the rotamer library, the self-adaptive Gaussian mutation performs a local, but continuous, search on torsional conformations, etc. Importantly, the flexibility of GEM should allow us to begin to systematically improve the forms and parameters of energy function for protein side-chain and other protein-structure prediction problems.
|
| Materials and methods |
|---|
|
|
|---|
25 min.
|
![]() | ((1)) |
1 and
1+2 were calculated excluding glycine and alanine residues. The accuracy for
1+2 refers to those cases in which both
1 and
2 were correctly predicted.
Rotamer library and side-chain construction
A main-chain independent rotamer library was built using a modification of the method of Tuffery et al (1991). For each amino acid, only up to ten of the highest populated rotamers were considered. The number of rotamers was 1 for Pro; 3 for Cys, Ser, Thr, Val, Asp, and Phe; 4 for Asn; 5 for Ile; 6 for His, Leu, and Tyr; 7 for Trp; and 10 for the remaining amino acids (excluding Ala and Gly, which have no rotamers). The total number of rotamers was 103.
The side-chain atoms were geometrically constructed by placing side-chain templates onto the main chain of the X-ray structures, using the side-chain dihedral angles generated by GEM. These templates have standard bond lengths and bond angles according to the AMBER force field (Weiner et al. 1984). The initial dihedral angles of the side chains came either from the rotamer library or from an assigned value within the feasible region (-
to
). GEM then adapted these dihedral angles to search for optimal side-chain conformations by minimizing the scoring function.
Scoring energy function
Our scoring energy function was modified from Levitt (1983) and Hwang and Liao (1995):
![]() | ((2)) |
![]() | ((3)) |
ij and rij are constants which depend on the chemical characteristics of atoms i and j; Rij is the distance between the atoms i and j; M is the number of atoms in a protein; NB is the number of atoms within a preset distance (8 Å) of atom i;
ij is the depth of the energy well, and rij is the equilibrium interatomic distance for the van der Waals interaction between atoms i and j. The same function form was used to compute the potential of the hydrogen bonds (EHbond) and disulfide bonds (ESbond). Table 7
|
![]() | ((4)) |
0 were 3, 3, and
, respectively, for the sp3-sp3 type and 1.5, 6, and 0, respectively, for the sp3-sp2 type (Gehlhaar et al. 1995).
GEM algorithm details
Here, we provide an outline of our evolutionary approach for predicting protein side-chain conformations, which can be represented by adjustable variables of dihedral angles as
![]() | ((5)) |
,
). Repeat this N times to generate the initial population of N side-chain conformations for a protein structure. Evaluate the objective value of each conformation based on the scoring function.
Main procedure
In the following subsections, we present the details of our approach for side-chain prediction. The method integrates a global discrete-search mechanism based on a rotamer library and continuous-search mechanisms based on Gaussian mutations. The basic structure of the method is as follows (Fig. 5
): N solutions are generated as the initial population. Each solution (side-chain conformation) is represented as a set of three n-dimensional vectors (
i,
i,
i), where n is the number of adjustable variables (dihedral angles) of a side-chain conformation and i = 1,. . .,N. The vector
in equation 5
represents the adjustable variable to be optimized, and
and
are the step-size vectors of decreasing-based mutation and self-adaptive Gaussian mutation, respectively (Yang and Kao 2000a, b). In other words, each solution,
, is associated with some parameters for step-size control. In this paper, the initial value of
jwas randomly selected either from the rotamer library or from -
to
in radians. The initial step sizes,
and
, were 0.8 and 0.2 radians, respectively.
|
The FC_adaptive procedure employs two parameters, the population operator (P, with N solutions) and the mutation operator (M), to generate a new quasi-population. The main purpose of the FC_adaptive procedure is to produce offspring and then perform the family competition. Each individual in the population sequentially becomes the "family father." This family father and another solution randomly chosen from the rest of the parent population are used as parents for a recombination operation, with a probability of recombination of pc; then a mutation operates on the new offspring or the family father (if recombination does not occur). For each family father, this procedure is repeated L times (family competition length). Finally, L children are produced, but only the one with the lowest objective value survives. Because we create L children from one family father and perform a selection, this is a family-competition strategy.
For easy description of operators, we use a = (
a,
a,
a) to represent the family father and b = (
b,
b,
b) as another parent (only for the recombination operator). The offspring of each operation is represented as c = (
c,
c,
c). We also use the symbol
jd to denote the jth dihedral angle of a side-chain conformation d.
Recombination operators
Modified discrete recombination
Because experience indicated that our method was more robust if the child inherited genes from the family father with a higher probability, we therefore modified the original discrete recombination (Bäck 1996) as follows:
![]() | ((6)) |
Intermediate recombination
This operator is applied in the continuous-search stages as follows:
![]() | ((7)) |
![]() | ((8)) |
is
or
, depending on the mutation operator applied. For example, if the self-adaptive Gaussian mutation was used in this FC_adaptive procedure,
is
. ß is a constant set as 0.5 in this work.
Mutation operators
Mutations are the main operators of our method. After recombination, a mutation operator is applied to the family father or to the new offspring generated by a recombination operator.
Rotamer mutation operator
This operator is used at the rotamer-search stage to find a combination of rotamer conformations. For each residue, this operator is biased to select rotamers of higher probabilities and mutates all of the dihedral angles of a residue according to the rotamer library. For example, this operator changes 3 dihedral angles (
j,
j + 1, and
j + 2) if the residue is Gln, Glu, or, Met. For each of these dihedral angles, this operator is applied with probability 0.2 as follows:
![]() | ((9)) |
ki and pki are the angle value and probability, respectively, of the ith rotamer of the kth residue type. The values of
ki and pki are defined in the rotamer library.
Self-adaptive Gaussian mutation
The mutation is accomplished by first mutating the step size
j and then mutating the dihedral angle
j as follows:
![]() | ((10)) |
![]() | ((11)) |
and
as (
2n)-1 and (
2
n)-1, respectively.
Decreasing-based Gaussian mutations
The decreasing-based Gaussian mutation is accomplished by mutating the step-size vector
with a fixed decreasing rate
= 0.95 as follows:
![]() | ((12)) |
![]() | ((13)) |
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Bower, M.J., Cohen, F.E., and Dunbrack, R.L.J. 1997. Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: A new homology modeling tool. J. Mol. Biol. 267: 12681282.[CrossRef][Medline]
De Maeyer, M., Desmet, J., and Lasters, I. 1997. All in one: A highly detailed rotamer library improves both accuracy and speed in the modeling of sidechains by dead-end elimination. Fold. Des. 2: 5366[CrossRef][Medline]
Desmet, J., De Maeyer, M., Hazes, B, and Lasters, I. 1992. The dead-end elimination theorem and its use in protein side-chain positioning. Nature 356: 539542.[CrossRef]
Dunbrack, R.L.J. and Karplus, M. 1993. Backbone-dependent rotamer library for proteins: Application to side-chain prediction. J. Mol. Biol. 230: 543574.[CrossRef][Medline]
Fogel, DB. 1995. Evolutionary computation: Toward a new philosophy of machine intelligence. IEEE Press, Piscataway, NJ.
Gehlhaar, D.K., Verkhivker, G.M., Rejto, P., Sherman, C.J., Fogel, D.B., Fogel, L.J., and Freer, S.T. 1995. Molecular recognition of the inhibitor AG-1343 by HIV-1 protease: Conformationally flexible docking by evolutionary programming. Chem. Biol. 2: 317324.[CrossRef][Medline]
Goldberg, D.E. 1989. Genetic algorithms in search, optimization and machine learning. Addison-Wesley Publishing Company, Reading, MA.
Holm, L. and Sander, C. 1992. Fast and simple Monte Carlo algorithm for side-chain optimization in proteins: Application to model building by homology. Proteins 14: 213223.[CrossRef][Medline]
Hubbard, S.J. and Thornton, J.M. 1993. NACCESS, computer program. Department of Biochemistry and Molecular Biology, University College London.
Hwang, H.K. and Liao W.F. 1995. Side-chain prediction by neural networks and simulated annealing optimization. Protein Eng. 18: 363370.
Koehl, P. and Delarue, M. 1994. Application of a self-consistent mean field theory to predict protein side-chains conformation and estimate their conformational entropy. J. Mol. Biol. 239: 249275.[CrossRef][Medline]
Laughton, C. 1994. Prediction of protein side-chain conformations from local three-dimensional homology relationships. J. Mol. Biol. 235: 10881097.[CrossRef][Medline]
Leach, A.R. 1994. Ligand docking to proteins with discrete side-chain flexibility. J. Mol. Biol. 235: 345356.[Medline]
Leach, A.R. and Lemon, A.P. 1998. Exploring the conformational space of protein side chains using dead-end elimination and the A* algorithm. Proteins 33: 227239.[CrossRef][Medline]
Lee, C. and Subbiah, S. 1991. Prediction of protein side chain conformation by packing optimization. J. Mol. Biol. 217: 373388.[CrossRef][Medline]
Levitt, M. 1983. Protein folding by constrained energy minimization and molecular dynamics. J. Mol. Biol. 170: 723764.[Medline]
Levitt, M., Gerstein, M., Huang, E., Subbiah, S., and Tsai, J. 1997. Protein folding: The endgame. Annu. Rev Biochem. 66: 549579.[CrossRef][Medline]
Liang, S. and Grishin, N.V. 2002. Side-chain modeling with an optimized scoring function. Protein Sci.11: 322331.
Looger, L.L. and Hellinga, H.W. 2001. Generalized dead-end elimination algorithms make large-scale protein side-chain structure prediction tractable: Implications for protein design and structural genomics. J. Mol. Biol. 307: 429445.[CrossRef][Medline]
Mendes, J., Soares, C.M., and Carrondo, M.A. 1999. Improvement of side-chain modeling in proteins with the self-consistent mean field theory method based on an analysis of the factors influencing prediction. Biopolymers 50: 111131.[CrossRef][Medline]
Morris, G.M., Goodsell, D.S., Halliday, R.S., Huey, R., Hart, W.E., Belew, R.K., and Olson, A.J. 1998. Automated docking using a Lamarckian genetic algorithm and empirical binding free energy function. J. Comp. Chem. 19: 16391662.[CrossRef]
Ponder, J.W. and Richards, F.M. 1987. Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes. J. Mol. Biol. 193: 775791.[CrossRef][Medline]
Samudrala, R. and Moult, J. 1998. Determinants of side chain conformational preferences in protein structures. Protein Eng. 11: 991997.
Schrauber, H., Eisenhaber, F., and Argos, P. 1993. Rotamers: To be or not to be? An analysis of amino acid side-chain conformations in globular proteins. J. Mol. Biol. 230: 592612.[CrossRef][Medline]
Tuffery, P., Etchebest, C., Hazout, S., and Lavery, R. 1991. A new approach to the rapid determination of protein side-chain conformations. J. Biomol. Struct. Dyn. 8: 12671289.[Medline]
. 1993. A critical comparison of search algorithms applied to the optimization of protein side-chain conformations. J. Comp. Chem. 14: 790798.
Väsquez, M. 1995. An evaluation of discrete and continuum search techniques for conformational analysis of side-chains in proteins. Biopolymers 36: 5370.[CrossRef]
Weiner, S.J., Kollman, P.A., Case, D.A., Singh, U.C., Ghio, C., Alagona, G., Profeta Jr., S., and Weiner, P. 1984. A new force field for molecular mechanical simulation of nucleic acids and proteins. J. Am. Chem. Soc. 106: 765784.[CrossRef]
Xiang, Z. and Honig, B. 2001. Extending the accuracy limits of prediction for side-chain conformations. J. Mol. Biol. 311: 421430.[CrossRef][Medline]
Yang, J.M. and Kao, C.Y. 2000a. Flexible ligand docking using a robust evolutionary algorithm. J. Comp. Chem. 21: 988998.[CrossRef]
. 2000b. Integrating adaptive mutations and family competition into genetic algorithms as function optimizer. Soft Computing 4: 89102.
. 2001. An evolutionary algorithm for the synthesis of multiplayer coatings at oblique light incidence. IEEE/OSA J. Lightwave Technol. 19: 559570.[CrossRef]
Yang, J.M., Horng, J.T., and Kao, C.Y. 2000. A genetic algorithm with adaptive mutations and family competition for training neural networks. Int. J. Neural Syst. 10: 333352.[Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
V. Z. Spassov, L. Yan, and P. K. Flook The dominant role of side-chain backbone interactions in structural realization of amino acid code. ChiRotor: A side-chain prediction algorithm based on side-chain backbone interactions Protein Sci., March 1, 2007; 16(3): 494 - 506. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Larranaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J. A. Lozano, R. Armananzas, G. Santafe, A. Perez, et al. Machine learning in bioinformatics Brief Bioinform, March 1, 2006; 7(1): 86 - 112. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Wang, O. Schueler-Furman, and D. Baker Improved side-chain modeling for protein-protein docking Protein Sci., May 1, 2005; 14(5): 1328 - 1339. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP |