|
|
||||||||
1 Department of Biochemistry, University of Washington, Seattle, Washington 98195, USA
2 Molecular Biophysics, Structure & Design Program, University of Washington, Seattle, Washington 98195, USA
3 Department of Chemistry and Pharmacology, Vanderbilt University, Nashville, Tennessee 37232, USA
4 Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
(RECEIVED May 17, 2006; FINAL REVISION September 29, 2006; ACCEPTED October 2, 2006)
| Abstract |
|---|
|
|
|---|
Keywords: enzyme design; protein design; active site recapitulation; proteinligand interactions; geometric hashing
| Introduction |
|---|
|
|
|---|
There has been exciting progress in enzyme design. On the experimental side, catalytic antibodies, elicited by immunization with transition state analogs, have been shown to possess catalytic activity (Lerner et al. 1991; Hilvert 2000). More recently, several successful enzyme designs have been reported. Kaplan and DeGrado (2004) designed a de novo O2-dependent phenol oxidase within a designed four-helix bundle fold. Using computational protein design, Bolon and Mayo (2001) created a histidine-bearing catalyst for the hydrolysis of p-nitrophenyl acetate into p-nitrophenol. More recently, Dwyer et al. (2004) created a highly active enzyme by grafting the triose phosphate isomerase (TIM) active site on to the ribose binding protein scaffold.
The computational methods used in enzyme site design to date, such as ORBIT from the Mayo group (Dahiyat and Mayo 1996) and Dezymer from the Hellinga group (Hellinga and Richards 1991), have primarily been used to search for catalytic site placement in one or a small number of scaffolds. In contrast, computational methods for searching for functional sites that employ geometric hashing (Russell 1998) are able to search through thousands of scaffolds rapidly. However, these methods have been used in cases where the positions of the side chain functional groups are known, as in the case of proteins of known structure and thus are not immediately applicable to enzyme active site design.
In general, how to evaluate and optimize computational design methods for the creation of new molecules is a nontrivial problem. For robust conclusions, it is desirable to compare alternative methods and parameter choices by comparing results on a representative set of test systems. In the "protein design cycle" approach described by Dahiyat and Mayo (1997), alternative choices in a design method are tested by producing designs and experimentally characterizing them, and the choice is selected that produces designs with the desired properties. While this is a very powerful approach, experimentally characterizing a large number of designs for a number of different methods is slow and expensive, and therefore, it is desirable to have a faster and cheaper test. A purely in silico test for monomeric protein design developed in our group based on recapitulation of native protein sequences (Kuhlman and Baker 2000) has proven invaluable in guiding improvement of our protein design methodology.
In this study, we describe an in silico benchmark for computational enzyme design based on recapitulation of the locations and structures of native enzyme active sites in a set of naturally occurring enzymatic scaffolds. Given the backbone coordinates of 10 naturally occurring enzymes, and a list of the 10 reactions they catalyze, active sites are designed for each reaction in each scaffold. The designs for each reaction are collectively ranked based on their computed catalytic efficacy (see Materials and Methods). To evaluate and guide the optimization of enzyme design methodology, we make the assumption that the actual native enzyme is likely to be a better catalyst than any of the designed enzymes. With this assumption, alternative design methods can be evaluated based on the ranks of the actual native active site for each reaction among all the designs found and the associated computational cost required for the large number of calculations involved.
We also describe two new methods for computational enzyme design that utilize hashing algorithms to enable active site searches in large numbers of scaffolds. Given a description of a catalytic site consisting of a transition state structure surrounded by protein functional groups in geometrical positions optimal for catalysis and a set of protein scaffolds, the methods first search for sites in the scaffolds where the active site can be recapitulated. In the first method, an "inverse rotamer tree" approach is used with a modified version of the geometric hashing algorithm (Bachar et al. 1993) to find positions in a set of scaffolds that can support the catalytic site. In the second method, based on iterative side chain placement and hashing in six-dimensional space, candidate catalytic sites in scaffolds are detected in linear time. Both methods are followed by the design of the pocket using the standard Rosetta design methodology (Kuhlman and Baker 2000). We describe the performance of the two methods in the in silico enzyme design benchmark.
| Results |
|---|
|
|
|---|
Recapitulation of native enzymatic sites
We use two native active site recapitulation tests to benchmark the two new methods. Ten crystal structures of enzyme-transition state analog complexes or enzymeinhibitor complexes with a resolution of 2.5 Å or better were taken from the Protein Data Bank (PDB) (Berman et al. 2002). The resulting benchmark set includes members of the hydrolase, lyase, isomerase, and transferase enzyme families (Table 1); the only major family missing is the oxidoreductase family, which typically employs non-protein cofactors. The number of catalytic residues at the active sites varies from two to four, and the catalytic amino acids include Asp, Glu, Asn, His, Cys, Ser, Tyr, and Lys (Table 1). The catalytic residues documented as being involved in the catalytic mechanism for each enzyme of our benchmark are used to build the catalytic site descriptions for the corresponding reaction. For each chemical reaction, two benchmark tests were carried out using the complete protocol described in Figure 1, using either the inverse rotamer tree method or the RosettaMatch method. In the first benchmark test, the geometrical parameters relating the TS analog and the functional atoms are taken directly from the crystal structure of the complex. In the second benchmark test, the geometrical parameters are set to optimal values based on the simple rules described in Table 2. The challenge is to recapitulate the native active site by correctly identifying among all designs in all scaffolds the native site in the native scaffold based on the predicted catalytic efficacy.
|
|
|
|
|
|
|
-chymotrypsin, cytosine deaminase, and bovine carboxypeptidase A, which contain two, three, four, and four catalytic residues, respectively. The number of homolog structures and their backbone root mean square deviation (RMSD) to the query structure for each enzyme are summarized in Table 5. As indicated in Table 2, our methods are capable of finding the active site for homolog structures of up to
4.0 Å backbone RMSD, showing that they are tolerant of variation in backbone coordinates up to this level (the native site can be found multiple times because of the fineness of the rotamer sampling).
|
| Discussion |
|---|
|
|
|---|
Although the algorithms we describe are new, the overall approach of starting with a geometric description of an active site, searching through a protein scaffold for positions where it can be placed, and designing the surrounding residues has been used in previous studies (Hellinga and Richards 1991; Bolon and Mayo 2001). The algorithms described here have several advantages over previously described methods. The inverse rotamer treebased search complexity does not depend on the number of scaffolds searched, whereas previous methods scale at least linearly with the number of positions (and, consequently, scaffolds searched). Dezymer (Hellinga and Richards 1991), for example, places all rotamers for the anchor residue at each position, thereby scaling at least proportionally with the number of positions considered. The approach taken by Bolon and Mayo (2001) also places an extended rotamer (that includes the TS model) on each search position, leading to the same dependence. The computational efficiency of the inverse rotamer treebased algorithm can be a tremendous advantage if large-scale enzyme site searches are required. The algorithm, however, is limited by its exponential dependence on the number of rotamer combinations considered. In the case of active sites with four or more active site residues, the algorithm performs poorly. Since it is not possible to use large rotamer libraries, the use of this algorithm is limited to a more coarse-grained search.
The RosettaMatch method avoids the combinatorial explosion by treating each catalytic side chain independently in building up the hash table. It thus scales linearly with the number of rotamer combinations considered. Once the hash maps have been built up, the complexity of the look-up step is constant time on average. In the worse-case scenario (i.e., when many TS models placed in different boxes map to the same hash key), the hash look up scales as O(N), where N is the number of entries for the box. Although it is not easy to directly compare the complexity of the algorithm with Hellinga's Dezymer, the RosettaMatch method has the advantage that the algorithm complexity depends only linearly on the number of residues making up the active site and the total number of rotamers used.
The design methods in their current form can be used to design new active sites in existing scaffolds based either on the structures of naturally occurring active sites or on chemical intuition; the speed of the methods makes it possible to search large sets of scaffolds for optimal active site placements. In the benchmark test, a number of the non-native designs have nearly perfect catalytic geometries and transition state binding energies as low or lower than the native match and potentially represent viable enzymes. As an example, Figure 4 shows a design for an aldolase active site built on a decarboxylase scaffold with a calculated binding energy after design comparable to the native enzyme. The experimental evaluation of the activity of such high-ranking designs in non-native scaffolds will test our understanding of the mechanisms of enzyme catalysis. To extend to new reactions for which natural enzymes provide less guidance, it should be very advantageous to use quantum chemistry methods to compute transition states and ideal active site geometries. In particular, the "theozyme" concept developed by Houk and coworkers (Tantillo et al. 1998) fits very nicely into our approach as the coordinates of the theozyme can be used directly as input for the matching process.
|
| Materials and methods |
|---|
|
|
|---|
|
|
,C) for each pair of residues for each scaffold are mapped onto a unique key that is computed from the C
C
distance and the [C
,C
] vector orientations. For speed, all the scaffolds are mapped into a single hash in memory at the beginning of the program. Each combination of backbone atom coordinates from the inverse rotamer tree is matched against the backbone distances and orientations stored in the hash table using a subgraph isomorphism algorithm similar to that described by Russell (1998). Matches are ranked based on their structural similarity (in RMSD) to the specified active site geometry, and the absence of atomic clashes between the TS model, the placed catalytic side chains, and the protein backbone.
The RosettaMatch approach
The idea of this approach is to build forward from the protein backbone to the TS model for each catalytic side chain independently, and then to identify TS placements compatible with placement of each catalytic residue. The method may be viewed as an extension of the MetalSearch algorithm (Clarke and Yuan 1995) to include ligand orientation as well as center of mass coordinates. We describe first the storage of the position of the TS model for each catalytic side chain rotamer placed at each position using a hash table and, second, the processing of the hash table to extract sets of positions compatible with the specified active site geometry. Finally, we describe performance enhancements to the method using precomputed grids to restrict TS placement to clefts and pockets in the scaffolds, and to speed up the evaluation of atomic clashes with the protein backbone.
For each protein scaffold, a set of potential active site positions is predefined, either all positions in the protein or positions lining cavities or small molecule binding sites. For each amino acid residue in the catalytic site description, all rotamers form the Dunbrack backbone dependent library (Dunbrack and Cohen 1997) are placed at each position. If there is no clash with protein backbone, the TS model for the reaction is positioned as specified in the catalytic site definition. For catalytic side chainTS interactions such as hydrogen bonds, where there are many chemically equivalent interaction geometries, a large set of TS model placements are considered; the fineness of the sampling around the varying degrees of freedom (the side chainTS dihedral in the hydrogen bonding case) is specified in Table 2. Each TS rigid body placement is represented by [v, q ], where v is the vector of the coordinates of the center of mass (x,y,z) of the TS model, and q is the unit quaternion (q1,q2,q3,q4) (Latombe 1991) associated with the rotation that moves the TS model from a reference frame to its current placement. TS model placements are recorded in a hash table if there are no clashes with the protein backbone or the catalytic side chain using the key K computed as follows:
|
|
where the bracket is the integer part, N h is the expected size of the hash, c i is the coordinate in direction i, c i0 is the origin for the direction i, d i is grid spacing for each direction, and N j is the total number of grid points in direction j.
For each placement of the TS model, the following information is stored in the hash table at the position identified by the key K: the box coordinates (c 1,...,c 7) in which the TS model falls, the position in the protein sequence, the residue type (e.g., His, Asp, etc.), the index of the rotamer in the backbone-dependent library, and the rigid body orientation of the TS model [v, q ]. The position in the hash does not suffice to specify the TS position because the hash operator cannot be inverted. For each key K, one list per catalytic residue is kept that records all the information described above for each TS model that hashed with the key K.
Each key of the hash table (corresponding to each discrete box of the six-dimensional space) thus contains N lists, where N is the number of residues making up the catalytic site. If at least one of the N lists is empty, a catalytic site with the specified geometry does not exist for the corresponding TS model location. If the N lists are all not empty, a complete active site can be generated, and every combination of catalytic residues for which there are no significant atomic clashes between the catalytic side chains and no two residues originate in the same backbone position are selected for subsequent minimization and design as described below.
Finding the active site matches requires on the order of 15 min per scaffold on an Intel Xeon machine at 2.8 Ghz with 2 Gb of RAM, with no diversification for the three-residue active site for type II aldolase. The runs take
2 h on the same machine with full diversification of the free degrees of freedom for the same active site. In addition, the RosettaMatch method is easily amenable to parallelization by splitting the pocket into different spatial regions and distributing the building of the hash table among different processors.
To focus the design calculations on promising regions of the scaffold, the center of mass of the TS model may be restricted to clefts or pockets that are likely to be large enough to comprise a viable active site. A square grid box is first constructed that covers the regions targeted for active site design. This grid is then trimmed to remove all the grid points that are <2.25 Å from any protein backbone atom. Any residue on the protein backbone that has a C
C
vector pointing toward one of those grid points and a C
<3.5 Å from any grid point is then included in the set of active site positions. In practice, the use of the grid does not substantially reduce the number of matches found, but it considerably speeds up the search process by eliminating regions unlikely to contribute high ranking active site designs.
To speed up the evaluation of clashes between the TS model and the protein backbone, a "backbone" grid is constructed that contains points that are <2.25 Å from any backbone atom. TS model placements for which atoms overlap the backbone grid are not included in the hash.
Step 2: Optimization of catalytic site placement in scaffold
For each match found with the inverse rotamer tree or the RosettaMatch method, residues around the TS model, other than the catalytic residues, are truncated to glycines. The initial placements of the TS model and catalytic side chain conformations are optimized by rigid body minimization followed by side chain minimization using Rosetta (Gray et al. 2003a,b; Wang et al. 2005). The potential used for minimization consists of the repulsive part of a standard Lennard Jones 612 potential (Kuhlman and Baker 2004), a side chain torsional statistical potential (Dunbrack and Cohen 1997) complemented by a "virtual energy" term that describes the extent to which the functional groups on the catalytic side chains satisfy the ideal geometry described in the active site. The virtual energy term is a quadratic penalty function of the geometrical parameters that relate the functional groups of the catalytic residues to the TS (Fig. 5). Minimization is carried out multiple times using Powell's method (Flannery et al. 2002), gradually increasing the weight on the repulsive interactions between iterations. A very low value is used initially to avoid repulsion of the TS model from the active site.
Step 3: Sequence optimization around the TS model
The minimization step leads to pockets in which a non-clashing TS model is placed with catalytic side chains positioned with functional atoms close to the optimal geometry required for catalysis. It is then necessary to design the surrounding, noncatalytic protein residues to maximally stabilize the transition state. The conformations and identities of residues surrounding the TS model are optimized using Monte Carlo simulated annealing as described previously (Kuhlman and Baker 2000). The potential consists of (1) a 12-6 Lennard-Jones potential with an attenuated repulsive component (Kuhlman and Baker 2004), (2) an implicit solvation model (Lazaridis and Karplus 1999), (3) an orientation-dependent hydrogen bonding term (Kortemme and Baker 2002; Kortemme et al. 2003, 2004; Jiang et al. 2005), (4) a Coulomb model with a distance dependent dielectric constant, (5) a pair potential derived from the Protein Data Bank (Simons et al. 1999) that captures features of side chain side chain electrostatics, and (6) a backbone dependent side chain torsional potential derived from known structures (Dunbrack and Cohen 1997). This potential has performed very well in proteinsmall molecule docking calculations (Meiler and Baker 2006).
| Footnotes |
|---|
Reprint requests to: David Baker, Department of Biochemistry, University of Washington, Seattle, WA 98195, USA; e-mail: dabaker{at}u.washington.edu; fax: (206) 685-1792.
Abbreviations: TS, transition state; RMSD, root mean square deviation; DERA, deoxyribose-phosphate aldolase.
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.062353106.
| Acknowledgments |
|---|
|
|
|---|
| References |
|---|
|
|
|---|
Bachar, O., Fischer, D., Nussinov, R., and Wolfson, H. 1993. A computer vision based technique for 3-D sequence-independent structural comparison of proteins. Protein Eng. 6: 279288.
Berman, H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K., Feng, Z., Gilliland, G.L., Iype, L., and Jain, S., et al. 2002. The Protein Data Bank. Acta Crystallogr. D Biol. Crystallogr. 58: 899907.[CrossRef][Medline]
Bolon, D.N. and Mayo, S.L. 2001. Enzyme-like proteins by computational design. Proc. Natl. Acad. Sci. 98: 1427414279.
Clarke, N.D. and Yuan, S.M. 1995. Metal search: A computer program that helps design tetrahedral metal-binding sites. Proteins 23: 256263.[CrossRef][Medline]
Dahiyat, B.I. and Mayo, S.L. 1996. Protein design automation. Protein Sci. 5: 895903.[Abstract]
Dahiyat, B.I. and Mayo, S.L. 1997. De novo protein design: Fully automated sequence selection. Science 278: 8287.
Dunbrack Jr., R.L. and Cohen, F.E. 1997. Bayesian statistical analysis of protein side-chain rotamer preferences. Protein Sci. 6: 16611681.[Abstract]
Dwyer, M.A., Looger, L.L., and Hellinga, H.W. 2004. Computational design of a biologically active enzyme. Science. 304: 19671971.
Flannery, B., Vetterling, W.T., Teukolsky, S.A., and Press, W.H. 2002. Numerical recipes in C++, 2nd ed. Cambridge University Press, Cambridge, UK.
Gray, J.J., Moughon, S., Wang, C., Schueler-Furman, O., Kuhlman, B., Rohl, C.A., and Baker, D. 2003a. Protein-protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. J. Mol. Biol. 331: 281299.[CrossRef][Medline]
Gray, J.J., Moughon, S.E., Kortemme, T., Schueler-Furman, O., Misura, K.M., Morozov, A.V., and Baker, D. 2003b. Protein-protein docking predictions for the CAPRI experiment. Proteins 52: 118122.[CrossRef][Medline]
Hellinga, H.W. and Richards, F.M. 1991. Construction of new ligand binding sites in proteins of known structure. I. Computer-aided modeling of sites with pre-defined geometry. J. Mol. Biol. 222: 763785.[CrossRef][Medline]
Hilvert, D. 2000. Critical analysis of antibody catalysis. Annu. Rev. Biochem. 69: 751793.[CrossRef][Medline]
Jiang, L., Kuhlman, B., Kortemme, T., and Baker, D. 2005. A "solvated rotamer" approach to modeling water-mediated hydrogen bonds at protein-protein interfaces. Proteins 58: 893904.[CrossRef][Medline]
Kaplan, J. and DeGrado, W.F. 2004. De novo design of catalytic proteins. Proc. Natl. Acad. Sci. 101: 1156611570.
Kortemme, T. and Baker, D. 2002. A simple physical model for binding energy hot spots in protein-protein complexes. Proc. Natl. Acad. Sci. 99: 1411614121.
Kortemme, T., Morozov, A.V., and Baker, D. 2003. An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. J. Mol. Biol. 326: 12391259.[CrossRef][Medline]
Kortemme, T., Joachimiak, L.A., Bullock, A.N., Schuler, A.D., Stoddard, B.L., and Baker, D. 2004. Computational redesign of protein-protein interaction specificity. Nat. Struct. Mol. Biol. 11: 371379.[CrossRef][Medline]
Kuhlman, B. and Baker, D. 2000. Native protein sequences are close to optimal for their structures. Proc. Natl. Acad. Sci. 97: 1038310388.
Kuhlman, B. and Baker, D. 2004. Exploring folding free energy landscapes using computational protein design. Curr. Opin. Struct. Biol. 14: 8995.[CrossRef][Medline]
Latombe, J. 1991. Quaternions. In Robot motion planning,, pp. 7274. Kluwer Academic Publishers, Boston.
Lazaridis, T. and Karplus, M. 1999. Effective energy function for proteins in solution. Proteins 35: 133152.[CrossRef][Medline]
Lerner, R.A., Benkovic, S.J., and Schultz, P.G. 1991. At the crossroads of chemistry and immunology: Catalytic antibodies. Science 252: 659667.
Meiler, J. and Baker, D. 2006. ROSETTALIGAND: Protein-small molecule docking with full side-chain flexibility. Proteins 65: 538548.[CrossRef][Medline]
Russell, R.B. 1998. Detection of protein three-dimensional side-chain patterns: New examples of convergent evolution. J. Mol. Biol. 279: 12111227.[CrossRef][Medline]
Simons, K.T., Ruczinski, I., Kooperberg, C., Fox, B.A., Bystroff, C., and Baker, D. 1999. Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins 34: 8295.[CrossRef][Medline]
Tantillo, D.J., Chen, J., and Houk, K.N. 1998. Theozymes and compuzymes: Theoretical models for biological catalysis. Curr. Opin. Chem. Biol. 2: 743750.[CrossRef][Medline]
Wang, C., Schueler-Furman, O., and Baker, D. 2005. Improved side-chain modeling for protein-protein docking. Protein Sci. 14: 13281339.
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
I. Georgiev, D. Keedy, J. S. Richardson, D. C. Richardson, and B. R. Donald Algorithm for backrub motions in protein design Bioinformatics, July 1, 2008; 24(13): i196 - i204. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Jiang, E. A. Althoff, F. R. Clemente, L. Doyle, D. Rothlisberger, A. Zanghellini, J. L. Gallaher, J. L. Betker, F. Tanaka, C. F. Barbas III, et al. De Novo Computational Design of Retro-Aldol Enzymes Science, March 7, 2008; 319(5868): 1387 - 1391. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. DeChancie, F. R. Clemente, A. J.T. Smith, H. Gunaydin, Y.-L. Zhao, X. Zhang, and K.N. Houk How similar are enzyme active site geometries derived from quantum mechanical theozymes to crystal structures of enzyme-inhibitor complexes? Implications for enzyme design Protein Sci., September 1, 2007; 16(9): 1851 - 1866. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Georgiev and B. R. Donald Dead-End Elimination with Backbone Flexibility Bioinformatics, July 1, 2007; 23(13): i185 - i194. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |