Protein Science Sheba protein
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Protein Science (2005), 14:1741-1752. Published by Cold Spring Harbor Laboratory Press. Copyright © 2005 The Protein Society
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Chen, W. W.
Right arrow Articles by Shakhnovich, E. I.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chen, W. W.
Right arrow Articles by Shakhnovich, E. I.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Lessons from the design of a novel atomic potential for protein folding

William W. Chen1 and Eugene I. Shakhnovich2

1 Department of Biophysics, Harvard University, Boston, Massachusetts 02115, USA
2 Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts 02138, USA

Reprint requests to: Eugene I. Shakhnovich, Harvard University, Department of Chemistry and Chemical Biology, 12 Oxford St., Cambridge, MA 02138, USA; e-mail: eugene{at}belok.harvard.edu; fax: (617) 495-3075.

(RECEIVED March 2, 2005; FINAL REVISION March 2, 2005; ACCEPTED April 11, 2005)


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
We investigate all-atom potentials of mean force for estimating free energies in protein folding and fold recognition. We search through the space potentials and design novel atomic potentials with a random mixing approximation and a contact-correlated Gaussian approximation of decoy states. We show that the two derived potentials are highly correlated, supporting the use of the random energy model as an accurate statistical description of protein conformational states. The novel atomic potentials perform well in a Z-score and fold decoy recognition test. Furthermore, the designed atomic potential performs slightly and significantly better than atomic potentials derived under a quasi-chemical assumption. While accounting for connectivity correlations between atom types does not improve the performance of the designed potential, we show these correlations lead to ambiguities in the distribution of energetic contributions for atoms on the same residue. Within the confines of the model then, many potentials may exist which stabilize all native folds in subtly different ways. Comparison of different protein conformations under the various atomic potentials reveals both a remarkable degree of correspondence in the estimated free energies and a remarkable degree of correspondence in the identity of the contacts types that make the dominant contributions to the estimated free energies. This consistency may be interpreted as a sign that the design procedure is extracting physically meaningful quantities.

Keywords: atomic; potential; protein folding; fold recognition; random energy model; optimization

Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.051440705.


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
When describing the conformation of a protein, one is given a choice for the degree of coarse-graining. At one end of the spectrum, there is little or no coarse-graining and all atoms of both protein and solvent are retained (Duan and Kollman 1998). While such a description may offer a detailed description of the underlying physics, it suffers from unreasonable computational complexity. At the other end of the spectrum, a protein may be simplified to the point of confinement to some underlying lattice, with the solvent implicitly defined on all exposed lattice sites (Krigbaum and Lin 1982). Such a description gives one considerable reduction in conformational space, at the cost of losing accurate predictive power of the physics. In between these two extremes are two oft-used approximations: the implicit solvent/residue representation and the implicit solvent/all-atom representation. In the residue representation, each residue is reduced to a point, and energetics are assumed to be independent of the internal conformations of the amino acids. This description is thought to be inadequate for ab initio folding or difficult fold recognition (Vendruscolo et al. 2000). For this reason, more detailed models involving many-body potentials, residue orientational dependences, and all-atom representation have been used in computational studies of protein folding and fold prediction (Shimada and Shakhnovich 2002; Buchete et al. 2004). The all-atom representation offers proper accounting of volume effects of amino acids, and proper accounting of the energetics of the different atom types in a single residue. Some multi-body residue-level effects arising from sterics are naturally encapsulated in an all-atom model. Furthermore, the implicit solvent condition ensures that the model is not overly computationally inaccessible to simulation. The Hamiltonian of this coarse-grained system is typically given in terms of a sum of two-body contact terms.


(1)

The presence of a contact between any two atoms i and j is indicated by a 1 or 0 in {Delta}ij. Each atom is typed according to some scheme in which an atom is classified by its physico-chemical properties or its identity in an amino acid. The type of atom i is given by {eta}i. The potential of mean force between two atom types {eta}i and {eta}j is given by the N x N array, B ({eta}i, {eta}j) where N is the number of atom types. An equivalent way of writing this contact energy is the following.


(2)

Here the term B({eta}i, {eta}j) is given an identity {varepsilon}k, and nk is the number of contacts in the conformation involving the types {eta}i and {eta}j. This is an "effective" potential, or a potential of mean force because all degrees of freedom arising from the solvent have been averaged out. The commonly accepted Anfinsen hypothesis (Anfinsen et al. 1954), that the native state is thermodynamically dominant, means that the minimization of the effective Hamiltonian (Equation 1) over protein conformations should yield the native conformation.

The big problem arising from this process of coarsegraining is the lack of energy parameters for the coarsegrained Hamitonian in Equation 1. In principle, it is possible to run equilibrium simulations to calculate potentials of mean force for each particular coarsegrained representation from more fundamental physical parameters. In fact, it has been shown that there is good correspondence between detailed physical potentials and coarse-grained ones (Mohanty et al. 1999). In practice, this is computationally quite expensive.

To circumvent these computational difficulties, in the case of residue models, a host of knowledge-based approaches have been used to extract or design potentials of mean force with varying adherence to physical theories. A notable example of potential extraction is found in an early study by Miyazawa and Jernigan (1985). The quasi-chemical approximation (Bethe approximation) contained therein was originally derived for extracting the Hamiltonian parameters between particles, using equilibrium snapshots of the particle mixtures (Bethe 1935). Efforts have been devoted to finding the appropriate reference state to describe a compact polymeric chain in the quasi-chemical picture (Skolnick et al. 1997, 2000). Because there is controversy as to whether it is appropriate to treat the native protein polymer as an equilibrium fluid mixture of residues, other approaches completely eschew the quasi-chemical assumptions and treat the problem as one of optimization (Mirny and Shakhnovich 1996; Tobi and Elber 2000; Vendruscolo et al. 2000). The objective, roughly speaking, is to maximize the gap between the energy of the native conformation and all other conformations in the space of parameters. Surprisingly, the extracted parameters tend to be highly correlated to the quasichemical parameters.

The study of all-atom potentials has been approached from similarly diverse angles. The quasi-chemical approach, when applied to atoms can give potentials that distinguish native from decoys in gapped threading, and can give high Z-scores to native structures when compared to folding decoys (Lu and Skolnick 2001; Zhou and Zhou 2002). The atomic µ-potential, derived from a set of native protein structures in a manner different from the quasi-chemical approach, gives a potential good enough to fold a small {alpha}-helical bundle (Kussell et al. 2002). Hybrid atomistic potentials for single proteins, in models incorporating the volume effects of atoms and a residue-based atom-typing scheme, have been generated by minimizing the native state free energy of a single protein (Crippen 2001). As in the case of the residue models, it would be interesting to know whether extracted all-atom potentials are really the best possible potentials, and if not, what relation exists between the extracted all-atom potentials and the optimal potential.

To address this question, it is possible to design an all-atom potential without a priori physical assumptions of the underlying physics of interatomic interactions, and use Z-score and gapped threading tests to assess the performance of the designed potential. Appropriate benchmarks for these tests are other atomic potentials generated from physical and mathematical considerations (e.g., µ-potentials and quasi-chemical potentials). Moreover, it would be of interest to know the appropriate assumptions needed for an accurate description of the decoy energy distribution.

To see what these assumptions might entail, we consider first that the design procedure necessarily requires one to maximize the energy gap against a set of decoys. With finite computational resources, the set of decoys should be appropriately picked for the potential under consideration, e.g., designing a hydrogen-bonding potential would require decoys with poorly formed or misformed hydrogen-bonds, whereas designing a residuebased potential might not have such a requirement. However, to avoid explicitly considering a finite set of decoys, one may also assume a general mathematical description of their energetic distribution. An example of this is theRandomEnergyModel (REM) (Shakhnovich and Gutin 1989). Drawn from the statistical mechanics of a polymer chain with quenched sequence disorder, the REM is essentially a thermodynamic description of the energy levels of chain conformations. The main implication is that the energy levels are independent values drawn from a Gaussian distribution. The design procedure outlined here may be carried out under such mathematical formulations for the decoy energy distribution, and the resulting potentials can be compared for their ability to distinguish native folds from non-native ones.

Finally, it would be interesting to see whether this design procedure for an atomic typing scheme, which begins with very little physics, can actually recapitulate meaningful physical parameters that are expected of a real potential of mean force. In this study, we design a novel implicit-solvent/all-atom potential of mean force using the procedure described in the work by Mirny and Shakhnovich (1996). The design objective is the simultaneous maximization of the energy gap for each protein in the design set.

We test the robustness of the design procedure to the background decoy distribution. While there are studies that show, with certain restrictions, the random mixing description (see Methods) for residue models of proteins is a good assumption (Skolnick et al. 1997), the picture is less clear in the all-atom representation. A simple extension of the random mixing model permits us to test whether real correlations between atom types in realistic decoys have an effect on both the decoy spectrum and the design procedure. These potentials can be compared via a Z-score test with a published set of compact decoys (Park and Levitt 1996), and also via a gapped threading test we have developed for all-atom models similar to previous studies (Lu and Skolnick 2001). While positive performance on these tests is not sufficient to show that the potentials are a good approximation of the "true" potential, it is at least necessary.

To compare two potentials, it is insufficient to look at the correlation of individual terms across the potentials. The contribution of each parameter to the stability of the native state depends on both the sign and magnitude of that term, and the number of contacts that involve the particular parameter. For this reason, two potentials may correlate very poorly, but the energies they assign to protein conformations may correlate very well. Thus, it makes most sense to compare the total energies of conformations under the different potentials of mean force. Similarly, to assess the importance of each parameter, it makes most sense to look at the weighted contribution of that parameter, i.e., the value of the parameter weighted by the number of contacts involving that parameter. In this way, we are able to assess the relative importance of each contact type in stabilizing the native structure for the different potentials.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
Typing scheme
We employ a twenty-three atom typing scheme (see Fig.1Go) rather than one in which each atom is typed separately (Kussell et al. 2002). This typing scheme is based on human intuition, with grouping by similar physico-chemical properties or by proximity in the side chain. Thus, for example, {gamma}- and {delta}-carbons of lysine are given one type. Although it may seem most general to give each residue atom its own identity, in reality the issue of typing is slightly more delicate. Recent work has shown that there should be balance between the maximal number of types and the size of a finite database: increasing the number of types becomes detrimental to the extracted potential at some point as the frequency of observed contacts falls and relative fluctuations become large (Dominy and Shakhnovich 2004). While a rigorous study of the optimal typing scheme is beyond the scope of the study presented here, the number of atom types we have chosen here is satisfactory for two reasons. First, this typing scheme can be accommodated into the Gaussian Approximation design procedure, which is memory intensive. Second, at the same time, this typing scheme is detailed enough to make distinctions between, say, the {beta}- and {gamma}-carbons of a long side chain. A description and diagram of the typing scheme is given in Figure 1Go.



View larger version (23K):
[in this window]
[in a new window]
 
Figure 1. The twenty-three atom typing scheme. Typing scheme was chosen to reflect either physico-chemical similarity or positional equivalence. For example, the {beta}-carbon of glutamine and asparagine are both typed 10 according to positional and chemical equivalence, but the {beta}-carbon of arginine is typed 15.

 
Training and testing sets
For designing the potential, we chose 70 compact, small proteins randomly selected from the FSSP for their structural diversity. All proteins in this set had 129 residues or less. We purposefully selected proteins without disulfide bonds so there would be no covalent effects in the designed potential. In order to test the potentials, we randomly selected another 91 compact, small proteins from the FSSP. The 70-protein design set and the 91-protein testing set have no overlap. When any one protein from set is compared to any other protein from the other set with BLAST, no pair exhibits >30% sequence identity. Pairwise BLAST of the two sets shows an average sequence identity of 8%, with a variance of 5%. Likewise, self-pairwise BLAST of the 70 protein set and 91 protein set reveals low sequence identity (average sequence identity 7.6% and 6.2%, respectively.) For the quasi-chemical potential and µ-potential, we choose a much larger training set of 488 proteins. These 488 proteins were selected randomly from the FSSP, to represent structurally diverse folds. Though a selection from the FSSP should guarantee a good sampling of folds, there is a question as to whether there would still be biases in the training set because of the dominance of a few protein folds. To check stability and robustness of our results, we generated the µ-potential and the quasi-chemical potentials from a randomly selected 100-protein subset of the 488-protein set. All results showed statistically insignificant changes (unpubl.), thereby lending us some confidence that our results involving the µ and quasi-chemical potentials do not depend strongly on the training set. Lists of PDB files can be found online (http://www-shakh.harvard.edu/~will/atom_pot_design.html).

Design procedure
The design procedure we use essentially simultaneously maximizes the energy gap between the native and the mean of the decoy spectrum. The scale measuring the energy gap is appropriately set by the width of the decoy spectrum. This is the Z-score. For one protein, the definition is as follows:


(3)


(4)


(5)

The number M is the number of decoys for this particular sequence. The term EN is the energy of the native state with the given potential. The term Ei is the energy of the ith decoy for the protein. Because we seek to maximize all gaps simultaneously, we combine the Z-scores using the harmonic mean of all n proteins in the design set. The use of the harmonic mean discourages a few outlier Z-scores from dominating the maximization process. For computer memory reasons, the design set was kept small (70 proteins). To prevent covalent effects from appearing in the potential, the design set was picked so that no protein contained a disulfide bond.


(6)

Because only the first and second moments of the decoy energies E are required to calculate the Z-score, it is convenient to use the contact representation of Equation 7 and calculate the average number of contacts for each type, and the correlators between each pair of types for each decoy set.


(7)


(8)

The choice of decoys determines the matrix of correlators (Equation 8) to be used in design, two of which we describe below. The design algorithm proceeds along the following lines. The all-atom contact energy values are initially generated from a Gaussian distribution. The mean and variance are scaled to 0 and 1, respectively. The quantity ZHARM is calculated for all proteins in the design set. A random term in the potential is selected, and its value is adjusted by a number drawn from a Gaussian distribution with mean and variance of 0 and 0.1 respectively. The quantity ZHARM is recalculated, and a metropolis criterion is used to accept or reject this particular move. Repeated design runs with different initial conditions converge to very similar potentials (correlations of 0.99).

Random mixing approximation
One possibility for decoys states is to imagine collapsed states of the protein as "soups" of atoms. As in the quasi-chemical approximation, atoms randomly diffuse in the soup, and thus the types of contacts are randomly distributed. We generate correlators for this random mixing model (<nknl>RANDOM) by repeatedly shuffling the atoms in the native structure and recording the correlations between atom types. The correlators of these random mixing model decoys are dictated primarily by the amino acid composition of the protein. We designate this potential the Random Mixing Approximation potential.

Gaussian approximation
Unlike the case of residue models (Skolnick et al. 1997), it is unclear as to whether the random mixing model of decoys holds true in an all-atom picture. When two atoms interact, for example, a positively charged nitrogen of arginine and a negatively charged oxygen of aspartic acid, they necessarily "drag" along other atoms in the side chains, possibly bringing them into the contact. Within the framework of the second moment approximation of the decoy spectrum, one can go beyond the random mixing model and introduce these correlations for real decoys explicitly. We generate real, compact decoys with secondary structure by threading the original sequence onto backbone templates from a set of 488 protein structures. The sequence of interest is threaded onto a non-native template. Side chains were then built into the structure. We used a Monte-Carlo procedure to sample and minimize sidechain clashes. The system was sampled at a temperature of 5 (the energy scale has been set by fixing the variance of the potential terms to 1; see Methods, "Design Procedure"). The best results were obtained with high clash penalties (e.g., 100). Side-chain rotamer angles were chosen from a flat distribution over 2{pi} radians. For physical reasons, we ignored interactions between atoms of neighboring and nearest neighboring residues. Clash radii were chosen as in another study (Kussell et al. 2002). For each compact decoy, we record the value nknl for each pair of contact types to build up the correlator. Three thousand decoys were sampled for their contact correlators. Because this is the second moment approximation to real decoys, we term this correlator the Gaussian Approximation correlator, <nknl>GAUSSIAN. The Gaussian Approximation correlator allows us to probe two points: to what extent is the random mixing model sufficient to describe spectrum of compact, protein-like decoys, and to what extent would accounting for such correlations affect the potentials that are designed under these conditions. We designate this potential the Gaussian Approximation potential.

Quasi-chemical approximation
For comparison with the same 23-atom typing scheme, we constructed a quasi-chemical potential using a formulation previously studied (Skolnick et al. 2000; Lu and Skolnick 2001). In these studies, three different types of reference states were considered. Of the three, the so-called "partial-composition-corrected" method demonstrated superior results in threading tests (Lu and Skolnick 2001) when compared against the other formulations. We used this partial-compositioncorrected reference state. Each pairwise energy parameter was defined as the following:


(9)


(10)

The term ni,jobserved is the pooled observed number of contacts between atom types i and j from the training set of proteins. For each protein, we can calculate the expected number of contacts by the mole fraction of atom types and the total number of contacts, where the mole fraction of type i in protein {alpha} and the number of contacts in protein {alpha} are given by {chi}i{alpha}, and nTOTAL,{alpha} , respectively. Thus, for a protein {alpha}, the individual reference state is given by


The final reference state ni,jref is the sum over the individual reference states (Skolnick et al. 2000).

µ-Potential
We also constructed a µ-potential within this typing scheme. The µ-potential for a single protein is derived via the following definition (Kussell et al. 2002).


(11)

In the case of deriving a potential for use on multiple proteins, we can define the term n{alpha}i,j(ñ{alpha}i,j), which is the number of pairs of type i and j observed to be in contact (not in contact) in protein{alpha}. We pool these terms for an aggregate count of "contacts" and "non-contacts." The final form of the µ-potential then, is the following.


(12)

Gapless threading test
The ultimate test for any potential energy function is its ability to fold a protein ab initio.Aless stringent test, though nevertheless useful, is the ability of a potential to discriminate the native fold from non-native folds. This can be accomplished by generating a set of all-atom decoys from all-atomthreading, and comparing the native energy to such all-atom threading decoys. We generated decoys by threading the sequence onto the non-native template. Side chains were then built into the structure. We used a Monte-Carlo procedure to anneal side-chain conformations for twenty thousand steps. As in the decoy generation procedure, sidechain rotamer angles were chosen from a flat distribution over 2{pi} radians. A total of 3000 decoys were generated for each protein in the test set. For comparison, residue-based threading was also performed on the same test set, with the same decoys. The side-chain rebuilding step was omitted, and residue potentials were used to score the native versus decoy folds. The two designed atomic potentials, the µ-potential, the quasi-chemical potential, and four residue potentials are compared to each other below.

Levitt decoy Z-score test
In addition to gapless threading decoys, we also tested all atomic potentials on a standard set of decoys generated from another study by Park and Levitt (1996). These are also compact decoys which satisfy certain physical filters and are deemed protein-like.


    Results
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
Decoy spectrum
The choice of the background decoy spectrum is crucial to the potential design procedure. The condition for energy gap minimization requires the native energy to be lower than some reference energy. The reference energy, as defined in the Z-score, is usually taken to be the mean energy of the decoys. In this study of atomic potentials, we pursue two types of decoy distributions. In the first, atom types are considered to be randomly mixing in decoy states. In the second, real decoys are sampled via all-atom gapless threading, and the correlators are recorded such that the energy distribution can be approximated by a Gaussian. The apparent difference between these two types of decoy distributions can be seen in Figure 2Go. It is clear that the random mixing spectrum is quite different from gapless threading decoys. The gross features, such as the mean and variance, of the two distributions are different—and what is consistent, given the results of previous studies (Skolnick et al. 1997), is that the all-atom decoy spectrum of shuffled residues continues to resemble that of the gapless threading decoys. Thus, insofar as a quantitative prediction of the all-atom decoy spectrum is required, the Random Mixing Approximation is inaccurate. This is one reason why the Gaussian Approximation should be included in the design procedure, to account for the simplest possible approximation derivable from gapless threading decoys. However, as to whether accounting for second order correlations improves the designed potential can only be answered by testing the potential’s efficacy directly.



View larger version (20K):
[in this window]
[in a new window]
 
Figure 2. Decoy energy spectrum under the various assumptions. The energy distribution of gapless threading decoys is shown by the dashed curve. The energy distribution of the random mixing decoys is shown by the solid curve. While the random mixing model approximates the real distribution, note both the difference in the average energy and dispersion of the two distributions. For comparison, decoys were also generated from random mixing of entire residues (clashes eliminated; diamond curve). Note that the decoy spectrum from shuffling residues is much more similar to the gapless threading decoys with only the mean shifted. A similar test was done for the same protein, using a residue-based gapless threading and random mixing of residues (inset). In this case, note that the random mixing and the gapless decoy spectra coincide very well.

 
Gapless threading
To assess the usefulness of the newly designed potentials, we compared gapless threading results for both all-atom and residue potentials (residue-based threading was applied to the test set of 91 proteins.) The results are shown in Table 1Go. In terms of ranking, all potentials performed very similarly to the Miyazawa-Jernigan (QC) potential. The potential designed under the Gaussian Approximation showed a significantly higher average Z-score, but its ability to distinguish the native conformation was more or less the same as the other potentials. For the same test set, we applied all-atom threading and fold recognition. The results are shown in Table 2Go. Again, in terms of recognition of the native structure, all atomic potentials performed quite well, with recognition of almost all proteins in the test set. While the number of recognized proteins in the test set is higher than the number recognized using residue-based models, we discuss below (see Discussion) why the two results are not directly comparable. Furthermore, theGaussian Approximation potential shows a significantly higher Z-score than the quasi-chemical potential (p=1.49 x 10–10). Compared to the µ-potential, the Gaussian Approximation potential also performs slightly better, although less significantly so (p=0.07). What is surprising, given that the background decoy spectra are so different, is the similar performance between the Gaussian Approximation and Random Mixing Approximation potentials. The Gaussian Approximation potential is slightly but insignificantly higher (p=0.29).


View this table:
[in this window]
[in a new window]
 
Table 1. Rank distribution in gapless threading test
 

View this table:
[in this window]
[in a new window]
 
Table 2. Rank distribution in gapless threading test
 
To test for significance, we assume that the Z-scores of the designed potential and the Z-scores of the quasi-chemical potential are drawn from two distributions. Our null hypothesis is that the two distributions are the same, and thus the means of the distributions are the same. This would imply the difference in means is distributed in a Gaussian manner. The appropriate z-statistic (ZSTAT) for testing significance is the following:


(13)


(14)


(15)

where <ZA>is the mean of the Z-scores under potential A, {sum}2A is the standard deviation of Z-scores of proteins under potential A, n=91 is the number of Z-scores from the test set. The ZSTAT parameter is distributed in a Gaussian manner also.

Decoy Z-scores
To place the results of the all-atom gapless threading into context, we also examined the ability of the atomic potentials to discriminate the native state from a set of published decoys for seven small proteins. The results are shown in Table 3Go. The best performing residue-based potential (MS) was also tested on this set of decoys. The small set of proteins show that all atomic potentials performed comparably. Using Z-score and ranking as measures, the Gaussian Approximation potential, like the quasi-chemical potential and µ-potentials, is able to rank all native conformations to the first twelve positions, except for one case (pdb code 4PTI [PDB] ). This protein is held together by three disulfide bonds. For this reason, the potentials designed under the Gaussian Approximation and Random Mixing Approximation probably cannot discriminate well the native structure from decoys due to the lack of disulfides in the design set. The Random Mixing Approximation potential performs similarly. As in the gapless threading tests, it is also the case that the atomic potentials perform better than the residue potentials on all counts. We will discuss why these tests are also not directly comparable.


View this table:
[in this window]
[in a new window]
 
Table 3. Z-scores for native structures when compared to decoys from Park and Levitt (1996)
 
Comparison of total energies and individual potential terms
On a gross level, it can be seen that the potentials are similar in the way they assign energies to sets of decoys for one protein. For the protein 1A3C [PDB] , gapless threading decoys were generated. For each decoy, the total energy was calculated, using each of the four all-atom potentials. This gives four sets of energies for the same decoys. For each pair of sets, the correlation coefficient can be calculated to examine how highly correlated the total energies are for the same states. This is shown in Table 4Go. In contrast, the correlations of individual terms between the potentials can be very poor, as seen in Table 5Go. These two observations can be reconciled by the fact that the total energetics of any one protein is dominated by only a few types of contacts (data not shown), so while the potentials may not correlate well, the few terms that are dominant may allow the total energies to correlate well for different protein conformational states.


View this table:
[in this window]
[in a new window]
 
Table 4. Correlation of total energy for gapless threading decoys
 

View this table:
[in this window]
[in a new window]
 
Table 5. Correlation of energy parameters for all 267 contact types
 
Dominant contributions in one protein
It is of interest to compare these dominant terms in more detail, in terms of the identities of the contacts that make the dominant contributions to the energetics of one protein. For any protein, the energy gap may be defined as EGAP{equiv}ENATIVE–EAVERAGE where EAVERAGE is the average energy of decoys. We define the stabilizing contributions as those terms which are negatively contributing to the energy gap of a native protein state when compared against the average decoy state.


(16)

The total stabilizing portion of the energy gap is the sum of all stabilizing terms. We measure the fractional contribution of each stabilizing term defined as follows:


(17)

where {varepsilon}i is a stabilizing term, and {sum}' j indicates the restricted sum over all stabilizing terms. The top 10 stabilizing terms for the protein 1A3C [PDB] are shown under each of the four all-atom potentials in Table 6Go. Remarkably, of the top 10, five contact types are shared by all four potentials. This is despite the fact that the designed potentials were extracted from an optimization procedure, and that the µ and quasi-chemical potentials were extracted with simple mathematical relations. While these five contact types show up as the dominant contributors to the energy gap, the distribution of the stability between these five terms is different from potential to potential. The fractional contribution to stability for each contact type was then averaged over all 91 proteins from the testing set. The top averaged fractional contributors to stability are given in Table 7Go. Qualitatively, the results differ little from the single case of 1A3C. In the table of averaged values, six contact types of the top 10 are shared by all four potentials. These dominant stabilizing contacts are composed mainly of hydrophobic and salt-bridge interactions. Specifically, hydrophobic interactions come predominantly from both aliphatic and cyclic hydrophobic atoms (e.g., contacts between atom types 2,2, and contacts between atom types 2,6). This is in slight contrast to the specific case of 1A3C: Here the dominant hydrophobic stabilizing interactions come from only aliphatic side-chain carbon atoms, and may be reflective of the fact that there are only four cyclic hydrophobic residues in this protein of 166 amino acids.


View this table:
[in this window]
[in a new window]
 
Table 6. Top 10 contributors to stability
 

View this table:
[in this window]
[in a new window]
 
Table 7. Top 10 contributors to average stability
 

    Discussion
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
We used a Monte-Carlo procedure to search potential space to design a new potential. The choice of assumptions regarding the background distribution, and the resulting types of potentials and their differences and similarities to the knowledge-based potentials gives us valuable insight into how we can systematically explore potential space under different coarse-grainings.

The assumption of the two background distributions, a Random Mixing model and a Gaussian model of contact–contact correlations, shows us that though the decoy spectra may be quantitatively different (Fig. 2Go), the potentials designed under the two assumptions may be very similar. We surmise this to be because the gross features of the spectra change in similar ways in different regions of the potential space. While not exactly the same, the mean and variance are most certainly within an order of magnitude of each other for the two spectra. The typing scheme used in this study was simple, but the results highlight two important, and somewhat contradictory, points. The first point is that typing cannot be done naively, as correlations between atoms cause interactions to emerge between nonobvious pairs. A case in point is the strong contribution to the total stability of interactions between atoms of type 16 and type 22. The {gamma}-, {delta}-, and {varepsilon}-carbons of lysine belong to type 16. The oxygens of glutamate belong to type 22. Therefore, given the restrictions of this model and parameters, it is possible that carbon atoms and nitrogen atoms exhibit an effective attraction. This by itself should not be completely alarming, since we can rationalize that the carbons of lysine are always "dragged" along with the respective nitrogen. The strong spatial correlation provides for an explanation of why the lysine carbon and the glutamatic oxygen exhibit an effective attraction. Given this fact, it is clear that typing should be done with some care—for example, assigning the same type to all C{beta} atoms can yield a worse potential. The second point is that, because of correlations, there is some flexibility in assigning energies to interactions involving atoms that neighbor each other on a side chain. Explicitly, for example, when examining the dominant contributors to stability, contacts between types 2 and 2 may contribute strongly to stability in one potential (e.g., in the designed potentials), but contacts between types 2 and 1 may contribute strongly to stability in another potential (e.g., in the quasi-chemical and µ-potential). This is shown in Table 7Go. But this is quite understandable given that contacts between types 2 and 1 and contacts between types 1 and 1 are both aliphatic hydrophobic interactions. Furthermore, types 2 and 1 tend to be strongly correlated in real space since they are connected in aliphatic chains. Contacts between types 1 and 1 will tend to "drag" atoms of type 2 nearby and produce contacts between types 2 and 1. In this view, there is some flexibility in typing because the potential can trade off energies between different, correlated contact types without adversely affecting stability. These two contact types are equivalent insofar as one is working within the confines of this particular model.

The use of these all-atom potentials in such an application as threading reveals some caveats though. While it would seem, from gapless threading and decoy Z-score tests, that all-atom potentials perform remarkably better than residue potentials in fold recognition, the results are actually not comparable for one important reason. Threading of sequences for which structures are not known usually relies upon templates that imperfectly represent the native state. The prediction from such a template would most necessarily be distorted. In a residue-based model, this is fine since residue representations do not have true volume, and can accommodate such distortions. In the case of all-atom models, the situation is more complex. Because all-atom models account for volume effects more accurately than the residue models, distorted templatesmay have amuch greater impact in all-atom threading. An imperfect template may mean a sequence of similar fold would be "ill-fitted" in an all-atom model, while it would "fit" quite well in a residue model.

Examination of the greatest contributors to stability reveals another interesting fact. The design procedure, which is based largely on optimization with the one physical condition that the native state must exhibit an energy gap to all other states, manages to produce two potentials, both of which match in six of the top 10 dominant contributors to stability in the two knowledge-based potentials. This detail, coupled with the fact that there is high correlation of energies between all potentials, and the fact that the designed potentials perform remarkably well in gapless threading and Z-score, means that the design procedure is extracting something physically meaningful. We suggest that other simplified models, for which there exists no proper theoretical treatment of the potential of mean force, can be subjected to the same design procedure in order that physically meaningful, and perhaps as importantly, useful energetic terms can be extracted. They are useful insofar as they are able to give reasonable estimates to free energies of protein conformations. And similarly, more complicated models with extended type schemes, or with more complex degrees of freedom, can be subjected to the same design procedure; the extracted parameters resembling real physics may similarly emerge from such a procedure.


    Acknowledgments
 
We wish to thank Huseyin Kaya for help in implementing the µ-potential. This work was supported by the National Institutes of Health.


    References
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
Anfinsen, C.B., Redfield, R.R., and Choate, W.L. 1954. Studies on the gross structure, cross-linkages, and terminal sequences in ribonuclease. J. Biol. Chem. 207: 201–210.[Free Full Text]

Bethe, H.A. 1935. Statistical theory of superlattices. Proc. Roy. Soc. Lond. A 150: 552–575.

Buchete, N.V., Straub, J.E., and Thirumalai, D. 2004. Development of novel statistical potentials for protein fold recognition. Curr. Opin. Struct. Biol. 14: 225–232.[CrossRef][Medline]

Crippen, G.M. 2001. A Gaussian statistical mechanical model for the equilibrium thermodynamics of barnase folding. J. Mol. Biol. 306: 565–573.[Medline]

Dominy, B.D. and Shakhnovich, E.I. 2004. Native atom types for knowledge-based potentals: Application to binding energy prediction. J. Med. Chem. 47: 4538–4558.[Medline]

Duan, Y. and Kollman, P.A. 1998. Pathways to a protein folding intermediate observed in a 1-microsecond simulation in aqueous solution. Science 282: 740–744.[Abstract/Free Full Text]

Krigbaum, W.R., and Lin, S.F. 1982. Monte-Carlo simulation of protein folding using a lattice model. Macromolecules 15: 1135–1145.[CrossRef]

Kussell, E., Shimada, J., and Shakhnovich, E.I. 2002. A structure-based method for derivation of all-atom potentials for protein folding. Proc. Natl. Acad. Sci. 99: 5343–5348.[Abstract/Free Full Text]

Lu, H. and Skolnick, J. 2001. A distance-dependent atomic knowledgebased potential for improved protein structure selection.Proteins 44: 223–232.[CrossRef][Medline]

Mirny, L.A. and Shakhnovich, E.I. 1996. How to derive a protein folding potential? A new approach to an old problem. J. Mol. Biol. 264: 1164–1179.[CrossRef][Medline]

Miyazawa, S. and Jernigan, R.L. 1985. Estimation of effective interresidue contact energies from protein crystal-structures—quasi-chemical approximation. Macromolecules 18: 534–552.[CrossRef]

Mohanty, D., Dominy, B.D., Kolinski, A., Brooks, C.L., and Skolnick, J. 1999. Correlation between knowledge-based and detailed atomic potentials: application to the unfolding of the GCN4 leucine zipper. Proteins 35: 447–452.[Medline]

Park, B. and Levitt, M. 1996. Energy functions that discriminate X-ray and near-native folds from well-constructed decoys. J. Mol. Biol. 258: 367–392.[CrossRef][Medline]

Shakhnovich, E.I. and Gutin, A.M. 1989. Formation of unique structure in polypeptide-chains theoretical investigation with the aid of a replica approach. Biophys. Chem. 34: 187–199.[CrossRef][Medline]

Shimada, J. and Shakhnovich, E.I. 2002. The ensemble folding kinetics of protein G from an all-atom Monte Carlo simulation. Proc. Natl. Acad. Sci. 99: 11175–11180.[Abstract/Free Full Text]

Skolnick, J., Jaroszewski, L., Kolinski, A., and Godzik, A. 1997. Derivation and testing of pair potentials for protein folding. When is the quasi-chemical approximation correct? Protein Sci. 6: 676–688.[Abstract]

Skolnick, J., Kolinski, A., and Ortiz, A. 2000. Derivation of proteinspecific pair potentials based on weak sequence fragment similarity. Proteins 38: 3–16.[CrossRef][Medline]

Tobi, D. and Elber, R. 2000. Distance-dependent, pair potential for protein folding: Results from linear optimization. Proteins 41: 40–46.[CrossRef][Medline]

Vendruscolo, M., Najmanovich, R., and Domany, E. 2000. Can a pairwise contact potential stabilize native protein folds against decoys obtained by threading? Proteins 38: 134–148.[CrossRef][Medline]

Zhou, H.Y. and Zhou, Y.Q. 2002. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci. 11: 2714–2726.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
J. E. Donald, W. W. Chen, and E. I. Shakhnovich
Energetics of protein-DNA interactions
Nucleic Acids Res., February 28, 2007; 35(4): 1039 - 1047.
[Abstract] [Full Text] [PDF]


Home page
Protein Sci.Home page
M.-y. Shen and A. Sali
Statistical potential for assessment and prediction of protein structures.
Protein Sci., November 1, 2006; 15(11): 2507 - 2524.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Chen, W. W.
Right arrow Articles by Shakhnovich, E. I.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chen, W. W.
Right arrow Articles by Shakhnovich, E. I.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS