|
|
||||||||
Department of Biopharmaceutical Sciences, Department of Pharmaceutical Chemistry, and California Institute for Quantitative Biomedical Research, University of California at San Francisco, San Francisco, California 94158, USA
(RECEIVED June 28, 2006; FINAL REVISION August 18, 2006; ACCEPTED August 18, 2006)
| Abstract |
|---|
|
|
|---|
Keywords: statistical potential; protein structure prediction; comparative or homology modeling; model assessment
| Introduction |
|---|
|
|
|---|
The pioneering work of Tanaka and Scheraga (1976) related the frequencies of contact between different residue types, obtained from known native structures, to the free energies of corresponding interactions using the simple relationship between free energy and the equilibrium constant. Their work was followed by that of Miyazawa and Jernigan (1985, 1996, 1999), who developed residue contact statistical potentials using a quasichemical approximation. A new form of a statistical potential dependent on a distance between two residue types was then proposed independently by Sippl (1990, 1993a, b), based on the assumption that the distributions of distances between different residue types in diverse native structures in PDB are Boltzmann-like.
Subsequently, a large number of different statistical potentials were described and tested (Hendlich et al. 1990; Colovos and Yeates 1993; Sippl 1993a; Kocher et al. 1994; Huang et al. 1995; Rooman and Wodak 1995; Jernigan and Bahar 1996; Jones and Thornton 1996; Miyazawa and Jernigan 1996; Moult 1997; Park and Levitt 1996; Park et al. 1997; Reva et al. 1997; Simons et al. 1997; Vajda et al. 1997; Furuichi and Koehl 1998; Melo and Feytmans 1998;Rooman and Gilis 1998; Samudrala and Moult 1998; Betancourt and Thirumalai 1999; Jones 1999b; Rojnuckarin and Subramaniam 1999; Simons et al. 1999; Bastolla et al. 2000; Chiu and Goldstein 2000; Gatchell et al. 2000; Lazaridis and Karplus 2000; Vendruscolo et al. 2000; Lu and Skolnick 2001; Melo et al. 2002; Keasar and Levitt 2003; Zhou and Zhou 2003; Betancourt and Skolnick 2004; Buchete et al. 2004a,b; Wang et al. 2004; Zhang et al. 2004; Chen and Shakhnovich 2005; Fang and Shortle 2005; Qiu and Elber 2005; Summa et al. 2005; Dehouck et al. 2006; Eramian et al. 2006). Statistical potentials can be classified by the following characteristics: (1) protein representation (e.g., centroids of amino acid residues, C
/C
atoms, and all atoms), (2) the restrained spatial feature (e.g., solvent accessibility, contact, distance, torsional angle), and (3) the reference state. Statistical potentials for the all-atom representation are generally more accurate than those for an amino acid residue representation (Samudrala and Moult 1998; Lu and Skolnick 2001; Melo et al. 2002; Zhou and Zhou 2002). The most commonly used statistical potentials depend on atomic distances only.
Statistical potentials are widely used in numerous applications because of their relative simplicity, accuracy, and computational efficiency. These applications include assessment of experimentally determined and computationally predicted protein structures (Sippl 1993b; DeBolt and Skolnick 1996; Gatchell et al. 2000; Melo et al. 2002; John and Sali 2003; Wang et al. 2004; Topf and Sali 2005; Topf et al. 2006), ab initio protein structure prediction (Bowie et al. 1991; Sun 1993; O'Donoghue and Nilges 1997; Chiu and Goldstein 2000; Tobi and Elber 2000; Tobi et al. 2000), fold recognition or threading (Maiorov and Crippen 1992; Sippl and Weitckus 1992; Bryant and Lawrence 1993; Ouzounis et al. 1993; Huang et al. 1995; DeBolt and Skolnick 1996; Jones and Thornton 1996; Reva et al. 1997; Jones 1999a; Kolinski et al. 1999; Miyazawa and Jernigan 1999, 2000; Panchenko et al. 2000; Skolnick et al. 2000), detection of native-like protein conformations (Hendlich et al. 1990; Casari and Sippl 1992; Bauer and Beyer 1994; Samudrala and Moult 1998; Simons et al. 1999; Gatchell et al. 2000; Vendruscolo et al. 2000), and prediction of protein stability (Gilis and Rooman 1996, 1997).
Perhaps the most essential question in the derivation of a statistical potential is how best to formulate and interpret a scoring function derived from a sample of native structures. In general, the derivation of a statistical potential has been motivated by a presumed analogy between a sample of native structures and the canonical ensemble in statistical mechanics. The principal of the corresponding assumptions is that the distributions of different structural features obtained from a sample of native structures obey the Boltzmann distribution of statistical mechanics (Sippl 1990). However, such a sample contains native states of different sequences at different temperatures, not states of the same sequence over a longer period of time at a specific temperature (Thomas and Dill 1996b), as required by the definition of the canonical ensemble to which the Boltzmann distribution applies. Therefore, alternative interpretations of the origin of the Boltzmann-like distribution for structural features in a sample of native structures have also been suggested (Finkelstein et al. 1995). In this other view, the Boltzmann-like distribution is a consequence of evolution that favors structural features for which more sequences have the global free energy minimum.
As a result of the uncertainties in the very formulation of a statistical potential, there are several related problems, including the question of the most appropriate reference state (Skolnick et al. 1997), the additivity of the individual terms in a statistical potential (BenNaim 1997), as well as balancing of a statistical potential with other terms that may be used in a complete scoring function for protein structure prediction (Misura et al. 2006).
Here, we first identify a statistical potential with the negative logarithm of the joint probability density function of a given protein. We then derive an atomic distance-dependent statistical potential from a sample of native structures based entirely on the probability theory, without recourse to statistical mechanics, thus circumventing the assumption of the Boltzmann distribution. Subsequently, we clarify the assumptions and approximations needed to interpret a statistical potential as a potential of mean force. This approach allowed us to treat the problem of the reference state more accurately than has been done previously. In our theory, the reference state is a finite sphere of uniform density and appropriate size, instead of the distribution of interatomic distances in the sample native structures irrespective of their sizes and atom types. In other words, in contrast to the previous approaches, our reference state explicitly depends on the sizes of the native structures from which the statistical potential is derived. This improvement results in an increased accuracy of protein structure assessment, as demonstrated by testing various statistical potentials, including ours, on multiple decoy sets. We term our new statistical potential Discrete Optimized Protein Energy (DOPE).
We begin by deriving DOPE from a sample of the native structures (see Theory). Next, we describe its accuracy compared to five other scoring functions with the aid of six multiple target decoy sets (see Results). We proceed by discussing its relative successes, failures, and applications (see Discussion). A detailed description of the training and decoy sets, the evaluated scoring functions, and the evaluation criteria are provided in Materials and Methods.
| Theory |
|---|
|
|
|---|
Joint pdf for a native structure as a product of pair pdfs
Prediction of the native structure of a protein would be enabled by expressing our knowledge of any kind as a scoring function whose global optimum corresponds to the native structure. One such function is a joint probability density function of the Cartesian coordinates of the protein atoms, given available information I about the system,
, where N is the number of atoms in the protein and
are the Cartesian coordinates of atom i. For each atom in a given protein, the joint pdf p gives the probability density that the atom i of the native structure is positioned very close to
, given the information I we wish to consider in the calculation. In general, information I may include the sequence of the protein, a molecular mechanics force field, experimental structural information, a sample of known native structures, and an alignment of the sequence to a related known protein structure. For example, when information I reflects only the sequence and the laws of physics under the conditions of the canonical ensemble, the joint pdf corresponds to the Boltzmann distribution. If I also includes a crystallographic data set sufficient to define the native structure precisely, the joint pdf is a Dirac delta function centered on the native atomic coordinates. For simplicity, we omit I from notation in the rest of the paper.
We now wish to estimate the joint pdf for a given protein from a sample of the native structures for different proteins, deposited in the PDB. To minimize the needed size of the sample and to derive a joint pdf for any protein sequence, we seek to approximate the joint pdf p in terms of pdfs for all pairs of atoms in the system,
(i.e., pair pdfs); a pair pdf will depend only on the type and position of the two atoms, not on the whole sequence.
As suggested above, the joint pdf p can be approximated by a normalized product of the pair pdfs for all protein atom pairs:
![]() |
The denominator is derived from the condition that the joint pdf must be a product of single body pdfs when all the pair pdfs are uncorrelated with each other. The terms in the denominator,
, are single-body distribution functions that depend only on the composition of the protein and the total volume of the system. In other words,
is the number density of atom i, equal to the reciprocal volume of the system. Because
is constant for a given protein, it does not impact on the rank order of different conformations and is ignored here.
In the context of the statistical mechanical liquid state theory, Equation 1 is also known as the Kirkwood superposition approximation (Kirkwood 1935). The superposition approximation would be exact only if all the pair pdfs were mutually independent from each other
. The pair pdfs
of atom pairs are generally interdependent because each atom in the system interacts with more than one other atom; for example, a major source of interdependence of pair pdfs for nonbonded atoms within the same amino acid residue are the chemical bonds. For simple dense liquids, the ratio between the exact three-body distribution and its two-body approximation ranges from 0.8 to 1.2 (Alder 1964; Rahman 1964). In general, the Kirkwood approximation of the joint pdf
by pair pdfs
is clearly more accurate than a product of N single-body pdfs,
.
It is tempting to equate interdependence with redundancy and thus minimize the problem by including only a subset of atoms (e.g., only C
atoms) in the joint pdf. However, it was found empirically that such simplifications reduce the accuracy of the resulting statistical potentials (Samudrala and Moult 1998; Lu and Skolnick 2001; Melo et al. 2002; Zhou and Zhou 2002). A probable reason is that the all atom pair pdfs jointly encode the preferred relative orientations between whole residues. Correspondingly, it appears to be possible to reduce the number of atoms considered in the joint pdf without sacrificing its accuracy by using orientation-dependent terms (Buchete et al. 2004b), albeit this reduction comes at a cost of introducing additional degrees of freedom into each term.
A general reduction of the joint pdf to lower-order pdfs is provided by the BogolyubovBornGreenKirkwoodYvon hierarchy of a chain of integral equations connecting the N-body pdf with simpler pdfs (McQuarrie 1975). Therefore, in principle, the formalism developed here for deriving the joint pdf as a sum of pairwise terms can also be applied to derive the joint pdf approximated by a sum of higher-order terms.
The joint pdf may in principle be improved beyond the superposition approximation by iteratively modifying the individual terms so as to maximize the discrimination between the native and non-native structures (Thomas and Dill 1996a). Although this approach has been applied successfully to a contact statistical potential, it is less likely to work well for the larger parameter space of an atomic distance-dependent statistical potential, such as the one developed here.
Calculation of the pair pdf from the distance pdf estimated from a single sample native structure
We now estimate the pair pdf
for all atom pairs (i,j), using a single sample native structure. A structure is defined by internal coordinates that are invariant with respect to translation and rotation. Thus, the interparticle distance r between
and
is the most relevant internal coordinate for a pair of atoms. Consequently, the distribution that can be estimated directly from a sample native structure is the distance pdf for a pair of atom types:
|
|
where m and n denote the atom types and Nmn(r) is the number of atom type pairs (m,n) at a distance within [r, r +
r]. The distance pdf is proportional to the number of (m,n) pairs in a spherical shell of volume 4
r2
r; thus, the density of the (m,n) pairs in the shell is pmn(r)/4
r2
r. For a finite and nonspherical native structure, only a fraction
(r) of the spherical shell between r and r +
r centered on
is occupied by protein atoms (0
(r)
1) (Fig. 1A). Thus, the density of the (m,n) pairs at the distance r is pmn(r)/[4
r2
(r)].
|
|
|
where n(r) is the normalization function equal to 4
r2
(r), and m and n are the types of atoms i and j, respectively. As mentioned above, the single-body pdf
is the number density of atom i and is ignored because it does not impact on the ranking of different conformations of the same protein.
Normalization function n(r; a) for a single sample native structure
The calculation of the normalization function n(r) is not straightforward because the native structures are finite and varying in size (Fig. 1A). Therefore, we explicitly denote n(r) as dependent on the size a of the sample native structure, n(r;a). We define the size a to be the radius of the sphere of uniform density that has the same radius of gyration Rg as the sample native structure; thus,
. Similarly, for clarity, we also denote the distance pdf pm,n(r) as pm,n(r;a).
The key simplification in the calculation of the normalization function pm,n(r;a) is as follows: We construct a special state (i.e., the reference state) for which the calculation of pm,n(r;a) is analytically tractable and then assume that this pm,n(r;a) is applicable to any protein of size a. We define the reference state for a sample native structure with the radius of gyration Rg to be a sphere with the same radius of gyration and density, but with uncorrelated uniformly distributed atomic positions (Fig. 1B); this reference state is independent of the composition. The assumption of uncorrelated uniform atomic density is grounded in the maximum entropy principle corresponding to no prior knowledge about the native structures. It is difficult to justify this particular reference state, especially its finite spherical attribute, other than by intuition and the performance of the corresponding statistical potential in protein structure prediction (Results).
According to Equation 3, in a reference state with uncorrelated positions of atoms i and j
, the normalization function n(r;a) is equal to the distance pdf
. Although
and n(r;a) for a sphere have already been calculated (Deltheli 1919; Hammersley 1950; Lord 1954; de Smith 1977; Tu and Fischbach 2002; Garcia-Pelayo 2005), we re-derive them here for completeness.
We start by placing a stationary point (sp) on the center of the sphere encompassing the reference state (h = 0) while allowing a mobile point (mp) to move freely on the surface of a smaller "probe" sphere at a distance r from sp (small sphere in Fig. 1C). Because the mobile point cannot be outside of the reference sphere, the partial normalization function m(r;a) for an atom at the center of the reference sphere is
|
|
Next, we offset the stationary point from the center of the reference sphere by distance h
a. For distance r smaller than a h, the mobile point must reside inside the sphere; thus, m(r;a) remains 4
r2. For a h
r
a + h, the mobile point touches the reference sphere surface; thus, m(r;a) is proportional to the intersecting surface area of the probe sphere that remains within the reference sphere (Wodak and Janin 1980). Therefore, m(r;a) of the offset point is
![]() |
The normalization function is thus determined by integrating m(r;a) over all possible offsets h:
|
|
which yields the normalization function:
![]() |
where rc is some upper bound on the range of the statistical potential. Equation 7 is validated numerically by the histogram of one million pairs of randomly generated distances inside a 22 Å sphere (the average size of a
150 residue protein domain) (Fig. 2). The most probable distance lies at
, which is
5% greater than the sphere radius. When r is infinitesimally small, n(r;a) approaches r2; as the distance r increases, n(r;a) can be expressed as r
, where
depends on both the distance r and sphere radius a. The effective exponent
can be derived by taking the logarithm and then the first derivative of both sides of the definition m(r;a) = r
; thus,
|
|
and finally combining this result with Equation 7 (Fig. 3):
|
|
|
|
are
2 as r
0 and
0 as r
Using a sample of many native structures of varying size
We now need to derive the joint pdf from a sample of many native structures of varying size, which is nontrivial because both the distance pdf pm,n(r;a) and the normalization function n(r;a) depend on size a of a sample structure. We note in passing that if we had a sufficiently large sample of native structures, we could subdivide it into subsets of proteins of equal size and then derive the joint pdf separately for each subset without the complication of combining the statistics from structures of varying size.
The dependence of both pm,n(r;a) and n(r;a) on protein size a might suggest that their ratio, pair pdf
, also depends on a. However, we suggest that the pair pdf may be approximately independent of the protein size. While we are not able to support this approximation based on statistics, we can ground it in physics if we assume that the pair pdf reflects only the potential of mean force between the corresponding atom types (Equation 12). The potential of mean force between two atoms in a protein, in turn, should depend only on their chemical properties and their distance, not on the protein size. Thus, the pair pdfs derived from subsets of protein structures of different sizes will not vary, except for statistical fluctuations due to small sample sizes, and can thus be averaged to obtain a more accurate estimate of the pair pdf.
Calculating DOPE
Based on the arguments and equations above, the complete calculation of DOPE is as follows. First, for each sample native structure, the distance pdf pm,n(r;a) is estimated by Equation 2; the details about the sample of the native structures, the sampling of the distances, atom types, and implementation in MODELLER-8 are described in Materials and Methods. The normalization function n(r;a) is calculated from Equation 7 using
.
Second, for each atom type pair (m,n) in the sample native structure, the pair pdf
is calculated using Equation 2 and 3.
Third, the weight ws of the sample native structure is calculated as the ratio between the number of all atom pairs in this structure and the number of atom pairs in all sample structures, irrespective of their types.
Fourth, the pair pdf
for the sample of all native structures is calculated as a weighted sum of the pair pdfs corresponding to the individual sample structures:
|
|
where index s runs over all sample native structures. This averaging procedure is based on the presumed independence of the pair pdf from the protein size, as rationalized above.
Requirements needed for a physical interpretation of the joint pdf as the free energy
There are two approximations needed to relate the joint pdf
(Equations 13) derived from a sample of native structures and the free energy of a protein in solvent. However, we do not argue that these approximations are actually accurate or physically correct. The first approximation is that the pair pdfs
estimated from a sample of known native structures obey the Boltzmann statistics of a canonical ensemble at some temperature T (Sippl 1990). While this approximation is partly validated by the observed Boltzmann statistics of structural features (e.g., atomic distances) in a set of known native structures (Finkelstein et al. 1995), serious concerns about its correct interpretation remain (Finkelstein et al. 1995; Thomas and Dill 1996b). The second approximation is the superposition approximation (above). Importantly, we note that the derivation of our statistical potential is entirely independent of the hypothetical relationships outlined in this section.
The joint pdf p defines the N-body correlation function g (Hill 1956):
|
|
The total free energy G of the system can then be expressed in terms of the correlation function g:
|
|
where kB is the Boltzmann constant. Therefore, an approximate free energy of a system is (Equations 1, 3, 9, 10):
![]() |
where
is the radial distribution function equal to pm,n(r)/n(r), and
i,j(r) is the potential of mean force for a pair of atoms
.
Because
and thus
i,j(r) = 0 for the reference state as defined above, the potential of mean force derived from the observed distance pdf pm,n(r) is
|
|
where
and
are the numbers of atom type pairs (m,n) at a distance r within [r, r +
r] for the "interacting" real system and the "noninteracting" reference state, respectively. This interpretation of our reference state is identical to that of the "discharged" state used for free energy calculations in statistical mechanics (Onsager 1933; Hill 1956). Equation 13 establishes the relation between the statistical potential derived from a sample of known structure and the potential of mean force.
| Results |
|---|
|
|
|---|
Leu C
and Asp C
Leu C
pairs (i.e., main chainside chain and polar side chainhydrophobic side chain; Fig. 4B,D) are very similar for all three statistical potentials. The structured distance dependence of the Cys NTrp O main chainmain chain pair is present in DOPE as it is in DFIRE and RAPDF (Fig. 4A). The minor difference between DOPE and DFIRE lies at
2.75 Å, where a small peak in DFIRE is absent in DOPE. The difference between DOPE and DFIRE is more pronounced for the Ile C
Leu C
hydrophobic side chainhydrophobic side chain pair, where DOPE lies between DFIRE and RAPDF.
|
|
For all six decoy sets, DOPE correctly identifies 47 native states for 52 targets (90%), while DFIRE, Rosetta, ModPipe-Pair, Modpipe-Surf, and Modpipe-Comb succeed in 46, 33, 38, 25, and 37 cases (88%, 63%, 73%, 48%, and 71%), respectively.
Correlation between the score and model error
The percentage of detected native states provides a useful but incomplete indication of the accuracy of a scoring function. In particular, it does not describe the correlation between the score of a model and its structural similarity to the native state, suggested by the funnel-shaped free energy landscape of protein folding. To quantify these scoreerror correlations for the six scoring functions, we calculated the individual and average scoreerror correlation coefficients between the scores and the C
RMS errors for the 4state_reduced decoy set (Table 2). ModPipe-Pair yields the highest average scoreerror correlation coefficient (0.69) among the six tested scoring functions, with DOPE and Modpipe-Comb following closely second (0.67). Moreover, ModPipe-Pair and DOPE both have the highest scoreerror correlation in three out of seven individual targets in the set.
|
0.9, indicating a better performance of DOPE compared to DFIRE (6) and Rosetta (4).
|
|
Selection of the most accurate non-native model
Identification of the non-native structure closest to the native structure among a set of decoys is generally significantly more difficult than identification of the native structure. Even the actual free energy function would generally not succeed if the best model were far enough from the native state. Yet, it is this more difficult task that needs to be performed in realistic model assessments where the native structure is not available. Therefore, to further test the practical utility of DOPE, we assess each of the 300 models of the 20 targets in the moulder decoy set by DOPE and the five other scoring functions. For the moulder decoy set, DOPE is the most accurate score according to three assessment measures (Table 3BD) as follows.
Using
RMSD as the accuracy criterion, DOPE is the best of all six tested scores with average and median
RMSDs of 0.58 Å and 0.30 Å, respectively. DOPE outperforms DFIRE, Rosetta, and ModPipe-Comb, whose average (median)
RMSDs are 0.69 Å (0.44 Å), 0.87 Å (0.43 Å), and 1.23 Å (0.76 Å), respectively. In four cases, DOPE selects a model with the lowest
RMSD, a better performance than DFIRE and Rosetta (three cases each). The two structures (1cau and 1cew) for which DOPE failed to select a model with
RMSD < 2.0 Å are both difficult cases in the sense that they failed at least half of the tested scoring functions.
DOPE also performs better than other scores according to the n% enrichment measures (Table 3C,D). The DOPE average 20% enrichment for the 20 targets is 3.92, compared to 3.85, 3.70, and 3.78 for DFIRE, ModPipe-Comb, and Rosetta, respectively. Ten-percent enrichment results in a similar ranking of the scoring functions. The difference in enrichment ratios between DOPE and DFIRE is more pronounced at the 10% threshold (6.25 compared to 6.00) than at the 20% threshold (3.92 compared to 3.85). This difference is primarily due to the relatively higher accuracy of DOPE in the near-native region.
The relative accuracy of a score can also be illustrated by the number of the 10% most accurate models identified by the score; there are 600 such top 10% models in the moulder decoy set (10% x 20 targets x 300 models per target). DOPE identifies 375 of these models, which is 15, 26, and 28 more than identified by DFIRE, Rosetta, and ModPipe-Comb, respectively.
| Discussion |
|---|
|
|
|---|
RMS error, and the identification of the most accurate non-native model. The improvement results primarily from a more rigorous treatment of the reference state in the derivation of DOPE. Next, we first compare the theory of DOPE with other existing methods for deriving statistical potentials; second, we discuss the importance of the size of the spherical reference state of DOPE; and finally, we describe four regimes where DOPE tends to be less accurate: incomplete models, small structures, low-accuracy models, and NMR structures. We conclude by listing several current applications of DOPE.
Comparison of statistical potential reference states
All statistical potentials depend on the same protein structure database (i.e., PDB). Therefore, the differences between the distance pdfs pm,n(r) of various statistical potentials depend only on the specific choice of the sample structures and are not significant conceptually. In contrast, significantly different definitions of the distance pdf for the reference state
(Equation 3), which is equal to the normalization function n(r) (Equation 3) or equivalently
(Equations 2 and 3), have been used in the derivation of different statistical potentials. For example, RAPDF uses a conditional pdf to construct a distance pdf for the reference state (Samudrala and Moult 1998), and AKBP relies on a mole fraction-dependent reference state function (Lu and Skolnick 2001). We now compare the DOPE n(r) function to that of DFIRE, which is the most similar statistical potential to DOPE.
A physical picture of noninteracting atoms in a finite spherical volume has inspired the DFIRE reference state (Zhou and Zhou 2002; Zhang et al. 2004), just as it did for DOPE. The DFIRE normalization function is n(r) = r
, relying on a constant effective exponent parameter
that is used for all sample native structures irrespective of their size. The optimal value of
was found empirically to be 1.57 (Zhou and Zhou 2002) and subsequently refined to 1.61 (Zhou and Zhou 2002, 2003). The DFIRE reference state captures an important feature of particle density in a finite spherical volume; namely, it does not grow with r2, but with r
, where
is smaller than 2.
DOPE goes a step further in the definition of the reference state and the corresponding normalization function. The DOPE reference state is a sphere whose size reflects that of the sample native structure, but with a uniform uncorrelated atom density; the reference state is independent of the composition of the protein. In contrast to DFIRE, the DOPE normalization function is derived analytically, without any adjustable parameters. It turns out that
is not a constant but, in fact, a function of both the distance between the interacting atoms r and the sample native structure size a (Equation 8). Thus, the DFIRE reference state can be considered to be an empirical approximation of the DOPE theory. For example, the average (over r from 0 to 15 Å) effective exponent
for a 22 Å reference sphere is 1.63, which is very close to the effective exponent
of DFIRE (1.61). The DOPE reference state results in a more accurate (see Results) and presumably more broadly applicable statistical potential; in principle, even though derived from sample native structures of different sizes, it is applicable to models of any size. Moreover, it affords a greater opportunity for generalization to other kinds of statistical potentials and other future developments.
The size of the reference state
The reference state and the corresponding normalization function n(r) depend on the reference sphere radius a (see Theory). To further illustrate the impact of the reference sphere on the accuracy of the DOPE potential, we derived a version of DOPE in which the pair pdf in Equation 3 was calculated with the normalization function for the single sphere size a. This limited version of DOPE is termed DOPE-a. DOPE-a is highly inaccurate when derived with an incorrect reference sphere radius a. For example, DOPE-16 cannot identify the native structure of 1bbh (Fig. 6A). In contrast, DOPE-24 does correctly identify the 1bbh native state (Fig. 6B). The scoring function derived from a reference state that is too small results in an erroneous preference for loosely packed structures because the short-range interactions are deemed to be more repulsive by the corresponding DOPE-a than by DOPE.
|
Accuracy of DOPE as the function of completeness of assessed model
In general, the failures in picking the native structure may be caused by a number different factors, which will be discussed in the following four sections. The first of these factors is the incompleteness of the assessed model. For example, the native interface between two domains in a single protein may be needed for its stability. In such a case, it is conceivable that neither a physics-based energy function nor a statistical potential will score the native state of an isolated domain correctly. There is not much that can be done about this problem, other than striving to assess biologically stable units of structure. A possible example of such a failure is 1b0n-B, whose crystal structure indicates that the stable unit is a dimer, not a monomer.
Accuracy of DOPE as the function of assessed model size
Yet another difficulty in model assessment is a small size of an assessed model; for example, 1b0n-B, 1bba, and 1fc2 have only 31, 36, and 43 residues, respectively. DOPE, like other statistical potentials, is less accurate for smaller proteins (Table 1). There are two potential sources of errors in assessing small proteins. First, small proteins are assessed by a smaller number of pairwise terms, thus resulting in a larger statistical error of the final score. The number of individual pairwise terms in the scoring function is proportional to the square of the protein sequence length; thus, the relative statistical fluctuation of the final score (Equation 12) is inversely proportional to the sequence length.
Second, the statistical mechanics of large proteins, which contribute most of the distances toward the construction of a statistical potential, may be different from that of small proteins, in addition to the differences accounted for by varying a. For example, the relatively more aspherical structure of small protein domains makes the spherical reference states less accurate. Also, a small protein domain usually lacks a well-packed hydrophobic core. To maintain the 1:1 hydrophobic-to-polar residue ratio found in these proteins, a substantial fraction of hydrophobic residues must be exposed (Shen et al. 2005). This irregularity is not captured well by a statistical potential and thus causes assessment errors. In an attempt to quantify this problem, we derived and tested a statistical potential by using only small protein structures (data not shown). This specialized statistical potential did not perform better than the statistical potential constructed from structures of all sizes (i.e., DOPE), presumably because of the first problem that apparently cancels the gains made by using the specialized sample consisting of only small native structures.
Accuracy of DOPE as the function of model accuracy
The performance of DOPE in selecting the most accurate model tends to increase with the accuracy of the best model in the decoy set. The average C
RMSD, scoreC
RMS error correlation coefficient, and 10% enrichment for high-accuracy targets (best model C
RMS error <3.0 Å) in the moulder decoy set are 0.41 Å, 0.90, and 6.62, respectively. All three measures are significantly better than the averages of 0.58 Å, 0.87, and 6.25 for all 20 targets in the decoy set. This trend is also observed for other scoring functions to a lesser extent; for example, the average correlation coefficient of the Rosetta score exhibits slight improvement from 0.846 to 0.854 when limited to high-accuracy targets.
The dependence of DOPE performance on model accuracy is illustrated further by the relation between the median C
RMS error and the scoreerror correlation coefficient for the 20 targets in the moulder decoy set (Fig. 7). The DOPE scoreerror correlation begins to decrease when the median model C
RMS error is greater than
10 Å. In contrast, the Rosetta scoreerror correlation coefficients remain relatively stable (although lower than those of DOPE for median model C
RMS error <10 Å) in all model accuracy ranges. This difference between DOPE and Rosetta is presumably due to the different information used in the construction of the two scoring functions. By construction, DOPE is based entirely on the native structures and lacks the ability to discriminate models with large C
RMS errors. In contrast, Rosetta contains several physics-based terms, including electrostatics, hydrogen bonds, and solvation, making it possible at least in principle to assess non-native models more accurately based on the laws of physics. In conclusion, a combination of physics-based energy terms and DOPE may be able to improve DOPE's correlation with model accuracy when the C
RMS error is large.
|
Assessment of NMR structures
Another difficulty is the assessment of structures determined by NMR spectroscopy. For example, neither DOPE nor DFIRE is able to identify the native structure of 2pna in the moulder decoy set. Initially, the 2pna native structure was arbitrarily chosen to correspond to the first of the 22 separately listed structures in the 2pna PDB file. To elaborate, we also assessed all 22 native structures in the PDB file; the corresponding DOPE scores ranged from 8997 to 9391. However, even the best scoring native structure does not score better than some of the decoys, although the rank does improves from 103 to 81 for the first and best scoring native structures, respectively. The apparent decrease of DOPE's ability to identify native NMR structures compared to X-ray structures is presumably a consequence of both the derivation of DOPE from X-ray structures and the inherent relative inaccuracy of the NMR structures compared to X-ray structures. It is conceivable that a DOPE-like score derived exclusively from the NMR structures will perform better than the current DOPE.
Applications of DOPE
Protein structure scoring functions have many applications (see introductory section). To facilitate applications of DOPE in particular, we implemented it in MODELLER-8, using cubic splines to interpolate between sampled points for smooth function values and first derivatives (see Materials and Methods) (http://salilab.org/modeller) (Sali and Blundell 1993; Fiser et al. 2000). This implementation allows us to use DOPE for both assessment of given structures as well as their refinement with optimization methods that depend on first derivatives, such as conjugate gradients and molecular dynamics. It also allows us to benefit from DOPE combined together with other scoring functionals that have been already incorporated in MODELLER.
So far, DOPE has already been applied to ab initio protein structure prediction (Colubri et al. 2006), proteinprotein docking (Shen et al. 2005), various problems in comparative modeling, including fold assignment, template selection, sequencestructure alignment (John and Sali 2003; Eramian et al. 2006; Pieper et al. 2006), loop modeling (M.-Y. Shen and A. Sali, unpubl.), side chain modeling (B. Webb and A. Sali, unpubl.), refinement of whole models (B. Webb and A. Sali, unpubl.), and model assessment (Eramian et al. 2006), the modeling of quaternary structure restrained by small angle scattering spectra (F. Foerster, D. Agard, and A. Sali, unpubl.), as well as fitting into cryo-electron microscopy mass density maps combined with comparative modeling (Topf and Sali 2005; Topf et al. 2006).
In the future, further improvements and generalizations of DOPE will be facilitated by its rigorous statistical formulation, with clear assumptions and approximations, free of adjustable parameters. Ongoing work includes a comparison between statistical and physics-based potentials, which may result in a combined function with better accuracy in non-native regions, adapting the potential for maximum accuracy with small proteins, generalization of statistical potentials to multibody forms with a rigorous reference state, and inclusion of information about sequences that do not fold into a given native structure as well as conformations that are not the native structure for a given sequence.
| Materials and methods |
|---|
|
|
|---|
1.8 Å resolution and with an R-factor
0.25. These representative structures share <30% sequence identity with each other. The list was constructed with the PISCES Web server (Wang and Dunbrack 2003). No effort was made to exclude chains with chain breaks, ligands, and quaternary interactions.
Implementation of DOPE
We calculated DOPE for all pairs of non-hydrogen atoms in each of the 20 standard residue types, ignoring the N-terminal and C-terminal nitrogen and oxygen atoms, respectively. Thus, there are a total of 158 residue-dependent atom types. Nine of these atom types include differently labeled but chemically equivalent atoms (i.e., NH1 and NH2 in Arg, OD1 and OD2 in Asp, OE1 and OE2 in Glu, CD1 and CD2 in Leu, CD1 and CD2 in Phe, CE1 and CE2 in Phe, CD1 and CD2 in Tyr, CE1 and CE2 in Tyr, and CG1 and CG2 in Val). The possibility of equivalent atom pairs existing in different protonation states (e.g., OD1 and OD2 in Asp) was not addressed here. Such alternative assignments are difficult to resolve because of the absence of the experimentally determined positions of the hydrogen atoms in most of the sample structures, although an iterative scheme to address the problem has been described recently (Weichenberger and Sippl 2006).
DOPE was tabulated for distances from 0 to 15 Å (rc), with an interval of 0.5 Å (
R). These values were based on prior work (Melo et al. 2002; Zhou and Zhou 2002). The interval counts are converted into the DOPE scores, as described in the Theory section, except for counts of zero, which are assigned a DOPE score of 10, corresponding to the least favorable score.
DOPE was implemented in MODELLER-8 (http://salilab.org/modeller) (Sali and Blundell 1993) and the molecular dynamics package TINKER (Pappu et al. 1998) (http://dasher.wustl.edu). The MODELLER implementation relies on cubic splines (Press 1992) that allow us to smoothly interpolate between the sampled histogram points of DOPE as well as analytically calculate its continuous first derivatives. Thus, we can use DOPE, potentially combined with other scoring functions, for an optimization of a given model, in addition to its assessment.
Tested scoring functions
We assessed DOPE against five other scoring functions: DFIRE (Zhou and Zhou 2002), Rosetta (Simons et al. 1997, 1999; Misura et al. 2006), as well as ModPipe-Surf, ModPipe-Pair, and ModPipe-Comb (Melo et al. 2002). This selection of scores includes both single-body and two-body scoring functions, as well as coarse-grained and all-atom scoring functions. The coarse-grained residue-based scores are ModPipe-Surf (single-body), ModPipe-Pair (two-body), and ModPipe-Comb (combined). The all-atom distance-dependent statistical potentials are DOPE and DFIRE. Rosetta is an all-atom scoring function that includes both physics-based models and statistical potentials, quantifying stereochemistry, nonbonded interactions, and solvation.
We calculated the Rosetta score by the Rosetta program, kindly provided by the authors, using standard argument values (i.e., -score). The DOPE and three ModPipe scores were calculated by MODELLER-8 (http://salilab.org/modeller). We calculated the DFIRE scores for the moulder decoy set by the DFIRE program, also kindly provided by the authors, while the assessments of DFIRE by the five Decoys R Us decoy sets were taken from the original description of DFIRE (Zhou and Zhou 2002).
Decoy sets of protein structures
Six multiple decoy sets, including the 4-state_reduced, fisa, fisa_casp3, lmds, lattice_ssfit, and moulder decoy sets, were used to evaluate the performance of the DOPE statistical potential. The first five decoy sets are available through Decoys R Us (http://dd.stanford.edu). The 4state-reduced decoy set, containing from 632 to 689 models per target (seven targets in total), was generated using a four-state off-lattice model with a conformational relaxation method (Park and Levitt 1996). The fisa and fisa_casp3 decoy sets with four and three targets (5001400 models per target), respectively, were obtained using a combination of a Bayesian scoring function and a simulated annealing protocol (Simons et al. 1997, 1999). The lmds decoy set with 215500 models for each one of 10 primarily short targets, was obtained by a local optimization method and a reduced ENCAD energy function (Keasar and Levitt 2003). The largest lattice_ssfit decoy set, containing 2000 decoys for each of eight targets, was generated using a tetrahedral lattice model with the all-atom ENCAD energy function (Xia et al. 2000).
The moulder decoy set is derived by iterative target-template alignment and comparative model building of 20 target sequences only remotely related to their template structures, relying on MODELLER-6 (John and Sali 2003); it contains 300 models for each target, based on a wide range of target-template alignment accuracy.
All the target native structures in all decoy sets were determined by X-ray crystallography, except for 1bba in lmds and 2pna in moulder, which were determined by NMR spectroscopy.
Assessment of scoring function accuracy
Scoring functions were tested by five different criteria: First, the rank of the native structure in the list of decoys sorted by the tested score (NR). Second, the fraction of the targets for which the native structure was the best scoring structure in a decoy set. Third, the Pearson correlation coefficient r between the scoring function and the C
RMS error for the target decoys (the scoreerror correlation); the C
RMS error was calculated by MODELLER-8, upon least-squares rigid body superposition of a model and the native structure. Fourth, the difference in the C
RMS error between the best scored model and the best model (
RMSD); ideally, C
RMSD should be 0 Å. Fifth, the relative occurrence of the most accurate (C
RMS error) n% models among the n% best scoring models compared to that for the entire decoy set (n% enrichment). The best possible enrichment ratio (i.e., all n% most accurate models are recognized by the scoring function) is 1/n% [i.e., (n%/n%)/n%]. For example, the 10%-enrichment ratio for the best possible scoring function is 10 (i.e., 1/0.10). In contrast, a random scoring function has the probability of n% to correctly identify n% most accurate models, thus its enrichment ratio is n%/n% = 1.
| Footnotes |
|---|