|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Alameda 340, Santiago, Chile
(RECEIVED December 18, 2006; FINAL REVISION February 28, 2007; ACCEPTED April 2, 2007)
| Abstract |
|---|
|
|
|---|
Keywords: protein structure assessment; knowledge-based potentials; statistical potentials; comparative modeling; protein structure prediction
| Introduction |
|---|
|
|
|---|
Many tools for error detection are currently available (Marti-Renom et al. 2000), most of which use statistical potentials or molecular mechanics force fields to assess the accuracy of a protein model. They include contact potentials (Miyazawa and Jernigan 1985; DeBolt and Skolnick 1996; Park and Levitt 1996; Park et al. 1997; Melo et al. 2002), residueresidue distance-dependent potentials (Sippl 1993a; Jones 1999; Xia et al. 2000; Melo et al. 2002), residue-based solvent-accessibility potentials (Jones et al. 1992; Sippl 1993b; Melo et al. 2002), atomic solvent accessibility, and pairwise interaction potentials (Melo and Feytmans 1997, 1998; Lazaridis and Karplus 1998; Samudrala and Moult 1998; Lu and Skolnick 2001; Zhou and Zhou 2002; Wang et al. 2004). All these scoring functions have proven to be able to discriminate between native and misfolded proteins.
The accurate detection of structural errors constitutes a challenging problem, and it is clear that different scoring functions exhibit varying degrees of performance (Park and Levitt 1996). The success of a given energy or scoring function at detecting structural errors will depend on the nature of the model being assessed and on the parameters of the energy function itself. In this respect, most of the protein decoys available for the testing of new scoring functions are not complete and contain many models of low resolution, with incomplete atomic coordinates and significant structural errors (Sanchez and Sali 1998; Jones 1999; Tsai et al. 2003). Therefore, if the ability of scoring functions in the detection of errors of small magnitude is going to be tested, new decoys containing highly accurate and complete protein models are needed.
A scoring function capable of detecting structural errors in highly accurate protein models should have some specific requirements. First, the scoring function must be able to detect a few specific errors of small magnitude that would probably be scattered along the protein sequence and mapping onto different locations in the three-dimensional (3D) structure. Then, this scoring function should be highly sensitive and therefore parameterized at the atomic level with a high-resolution description of the interactions. Second, it is also probable that few localized errors of small magnitude within a structure will not interact with each other in 3D space at a short distance range. Thus, a successful scoring function at detecting errors in highly accurate protein models should not require a strong nonlocal component, nor have a long interacting distance range to describe the atomic interactions.
The requirements described above point toward a central role of a local and short distance range component in a scoring function that is capable of detecting small and localized errors in protein structures. To date, all knowledge-based potentials described have a local term with a lower limit that accounts for the interactions of atoms belonging to two consecutive amino acids. In contrast, force fields like CHARMM (Brooks et al. 1983; MacKerell et al. 1998), AMBER (Weiner and Kollman 1981), and CVFF (Hagler et al. 1974) go beyond this limit and calculate the energy from the 14 nonbonded interactions for the Lennard-Jones and the Coulomb terms of the energy function. Therefore, local interactions within a residue are currently considered and their energy calculated when using force fields, but not when using knowledge-based potentials.
Knowledge-based potentials do not incorporate nonbonded terms within a residue because of the lack of an appropriate reference system to accurately calculate them. These energy functions are derived from a set of native protein conformations that have been experimentally solved. The reference state is also calculated from the same data set, because an unbiased and representative experimental data source for the unfolded state of proteins is not yet available. Both the uniform density (Sippl 1990; Samudrala and Moult 1998; Lu and Skolnick 2001) and the distance-scaled finite ideal-gas (Zhou and Zhou 2002) reference states cannot explicitly deal with atomatom connectivity issues, which become extremely important for nonbonded interactions of short distance range. Additionally, the assumption of independence among interatomic distances for these type of interactions is clearly invalid (Dill 1997). The use of current reference systems to derive the pseudo-energy of nonbonded interactions at a short distance range leads to functions that are either flat (not informative) or nearly identical for different interacting atom pairs (not discriminative).
We have already described the calculation of an atomic knowledge-based potential that considered local and nonlocal interactions (Melo and Feytmans 1997). As in other potentials, the nonbonded interactions within residues were not included, and sequence separation was used to discriminate between local and nonlocal interactions. The shape of each pairwise energy curve of this potential exhibited a high variability for very short sequence separations, which resulted from its local energy component (Melo and Feytmans 1997). On the other hand, when the sequence separation was increased to values larger than
10 amino acids, a monotonic curve shape was observed for a given pairwise interaction (Melo and Feytmans 1997). Based on those observations, we concluded that the splitting of nonlocal interactions into different functions is unnecessary, and therefore two atoms that belong to sequentially distant amino acids could be considered as free particles. Then, in a subsequent study, we derived a new atomic potential that included only the nonlocal interactions. In this nonlocal potential, all interacting atom pairs belonging to residues with sequence separations above 10 residues were collapsed and represented by a single energy function (Melo and Feytmans 1998). This grouping substantially increased the number of observations and allowed us to increase the number of classes of interatomic distances, resulting in a more accurate potential. Additionally, a short distance range that allows a maximum of about two atomic shells was used, thus minimizing the impact of side effects due to atom connectivity. Since then, the resulting potential, which is called ANOLEA, has been used to assess the quality of protein structure models (Melo and Feytmans 1998; Gonzalez et al. 2004; Feliciangeli et al. 2006; Pizzo et al. 2006), and it is currently incorporated on the SwissModel Web server and repository (Schwede et al. 2003; Kopp and Schwede 2004).
The ANOLEA potential was also incorporated into MODELLER (Sali and Blundell 1993), a comparative protein structure modeling software, and used for the prediction of the conformation of loops through molecular dynamics simulations (Fiser et al. 2000). In that work, the ANOLEA potential (Melo and Feytmans 1998) was combined with the CHARMM-22 molecular mechanics force field (Brooks et al. 1983; MacKerell et al. 1998) by replacing Lennard-Jones and Coulomb nonbonded terms from CHARMM with the complete set of terms in ANOLEA. Therefore, the 12 bond, 13 angle, and 14 dihedral and improper dihedral energies were obtained from CHARMM; and the 15 and above nonbonded pseudo-energies were obtained by cubic spline interpolation from the discrete ANOLEA functions. Thus, all 15 or higher nonbonded interactions were obtained from a single pseudo-energy function that was derived nonlocally for each atomic pair. Contributions from both potentials, as well as residue side-chain dihedral angle pseudo-energies derived from the observed statistical preferences in experimental data (Sali and Blundell 1993), were equally weighted and combined to get the total pseudo-energy of the system. It turned out that this approach resulted in the most accurate predictions of loop conformations (Fiser et al. 2000).
In this study, we have evaluated the impact of incorporating 15 nonbonded terms on the ability to detect small and localized errors in a benchmark of protein structure models built by comparative modeling. To avoid the connectivity issues and the problems with the reference system mentioned above, and motivated by the results obtained in the modeling of loops (Fiser et al. 2000), the distance-dependent 15 nonbonded terms of each atomatom interaction were directly extrapolated from our previous nonlocal knowledge-based potential (Melo and Feytmans 1998).
We begin the Results section by describing the building of a new benchmark set of accurate comparative models, which was used to test the potentials in error detection. We continue by assessing the performance in error detection of different potentials with and without the calculation of nonbonded terms. After that, we compare the performance of the potential that includes nonbonded terms described here against the performance of other existing and commonly used energy functions. We conclude by summarizing and discussing the main lessons learned from this study. In the Materials and Methods section, we describe in detail the building of the benchmark set of models and define the derivation and calculation scheme for the different knowledge-based potentials, as well as the criteria used to assess their performance.
| Results |
|---|
|
|
|---|
|
atoms within 3.5 Å from their corresponding native structures, after the optimal superposition of the structures in 3D space. The total C
root mean square deviation (RMSD) of all residues in each model is <1.1 Å. The second set, the class B set, contains 57 accurate protein models with >90% of their C
atoms within 3.5 Å from their corresponding native structures. The total C
RMSD of all residues in each model is >1.5 Å and <2.6 Å. The structural modeling accuracy of each residue in a model was assessed by comparison to the corresponding native structure. The 55 models in the class A set contain a total of 10,295 residues, out of which only 201 contain structural errors. The 57 models in the class B set contain 10,714 residues, out of which a total of 1257 were erroneously modeled. For details about the building of these two sets of models and the definition of structural errors, see Materials and Methods.
Performance of knowledge-based potentials with different interaction terms on the detection of wrongly modeled residues
Different knowledge-based potentials were calculated in this study, which only differed on the type of interactions included (Table 1). These potentials were derived by considering only local, only nonlocal, both types of interactions (i.e., local + nonlocal), or all 15 and above nonbonded interactions. Based on previous observations (Melo and Feytmans 1997, 1998), we have used a sequence separation threshold of eight to nine amino acids to differentiate local from nonlocal interactions. The lower limit for the calculation of local interactions was set to a minimal sequence separation of at least one amino acid. This threshold was adopted because none of the different reference systems that can be used to derive the potentials (Sippl 1990; Samudrala and Moult 1998; Lu and Skolnick 2001; Zhou and Zhou 2002) is able to remove the distance-dependency bias that arises from connectivity issues at this short sequence separation range (Sippl 1990; Sippl and Weitckus 1992; Melo et al. 2002). However, the major aim of this study was to assess the effect of more local terms such as close nonbonded interactions that occur within two atoms belonging to the same residue or to adjacent residues. As mentioned above, these nonbonded (NB) pseudo-energy terms cannot be directly derived from a source set of native protein structures, irrespective of the reference system that is adopted. To overcome this limitation, we have extrapolated the terms from the nonlocal (NL) potential in order to obtain an estimate of the energy scores for interacting atom pairs at shorter connectivity ranges. Therefore, a total of five different schemes were used and tested (Table 1), which are defined not only by how the potential is derived, but also by how it is used to calculate the pseudo-energies of a protein structure.
|
|
The comparative performance of optimal classifiers based on these potentials was also assessed in the class A and class B sets of models (Table 2). The results show that for the class A set, the NL-NB potential is clearly the best, achieving an accuracy of 80%, a specificity of 80%, and a sensitivity of 70%. The second best classifier, the L-L potential, achieves an accuracy of 72%, a specificity of 73%, and a sensitivity of 67%. The other potentials perform clearly worse on this set upon the evaluation of these measures. For the class B set, although a similar trend is observed, the NL-NB potential is not clearly better than the L-L potential.
|
atoms; and (3) DFIRE based on only C
atoms (Zhou and Zhou 2002). These three potentials were used because they are freely available, they are widely used for error detection in protein structures, they produce as an output the energy profiles that are required to assess their performance in this benchmark, and because they are derived by using two different reference systems. We were unable to compare against other potentials because the implementations available that are based on them do not produce as an output an energy profile (i.e., only the total raw and normalized energies of a protein are reported).
The results of this comparison show that the NL-NB potential clearly outperforms the other potentials in both sets of models, for almost all possible classification thresholds (Fig. 3). The most important observation is the large difference in performance for the class A set when a high specificity is required. For example, for a fixed specificity of 95% (i.e., a false-positive rate of 5%), the NL-NB potential exhibits a sensitivity of 45%. For the same specificity value, DFIRE-C
and DFIRE-C
have a sensitivity of
18%, and ProSa exhibits a sensitivity of
11%. Analogously, for a fixed specificity of 80%, the NL-NB potential exhibits a sensitivity of 71%. This performance compares favorably against that obtained for DFIRE-C
, DFIRE-C
, and ProSa, which exhibit sensitivities in the range between 30% and 40%. For the class B set, the same trend is observed, although the differences are not as large as those observed for the class A set.
|
The comparative performance of optimal classifiers based on the NL-NB and the external potentials was also assessed in the class A and class B sets of models (Table 3). In accordance with the observations described above, the NL-NB potential is clearly the best classifier of residue modeling accuracy in both sets of models.
|
| Discussion |
|---|
|
|
|---|
The central point addressed in this study involves the incorporation of close nonbonded terms for the evaluation of protein structure conformations. By close nonbonded terms, we refer to those nonbonded interactions occurring between pair of atoms that are closely connected through a few intermediate covalent bonds (i.e., 15 interactions and above). This type of interaction is clearly not independent of the protein backbone and side-chain configurations of the amino acids since some pairs of atoms are restricted to be close in 3D space because of the existing connectivity. Additionally, some interactions such as those occurring between atoms that belong to a ring are very restricted and do not adopt alternative conformations. This poses a challenge to current methodologies used to derive the knowledge-based potentials, if these terms are to be included. The problem resides in the reference system used to derive the potential. The derivation of knowledge-based potentials for protein structure prediction is carried out by using the inverse Boltzmann law, which in its general form states:
|
|
where
E(s) represents the change of energy associated with the transition between the unfolded and the folded states defined by the variable s; pF (s) represents the probability of occurrence of the subsystem defined by s in the folded state F; pU (s) represents the probability of occurrence of the subsystem defined by s in the unfolded state U; R is the gas constant; and T is the absolute temperature measured in kelvins. Unfortunately, the term pU (s), which describes the reference state, cannot be directly calculated because a homogeneous and unbiased experimental sample of the unfolded state of proteins is not available. As an attempt to circumvent this problem, the observed interactions among any pair of atoms by assuming a null interaction model are often used as a reference system to derive the potential. However, these probability estimations are also derived from the same experimental data set as the pF (s) term, this is the folded state F, and it could not necessarily correspond to the real probabilities that exist in the unfolded state U.
When a reference system based on the folded state is used to derive a knowledge-based potential, most of the close nonbonded terms result in noninformative or nondiscriminate energy functions. A possibility to circumvent this problem, as it has been proposed in this work, is to derive a nonlocal potential with a short distance range and then to extrapolate the resulting energy values of nonlocal atom pairs to the close nonbonded terms. The calculation of a nonlocal potential with a short distance range is a good approximation to capture atomatom interactions as if they were free particles, because there should not exist any connectivity constraints for two nonlocal atoms to be close in space. In addition, the short distance range also allows us to minimize some side effects or dependency on bonded atoms when deriving the potential. For example, the existence of a real interaction between two atoms i and j should not lead to the observation of an artificial favorable interaction between atoms bonded to i and atoms bonded to j.
Based on a benchmark of accurate protein models, we have demonstrated that this approximation exhibited the best performance at classifying residue modeling accuracy. The incorporation of close nonbonded terms by direct extrapolation improved the potential's performance as compared to other potentials that either included or not more distant local terms. On the other hand, the direct derivation of a potential including close nonbonded terms resulted in a poor classifier of residue modeling accuracy. These results suggest that close nonbonded terms are important to describe native protein properties and should be included in the potentials, but their direct derivation by using the current methodologies and experimental data sets does not lead to an accurate description of this type of interaction.
The accurate detection of small and localized errors in the benchmark of protein structures used in this study is not an easy task. This is demonstrated by the maximum accuracy, specificity, and sensitivity achieved by all the potentials tested here. The best performance was achieved by the NL-NB potential, with a maximum accuracy of
80%, a sensitivity of 70%, and a specificity of 80%. This means a rate of 20% false positives and a rate of 30% false negatives, which certainly does not constitute an ideal classifier. However, it is important to mention that this performance was achieved by considering only distance-dependent pairwise terms. It is expected that the incorporation of solvation terms will further improve the current performance.
Knowledge-based potentials are informatics functions (Solis and Rackovsky 2006). Their capacity to properly describe the atomic interactions that are recurrent in native protein conformations depends on many parameters and on how the data are compressed and classified. In addition to this, the performance of a potential will not only depend on how the information was extracted, compressed, and classified, but also on how the information is used. In this study, we have demonstrated that a potential that was derived by considering only the nonlocal interactions can be successfully used to infer the close nonbonded interactions by direct extrapolation. We are not claiming that these extrapolated terms would reflect the true existing nonbonded interactions, but only that this methodology allows to improve the performance for the task of classifying residue modeling accuracy. This improvement could only be explained by the fact that there is a given amount of information gain for this classification task when using this methodology. This information gain was certainly not achieved when using the NB-NB potential, although the total number of terms considered to estimate the accuracy of a given residue was exactly the same as that for the NL-NB potential.
The limited performance of the ProSa and DFIRE potentials in the benchmark of models used in this study is partially due to the fact that these are residue-based potentials, and, as such, they only include a description of the C
, C
, or backbone atom interactions. Therefore, a fair comparison of their performance in error detection against full-atom potentials is simply not possible. However, on the other hand, these potentials are distributed in this form, and therefore it is the only way that they can be used and compared against other potentials in a particular benchmark and error detection task. It is evident that, for the detection of errors of small magnitude like the ones present in our benchmark of models, residue-based potentials exhibit a low sensitivity and specificity.
Future directions to further improve the performance of knowledge-based potentials for the task of error detection include the incorporation of implicit solvation models to account for the interactions between protein and solvent atoms, the consideration of atom shielding to describe more accurately the effective atomic interactions, and the exploration of different reference systems to derive the potentials. We are presently working on these topics and expect to incorporate these elements in a new generation of knowledge-based potentials that should be more accurate for the task of protein structure assessment.
| Materials and Methods |
|---|
|
|
|---|
Derivation of knowledge-based potentials
Different types of distance-dependent potentials were calculated. The only distinct feature for the derivation of these potentials was the type of interaction considered (Table 1). The types of interaction are only local, only nonlocal, both types of interaction (i.e., local + nonlocal), or all 15 and above nonbonded interactions. Local interactions are defined as those occurring between any two atoms that belong to amino acids in the same chain with a separation k along the sequence in the range 28. Sequence separation (k) of amino acids in positions i and j is defined as |i j|. Therefore, this definition of local interactions excludes all interactions between atoms belonging to the same residue or to adjacent residues. Nonlocal interactions are defined as those occurring between any two atoms that belong to amino acids in the same chain with a separation k along the sequence
9, or to amino acids from different chains. All interactions are the union of the local and nonlocal interactions defined above (i.e., local + nonlocal). Finally, 15 and above nonbonded interactions are defined as those occurring by two interacting atoms that are separated by a connectivity of at least four covalent bonds in the structure (see next subsection below), irrespective of the existing sequence separation k between the amino acids that contain the interacting atom pair. The 15 and above nonbonded interactions are also assessed for pair of atoms belonging to amino acids with a sequence separation of 0 (same amino acid) or 1 (adjacent amino acids).
The distance-dependent potentials were calculated as described (Sippl 1993a; Melo and Feytmans 1997, 1998):
|
|
where Mijk is the number of occurrences for the interaction of atom types i and j at sequence separation k:
|
|
r is the number of classes of distance; and
is the weight given to each observation.
= 0.02 was used (Sippl 1990), so that with 50 observations fk ij (d) and fk xx (d) have equal weights for the calculation of
Ek ij (d) · fk ij (d) is the relative frequency of occurrence for the interaction of atom types i and j at sequence separation k in the class of distance d:
|
|
fk xx (d) is the relative frequency of occurrence for all the interactions of any two atom types at sequence separation k in the class of distance d:
|
|
where n is the number of different atom types; and r is the number of distance classes. The temperature T was set to 293 K, so that RT is equivalent to 0.582 kcal/mol.
The common parameters of the potentials tested in this work are the same as those previously optimized for the ANOLEA potential (Melo and Feytmans 1997). The potentials were derived for a maximal distance range of 7.0 Å using 35 discrete classes of 0.2 Å each. A total of 40 atom types were defined for all heavy atoms in the 20 standard amino acids.
Use of knowledge-based potentials
The potentials describing local, nonlocal, and all interactions were used to assess the quality of protein structure models with the same sequence separation parameters defined to derive them (Table 1). The NB-NB potential was calculated for all 15 nonbonded pairs without considering the sequence separation as a parameter. Additionally, the nonlocal potential was also used to calculate the energies for up to the 15 nonbonded interactions through direct extrapolation, based on the particular observed distance among any two given interacting atom types (Table 1). In this study, the 15 and above nonbonded interactions were considered. A 15 nonbonded interaction is defined as that conformed by two interacting atoms that are separated by a connectivity of four covalent bonds in the structure (e.g., the C
and the C
of a lysine). More generally, a 1X nonbonded interaction is defined as that existing between two atoms that are separated by a connectivity of X 1 (X minus one) covalent bonds in the structure. In the case of rings, the minimal connectivity in terms of covalent bonds between the two atoms is adopted.
The detection of residues with structural errors was carried out by means of smoothed and normalized energy profiles. The normalized total energy score per residue (ER ) is defined as follows:
|
|
where N is the total number of i atoms belonging to a given residue, and M is the total number of atoms j that interact with atom i at sequence separation k and below the distance range defined in the potential. Ek ij (d) corresponds to the energy score assigned by the statistical potential to the interaction between atoms i and j at sequence separation k and at distance d. TR is the total sum of ij interactions recorded for residue R and lies in the range 0
TR
N x M. The normalized energy profiles were smoothed by a sliding window with a length of seven residues. These normalized and smoothed energy profiles were finally used to assess the structural errors in our benchmark of protein models. Thus, for each protein model, the three amino acids at the C and N termini were not assessed. We initially calculated smoothing windows of three, five, seven, and nine residues. Because the windows of five, seven, and nine residues gave similar results, but slightly better than those obtained when using a window of three residues, we decided to fix this parameter to the middle value of seven. Different potentials showed the same trend upon changing the smoothing window size parameter value.
External knowledge-based potentials and semiempirical force fields
In addition to the potentials described above, we also tested the performance in error detection of DFIRE (Zhou and Zhou 2002), ProSa (Sippl 1993b), and CHARMM (Brooks et al. 1983; MacKerell et al. 1998). DFIRE is a knowledge-based potential that uses a distance-scaled finite ideal-gas reference state (Zhou and Zhou 2002). We requested a full-atom version of DFIRE from the authors, as well as the versions of only C
(DFIREca) or C
(DFIREcb). However, the full-atom version of the software does not output an energy profile for each residue in the protein, but a total energy score. Therefore, we were only able to evaluate the performance of the C
and C
versions of DFIRE on our benchmark set of models. ProSa was initially developed in 1993 (Sippl 1993b), but here we used the most recent version of this software, which was released in 2003. The software was downloaded from http://www.came.sbg.ac.at. We also used the semiempirical force field CHARMM22 (Brooks et al. 1983; MacKerell et al. 1998), as implemented in the comparative modeling software MODELLER (Sali and Blundell 1993).
Building of a benchmark set of accurate and highly accurate protein structure models
To assess the performance of knowledge-based potentials in the detection of structural errors of small magnitude, two sets of comparative protein structure models were built (Fig. 1). The first set (class A) contains 55 highly accurate protein models. The second set (class B) contains 57 accurate protein models. The highly accurate protein models from the class A set have >95% equivalent
-carbons with their corresponding native structures and a total root mean square deviation (RMSD) of <1.1 Å for all
-carbons. The accurate protein models from the class B set have >90% equivalent
-carbons with their corresponding native structures and a total RMSD of >1.5 Å and <2.6 Å for all
-carbons.
The models in the class A set represent 35 distinct folds. In the class B set, 34 distinct folds are represented. Therefore, even though not all the models in these two sets were built for proteins representing different folds, they are not strongly biased to any particular fold since a large fraction of the models has a distinct fold. In terms of a more general classification, as the one defined by the composition and arrangement of secondary structure elements, a fairly good and poorly biased representation is achieved. For the class A set, 26% of the models contain only
-helices as secondary structure elements, 24% contain only
-sheets, 18% contain
and
, and 33% of the models contain
+
secondary structure elements in their structures. In the class B set of models, 36% of the models contain only
-helices, 31% contain only
-sheets, 13% contain
and
, and 20% contain
+
.
The 57 accurate and 55 highly accurate protein models were selected from a existing set of 3375 models with a correct fold that has been described previously (Sanchez and Sali 1998; Melo et al. 2002). This original set of 3375 models was built by the comparative modeling of representative chains of the Protein Data Bank (PDB) (Berman et al. 2002). The models were built based on the correct templates and mostly correct alignments between the target sequences and the template structures. The models were obtained by applying MODPIPE to 1.085 representative chains of the PDB (Sanchez and Sali 1998). These representative sequences corresponded to the protein chains in PDB that shared <30% sequence identity or were >30 residues different in length. The templates for comparative modeling, selected by MODPIPE, were 1637 PDB chains with <80% sequence identity to each other or >30 residues different in length. Each target sequence was aligned separately with each one of the 1637 known structures using the program ALIGN, which implements local sequence alignment by dynamic programming (Altschul 1998). Only the targettemplate alignments with a significance score higher than 22 nats (corresponding approximately to the PSI-BLAST e-value of 104) were used, resulting in 3993 models. Models with <30% structural overlap with the actual experimentally determined structure were eliminated. Structural overlap was defined as the fraction of the equivalent C
atoms upon least squares superposition of the two structures with the 3.5 Å cutoff. This procedure also removed models based on correct templates that had a poor alignment and models based on templates that had large domain or rigid-body movements with respect to the target structure. The final set contained 3375 correct models (Melo et al. 2002).
The set of 3375 correct models was initially filtered by updating the target and template structures currently available at the PDB and checking the sequence alignments originally used to build the models. A total of 132 models presented inconsistencies between the target sequence in the original alignment used to build the model and the current target sequence available at the PDB. These 132 models were removed, thus resulting in a total of 3243 entries (Fig. 1). Then a second filter was applied, and we selected only those protein models of a length >100 residues for which >90% of their residues were possible to model. Finally, as explained above, two independent filters were applied to select those models belonging to the class A and class B sets (Fig. 1). All models in both sets were built for target monomeric proteins. The 3D coordinates of these models in PDB format are available as supplemental material at http://protein.bio.puc.cl/sup-mat.html.
Identification and definition of those residues incorrectly modeled
To identify those residues that contained structural errors, all accurate and highly accurate protein models were optimally superposed to their corresponding native target structures using MODELLER software release 7v7 (Sali et al. 2001). Those residues that had an RMSD >1.8 Å for all main-chain atoms and a total side-chain RMSD >3.5 Å were defined as wrongly modeled. Otherwise, the residues were defined as correctly modeled. This binary classification of structural quality for each residue is called real classification. Upon this classification scheme, 201 residues (1.95%) are defined as wrongly modeled in the 55 models belonging to the class A set, which contains a total of 10,295 residues. In the 57 models belonging to the class B set, 1257 residues (11.73%) are defined as wrongly modeled, from a total of 10,714 residues. Although arbitrary, the definition of residues wrongly modeled upon these RMSD cutoffs was based on the visual inspection of all protein models after optimal superposition with their corresponding native structures. This definition of error clearly correlates with the observed structural deviation. All protein models, along with their corresponding error definitions and computer scripts for RASMOL software (Sayle and Milner-White 1995; Bernstein 2000) to visualize them graphically in three dimensions, are available as supplemental material at http://protein.bio.puc.cl/sup-mat.html.
Evaluation of energy functions for the detection of residues incorrectly modeled
As mentioned above, each potential was used to obtain a smoothed and normalized energy profile for each protein model in both sets. Upon a given energy score threshold, a binary classifier was built for each potential, where each residue was predicted or classified as correctly or wrongly modeled, depending on whether its energy score value fell below or above the fixed threshold, respectively. A positive instance was defined as a wrongly modeled residue. A negative instance was defined as a correctly modeled residue. The predictions generated by each classifier at each possible energy score threshold for all residues within a set of models, named hypothetical classifications, were then compared to those previously defined by the real classification of errors (see above). When comparing a real classification against a hypothetical one, four possible outcomes can be observed: (1) If the instance is positive and it is classified as positive, it is counted as a true positive (TP); (2) if it is classified as negative, it is counted as a false negative (FN); (3) if the instance is negative and it is classified as negative, it is counted as a true negative (TN); and (4) if it is classified as positive, it is counted as a false positive (FP). Then, the following metrics are defined:
|
|
where tp is the true-positive rate, TP is the count of true-positive instances, P is the total number of positive instances (i.e., the sum of TP and FN counts), fp is the false-positive rate, FP is the count of false-positive instances, and N is the total number of negative instances (i.e., the sum of TN and FP counts).
Based on these two rates, receiver operating characteristic (ROC) curves were calculated for each potential as previously described (Fawcett 2004) and used to assess its performance. Receiver operating characteristic (ROC) curves are two-dimensional graphs in which the TP rate is plotted on the Y-axis and the FP rate is plotted on the X-axis (Swets 1988; Swets et al. 2000; Fawcett 2004). An ROC graph depicts relative tradeoffs between benefits (true positives) and costs (false positives) for all possible decision thresholds.
In addition to the ROC curves, which constitute the best way to compare two classifiers, we also calculated four overall measures to assess and compare the performance of the potentials in the detection of errors in accurate and highly accurate protein structure models. The first measure was the area under the ROC curve (AUC), which ranges from 0.5 to 1.0. The AUC represents somehow an overall measure or summary of the ROC itself, and it has an important statistical property: The AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance (Fawcett 2004). The second measure was the accuracy (ACC), which is defined as:
|
|
Finally we define the sensitivity (Sn) and specificity (Sp) measures as follows:
|
|
Since almost all the metrics mentioned above depend on the particular decision threshold that is used to classify the instances (with AUC as the only exception), we report these metrics at a single and fixed value that is called the optimal threshold (OT). The OT is uniquely defined by the point in the ROC curve that has the minimal distance to the upper-left corner of the ROC graph, which would correspond to a perfect classifier (i.e., fp = 0.0 and tp = 1.0).
| Footnotes |
|---|
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.062735907.
| Acknowledgment |
|---|
|
|
|---|
| References |
|---|
|
|
|---|
Baker, D. and Sali, A. 2001. Protein structure prediction and structural genomics. Science 294: 9396.
Berman, H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K., Feng, Z., Gilliland, G.L., Iype, L., Jain, S., et al. 2002. The Protein Data Bank. Acta Crystallogr. D Biol. Crystallogr. 58: 899907.[CrossRef][Medline]
Bernstein, H.J. 2000. Recent changes to RasMol, recombining the variants. Trends Biochem. Sci. 25: 453455.[CrossRef][Medline]
Bonneau, R., Strauss, C.E., Rohl, C.A., Chivian, D., Bradley, P., Malmstrom, L., Robertson, T., and Baker, D. 2002. De novo prediction of three-dimensional structures for major protein families. J. Mol. Biol. 322: 6578.[CrossRef][Medline]
Bradley, P., Chivian, D., Meiler, J., Misura, K.M., Rohl, C.A., Schief, W.R., Wedemeyer, W.J., Schueler-Furman, O., Murphy, P., Schonbrun, J., et al. 2003. Rosetta predictions in CASP5: Successes, failures, and prospects for complete automation. Proteins 53: 457468.[CrossRef][Medline]
Brooks, B., Bruccoleri, R., Olafson, B., States, D., Swaminathan, S., and Karplus, M. 1983. CHARMM: A program for macromolecular energy, minimizations and dynamic calculations. J. Comput. Chem. 4: 187217.[CrossRef]
Burley, S.K., Almo, S.C., Bonanno, J.B., Capel, M., Chance, M.R., Gaasterland, T., Lin, D., Sali, A., Studier, F.W., and Swaminathan, S. 1999. Structural genomics: Beyond the Human Genome Project. Nat. Genet. 23: 151157.[CrossRef][Medline]
DeBolt, S.E. and Skolnick, J. 1996. Evaluation of atomic level mean force potentials via inverse folding and inverse refinement of protein structures: Atomic burial position and pairwise nonbonded interactions. Protein Eng. 9: 637655.
Dill, K.A. 1997. Additivity principles in biochemistry. J. Biol. Chem. 272: 701704.
Fawcett, T. 2004. ROC graphs: Notes and practical considerations for researchers. Kluwer Academic Publishers, The Netherlands.
Feliciangeli, S.F., Thomas, L., Scott, G.K., Subbian, E., Hung, C., Molloy, S.S., Jean, F., Shinde, U., and Thomas, G. 2006. Identification of a pH sensor in the Furin propeptide that regulates enzyme activation. J. Biol. Chem. 281: 1610816116.
Fiser, A., Do, R.K., and Sali, A. 2000. Modeling of loops in protein structures. Protein Sci. 9: 17531773.[Abstract]
Gonzalez, E.M., Reed, C., Bix, G., Fu, J., Zhang, Y., Gopalakrishnan, B., Greenspan, D.S., and Iozzo, R.V. 2004. BMP-1/Tolloid-like metalloproteases process endorepellin, the angiostatic C-terminal fragment of perlecan. J. Biol. Chem. 280: 70807087.[CrossRef][Medline]
Hagler, A.T., Huler, E., and Lifson, S. 1974. Energy functions for peptides and proteins. I. Derivation of a consistent force field including the hydrogen bond from amide crystals. J. Am. Chem. Soc. 96: 53195327.[CrossRef][Medline]
Jones, D.T. 1999. GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 287: 797815.[CrossRef][Medline]
Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992. A new approach to protein fold recognition. Nature 358: 8689.[CrossRef][Medline]
Kopp, J. and Schwede, T. 2004. The SWISS-MODEL repository of annotated three-dimensional protein structure homology models. Nucleic Acids Res. 32: D230D234.
Kryshtafovych, A., Venclovas, C., Fidelis, K., and Moult, J. 2005. Progress over the first decade of CASP experiments. Proteins S7: 225236.
Lazaridis, T. and Karplus, M. 1998. Discrimination of the native from misfolded protein models with an energy function including implicit solvation. J. Mol. Biol. 288: 477487.
Lu, H. and Skolnick, J. 2001. A distance-dependent atomic knowledge-based potential for improved protein structure selection. Proteins 44: 223232.[CrossRef][Medline]
MacKerell Jr, A.D., Bashford, D., Bellott, M., Dunbrack Jr, R.L., Evanseck, J.D., Field, M.J., Fischer, S., Gao, J., Guo, H., Ha, S., et al. 1998. All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B 102: 35863616.
Marti-Renom, M.A., Stuart, A., Fiser, A., Sanchez, R., Melo, F., and Sali, A. 2000. Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct. 29: 291325.[CrossRef][Medline]
Melo, F. and Feytmans, E. 1997. Novel knowledge-based mean force potential at atomic level. J. Mol. Biol. 267: 207222.[CrossRef][Medline]
Melo, F. and Feytmans, E. 1998. Assessing protein structures with a non-local atomic interaction energy. J. Mol. Biol. 277: 11411152.[CrossRef][Medline]
Melo, F., Sanchez, R., and Sali, A. 2002. Statistical potentials for fold assessment. Protein Sci. 11: 430448.
Miyazawa, S. and Jernigan, R.L. 1985. Estimation of effective interresidue contact energies from protein crystal structures: Quasi-chemical approximation. Macromolecules 18: 534552.[CrossRef]
Park, B.H. and Levitt, M. 1996. Energy functions that discriminate X-ray and near-native folds from well-constructed decoys. J. Mol. Biol. 258: 367392.[CrossRef][Medline]
Park, B.H., Huang, E.S., and Levitt, M. 1997. Factors affecting the ability of energy functions to discriminate correct from incorrect folds. J. Mol. Biol. 266: 831846.[CrossRef][Medline]
Pieper, U., Eswar, N., Braberg, H., Madhusudhan, M.S., Davis, F., Rossi, A., Marti-Renom, M.A., Karchin, R., Webb, B., Melo, F., et al. 2006. MODBASE, a database of annotated comparative protein structure models, and associated resources. Nucleic Acids Res. 33: 291295.
Pizzo, E., Buonanno, P., Di Maro, A., Ponticelli, S., De Falco, S., Quarto, N., Cubellis, M.V., and D'Allesio, G. 2006. Ribonucleases and angiogenins from fish. J. Biol. Chem. 281: 2745427460.
Rohl, C.A., Strauss, C.E., Misura, K.M., and Baker, D. 2004. Protein structure prediction using Rosetta. Methods Enzymol. 383: 6693.[Medline]
Rychlewski, L., Zhang, B., and Godzik, A. 1998. Function and fold predictions for Mycoplasma genitalium proteins. Fold. Des. 3: 229238.[CrossRef][Medline]
Rychlewski, L., Zhang, B., and Godzik, A. 1999. Insights from structural predictions: Analysis of Escherichia coli genome. Protein Sci. 8: 614624.[Abstract]
Sali, A. and Blundell, T.L. 1993. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234: 779815.[CrossRef][Medline]
Sali, A., Fiser, A., Sanchez, R., Marti-Renom, M.A., Jerkovic, B., Badretdinov, A., Melo, F., Overington, J., and Feyfant, E. 2001. MODELLER, a protein structure modeling program, Release 6v0. http://salilab.org/modeller/.
Samudrala, R. and Moult, J. 1998. An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J. Mol. Biol. 275: 895916.[CrossRef][Medline]
Sanchez, R. and Sali, A. 1998. Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. 95: 1359713602.
Sanchez, R., Pieper, U., Melo, F., Eswar, N., Marti-Renom, M.A., Madhusudhan, M.S., Mirkovic, N., and Sali, A. 2000. Protein structure modeling for structural genomics. Nat. Struct. Biol. 7: (Suppl): 986990.[CrossRef][Medline]
Sayle, R. and Milner-White, E.J. 1995. RasMol: Biomolecular graphics for all. Trends Biochem. Sci. 20: 374.[CrossRef][Medline]
Schwede, T., Kopp, J., Guex, N., and Peitsch, M.C. 2003. SWISS-MODEL: An automated protein homology-modeling server. Nucleic Acids Res. 31: 33813385.
Sippl, M.J. 1990. Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J. Mol. Biol. 213: 859883.[Medline]
Sippl, M.J. 1993a. Boltzmann's principle, knowledge-based mean fields and protein folding. An approach to the computational determination of protein structures. J. Comput. Aided Mol. Des. 7: 473501.[CrossRef][Medline]
Sippl, M.J. 1993b. Recognition of errors in three-dimensional structures of proteins. Proteins 17: 355362.[CrossRef][Medline]
Sippl, M.J. and Weitckus, S. 1992. Detection of native like models for amino acid sequences of unknown three dimensional structure in a data base of known protein conformations. Proteins 13: 258271.[CrossRef][Medline]
Solis, A.D. and Rackovsky, S. 2006. Improvement of statistical potentials and threading score functions using information maximization. Proteins 62: 892908.[CrossRef][Medline]
Swets, J.A. 1988. Measuring the accuracy of diagnostic systems. Science 240: 12851293.
Swets, J.A., Dawes, R.M., and Monahan, J. 2000. Better decisions through science. Sci. Am. 283: 8287.[Medline]
Tsai, J., Bonneau, R., Morozov, A.V., Kuhlman, B., Rohl, C.A., and Baker, D. 2003. An improved protein decoy set for testing energy functions for protein structure prediction. Proteins 52: 7687.
Wang, K., Fan, B., Levitt, M., and Samudrala, R. 2004. Improved protein structure selection using decoy-dependent discriminatory functions. BMC Struct. Biol. 4: 8.[CrossRef][Medline]
Weiner, P.K. and Kollman, P.A. 1981. AMBER: Assisted model building with energy refinement. A general program for modeling molecules and their interactions. J. Comput. Chem. 2: 287303.[CrossRef]
Xia, Y., Huang, E.S., Levitt, M., and Samudrala, R. 2000. Ab initio construction of protein tertiary structures using a hierarchical approach. J. Mol. Biol. 300: 171185.[CrossRef][Medline]
Zhang, L., Godzik, A., Skolnick, J., and Fetrow, J.S. 1998. Functional analysis of the Escherichia coli genome for members of the
/
hydrolase family. Fold. Des. 3: 535548.[CrossRef][Medline]
Zhang, B., Rychlewski, L., Pawlowski, K., Fetrow, J.S., Skolnick, J., and Godzik, A. 1999. From fold predictions to function predictions: Automation of functional site conservation analysis for functional genome predictions. Protein Sci. 8: 11041115.[Abstract]
Zhang, Y., Kolinski, A., and Skolnick, J. 2003. TOUCHSTONE II: A new approach to ab initio protein structure prediction. Biophys. J. 85: 11451164.[Medline]
Zhou, H. and Zhou, Y. 2002. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci. 11: 27142726.
![]()