|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Max-Planck-Institute for Informatics, 66123 Saarbrücken, Germany
(RECEIVED November 29, 2006; FINAL REVISION March 1, 2007; ACCEPTED April 3, 2007)
| Abstract |
|---|
|
|
|---|
1 and
1 + 2 dihedral angles amounts to 84.7% and 71.6%, respectively, using a 40° cutoff. When we compared IRECS with SCWRL and SCAP, the performance of IRECS was comparable to that of both methods. IRECS and the ROTA potential are available for download from the URL http://irecs.bioinf.mpi-inf.mpg.de. Keywords: new methods; protein structure prediction, IRECS, side-chain prediction; structural ensembles
| Introduction |
|---|
|
|
|---|
1 and
1 + 2 are just below 90% and 80%, respectively, considering all side chains and a rotational deviation of ±40° (Mendes et al. 1999; Liang and Grishin 2002; Peterson et al. 2004). In fields such as structure-based drug design or proteinprotein docking, it is quite important to take protein flexibility into account. Hence a representation of the protein consisting of only a single conformation is insufficient (Leach and Lemon 1998; Claussen et al. 2001; Teodoro and Kavraki 2003; Gray 2006). In computational approaches protein flexibility is often represented via a set of protein conformations, also known as a conformational ensemble. This representation is motivated by the observation that a protein thermodynamically adopts no unique but one of many conformations. There are many low-energy conformations of the protein with the global minimum-energy conformation (GMEC) as the most probable conformation. In the conformational ensemble, each of the side chains has a limited set of allowed rotamers that are either attached to a fixed backbone conformation or may be combined with different backbone conformations. These rotamer sets usually comprise the side-chain conformations observed in different crystal structures, multiple molecular simulation snapshots, or from differently parameterized side-chain predictions. In the latter case, the obtained rotamer sets usually represent subsets of the full rotamer sets that could be assigned to the respective side chains from an underlying rotamer library.
One of the most prominent approaches for generating rotamer subsets for protein ensembles is the Dead-End-Elimination (DEE) algorithm (Desmet et al. 1992). This algorithm removes rotamers that cannot be part of the GMEC. There are some performance extensions that apply tighter removal criteria than Desmet et al. (1992) did, like the Goldstein DEE (Goldstein 1994) or Split DEE (Gordon et al. 2003). There are also several algorithms that speed up the application of the DEE criteria by preprocessing the graph representing all pairwise side-chain interactions, like the articulation-point reduction technique of SCWRL (Canutescu et al. 2003). The R3 algorithm of Xie and Sahinidis (2006) combines several of these techniques. As Gordon and coworkers found, the performance of the DEE algorithm in eliminating rotamers strongly depends on the rotamer library and scoring function used (Gordon et al. 2003). DEE works best if there are few rotamers per side chain and if the scoring function assigns a higher weight to single-residue contributions than to pairwise residue-interaction energies.
All potential functions developed so far introduce some (mostly not quantified) errors in the absolute and relative values of the computed energies. Even if these error ranges were known, it would not be possible to include them in the DEE criteria without drastically decreasing the DEE performance. Moreover, the existence of these errors raises the question if the theoretical GMEC is also the most probable conformation in reality.
In the present study, we propose the IRECS algorithm as a heuristic alternative to the DEE algorithm for rotamer ensemble selection. The algorithmic nature of IRECS easily allows for considerations of error ranges as well as handling the flexibility of side chains by presetting the average size of the selected rotamer sets right from the start. First, we formulate the probabilistic rotamer-subset problem. Second, we describe the IRECS algorithm for solving this problem heuristically. Third, we introduce a new scoring function that we calibrate to be applicable to rotamer scoring. Fourth, the algorithm is evaluated on a large protein test set, and its performance is compared to the other side-chain prediction programs, SCWRL and SCAP. Finally, we discuss the pros and cons of using IRECS instead of DEE or single side-chain conformation prediction algorithms and point to possible improvements and extensions to our method.
Probabilistic rotamer-subset problem
First, we formally define the problem of finding the most probable rotamer subset and introduce the notation used below in the manuscript. L = {L 1, ..., Ln } is the set of the initial sets of rotamers that are assigned to each of the n side chains of a target protein. For simplicity, we assume that these sets are maximal, that is, any set Li contains all rotamers from the underlying rotamer library for the amino acid at position i of the target protein. X = {X 1, ..., Xn } denotes any possible collection of subsets that can be created by eliminating rotamers from the sets L 1, ..., Ln . The set L is the rotamer universe of the target protein, and the set X is the considered reduced rotamer universe of the target protein. X represents a selection on L and is a solution candidate of the IRECS algorithm. In the special case that all X 1, ..., Xn contain exactly one rotamer, X corresponds to a solution candidate for the traditional side-chain prediction problem for a single protein conformation.
G(L) is the discrete conformational space available to the target protein created from a fixed backbone conformation and exactly one rotamer per side chain i out of Li . By restriction of all sets Li to the smaller subsets Xi , the conformational space G(L) is restricted to the conformational space G(X). We assign a probability to the restricted rotamer universe X and denote it PG (X). This probability can be interpreted as the probability of observing a conformation of a protein from G(L) that contains only rotamers from X. By this definition, PG (L) is one. The probability PG (X) is the sum of the probabilities of all conformations C of a protein that can be constructed using the fixed-backbone conformation and rotamers from X:
|
|
The probability PG (X) thus involves two sets of terms, (1) the individual probabilities of the conformations and (2) the total number of conformations in G(X). Desired solutions X for a usual protein-modeling purpose would have a maximal conformational probability and a minimal number of conformations. Since both cannot be optimized simultaneously, we restrict the number of conformations of G(X) to lie within a user-defined range and then maximize PG (X). The number of conformations of G(X) is
|
|
and thus depends on the number of side chains of the target protein and the sizes of the rotamer subsets Xi . We use the rotamer density
|
|
as an estimate for the conformational variety. The division by n allows for comparing the conformational complexity of target proteins of different sizes. This measure allows us to fix the overall number of rotamers but allows for choosing the individual number of rotamers in each rotamer subset Xi .
The probabilistic rotamer-subset problem can now be defined as follows: Given the full rotamer universe L for n side chains and a fixed-backbone conformation of a target protein, find a selection X that meets the following criteria:
|
|
with Y being any possible selection from L. Here b low and b high are user-defined constants. The subset probabilities PG (X) are given by Boltzmann's law, restricted to conformations C of G(L). For a single conformation, P(C) is given by
|
|
where E(C) is the energy of conformation C, k is the Boltzmann constant, and T is the temperature. Insertion of Equation 4 into the definition of the conformation probability gives
|
|
This equation relies on the inaccurate but useful assumption that the full conformational space of the protein can be approximated by G(L). The computation of the subset probability PG (X) requires enumerating all conformations in G(X) and exactly determining their energies. Since neither is possible, we must rely on heuristic scoring schemes and algorithms for computing PG (X). If we fix the rotamer density
rot to one, the probability PG (X) of the subset X that contains all rotamers of the GMEC is greater than that of any other possible subset selection with the rotamer density of one. As noted above, the probabilistic rotamer-subset problem with a rotamer density of one is equivalent to the side-chain prediction problem. Thus, in this case, the probabilistic rotamer-subset problem is also NP-hard. On the other side, the problem becomes trivial if the rotamer density has no upper limit, since then the full rotamer universe L is a legal solution candidate.
| Materials and Methods |
|---|
|
|
|---|
1,
1 + 2 dihedral angle matching. Before distance and coordinate comparisons are performed, the coordinates of symmetric atoms in the residues arginine, aspartic acid, glutamic acid, phenylalanine, and tyrosine are selected to yield to the optimal atom matching. Hydrogen atoms are not considered. Two side-chain dihedral angles are considered as matching if they differ by <40°, as it is common for the evaluation of many side-chain prediction methods using such coarse-grained rotamer libraries as we do. A predicted side-chain conformation is considered as being correct if all of its dihedral angles match those of the conformation of the corresponding side chain inside a reference high-quality protein structure (usually determined by X-ray crystallography). A selection of multiple rotamers for a single side chain is considered correct if it contains at least one correct rotamer.
For the comparison of pairs of protein conformations by their RMSD, we use only coordinates of heavy side-chain atoms, excluding coordinates of C
atoms. The discrimination between core and surface residues is performed using relative solvent accessibility (RSA) as a burial indicator. RSA values are calculated with the Naccess tool (version 2.1.1, http://wolf.bi.umist.ac.uk/naccess/; Hubbard et al. 1991). A residue is defined as belonging to the protein core if its relative accessibility is <20%. This marks nearly half of all residues of the proteins in the test set as belonging to the core.
The IRECS algorithm: Building rotamer subsets
We developed a new heuristic algorithm IRECS (Iterative REduction of Conformational Space) for the selection of rotamer subsets. The rotamer library used, the scoring function, and the final average number of rotamers assigned to the side chains can be configured freely. The algorithm combines the effective energy approach of the SCMF algorithm (Koehl and Delarue 1996) with a heuristic greedy iteration scheme. The general workflow of the algorithm is sketched in Figure 1.
|
After this filtering step, pairwise interaction scores between all rotamers of different side chains are computed. This is one of the computationally most elaborate steps because of the still large number of active rotamers at this stage. This computation is accelerated by the definition of a local residue neighborhood: For a given residue, its local neighborhood comprises all other residues of the protein that can possibly interact with it. The neighborhood can be computed by checking for each candidate residue if the sum of the radii of the enclosing spheres of the target residue and the candidate residue is smaller than the distance between the sphere centers plus the distance cutoff of the applied scoring function (ROTA: 10 Å). Next, a probability is assigned to each rotamer. These probabilities assess how likely it is that the side chain actually is in the respective rotameric state. Initially, for each side chain, rotamer probabilities are distributed evenly. The effective energy E eff of a rotamer xi is defined as:
|
|
The calculation of the terms U self and U inter is described below. p(yi ) is the probability of the rotamer yi , S denotes the set of all side chains, b is the protein backbone, and Cj is the set of rotamers of some side chain j. This definition is patterned after the application of the self-consistent mean-field (SCMF) theory to the side-chain prediction problem (Koehl and Delarue 1994).
After this initialization procedure, IRECS enters an iterative optimization scheme that tries to minimize the effective energy of the whole system. For this purpose, the range of effective rotamer energies is calculated for each side chain:
|
|
where xi and yi are the side-chain rotamers with the largest and smallest effective energy, respectively. The side chain with the largest
E eff(s) is identified, and for this side chain the probability of the rotamer with the highest energy is set to zero and the rotamer is labeled inactive. This leads effectively to the elimination of energetically unfavorable rotamers. In addition, side chains are preferably chosen in which the energy gap between different conformations is large, that is, which have only a low number of energetically favorable conformations and thus limited flexibility. Subsequently, all probabilities of the remaining active rotamers for this side chain are set to 1/m, where m is the number of active rotamers of this side chain. We also performed test runs in which we calculated the probabilities by Boltzmann's law using the effective rotamer energies of all remaining rotamers of a side chain. This, however, did not lead to an improvement of the overall performance of IRECS.
The algorithm described above has several advantages. First, the algorithm is certain to terminate, since the probability of each rotamer increases monotonically until termination or until it is set to zero for the rest of the optimization. Second, the number of iterations required for system calibration is independent from the scoring function used. Since exactly one rotamer is removed per iteration, the number of iterations is determined by the number of side chains, the starting rotamer density, and the user-defined rotamer density at the end of the optimization. Third, the size of the conformational space of all side chains also decreases monotonically by removing available rotamers. Fourth, the one-by-one removal of rotamers defines a unique elimination order of rotamers. This order is used to define meaningful rotamer subsets of arbitrary size. The algorithm can be stopped at any rotamer density,
rot(X) (see Equation 2), that is, any average number of rotamers per side chain. For a given rotamer density, the actual number of rotamers per side chain in the final structure can vary and is correlated to the flexibility of a specific residue. Specifically, such a set comprises all rotamers that are eliminated in the latest IRECS steps. It has to be stressed that the process of rotamer removal is heuristic. Rather than performing a complete re-computation of all rotamer probabilities as in the SCMF algorithm, only the probabilities of the rotamers of the considered side chain are readjusted.
Finally, the protein model is written to a text file in PDB format. If a side chain has multiple rotamers, each of these rotamers is assigned an uppercase letter, and the atoms of these rotamers are labeled with this letter in their alternative location field. Also, the effective energy of the rotamers is written to the temperature field of each atom of a rotamer. Using the Boltzmann law and all rotamers of a side chain, rotamer probabilities are calculated that fit to the final effective energy. These probabilities are written to the occupancy field of each atom of a rotamer. Since only one or a few rotamers of a side chain are left after an IRECS optimization and appear in the protein model file, their probabilities do not add to one. This helps the user to estimate for each side chain the loss in conformational representation that occurred through the application of IRECS on the full rotamer sets from the rotamer library.
The performance of IRECS is crucially dependent on the scoring function used; thus we derived a special scoring function (IRECSscore) based on a new knowledge-based scoring term ROTA, which is used to calculate the second and third terms of Equation 6.
IRECSscore scoring function
The overall functional form of the IRECSscore is given in Equation 6. In this section we define the terms U self and U inter of the effective rotamer energy. U self calculates self-scores of the rotamers and interaction scores of rotamers with the local backbone, and U inter calculates nonbonded interactions of a rotamer with other rotamers and interactions with the whole protein backbone. Both terms are weighted with respect to each other using the training set of single protein structures (set A) mentioned below.
The self-score, U self, is calculated using the BBDEP (backbone-dependent rotamer library) frequency scores. It is calculated analogously to the definition of the scoring function used in SCWRL (Canutescu et al. 2003). The conditional rotamer probabilities are taken from the BBDEP using the local backbone torsion angles
i and
i .
|
|
The interaction term, U inter, for a rotamer xi with another rotamer or the protein backbone is computed by summing up all ROTA scores of all pairwise distances between all atoms a of this rotamer and all atoms b of the other entity z, which is either another rotamer or the protein backbone as defined in Equation 6.
|
|
The weights w 1 and w2 were fitted using the training part of protein set A defined below. Their values are 1.0 and 0.14, respectively.
The ROTA function
We follow the formalism for conditional probabilities of pairwise atomatom preferences in proteins using statistical observations on native structures introduced by Samudrala and Moult (1998a,b) to derived the RAPDF (residue-atom specific probability density function) scoring function. This function proved to be useful in model quality assessment programs (MQAP) (Tosatto 2005) as well as for side-chain prediction (Samudrala and Moult 1998a,b). Our knowledge-based scoring function ROTA is derived from a "correct" set of crystal structures and a reference set of "incorrect" (also: decoy) protein models with the same backbone conformation as the structures from the correct set but enumerating multiple and possibly wrong rotameric states of the side chains.
The aim is to analyze a set of correct {C} and incorrect {I} structures on the basis of certain structural features f, which in our case are atomatom distances. The scoring function sums over all negative log-conditional probabilities for observing a given set of features in the set of correct structures divided by the probability of observing the same feature in the overall set of combined correct and incorrect structures:
|
|
This sum is proportional to ln P(C | {f}), where P(C | {f}) is the probability of a structure being correct if features {f} are observed. The derivation of the scoring function requires calculating the probability of observed features {f} both in a set of correct structures and a reference set comprising correct and incorrect structures. Since no a priori information about the correctness of a structure must be included in P({f}), an appropriate composition of a reference set from correct and incorrect structures is hard to find. In order to solve this problem, we assume that the set of incorrect structures contains examples for all relevant error cases, in our case structural motives that originate from side-chain prediction errors. Usually it is not known what exactly characterizes all incorrect protein structures. Since side-chain prediction is a quite straightforward process, though, we show that we can construct artificial sets of incorrect structures that contain the most relevant errors occurring in side-chain modeling. Given such a representative set of incorrect structures, we can derive a second scoring function:
|
|
Analogous to S,
is proportional to ln P(I | {f}), where P(I | {f}) is the probability of a structure being incorrect if features {f} are observed. In combination, both yield a difference function
S
|
|
The required conditional probabilities P(f | C) and P(f | I) are obtained by counting the individual frequency of each feature f in the correct and incorrect set. The features used here and also for the RAPDF are different distance intervals of atom pairs that all have one of the unique 167 PDB atom types (residue name plus atom identifier). The probability of observing two atoms with type a and b within distance range d is calculated by
|
|
where dab is a specific distance bin for atom types a and b, and N(dab ) is the number of atoms observed in the correct set to be in the distance range d and that have atom types a and b. The denominator contains the sum of all counts of all distance ranges for these two atom types. Substitution of P(dab | C) and P(dab | I) (computed analog) into Equation 12 yields the ROTA function. We used distance bins of width 0.25 Å, within the range 010 Å. In cases in which there are no instances of a certain atom pair within a distance range in the correct set, the score for this distance bin is set to the score of the distance bin that has the smallest correct/incorrect counts ratio. This correction was performed especially for small-distance ranges, which account for steric clashes; the score for this distance range was rounded up to 4.0. Given a query atom distance, we interpolate the scores of the two neighboring distance bins linearly.
Side-chain decoy sets for the derivation of the ROTA function
Since ROTA must discriminate between correct and incorrect side-chain conformations, we created the decoy sets AllLib, AllRot, TripleLib, and TripleRot with different distortion techniques for side-chain conformations. For the set of correct protein structures, we chose crystal structures from the high-quality top-500 set (Lovell et al. 2003), downloaded from the server at http://kinemage.biochem.duke.edu/ (Richardson Laboratory). We randomly split the set into 490 structures for training and 10 structures for testing. The PDB codes of the test set comprise 1edg
[PDB]
, 1jet, 1lmb4, 1pen, 1qb7, 1qq4, 1vfy, 1wab, 3ebx, and 3ezm. All of the 500 structures were used as base structures for generating decoy sets. Only the decoys of the training set were used to derive ROTA. The decoy sets were generated using four different decoy procedures. Each of them was designed to capture some aspect of fallacious side-chain prediction. For the AllLib and AllRot sets, all side chains were randomly rotated simultaneously. This procedure was performed 10 times for each protein structure. The decoys of the sets TripleLib and TripleRot were created by rotating the side chains of only three neighboring residues simultaneously. We slid a window comprising three residues along the peptide chain, creating one decoy per shift. This procedure is supposed to create decoy distances between atom pairs of native and disturbed side chains and for atom pairs of close pairs of disturbed side chains. The sets AllLib and TripleLib draw their new side-chain dihedral angles from the backbone-dependent rotamer library, with a selection probability among all available rotamers as defined in the library. This causes the decoys to contain mostly the rotamers favored by the BBDEP, but ignoring any long-range forces caused by atomic interactions. When distance distributions of atom pairs are drawn from the correct set and this kind of decoy set, the ROTA scores receive two important properties: First, ROTA favors rotamers that are less probable in the BBDEP; and second, ROTA favors those rotamers that establish meaningful atomic interactions with their environment. Rotamers that have a high frequency in the library but are not represented proportionally in the correct structure set are penalized. This should be the case for most of the false rotamer assignments from the BBDEP. This setup was chosen to adjust the ROTA scores to the BBDEP frequency scores.
The side-chain dihedral angles of the decoys in the AllRot and TripleRot sets are distributed uniformly over all rotational angles from 0° to 360°. Steric clashes were not removed from the generated conformations, since clashes should also be penalized by ROTA without help of another scoring function. As a result, most of the decoys are clearly identifiable as decoys by quite simple methods. Only in the case of small perturbations, which occurred often in the TripleLib and TripleRot sets, are decoys hard to distinguish from the starting crystal structures. See Table 1 for details on the rotamer sets.
|
Structure sets for testing the performance of IRECS
The PISCES Web server at http://dunbrack.fccc.edu/pisces (Wang and Dunbrack 2003, 2005) was used to compile a list of high-quality protein structures. The list holds 641 structures with a resolution better than or equal to 1.5 Å, an R and R free under 0.3, and a pairwise sequence identity lower than 25%. The list was divided into two main sets: The first set A contains 194 crystal structures with single conformations for all side chains. Set A was split again into a test subset of 160 structures and a training subset of 34 structures. The second set B contains 447 crystal structures that have at least one side chain with multiple conformations.
This selection does not take the structural training set of ROTA and the BBDep into account. The Top-500 set has 31 overlaps with the test subset of set A and 59 overlaps with the set B. The training set of the BBDep has 83 overlaps with the test subset of set A and 146 overlaps with the set B. These overlaps imply that the evaluation of IRECS and SCWRL will be biased in so far as a single side chain in the test set can benefit from its membership in the training sets. As it would be easy to reduce these overlaps by exchanging structures with those of close homologous proteins, such a replacement would more conceal the bias than reduce it. Since all protein sets used here aim at representing the conformational space of the proteins currently deposited in the PDB, it is not possible to eliminate this bias without reducing the represented conformational space of the training sets. As SCWRL and IRECS both make use of the BBDep, the influence of this bias remains controllable in a comparison of these programs.
Construction of conformations
We calculated atom coordinates of the rotamers by using the standard values for bond angles and lengths of the CHARMM force field (Brooks et al. 1983; MacKerell et al. 1998; MacKerell et al. 2004). Only the coordinates of heavy atoms were calculated. Dihedral angles were drawn from the BBDEP (Dunbrack and Cohen 1997) and therefore depend on the local conformation of the backbone. The BBDEP (version of May 2002) was downloaded from http://dunbrack.fccc.edu/bbdep/. We experienced limited accuracy improvements of IRECS when using native bond lengths and angles, but they were not used in the evaluation shown here.
Comparison setup
The RAPDF was downloaded from the Decoys R Us Web site (http://dd.compbio.washington.edu/; Samudrala and Levitt 2000). All MQAP scores were calculated using an in-house Python-script implementation. SCWRL and SCAP were started using default parameters. We used SCWRL version 3.0. All side-chain prediction tools were tested on the same machine (AMD Opteron V20z).
| Results |
|---|
|
|
|---|
|
|
Prediction of final structures with single side-chain conformations
By predicting protein models with a rotamer density of one, we can compare the performance of IRECS directly with that of other side-chain prediction tools. The tools SCWRL and SCAP were selected for comparison because of their popularity, availability, and high accuracy. The optimization algorithms used by these tools are quite different. All tools were used to predict models of all 160 proteins of the test part of set A. None of the tools included the knowledge of the true side-chain conformations from the native structure, except that some of the proteins were part of the training sets of the BBDep and ROTA. The performance of the tools was measured by the average similarity of the predicted model to the reference structure, once for all side chains and once for all side chains belonging to the core. In Table 4 the average values for all similarity measures are shown. In general, IRECS performs at best with respect to most of the applied similarity measures, except for the
1 + 2 accuracy of side chains in the core. However, the differences observed here are all quite small and may be dependent on the selection of the test set. It should also be noted that dependent of the protein target each of the three tools is sometimes able to outperform the other tools by far. Figure 2 shows the average prediction accuracy of IRECS for each amino acid. The achieved prediction accuracy is comparable to that of other methods and does not show much difference (>10% increase or decrease) to the average prediction accuracy reported for other methods (Liang and Grishin 2002; Peterson et al. 2004).
|
|
1 accuracy increased to 85.4% (84.7%), and the
1 + 2 accuracy to 72.9% (71.6%). If only the side chains in the core of the proteins are considered, the overall RMSD was lowered to 0.999 Å (1.046 Å for the IRECS program), the average RMSD to 0.483 Å (0.502 Å), the
1 accuracy was slightly lower at 91.9% (92.0%), and the
1 + 2 accuracy increased to 82.8% (82.2%). These results show that the IRECS algorithm quite accurately approximates the combinatorial problem of side-chain prediction and implies that the scoring function and modeling of the molecular environment are the accuracy-limiting components.
Relation between the selected rotamer subsets and side-chain flexibility
IRECS selects those residues for rotamer reduction that have the largest range of effective rotamer energies. This implies that, at the start of the optimization, a selected side chain has a pair of rotamers in which one rotamer dominates the other by some energy difference
E eff. As the optimization continues,
E eff decreases monotonically toward zero, for all side chains, and
E eff finally becomes zero by definition for all side chains that only have one rotamer left. This also implies that in later stages the rotamers left in the ensembles have nearly the same interaction energies with their environment. This is the condition for a side chain to be flexible, since no rotameric state can dominate the others; conformational changes of flexible side chains have low energy barriers between the available rotamers. Therefore, we can interpret the final ensemble of rotamers of a side chain as its discrete conformational space and use the differences in effective rotamer energies as a measure for the degree of side-chain flexibility. We can also compare the flexibility of different side chains by defining a threshold value on the range of effective rotamer energies
E eff and simply counting the still reachable rotameric states. Since IRECS aims at minimizing the energy difference
E eff inside the rotamer ensembles of each side chain, the ensembles created by IRECS maximize the overall conformational flexibility of the side chains for a given rotamer density. IRECS assigns different numbers of rotamers to each side chain and thus can weight their conformational flexibility. In Figure 3, the average number of rotamers per remaining side chain is shown, resulting from 160 protein models built with IRECS on the native backbone and a rotamer density of two.
|
1 dihedral angle that is not represented by the rotamer library at all. IRECS selects the closest possible rotamers here, but the
1 differ all
60°. Arg 75 has three rotamers assigned, with two of them being far from the native conformation and only one rotamer with a similar
1. This rotamer lies quite close to the native side-chain conformation, which gives the rotamer a similar ROTA interaction score, but its other dihedral angles are far away from those of the crystal structure. Lys 78 has two alternative conformations in the crystal structure, with an occupancy of 0.5 each. IRECS assigns eight rotamers to this residue, matching one of the alternative conformations quite well and missing the second alternative conformation completely. All buried residues are predicted optimally, considering the granularity of the backbone-dependent rotamer library.
|
rot). The ability of the IRECS algorithm to generate rotamer ensembles of sizes that reflect the flexibility of the respective side chains enables us to predict side-chain conformations with higher accuracy at the cost of obtaining ambiguous conformations. The rotamer ensembles were predicted for all 160 structures of the test part of the protein set A, using the native backbone. First, we tested the efficiency of IRECS to predict rotamer ensembles that contain at least one rotamer with a similar
1 angle to that of the respective reference side chain. In Figure 5 the average percentage of rotamer ensembles fulfilling this criterion is shown for different stages of the rotamer removal process, measured by the current overall density of rotamers,
rot. Each of the curves describe this average percentage, once using the IRECS scoring function (ROTA + BBDEP), using only components of the scoring function or using randomized rotamer interaction scores (the shaded area under the curves was just added for better visibility).
|
1 angle in the ensemble more than doubles in the last steps of the algorithm, going from 6% for a rotamer density of two up to 15% for a rotamer density of one. This high chance of removing the correct rotamer in the last steps of the algorithm has three reasons. First, at the start, the rotamer ensembles contain many rotamers that can be clearly identified as useless because of their high effective rotamer energy. In later iteration steps, all rotamers of an ensemble have almost the same effective energy; thus, the correct rotamer is hard to identify. Second, the error range of the scoring function leads to incorrect removals of rotamers. Although we have not measured the error range of ROTA and BBDEP, the differences of the effective rotamer energy can decrease below the error range of the IRECS scoring function. The removal of the last rotamers from the ensembles thus may become nearly random. Last, this behavior can be explained with the flexibility of some side chains that require a whole ensemble of rotamers for an adequate representation of their conformation. Since the performance of IRECS is evaluated with only single conformations for all side chains here, the chance of selecting the same conformation as observed in the crystal structure is quite low for these side chains. This also shows that reducing the rotamer density beyond the amount of flexibility that some side chains require helps little in raising the accuracy of the generated protein conformation but worsens it by neglecting any conformational side-chain flexibility.
In a second analysis, these characteristics were examined by comparing the rotamer ensembles generated using IRECS with rotamer ensembles sparsely found in the 447 crystal structures of set B. Such sets are formed whenever multiple side-chain conformations can be assigned to the observed electron density during the crystallographic refinement, leading to alternative coordinates for a given side chain in the PDB file. We chose only those side chains whose conformations differ by >40° in one of their
1 and
2 dihedral angles for this evaluation. We did not use NMR ensembles for this analysis since these ensembles usually also contain multiple conformations of the backbone. Since IRECS can only handle a single conformation of the backbone, it is not meaningful to compare the output of IRECS with such NMR ensembles.
The protein models computed with IRECS for this evaluation all have a rotamer density of two. The number of rotamer ensembles that were compared between crystal structures and IRECS models and their respective sizes are listed in Table 5. The results of the comparison of rotamer ensembles computed with IRECS and those rotamer ensembles found in crystal structures are listed in Table 6. In this table the accuracy of IRECS is measured for all ensemble pairs with certain sizes by dividing the number of rotamers in the crystal ensemble having one or more matching rotamers in the ensemble of the IRECS model, respectively, by the number of all ensemble pairs of such sizes. Two rotamers are considered as matching here if their
1 and
2 angles differ by <40°. For example, the upper-right cell of Table 6 shows the percentage (92.3%) of ensemble pairs in which the ensemble derived from the crystal structure has three rotamers, the ensemble computed with IRECS has one rotamer, and this rotamer matches at least to one of the rotamers from the ensemble of the crystal structure. A visualization of such pairs of rotamer ensembles can be found in Figure 4. The overall results support the strategy of representing the conformation of flexible side chains with an ensemble of rotamers. IRECS maintains a high degree of accuracy by assigning more rotamers to those side chains that have multiple conformations also in the crystal structure. Whenever IRECS assigns only a single rotamer to a side chain, the prediction of this side chain is much more accurate (88.7%, 81.2%, and 92.3%) on average than for models with rotamer density one (71.6%) (see Table 4). If some side chains are harder to predict, IRECS adapts by assigning more rotamers to these side chains. In this way an overall good accuracy is maintained.
|
|
| Discussion |
|---|
|
|
|---|
The ability of IRECS to create rotamer ensembles that model the flexibility of each side chain is of special interest in all applications in which the number of rotamers per side chain has a large impact on the runtime of the modeling procedure. Such effects are commonly observed in docking applications that handle the flexibility of the ligand and receptor simultaneously (Leach and Lemon 1998; Claussen et al. 2001; Teodoro and Kavraki 2003; Gray 2006). One limiting factor of IRECS is the granularity of the rotamer library used. It has been observed that the accuracy of side-chain prediction often improves when using more detailed rotamer libraries than we have done (Xiang and Honig 2001; Shetty et al. 2003). However, using such libraries requires very large runtimes. We therefore decided to use the more coarse-grained backbone-dependent rotamer library (Dunbrack and Cohen 1997), which represents a suitable tradeoff between runtime and detail of representation of the conformational space. Using the rotamers of the BBDEP and their associated conditional probabilities, we are able to curb the combinatorial explosion effectively and to find a good first guess for all side-chain conformations in the discrete rotamer space. Further refinement techniques that work on single protein conformations like standard energy minimization by steepest gradient descent are then free to leave the discrete rotamer conformation space and to optimize the protein conformation in continuous space. The quality of the final structure then relies first on the ability of the side-chain prediction tool to sample discrete conformational space and second on the accurate modeling of the energetic landscape around the discrete solution by the refinement method. As Xiang and Honig (2001) have already stated, we also conclude that the quality of the energy functions remains the dominant factor for side-chain prediction.
In deriving ROTA, we followed the strategy of enforcing the detection of structural features that are highly discriminatory between correct structures and explicit side-chain decoys. Structural features that are equally common to both correct and decoy structures are automatically filtered out and do not have any influence on the score of a query structure. Using side-chain decoys as a structural basis for the reference state instead of using a theoretical reference state renders ROTA much more sensitive toward any errors arising from false side-chain prediction, at the cost of lower performance in identifying other modeling errors. The procedure of generating the reference state shown here is applicable to other scoring and modeling applications, whenever the space of relevant modeling errors can be simulated realistically.
Since our and all other approaches for side-chain prediction published so far still have insufficient accuracy, both scoring techniques based on knowledge-based statistical potentials and force fields have to be improved further. Whereas the accuracy of force-field scoring can be improved primarily through more detailed and complex modeling of the protein environment (especially the solvent), knowledge-based scoring schemes allow for treating many complex structural features implicitly. We therefore regard the knowledge-based scoring schemes as the tools of choice for further efforts toward increasing the accuracy of side-chain prediction. Possible improvements include the use of an extended feature space used for scoring. In our view, the publicly available data on protein structure offer such a dense description of protein conformational space that it should be possible to derive scoring functions with much more detailed features than only pairwise atom distances. To further improve the prediction accuracy of IRECS, we plan to include conformational information about side-chain conformations in homologous proteins.
| Footnotes |
|---|
Reprint requests to: Christoph Hartmann, Max-Planck-Institute for Informatics, 66123 Saarbrücken, Germany; e-mail: Hartmann{at}mpi-inf.mpg.de; fax: 49-681-9325399.
Article published online ahead of print. Article and publication date are at http://www.proteinscience.org/cgi/doi/10.1110/ps.062658307.
| Acknowledgments |
|---|
|
|
|---|
| References |
|---|
|
|
|---|
Brooks, B.R., Bruccoleri, R.E., Olafson, B.D., States, D.J., Swaminathan, S., and Karplus, M. 1983. CHARMM: A program for macromolecular energy, minimization, and dynamics calculations. J. Comput. Chem. 4: 187217.[CrossRef]
Canutescu, A.A., Shelenkov, A.A., and Dunbrack, R.L. 2003. A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci. 12: 20012014.
Claussen, H., Buning, C., Rarey, M., and Lengauer, T. 2001. FlexE: Efficient molecular docking considering protein structure variations. J. Mol. Biol. 308: 377395.[CrossRef][Medline]
DeLano, W.L. 2002. The PyMOL molecular graphics system. DeLano Scientific, San Carlos, CA.
Desmet, J., Demaeyer, M., Hazes, B., and Lasters, I. 1992. The dead-end elimination theorem and its use in protein side-chain positioning. Nature 356: 539542.[CrossRef]
Dunbrack, R.L. and Cohen, F.E. 1997. Bayesian statistical analysis of protein side-chain rotamer preferences. Protein Sci. 6: 16611681.[Abstract]
Goldstein, R.F. 1994. Efficient rotamer elimination applied to protein side-chains and related spin glasses. Biophys. J. 66: 13351340.[Medline]
Gordon, D.B., Hom, G.K., Mayo, S.L., and Pierce, N.A. 2003. Exact rotamer optimization for protein design. J. Comput. Chem. 24: 232243.[CrossRef][Medline]
Gray, J.J. 2006. High-resolution proteinprotein docking. Curr. Opin. Struct. Biol. 16: 183193.[CrossRef][Medline]
Holm, L. and Sander, C. 1992. Fast and simple Monte Carlo algorithm for side chain optimization in proteins: Application to model building by homology. Proteins 14: 213223.[CrossRef][Medline]
Hubbard, S.J., Campbell, S.F., and Thornton, J.M. 1991. Molecular recognition. Conformational analysis of limited proteolytic sites and serine proteinase protein inhibitors. J. Mol. Biol. 220: 507530.[CrossRef][Medline]
Koehl, P. and Delarue, M. 1994. Application of a self-consistent mean field theory to predict protein side-chains conformation and estimate their conformational entropy. J. Mol. Biol. 239: 249275.[CrossRef][Medline]
Koehl, P. and Delarue, M. 1996. Mean-field minimization methods for biological macromolecules. Curr. Opin. Struct. Biol. 6: 222226.[CrossRef][Medline]
Leach, A.R. and Lemon, A.P. 1998. Exploring the conformational space of protein side chains using dead-end elimination and the A* algorithm. Proteins 33: 227239.[CrossRef]