|
|
||||||||
Institute for Cancer Research, Fox Chase Cancer Center, Philadelphia, Pennsylvania 19111, USA
Reprint requests to: Roland L. Dunbrack Jr., Institute for Cancer Research, Fox Chase Cancer Center, 7701 Burholme Avenue, Philadelphia, PA 19111, USA; e-mail: RL_Dunbrack{at}fccc.edu; fax: (215) 728-2412.
(RECEIVED April 22, 2003; FINAL REVISION June 16, 2003; ACCEPTED June 17, 2003)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.03154503.
| Abstract |
|---|
|
|
|---|
1 and
1 + 2 dihedral angle accuracies are 82.6% and 73.7% using a simple energy function based on the backbone-dependent rotamer library and a linear repulsive steric energy. The new algorithm will allow for use of SCWRL in more demanding applications such as sequence design and ab initio structure prediction, as well addition of a more complex energy function and conformational flexibility, leading to increased accuracy. Keywords: Structure prediction; rotamer library; homology modeling; SCWRL; side-chain prediction
| Introduction |
|---|
|
|
|---|
Nearly all of these are based on using a rotamer library of discrete side-chain conformations (McGregor et al. 1987; Ponder and Richards 1987; Tuffery et al. 1991; Dunbrack and Karplus 1993, 1994; De Maeyer et al. 1997; Dunbrack and Cohen 1997; Lovell et al. 2000; Xiang and Honig 2001; Dunbrack 2002) obtained from statistical analysis of the Protein Data Bank (PDB; Berman et al. 2000). Rotamer libraries can either be backbone-independent or backbone-dependent. Backbone-dependent rotamer libraries contain information on side-chain dihedral angles and rotamer populations as a function of the backbone dihedral angles
and
(Dunbrack and Karplus 1993, 1994; Dunbrack and Cohen 1997; Dunbrack 2002), whereas backbone-independent libraries ignore such dependence.
In side-chain prediction methods, rotamers are chosen based on the desired protein sequence and the given backbone coordinates, by using a defined energy function and search strategy. Energy functions that have been used in side-chain prediction include simple steric energy functions combined with log probabilities of backbone-dependent rotamer populations (Dunbrack 1999), as well as molecular mechanics potential energy functions with complex solvation free energy terms (Mendes et al. 2001a; Xiang and Honig 2001; Liang and Grishin 2002).
For some emerging uses of side-chain prediction, fast calculations are a requirement. These include Web-based homolog and fold-recognition programs that align query sequences to proteins of known structure and model proteins based on these identifications and alignments. For instance, 3D-PSSM (Kelley et al. 2000) uses SCWRL (Bower et al. 1997) to generate models of proteins from structure-derived profile alignments. As the numbers of available sequences and experimentally determined protein structures both increase, improved models of proteins can be constructed on the basis of a number of parent structures and many alternative alignments, as demonstrated at the recent CASP5 meeting. In some cases, hundreds of models can be used to determine the best parent and alignment for a single query sequence, thus necessitating fast side-chain prediction. The protein design problem, in which a sequence is designed that will fold into a given defined set of backbone coordinates (Desjarlais and Handel 1995; Dahiyat and Mayo 1996) also requires efficient side-chain search methods, because several residue types and many rotamers must be tested at every site of the protein. In this case, the combinatorial problem can be quite severe.
A search method used in side-chain prediction can be classified as either exact or approximate, depending on whether it is guaranteed to find the global minimum energy given the rotamer library and energy function, or instead whether it finds a low-energy conformation without such a guarantee. The many versions of the dead-end elimination algorithm (Desmet et al. 1992, 1997; Lasters and Desmet 1993; Goldstein 1994; Keller et al. 1995; Lasters et al. 1995; Gordon and Mayo 1999; Pierce et al. 1999; De Maeyer et al. 2000; Voigt et al. 2000; Looger and Hellinga 2001) are designed to find the global minimum energy, when they are able to converge. Conversely, Monte Carlo methods (Holm and Sander 1991; Liang and Grishin 2002), cyclical search methods (Dunbrack and Karplus 1993; Xiang and Honig 2001), and some other algorithms (Desmet et al. 2002) are not guaranteed to find a global minimum, but they will almost always find a low-energy conformation in a reasonable time.
We have continued to develop the SCWRL program, originally written by Michael Bower, over a number of years (Bower et al. 1997; Dunbrack 1999). SCWRL uses a backbone-dependent rotamer library (Dunbrack and Karplus 1993, 1994; Dunbrack and Cohen 1997; Dunbrack 2002) and an energy function based on log probabilities of these rotamers and a simple repulsive steric energy term. In this article, we will present a new algorithm for SCWRL based on graph theory, which is considerably faster than the original algorithm. The new algorithm will enable new uses such as protein design and ab initio structure prediction, as well as increases in accuracy with new energy functions and side-chain flexibility.
Before describing the new algorithm, it is useful to summarize the previous method used to solve the side-chain combinatorial problem in SCWRL as well as its drawbacks. The original SCWRL algorithm initially places side chains onto the backbone according to the backbone-dependent rotamer probabilities and steric interactions with the nonlocal backbone. Any side chains that clash with other side chains according to the steric energy function are then defined as "active." These active residues are placed in each of their possible rotamers in turn, and if any rotamer clashes with any currently "inactive" residues, these residues become active as well. This procedure continues until a set of active residues is identified. The combinatorial problem is reduced to finding the minimum energy configuration of these active residues, because the inactive residues are in their minimum energy conformation. This is de facto a dead-end elimination of all rotamers except one for each inactive residue.
We can represent the side-chain combinatorial problem in terms of graph theory by making each residue a vertex in an undirected graph. If at least one rotamer of residue i interacts with at least one rotamer of residue j, then there is an edge between vertices i and j of the graph. In the original SCWRL algorithm, the active residues can be represented as a graph, one that is not necessarily connected. The active residues can therefore be grouped into interacting clusters. Residues in different clusters do not have contacts with one another. Each cluster is a connected subgraph of the entire graph. The combinatorial problem is therefore reduced to enumerating the combinations of rotamers for the residues in each connected graph. In SCWRL, the lowest energy of each cluster is found by using a backtracking algorithm with pruning (see Materials and Methods).
Some clusters are too large to be solved in a reasonable period of time. The solution to this problem (Bower et al. 1997) is to locate one residue (the "keystone") that when removed from the cluster graph, breaks the graph into two separate subgraphs. This procedure is shown in Figure 1A
. Assuming each residue has the same number of rotamers, nrot, the number of combinations in this graph, is n11rot. The graph in Figure 1A
has one residue labeled as a dark gray vertex that breaks the graph into two pieces when removed, as shown on the lower part of the figure. The global minimum of the energy can be found by identifying the minimum energy configuration for each subgraph once for each rotamer of the keystone residue. The solution is the one that finds the minimum energy using the equation
![]() | (1) |
L,j
i} E({rj}
ri) is the lowest energy combination of rotamers in the left subgraph (L) with the keystone residue i fixed in rotamer ri, and ER is the term for the right subgraph. Eself (ri) is the energy of interaction of the side chain with the backbone and other fixed nonside-chain atoms, such as ligands. Splitting the graph results in n5rot + n7rot combinations. This process may result in subgraphs that are also too large to be solved in a reasonable time. SCWRL breaks these up by using a similar procedure, so that each subgraph is broken up for each rotamer of the first keystone residue and then for each rotamer of the second keystone residue, and so on.
|
Third, SCWRL also uses a number of heuristic steps to reduce the number of rotamers and the number of interactions considered in the cluster-solving algorithm. These include removing rotamers with backbone (local and nonlocal) energies >5.0 kcal/mole, and not forming edges between residues in the graph unless there is some pair of rotamers with interaction energy >10.0 kcal/mole. The effect of using these heuristics is that SCWRL does not locate the global energy minimum of its own energy function, because many interactions are ignored when setting up the combinatorial searches to make the calculations tractable.
In this article, we propose a new algorithm for SCWRL based on graph theory that overcomes the combinatorial problem in a novel way, as illustrated in Figure 1B
. Instead of recursively breaking up graphs into pieces as in previous versions of SCWRL (Fig. 1A
), the new algorithm breaks up clusters of interacting side chains into the biconnected components of an undirected graph. Biconnected graphs are those that cannot be broken apart by removal of a single vertex. Biconnected graphs are cycles, nested cycles, or a single pair of residues connected by an edge (sometimes referred to as a bridge). In Figure 1B
, the graph in Figure 1A
is shown, before and after being broken up into biconnected components. This graph can be broken up into five biconnected components: A is residues V10, R15, Y21, and K17; B is residues K17 and T30; C is residues T30, Q25, M32, H29, and W11; D is residues W11 and F35; and E is residues W11 and D39.
Vertices that appear in more than one biconnected component are called "articulation points." Removing an articulation point from a graph breaks the graph into two separate subgraphs. In this graph, vertices K17, T30, and W11 are articulation points. The order of an articulation point is the number of biconnected components of which it is a member minus one. A first-order articulation point connects only two components. Both K17 and T30 are first-order articulation points (Fig. 1B
, light gray), whereas residue W11 is a second-order articulation point (Fig. 1B
, dark gray). Finding biconnected components and their articulation points is easily accomplished by using an algorithm developed by Tarjan (1972). The algorithm uses a standard depth-first search algorithm from graph theory contained in many computer science textbooks.
To solve the side-chain combinatorial problem with biconnected components, we use the following simple procedure, illustrated in Figure 2
. For each biconnected component with only one articulation point, we find the minimum energy over all combinations of rotamers of the residues in the component for each rotamer of the articulation point. This energy includes all interactions among these residues and between these residues and the fixed rotamer of the articulation point. So in the first panel of Figure 2
, we calculate the minimum energy over all combinations of rotamers of residues V10, R15, and Y21 for each rotamer of residue K17. Once this is accomplished, we can collapse the biconnected component onto the articulation point, obtaining the graph in the second panel of Figure 2
. That is, we can now consider residue K17 a superresidue with a number of superrotamers. If residue K17 has 10 rotamers, superresidue {K17, V10, R15, Y21} also has 10 rotamers. The energies of these superrotamers include both the self-energies of residue K17 and the minimum energy of the biconnected component consisting of residues V10, R15, and Y21. After collapse, we reduce the order of articulation point K17 by one, so it is now no longer an articulation point in the second panel (i.e., is of zeroth order).
|
At this point, after collapsing several biconnected components, we have two components that now have only one articulation point, whereas before they had two each. We solve the first one, B, for each rotamer of residue T30 and form the superresidue {T30, K17, V10, R15, Y21}. We are left with one biconnected component with no articulation points. We find the minimum energy of this component over all combinations of its rotamers.
The procedure as just described always converges into a single biconnected component or a single residue, each rotamer of which now contains information about all other residues in the cluster as well as the total energy. The order of complexity of the side-chain problem is now reduced to the size of the largest biconnected component in any cluster.
To achieve reasonably sized biconnected components, the new program, referred to here as SCWRL3.0, begins with a dead-end elimination (DEE) step, based on the simple Goldstein criterion (Goldstein 1994). This DEE step replaces the heuristics in SCWRL previously used to remove many high-energy rotamers not likely to be part of the minimum-energy configuration. Because the simple DEE criterion is not sufficient to solve the combinatorial problem on its own for reasonably sized proteins, a number of recent papers have presented complex DEE schedules that remove clusters of rotamers. However, the new SCWRL algorithm based on biconnected components stands in contrast to recent DEE implementations (Pierce et al. 1999; De Maeyer et al. 2000; Looger and Hellinga 2001) in its simplicity and speed.
We show that the new algorithm is extremely fast even for very large proteins. We evaluate the accuracy and speed of the new algorithm, as implemented in SCWRL3.0. The new algorithm will allow for new applications and for the development of more accurate energy functions and consideration of side-chain flexibility without a significant impact on computational times.
| Materials and methods |
|---|
|
|
|---|
Step 1: Input
Read in backbone coordinates and (optionally) a new sequence file and ligand coordinates. Measure backbone
and
dihedral angles from the input coordinates, checking for chain connectivity.
Step 2: Rotamers
For each residue, read in rotamer dihedral angles and probabilities from binary-formatted backbone-dependent rotamer library, given measured values of
and
and the amino acid type. Currently rotamers are read in from highest to lowest probability until the cumulative density reaches at least 90% for each residue. Calculate cartesian coordinates for rotamers of all side chains, and calculate side-chain/backbone energy terms
![]() | (2) |
and
, and the second term consists of the interactions of the side-chain rotamer ri of residue i with the backbone of all residues further than one amino acid away in the sequence. The backbone-dependent rotamer library is assumed to take care of the backbone/side-chain interactions of residue i with itself and its neighbors one residue away. Determine which pairs of side chains may be able to interact given the distance between their Cß atoms and the maximum distance of any atom in the side chain from its Cß. These distances have been calculated from data in the PDB used in the backbone-dependent rotamer library.
Step 3: Disulfides
If desired, determine likely disulfide pairings (see below). Fix Cys side chains that are designated disulfides for the rest of the calculation.
Step 4: Dead-end elimination
Perform a DEE of rotamers that cannot be part of the global minimum energy configuration by using the "Goldstein criterion." The Goldstein criterion is the simplest version of DEE. If the total energy for all side chains is expressed as the sum of self and pairwise energies,
![]() | (3) |
![]() | (4) |
In words, rotamer si of residue i can be eliminated from the search if another rotamer of residue i, ri, always has a lower interaction energy with all other side chains and the backbone regardless of which rotamer is chosen for the other side chains. The DEE step can be made fairly efficient by applying the algorithm for rotamers from highest to lowest Eself, and once removed, a rotamer is not included in the minimum or sum steps in Equation 4
. Residues that have only one rotamer left after the DEE step are fixed for the rest of the calculation. The energy of interaction of these fixed side chains (also including disulfide-bonded cysteines) with any unfixed side chains is added to the self-energy of the unfixed side chains. That is,
![]() | (5) |
![]() |
Step 5: Residue graph
Define residues that have more than one rotamer left after the DEE step as "active" residues. These residues can be viewed as the vertices of a graph that may or may or not be connected. The first step in solving the combinatorial problem of the active residues is to compute the edges among the active residue pairs. For every pair of residues, i,j, there is an edge if at least one energy, Epair(ri,rj)
0. This does not necessarily mean checking every pair of rotamers, because as soon as one pair of interacting rotamers is found, then an edge is established, and the energies for other rotamer pairs for residue pair i,j do not need to be evaluated at this stage. Determine the interacting clusters of residues. That is, determine which sets of residues form connected graphs, given the list of edges.
Step 6: Biconnected components
For each cluster, determine the set of biconnected components and articulation points in the graph by sing a depth-first search procedure (see below; Tarjan 1972). The order of each articulation point is defined as Nbicon - 1, the number of biconnected components attached to the articulation point minus one. An articulation point that connects only two biconnected components is first order.
Step 7: Solve clusters
Find the minimum energy for each connected graph (cluster) in turn: For each biconnected component of the cluster with only one articulation point, find the energy minimum of the residues in the component for each rotamer of the articulation point. That is, the rotamer of the articulation point is fixed, and the rotamers of the other residues in the component are searched until the minimum energy is found. A branch-and-bound backtracking algorithm is used for this purpose (see below), in which the residues in the component are sorted from lowest to highest number of rotamers and the rotamers of each residue are sorted from lowest to highest self-energy. Store the lowest energy of the biconnected component as well as the rotamer identities with each rotamer of the articulation point. The articulation point rotamers now have a new energy component in their self energiesthe energy of the biconnected component, B, which includes the self energies of these rotamers as well as their interactions with each other and with the articulation point rotamer, ri,
|
| (6) |
The total self-energy for the articulation point rotamers is now
![]() | (7) |
After a biconnected component is solved for each rotamer, the component can be collapsed so that the articulation point rotamers are now superrotamers, defined as a group of residues with one rotamer defined for each residue. The superrotamers have self-energies defined in Equation 7
. After the collapse, we reduce the order of the articulation point by one. If the order is zero, then the residue is no longer an articulation point. After collapsing all biconnected components with only one articulation point and reducing the orders of these articulation points, additional biconnected components will have only one articulation point, as illustrated in Figure 2
. These biconnected components are then solved in a similar fashion. The process of solving each biconnected component for the rotamers of its articulation point is performed until there is only one biconnected component left with no articulation points. This final component is solved with the branch-and-bound algorithm described below.
Step 8: Output
Print coordinates of the chosen rotamers, including the disulfides, the fixed side chains, and those solved in each of the clusters.
Articulation points and biconnected components
In Step 6 above, a cluster of interacting residues, represented as vertices in a graph, are broken up into biconnected components. In this process, the articulation points that connect the biconnected components are also identified. A vertex of a connected graph is an articulation point if its removal from the graph, together with all the edges incident from that vertex, would split the graph into two or more components, with no edges between them. If a graph does not contain any articulation point, it is defined as biconnected.
The procedure for identifying biconnected components and articulation points is described in a number of computer science textbooks. The first step involves a depth-first search, or tree traversal, in which each vertex of the graph is assigned a depth-first number (DFN), numbering the vertices in the order in which they are visited. This is illustrated in Figure 3
. In a depth-first search, the vertices and edges of the graph are used to build a tree, in which each vertex may have several children, or descendants, connected by edges of the graph, and each vertex has a single parent, or ancestor. The nodes of the tree are searched by descending the tree as far as possible, before exploring additional nodes attached to each traversed vertex, in reverse order. For example, given the residue interaction graph from Figure 1B
, a depth-first traversal starting from residue V10 assigns the DFNs shown inside of the circles representing each node in Figure 3
. In the figure, the edges that were used during depth-first tree traversal are drawn in continuous lines and are called tree edges, whereas the ones that were not traversed are called back edges and are drawn with dashed lines.
|
|
| (8) |
In words, L(u) is the lowest DFN that can be reached from u using a path that includes only descendants of u and at most one back edge. In our example (Fig. 3
), low numbers are printed within the square boxes near each node. A node u that has a child w that satisfies DFN(u) < L(w) is either the root node or an articulation point.
We implemented a recursive function ART that scans the tree and computes the L and DFNs. Also it stores all the traversed edges in a stack. If for a certain vertex the articulation point inequality is met, a new biconnected component is created. All the edges are popped off of the edge stack and the vertices of these edges are used to define a biconnected component. For each identified articulation point, we keep track of the number of biconnected components that it is part of. The algorithm is shown in pseudocode in Figure 4
.
|
|
|
| (9) |
As defined by Equation 9
, the energies may be negative, positive, or zero.
We show in the Results section that a further optimization may be implemented by sorting the residues in the tree by the number of rotamers (from lowest to highest) and by sorting the rotamers in order of self-energy, from lowest to highest. This results in finding low-energy configurations of the system as early as possible in the search, thereby resulting in more frequent pruning in the search process. Sorting the tree by the number of rotamers is illustrated in Figure 5B
.
Rotamer library
SCWRL3.0 uses a new version of the backbone-dependent rotamer library (D.A. Montgomery and R. Dunbrack, unpubl.). A number of improvements have been made in the Bayesian statistical analysis in the determination of probabilities and average dihedral angles and variances for each rotamer at each value of
and
. Also, the quality control measures advocated by Lovell et al. (2000) in their backbone-independent rotamer library have been implemented. These include steric checks on hydrogen atoms (Word et al. 1999a) and B-factor cutoffs of 30.0 for any atom in a side chain, before it is included in the library data set. Also, Asn, Gln, and His residues have been checked for hydrogen-bonding partners and flipped when there was a clear preference for the opposite orientation of the terminal dihedral angle than the coordinates imply (Word et al. 1999b). Side chains with ambiguous orientations were not used in the data set. These three residue types now have six, four, and three rotamer types for their outermost dihedral angle respectively, compared with three, three, and two in previous versions of the rotamer library. This new rotamer library is available at http://dunbrack.fccc.edu/bbdep.html.
Energy function
The energy function consists of a log-probability term from the backbone-dependent rotamer library and steric terms between the side chains and the backbone and between side chains. The library term has the form
![]() | (10) |
![]() |
![]() |
![]() | (11) |
Input and output
As with earlier versions of SCWRL, SCWRL3.0 takes a PDB-formatted file that contains backbone coordinates and outputs a file, also in PDB format, containing backbone and predicted side-chain coordinates. Also, SCWRL optionally takes two other files as input. First, a sequence file may be used to change the sequence of the protein. This is an ASCII text file containing only the new sequence. As with earlier versions of SCWRL, this file can be used to instruct SCWRL to preserve the cartesian coordinates of some side chains from the input PDB file by placing them in lower case in the file. These side chains are kept fixed throughout the calculation. Their interaction energies with the rotamers of all other side chains are added to the self-energies of the active side-chain rotamers. In benchmarking tests of homology modeling, keeping conserved residues in their template conformation leads to improved prediction accuracy (data not shown).
Another optional input file for SCWRL contains coordinates for ligands (ions, small molecules, nucleic acids, or proteins) that may interact with side chains to be predicted. These atoms are kept fixed, and are assigned atomic radii based on their atom types. All naturally occurring elements are represented within SCWRL. SCWRL attempts to determine the element from the atom name (see online documentation). Radii were obtained from an analysis of nonbonded ligand-protein atom distances in 5319 proteins in the PDB obtained from the PISCES server (http://dunbrack.fccc.edu/pisces; Wang and Dunbrack 2003). Given the radii for protein atoms (given above), the radii for other atom types, mostly ions, were determined from the minimum distances between ligand atoms of each element type and protein atoms of any type. For example, minimum ZnO distances in the PDB are
1.9 Å. Because the radius of oxygen in SCWRL is 1.3 Å, the radius of Zn is 0.6 Å. Distances were determined such that there would not be significant repulsive interactions in the SCWRL energy function at atomatom distances observed in the PDB. The radii for 46 elements were determined in this manner. Radii for other elements were arbitrarily set to 1.0 Å.
Disulfide bonds
We use an empirically derived scoring function to determine disulfide bond pairings. The function is evaluated for all cysteine pairs in a protein, and a disulfide bond is made if for any rotamers of a given pair, the following scoring function is <45.0,
![]() | (12) |
![]() |
![]() |
1S
2 distance, A1 is the bond angle Cß1-S
1-S
2, A2 is the bond angle S
1-S
2-Cß2,
2 is the dihedral angle C
1-Cß1-S
1-S
2,
4 is the dihedral angle S
1-S
2-Cß2-C
2,
3 is the dihedral angle Cß1-S
1-S
2-Cß2, and r1 and r2 are the rotamers for the first and second cysteines in the proposed disulfide. In a set of 1865 disulfides in 2778 protein structures, this function correctly identifies 1848 of them. It overpredicts 301 disulfides, nearly all of which occur with cysteines that are ligands for ions. In these cases, the disulfide bond function can be turned off with a command line flag (-u).
Implementation details
SCWRL3.0 implementation was done in object-oriented C++. C++ was chosen over other widely used programming languages used in computational biology (FORTRAN, C) because of the high level of abstraction and flexibility it confers. Taking advantage of data encapsulation, inheritance, and polymorphism, the source code uses logical high-level entities (e.g., cluster, biconnected component, residue). At the same time, the code is highly modular, so that changes can be made without affecting the rest of the code. Other advantage are the more error-proof syntax imposed by C++ comparing, for instance, with C, as well as memory allocation/deallocation via wrapper classes instead of the classical C-style memory handling, which is prone to overflows. In terms of design, we preferred an object aggregation approach versus object composition. This provides us with more options regarding future changes and code reusability. We also made extensive use of the Standard Template Library for the parts of the code that involved vectors, sets, stacks, and maps. The development and debugging was done with Microsoft Visual C++ .Net, but the source code was constrained to standard C++ only, in order to keep it cross-compiler and cross-platform compatible. For a C++ compiler for Linux and MacOS X, we used gcc, version 3. SCWRL3.0 has been compiled on Microsoft Windows XP, RedHat Linux 7.3 and 8.0, and MacOS X (version 10.2).
Benchmarking was performed on a dual AMD MP1800+ computer running RedHat Linux 7.3.
Test set
The test set used for speed and accuracy assessments consists of 180 proteins of resolution
1.8 Å and mutual sequence identity of <50%, and were obtained from the PISCES server. The chains used were the single-chains in PDB entries 2ilk
[PDB]
, 1bec
[PDB]
, 1rb9
[PDB]
, 1thv
[PDB]
, 1pot
[PDB]
, 1tca
[PDB]
, 2end
[PDB]
, 6cel
[PDB]
, 1lam
[PDB]
, 8abp
[PDB]
, 3ebx
[PDB]
, 1thg
[PDB]
, 2erl
[PDB]
, 1pmi
[PDB]
, 1kuh
[PDB]
, 1ixh
[PDB]
, 1c52
[PDB]
, 1a7s
[PDB]
, 1gci
[PDB]
, 1nkd
[PDB]
, 1koe
[PDB]
, 1ycc
[PDB]
, 1tif
[PDB]
, 1orc
[PDB]
, 2pth
[PDB]
, 3grs
[PDB]
, 1bs9
[PDB]
, 2por
[PDB]
, 1l58
[PDB]
, 1dcs
[PDB]
, 1tag
[PDB]
, 3lzt
[PDB]
, 1bd8
[PDB]
, 1tyv
[PDB]
, 1qnf
[PDB]
, 1npk
[PDB]
, 1mrj
[PDB]
, 1eca
[PDB]
, 3pte
[PDB]
, 1bfd
[PDB]
, 7rsa
[PDB]
, 1nls
[PDB]
, 1a68
[PDB]
, 3cyr
[PDB]
, 1lcl
[PDB]
, 1uae
[PDB]
, 2eng
[PDB]
, 1a8d
[PDB]
, 2acy
[PDB]
, 1xnb
[PDB]
, 1vqb
[PDB]
, 1msi
[PDB]
, 1moq
[PDB]
, 1hxn
[PDB]
, 3vub
[PDB]
, 1hfc
[PDB]
, 2hbg
[PDB]
, 1aqb
[PDB]
, 1kid
[PDB]
, 1rzl
[PDB]
, 1ctf
[PDB]
, 1a8e
[PDB]
, 1csh
[PDB]
, 1pdo
[PDB]
, 1smd
[PDB]
, 1aba
[PDB]
, 2tgi
[PDB]
, 1bm8
[PDB]
, 2rn2
[PDB]
, 2a0b
[PDB]
, 1ako
[PDB]
, 2ctc
[PDB]
, 1b6g
[PDB]
, 1msk
[PDB]
, 2qwc
[PDB]
, 1bfg
[PDB]
, 3nul
[PDB]
, 1pda
[PDB]
, 1arb
[PDB]
, 1wab
[PDB]
, 1gof
[PDB]
, 1bg6
[PDB]
, 2cpl
[PDB]
, 1vie
[PDB]
, 1cor
[PDB]
, 1rhs
[PDB]
, 1aie
[PDB]
, 1bgf
[PDB]
, 3cla
[PDB]
, 1ifc
[PDB]
, 1vjs
[PDB]
, 4xis
[PDB]
, 1ha1
[PDB]
, 1dhn
[PDB]
, 1amm
[PDB]
, 2sak
[PDB]
, 1jer
[PDB]
, 1vhh
[PDB]
, 1g3p
[PDB]
, 1cyo
[PDB]
, 1hyp
[PDB]
, 1cbn
[PDB]
, 3pyp
[PDB]
, 1cex
[PDB]
, 1ctj
[PDB]
, 1a8i
[PDB]
, 1mof
[PDB]
, 1a6m
[PDB]
, 1cnv
[PDB]
, 1ryc
[PDB]
, 1mrp
[PDB]
, 1xjo
[PDB]
, 2sn3
[PDB]
, 2fdn
[PDB]
, 1din
[PDB]
, 2baa
[PDB]
, 1aru
[PDB]
, 1chd
[PDB]
, 1cv8
[PDB]
, 1bx7
[PDB]
, 16pk, 1bdo
[PDB]
, 2sns
[PDB]
, 3lck
[PDB]
, 3seb
[PDB]
, 1al3
[PDB]
, 1rie
[PDB]
, 2dri
[PDB]
, 1lbu
[PDB]
, 1sbp
[PDB]
, 1mml
[PDB]
, 1ush
[PDB]
, 1edg
[PDB]
, 1ads
[PDB]
, 2mcm
[PDB]
, 1yge
[PDB]
, 1whi
[PDB]
, 1iab
[PDB]
, 1ppn
[PDB]
, 1zin
[PDB]
, 1lst
[PDB]
, 1hka
[PDB]
, 1fna
[PDB]
, 1poa
[PDB]
, 1svy
[PDB]
, 3sil
[PDB]
, 1fus
[PDB]
, 1mun
[PDB]
, 1oaa
[PDB]
, 1gai
[PDB]
, 153l, 1cem
[PDB]
, 119l, 1iuz
[PDB]
, 1e70
[PDB]
, 1mla
[PDB]
, 1bj7
[PDB]
, 1ezm
[PDB]
, 1ra9
[PDB]
, 2igd
[PDB]
, 1nox
[PDB]
, 1fnc
[PDB]
, 1aop
[PDB]
, 1opd
[PDB]
, 2ayh
[PDB]
, 1byi
[PDB]
, 1cvl
[PDB]
, 1ayl
[PDB]
, 1axn
[PDB]
, 2cba
[PDB]
, 1pgs
[PDB]
, 5pti
[PDB]
, 1bkf
[PDB]
, 1vns
[PDB]
, 1aho
[PDB]
, 1b6a
[PDB]
, 1c3d
[PDB]
, 1phb
[PDB]
, 1rcf
[PDB]
, and 1atg
[PDB]
.
Availability
SCWRL3.0 is freely available to nonprofit research groups from http://dunbrack.fccc.edu/scwrl3/scwrl3.html. Commercial companies should contact the investigators by electronic mail at RL_Dunbrack{at}fccc.edu.
| Results |
|---|
|
|
|---|
Backtracking algorithm
We use a backtracking algorithm (Kreher and Stinson 1999) to find the minimum-energy combination of rotamers for each biconnected component. Backtracking algorithms are used to generate all feasible solutions to a combinatorial optimization problem by traversing a tree that represents all combinations of the system. This is illustrated in Figure 5
. A number of techniques are available for pruning the tree, so that not all combinations must be enumerated. One of these is to use a bounding function, which is defined at each level of the tree. For the side-chain problem here, we use a bounding function shown in Equation 9
that is the sum of minimum self-energies and minimum pairwise energies for all residues further down in the tree. Another way to improve the pruning is to sort the rotamers of each residue by their self-energies, so that low-energy rotamers are encountered as early in the tree search as possible. It is also possible to sort the residues by their respective number of rotamers. We investigated the effect of these three variants, singly and in combination, and the results are shown in Table 1
. In this table, "ener+" and "ener-" indicate sorting of rotamers from lowest-to-highest and highest-to-lowest energy, respectively. Similarly, "nrot+" and "nrot-" indicate sorting the residues from lowest to highest and highest to lowest number of rotamers, respectively. "Bound" means using the bounding function indicated in Equation 9
.
|
50,000 to <100. With random energies, the number of nodes can be pushed to
200 and still be solved in less than a minute (data not shown). With the energy function used currently in SCWRL in which many pairwise energies are zero, the algorithm is not as powerful because the bounding function sometimes does not change from level to level. However, breaking the degeneracy with favorable packing and electrostatic interactions will allow the branch-and-bound algorithm greater efficiency in solving larger clusters.
Computational complexity
SCWRL3.0 uses an updated version of the backbone-dependent rotamer library (Dunbrack and Cohen 1997; Dunbrack 2002), with improved Bayesian analysis of the probabilities and mean and variance of dihedral angles at all values of
and
. The backbone-dependent rotamer library contains information on all rotamer types for each side chain. For instance, lysine has four dihedral angle degrees of freedom, each of which can take on one of three different rotamer values, for a total of 81 rotamers. However, many of these are sterically impossible and have never or only very rarely been seen in the PDB. Because the library contains many rotamers of very low probability, SCWRL reads in the rotamers from highest to lowest probability, given the backbone conformation, and keeps only the top D proportion of the cumulative density. Currently D = 0.9 is used, because the prediction rate does not improve with higher values of D (M.J. Bower, unpubl.). After the coordinates are constructed for the input rotamers, the DEE step with the simple Goldstein criterion (Goldstein 1994) is used to remove rotamers that cannot be part of the global minimum-energy configuration. In Table 2
, we list the average number of rotamers that fall within the top 90% of density for each side-chain type, as well as the number that pass the DEE step. For Lys and Arg, the number of rotamers is reduced from 81 to less than 2 on average by the top-D and DEE criteria. However, the range is quite broad due to the variability in side-chain environments (buried, surface, number of near neighbors).
|