Protein Science
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Protein Science (2005), 14:1643-1660. Published by Cold Spring Harbor Laboratory Press. Copyright © 2005 The Protein Society
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Wallin, S.
Right arrow Articles by Chan, H. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wallin, S.
Right arrow Articles by Chan, H. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

A critical assessment of the topomer search model of protein folding using a continuum explicit-chain model with extensive conformational sampling

Stefan Wallin and Hue Sun Chan

Department of Biochemistry and Department of Medical Genetics & Microbiology, University of Toronto, Toronto, Ontario M5S 1A8, Canada

Reprint requests to: Hue Sun Chan, Department of Biochemistry, University of Toronto, 1 King’s College Circle, Toronto, Ontario M5S 1A8, Canada; e-mail: chan{at}arrhenius.med.utoronto.ca; fax: 1-416- 978-8548.

(RECEIVED December 16, 2004; FINAL REVISION February 23, 2005; ACCEPTED February 23, 2005)


    Abstract
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
Recently, a series of closely related theoretical constructs termed the "topomer search model" (TSM) has been proposed for the folding mechanism of small, single-domain proteins. A basic assumption of the proposed scenarios is that the rate-limiting step in folding is an essentially unbiased, diffusive search for a conformational state called the native topomer defined by an overall native-like topological pattern. Successes in correlating TSM-predicted folding rates with that of real proteins have been interpreted as experimental support for the model. To better delineate the physics entailed, key TSM concepts are examined here using extensive Langevin dynamics simulations of continuum C{alpha} chain models. The theoretical native topomers of four experimentally well-studied two-state proteins are characterized. Consistent with the TSM perspective, we found that the sizes of the native topomers increase with experimental folding rate. However, a careful determination of the corresponding probabilities that the native topomers are populated during a random search fails to reproduce the previously predicted folding rates. Instead, our results indicate that an unbiased TSM search for the native topomer amounts to a Levinthal-like process that would take an impossibly long average time to complete. Furthermore, intraprotein contacts in all four native topomers considered exhibit no apparent correlation with the experimental {phi}-values determined from the folding kinetics of these proteins. Thus, the present findings suggest that certain basic, generic yet essential energetic features in protein folding are not accounted for by TSM scenarios to date.

Keywords: protein folding; topomer search model; topomer-sampling model; Levinthal search; explicit-chain modeling

Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.041317705.


    Introduction
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
Many single-domain proteins fold via apparent two-state processes with rates that vary widely, from µsec–1 to sec–1 (Jackson 1998; Baker 2000). In 1998, an insightful discovery was made: These folding rates were found to be predictable to approximately within one to two orders of magnitude using a very simple topology-based parameter called the relative contact order, CO (Plaxco et al. 1998). This parameter quantifies the average sequence separation between native contacts divided by protein length. Proteins with low CO values (typically {alpha}-helical proteins) were found to fold faster than proteins with high CO values (typically {alpha}/{beta}- and {beta}-proteins). Since this seminal finding, other topology-based parameters have been found that exhibit similar correlation with folding rates. These include long-range order (Selvaraj and Gromiha 2001), total contact distance (Zhou and Zhou 2002), "cliquishness" (Micheletti 2003), local secondary structure content (Gong et al. 2003), and the topomer-derived parameter considered in this work (Makarov and Plaxco 2003). Several theoretical constructs that did not directly consider explicit chain representations have had notable successes in reproducing this empirically observed rate– topology relationship (Alm and Baker 1999; Galzitskaya and Finkelstein 1999; Muñoz and Eaton 1999; Weikl and Dill 2003). For more realistic explicit-chain models that involve direct simulations of folding kinetics, however, it has proven less straightforward (Koga and Takada 2001; Faisca and Ball 2002; Cieplak and Hoang 2003); nonetheless, significant recent advances have been made (Jewett et al. 2003; Kaya and Chan 2003c; Chavez et al. 2004; Ejtehadi et al. 2004).

Among the theoretical efforts aiming to elucidate the remarkable contact order-dependent folding rates, a series of closely related models under the name of "topomer search model" or "topomer-sampling model" (TSM) has been proposed in the past five years as a possible mechanistic basis for the empirically observed rate–topology relationship (Debe and Goddard 1999; Debe et al. 1999; Makarov and Metiu 2002; Makarov et al. 2002; Makarov and Plaxco 2003; Gillespie and Plaxco 2004). These topomer constructs share the idea that the folding of small single-domain proteins is a two-step process: The first step is an essentially unbiased, diffusive search for an overall native-like topological state termed the "native topomer state." In the second step, after the chain has formed the correct gross topology, the native contacts rapidly zipper to form the completed native structure. Since this second step is assumed to occur very rapidly, the rate-limiting step in folding is the essentially unbiased search process in the first step, and folding rates are therefore determined by the "size," or extent in conformational space, of the native topomer state.

The topomer folding picture may be schematically illustrated by the highly simplified energy landscape caricatures in Figure 1Go. Here the energy landscape "outside" of the native topomer (the part represented by a steep funnel in Fig. 1Go) has been drawn flat to highlight the unbiased nature of the hypothetical TSM process (Debe and Goddard 1999). This corresponds to the case in Makarov et al. (2002) for which "the barrier to folding is purely entropic," although the possibility of a general energy favoring contact formation has also been considered as part of a TSM process (Makarov et al. 2002, p. 3538; see Discussion below). Thus, in a proposed unbiased topomer-search scenario, the first step of the folding process is a search on this flat region of the energy landscape. Then, as soon as the native topomer is found, folding proceeds quickly and downhill to the native state until all native contacts are formed. As a consequence of this folding picture, the folding rate of a protein is controlled by a quantity Ptop, the probability that the native topomer state is populated during a random search. Slowand fast-folding proteins thus have "tight" and "loose" native topomers, respectively, corresponding to the two landscapes in Figure 1Go with smaller and larger "golf holes."



View larger version (11K):
[in this window]
[in a new window]
 
Figure 1. Simple schematic illustration of the energy landscape in the TSM, for a slow-folding (left) and a fast-folding (right) protein. The horizontal dimensions represent protein conformational variation or conformational entropy; the vertical dimension provides the free energy of every given protein conformation—with appropriate averaging of solvent degrees of freedom (Dill and Chan 1997).

 
To our knowledge, the original version of TSMwas first proposed by Debe et al. (1999). Using a discrete chain growth method to generate large numbers of random conformations, the total number of distinct topomer states (of which the native topomer is one)was estimated for proteins with chain lengths up to N=100 amino acid residues. For instance, they found the number of topomers to be ~3 x 107 for N=100. It was then argued that if the topomer state space was sampled on a nanosecond timescale, native topomers could be found on typical folding times, even in the absence of a mechanism that simplifies and expedites the search. In a subsequent paper, Debe and Goddard (1999) set out to test themodel further by estimating the quantities Ptop for a set of 18 {beta}- and {alpha}/{beta}-proteins using a similar chain growth method (all-{alpha} proteins were excluded from this data set). The folding rates that resulted from their calculations showed a good correlation with experimental data (correlation coefficient 0.78).

In all TSM scenarios, the proposed folding process is quickly completed once the native topomer is located. It follows that if TSM is to provide a rationalization of apparent two-state folding mechanisms, the precise definition of the native topomer state is of critical importance. In the original TSMstudy by Debe and colleagues, "topomers are tubes of topologically equivalent conformations" (Debe and Goddard 1999): Two structures (conformations) were classified as topomeric (i.e., belonging to the same topomer state) based on a test procedure in which equal-strength harmonic forces were applied between corresponding C{alpha} atoms of two optimally superimposed structures. If a conjugate gradient minimization procedure could completely relax the spring forces, that is, if small backbone moves of one structure could bring it onto the other without getting trapped in a local minimum, the two structures were classified as belonging to the same topomer (Debe et al. 1999).

In the subsequent development of the TSM approach by Makarov and colleagues, however, the native topomer state was defined by specific collections of native contacts, as the set of conformations in which all sequence-distant native contacts are formed. In other words, TSM is the conformational ensemble in which all sequence-distant residues (as defined by the formalism) that are in contact in the native state are in close spatial proximity. Hence, the number of sequence-distant native contacts, {Lambda}D, plays a central role in this more recent TSM formulation, as will be detailed below. This approach is different in many respects from the original TSM of Debe and Goddard (1999) and Debe et al. (1999). Certain features introduced in the later formulation of Makarov and colleagues are seen as instrumental in extending the TSM’s ability to predict folding rates to include also purely {alpha}-helical proteins, as opposed to only {beta}- and {alpha}/{beta}-proteins (Makarov and Metiu 2002; Makarov et al. 2002; Makarov and Plaxco 2003).

The kinetic process of topomer sampling was envisioned by Debe and colleagues to be mainly among relatively compact conformations with low (favorable) solvation energies, as was depicted by the kinetic path on their "Rose Bowl" (a stadium in Pasadena, California) landscape (see Debe et al. 1999, Fig. 5Go). This amounts to postulating a folding mechanism that involves a fast, "burst-phase" collapse to an ensemble of compact conformations that are partitioned into different topomers. In that case, the flat areas in Figure 1Go represent only such compact conformations but not the open conformations that presumably have higher solvation energies. The bulk of the search time for the native topomer is then spent in different compact conformations. But such a folding mechanism does not appear to correspond to that of many real, apparent two-state proteins—typified by chymotrypsin inhibitor 2 (Jackson and Fersht 1991)—that exhibit no significantly populated compact folding intermediates. In contrast, the topomer search approach of Makarov and colleagues seeks the probability of locating the native topomer among all possible conformations, most of which are presumably open (Makarov et al. 2002), and is therefore conceptually more in line with the principles of cooperative protein folding (Chan et al. 2004; Gillespie and Plaxco 2004).



View larger version (46K):
[in this window]
[in a new window]
 
Figure 5. Comparison between TSM-predicted (+) and experimental ({circ}) {phi}-values. The lines between data points serve merely as a guide for the eye. The secondary structure of the proteins along the sequences is indicated by large green and small black rectangular shapes for {alpha}-helical and {beta}-sheet regions, respectively.

 
For this reason, our present analysis focuses primarily on the more recent TSM formulation of Makarov and colleagues with the above-described contact-based definition of the native topomer (Makarov and Metiu 2002; Makarov et al. 2002; Makarov and Plaxco 2003). In their approach, TSM prediction of folding rates is reduced to solving the following problem: What is the probability P(nD) that nD sequence-distant residue pairs are brought into proximity by a random search process? In this notation, the native topomer probability Ptop is identified with P({Lambda}D). The parameter {Lambda}D (which is denoted QD in Makarov and Plaxco 2003) is closely related to the long-range order parameter LRO proposed previously by Selvaraj and Gromiha (2001), namely, {Lambda}D/N=LRO provided that the same definition of a contact is used. Based on results from inert Gaussian chains (without consideration of polymer excluded volume), Makarov and Plaxco (2003) introduced the approximate formula P({Lambda}D)={gamma}a{Lambda}D where a and {gamma} are constants. This expression leads to a simple topology-based rate-prediction formula that was tested on a set of 24 two-state proteins (Makarov and Plaxco 2003), a diverse set that did not exclude all-{alpha} proteins as had been done previously (Debe and Goddard 1999). The rate-prediction formula showed excellent correlation with experimental folding rate data with a correlation coefficient ~0.9 (Makarov and Plaxco 2003), even though folding rates predicted using {Lambda}D can sometimes be drastically different from those predicted using CO (Jones and Wittung-Stafshede 2003).

Because of this empirical success and the potential physical insights it may offer, it is imperative to take an in-depth look at the TSM perspective as several key aspects of it remain to be better elucidated. As it stands, TSM is not a self-contained, explicit-chain model. TSM investigations thus far have not used direct simulations of folding kinetics. Instead, kinetic behaviors and folding rates were deduced from presumed thermodynamics–kinetics relationships. In this basic respect, TSM is akin to other non-explicit-chain modeling of CO-dependent folding that have also enjoyed high degrees of success (Alm and Baker 1999; Muñoz and Eaton 1999). However, the relationship between the thermodynamics and kinetics of chain molecules can be subtle, and explicit-chain results do not always agree with expectations from non-explicitchain considerations (Karanicolas and Brooks 2003; Chan et al. 2004). Therefore, ultimately, TSM or any other non-explicit-chain model of protein folding has to be evaluated by ascertaining whether and how the model assumptions can emerge from polymer physics. With this in mind, this article explores the feasibility of the proposed TSM folding mechanism by performing extensive Langevin molecular dynamics simulations using an explicit C{alpha} chain model. We focus on four apparent two-state proteins with diverse CO values. These chain models allow us to assess the logic of the physical picture afforded by TSM, and to perform structural comparisons with experiments.

In the analysis below, we first derive a clear structural characterization of the native topomers for the four proteins. Then, by using standard histogram reweighting techniques, we estimate their P({Lambda}D) values. We find them to be much smaller than that stipulated by the TSM approximation P({Lambda}D) = {gamma}a{Lambda}D. In fact, our calculations show that finding the correct native topomer state in a random, unbiased TSM search is so unlikely that it is comparable to the hypothetical Levinthal search process. Finally, to connect TSM with common understanding of folding kinetics, we compare experimental {phi}-values with TSM-predicted {phi}-values calculated by taking the ensemble of conformations constituting a native topomer state to be the TSM-prescribed folding transition state.


    Results
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
Proteins
We focus on the four single-domain proteins in Table 1Go that cover a variety of chain lengths, secondary structure contents, and folding rates. They are part of a larger set of roughly 30 single-domain proteins that have been shown to fold via simple, apparent two-state thermodynamics and kinetics. A list of these proteins along with experimental folding data can be found in Ivankov et al. (2003).


View this table:
[in this window]
[in a new window]
 
Table 1. Data for the four proteins studied
 
Preliminaries: Definitions and assumptions
Following Makarov and Plaxco (2003), we use a spatial proximity cutoff parameter rc and a sequence separation cutoff parameter lc to determine the set of sequencedistant native contacts: An amino acid pair ij is considered part of the native topomer contact set if the C{alpha}’s of amino acids i and j are separated by a spatial distance rnij<rc and their sequence separation |i – j| > lc. Thus, the number of contacts {Lambda}D in this set depends on the choice of lc and rc. In Table 2Go, we report the {Lambda}D values for our four proteins, for several different choices of lc and rc within the limits 4 ≤ lc ≤ 12, and 6 Å ≤ rc ≤ 8 Å . Table 2Go shows that {Lambda}D depends rather strongly on lc and rc, and this dependence is particularly strong for 1lmb. For this protein, {Lambda}D varies almost by a factor 10 between the different values of lc and rc considered here.


View this table:
[in this window]
[in a new window]
 
Table 2. Number of sequence-distant native contacts {Lambda}D
 
We then define the native topomer state of a protein as the set of conformations in which every {Lambda}D amino acid pair has formed a "contact" or is otherwise in close spatial proximity. To simplify the terminology, in the following we will use "contact" to indicate simply that two C{alpha} atoms are close in space. For this purpose, two definitions of contact are used. Unless stated otherwise, we use the contact criterion rij<1.2rnij , where rij is the C{alpha}–C{alpha} distance between i and j. This criterion has been used in previous studies of native-centric C{alpha} models (Clementi et al. 2000; Kaya and Chan 2003a). In certain considerations, we find it useful to use a slightly more permissive contact criterion, rij<rnij+3.0 Å , which entails a broader definition of a native topomer. It should be noted that both of these contact criteria are independent of the cutoff parameter rc.

Chain simulation of the native topomer search process
To assess the TSM folding picture by an explicit-chain approach, it is necessary to (1) construct an overall conformational ensemble that is essentially unbiased, (2) determine the size of the conformational space covered by the native topomer as a sub-ensemble of the overall conformational ensemble, and (3) estimate by explicit construction the probability that this conformational sub-ensemble can be located by a diffusive search, as envisioned by TSM. We adopt a simple C{alpha} chain model for this purpose, with the following potential energy function:


(1)

where bi, {theta}i, {phi}i, and rij are the virtual bond lengths, bond angles, torsion angles, and C{alpha}–C{alpha} distances, respectively, and bni, {theta}ni, {phi}ni, and rnij are the corresponding native values. This energy function is designed to capture the basic polymer statistics of a generic self-avoiding polypeptide chain. The use of the {theta}ni and {phi}ni values in the last Here we focus on the cutoff values lc and rc within the limits suggested by Makarov and Plaxco (2003). two terms introduces a small bias toward the native state. These terms are included here for computational efficiency. Their presence does not alter our main conclusion, based on the results from Equation 1, that the probability of locating the native topomer by an unbiased search is vanishingly small (see below). This is because the native-centric nature of these terms in Equation 1 implies that the relative size of the native topomer sub-ensemble would be even smaller in an entirely unbiased overall conformational ensemble. The first term of the above potential function accounts for polymer excluded volume. Its summation is over all amino acid pairs ij, and the parameters {sigma}ij are chosen in the following manner: {sigma}ij=0.85 rnij for pairs ij in contact in the native state (rnij <rc); otherwise, {sigma}ij=4.0 Å . To speed up the calculations, the repulsive energy term is evaluated using a cutoff procedure with cutoff radius 2{sigma}ij. This choice of the {sigma}ij parameters ensures that the native conformation would not be made inaccessible by unphysical repulsive forces.

A motivation for adopting a simplified C{alpha} model in this study rather than using chain representations that account for more structural details is computational efficiency: A simplified chain model allows for extensive conformational sampling and reliable probability estimations that are essential in addressing issues pertinent to the TSM. In this regard, it is noteworthy that the TSM itself—like several other topology-based folding scenarios—is also an inherently structurally low-resolution formulation that does not refer directly to sequence-specific interactions (Makarov and Plaxco 2003). Hence, a C{alpha} chain model enjoys the advantage of making direct connections to the underlying mathematical approximations of the TSM. Continuum (off-lattice) C{alpha} models have been used extensively in the literature for studying protein folding, mostly in combination with Go-type energy functions (Clementi et al. 2000) but also for structure prediction (Nanias et al. 2003). In particular, our model is similar to the one introduced in Clementi et al. (2000) and further studied in Kaya and Chan (2003a). Following Kaya and Chan (2003a), we choose the strengths of energy terms in Equation 1 to be krep= {varepsilon}, kbon=100 {varepsilon}, kben=20 {varepsilon}, and ktor=0.5 {varepsilon}, respectively, where the energy unit {varepsilon} is set to 1, and use Langevin dynamics to calculate the thermodynamic behavior. The temperature is kept constant at T=1.0, and snapshots are taken every 100 time steps. Simulation details are otherwise the same as in Kaya and Chan (2003a).

The native topomers
The energy function in Equation 1 describes a disordered and "floppy" protein chain that samples many different conformations. To obtain a subset of these conformations that corresponds to the native topomer state of each of the four proteins studied here, we impose harmonic constraints representing the sequence-distant topomer contact pairs on the C{alpha} chain. This is achieved by using the energy function


(2)

where the summation shown is over the {Lambda}D sequence-distant native contacts as defined by the cutoff parameters lc and rc (see Table 1Go). The conformational sampling procedure here is similar to the determination of ensembles of partially disordered protein conformational states that uses experimental data as constraints (Choy and Forman- Kay 2001; Vendruscolo et al. 2001), the main difference being that the constraints in this study are supplied by the TSM hypothesis instead of experimental measurements. In Equation 2, the topomer constraint strength is taken to be ktop=10 {varepsilon}, making these harmonic forces one order of magnitude weaker than the strength kbon of the virtual bonds between sequentially adjacent C{alpha} atoms. We perform Langevin dynamics simulations of the energy function Etop for each of our four proteins and different choices of the parameters lc and rc. Each simulation is started from the native structure. The system is first equilibrated for 108 simulation time steps, after which sampling is performed for 109 time steps.

In general, a smaller lc and a larger rc will produce a larger number of topomer constraints, resulting in a topomer ensemble that more closely resembles the native pdb structure. This trend can be seen from Table 3Go, which shows thermodynamic averages of the radius of gyration Rg, the root-mean-square deviation (rmsd) from the corresponding native structure, and the fraction of native contacts Q for four different parameter choices of lc and rc. Smaller lc and larger rc naturally lead to smaller <rmsd> and larger <Q> values, consistent with more native-like ensembles. We note also that <Rg> remains fairly close to the corresponding native values (see Table 1Go) across the various choices of lc and rc.


View this table:
[in this window]
[in a new window]
 
Table 3. Native topomer thermodynamic averages
 
The conformational ensembles of native topomers obtained here are a direct logical consequence of the TSM assumptions of Makarov and Plaxco (2003). An advantage of our explicit-chain approach is that it can supply relatively detailed conformational information that was not available in their TSM formulation that used an analytical approximation of conformations of chains without excluded volume. To provide a graphic structural illustration of the ensemble of topomer conformations obtained from our simulations, we display in Figure 2Go a selected set of these conformations optimally superimposed on the corresponding native structure. These conformations are the first 25 centroids from a simple clustering scheme on the complete topomer ensembles (see figure caption). It is interesting to note that, qualitatively, these drawings appear to be consistent with the TSM perspective (Makarov et al. 2002; Makarov and Plaxco 2003) that a slow-folding protein such as 1aps [PDB] has a "tighter" transition state ensemble than, for example, the much faster folding protein 1lmb [PDB] , which has a very "loose" native topomer state (Fig. 2Go). A similar picture is offered by a more recent non-explicit- chain analysis of the correlation between folding rates and presumed ensemble sizes of "transition states" (Bai et al. 2004) derived from the total contact distance measure (Zhou and Zhou 2002). Here, the trend exhibited in Figure 2Go is further demonstrated by Figures 3Go and 4Go, which show the normalized probability distributions P(rmsd) and P(Q) for lc=12 and rc=8 Å . The order of these distributions in terms of their closeness to the native structure follows the order of the experimental folding rates for these proteins (see Table 1Go), as expected from the TSM picture. The same rmsd and Q order holds for practically all the other choices of lc and rc within the limits 4 ≤ lc ≤ 12 and 6 Å ≤ rc ≤ 8 Å studied here.



View larger version (94K):
[in this window]
[in a new window]
 
Figure 2. Illustrations of the native topomers for the proteins 1aps [PDB] , 1ci2, 1urn, and 1lmb. Shown are the top 25 conformations (red) selected by a simple clustering procedure on the full lc=12, rc=8 Å topomer ensembles. Each of the 25 conformations is optimally superimposed on the corresponding native structure (dark trace). In the clustering procedure, the highest ranking conformation is the one with the largest number of conformational neighbors, where two conformations are considered to be neighbors if their rmsd is less than a certain cutoff (1.3–4.0 Å). The next highest ranking conformation is the conformation with most neighbors in the reduced ensemble, where the highest-ranking conformation and its neighbors have been excluded, and so on.

 


View larger version (23K):
[in this window]
[in a new window]
 
Figure 3. Probability distributions P(rmsd), where rmsd is from the corresponding native pdb structure, for 1aps [PDB] , 2ci2, 1urn, and 1lmb, as obtained using lc=12 and rc=8 Å in Equation 1, and sampling at T=1.0.

 


View larger version (20K):
[in this window]
[in a new window]
 
Figure 4. Probability distributions P(Q) for 1aps, 2ci2, 1urn, and 1lmb, as obtained using lc=12 and rc=8 Å in Equation 1, and sampling at T=1.0.

 
Native topomer ensemble sizes and folding rate predictions
We turn now to a more quantitative assessment of the TSM. In the previous section we saw that the native topomer ensembles of the four proteins display various degrees of "floppiness"; the 1lmb [PDB] topomer ensemble is, in a sense, larger than that of 1aps (see Fig. 2Go). The TSM perspective hypothesizes that it is these differences in ensemble sizes that underlie the different folding rates observed for different proteins. More precisely, TSM stipulates that the folding rate is essentially proportional to {Lambda}DP({Lambda}D), the probability of populating the native topomer during a random, diffusive search, since TSM considers this to be the rate-limiting step in the folding of small, single-domain proteins (Makarov and Plaxco 2003).

As mentioned briefly above, Makarov and Plaxco (2003) made the approximation, based on considerations of Gaussian inert chains, that P({Lambda}D)= {gamma}a{Lambda}D, where {gamma} and a are constants applicable to all proteins. This means that bringing each additional sequence-distant native pair together contributes on average a constant multiplicative factor a (<1) to the overall probability P({Lambda}D) of the native topomer state among the full ensemble of all accessible conformations. (The first three contact pairs are assumed to contribute with somewhat different factors, which are taken into account by the overall multiplicative factor {gamma} != 1.) The topomer search rate was then postulated to be given approximately by the formula kf= {kappa} {Lambda}DP({Lambda}D), where kf is the folding rate, and {kappa}{Lambda}D is the frequency of attempted contact formation. This formula was fitted to experimental folding rate data for a set of 24 twostate proteins yielding an excellent correlation coefficient of {approx}0.88 (Makarov and Plaxco 2003). It should be noted that the dominant, exponential dependence on {Lambda}D in this Makarov-Plaxco expression is in P({Lambda}D). Thus, this rate correlation formula resembles closely the ln kf {vprop} LRO hypothesis proposed previously by Selvaraj and Gromiha (2001) because LRO={Lambda}D/N.

The success of this empirical two-parameter fit implies that the structural quantity {Lambda}D or LRO—like the original CO parameter (Plaxco et al. 1998)—is capturing significant aspects of the physics of apparent two-state protein folding. Obviously, the more important issue, then, is whether the physical picture offered by the TSM perspective is supported by this remarkable folding rate correlation as well. Does the empirical success of the TSM kf ({Lambda}D) formula necessarily imply that the proposed process of topomer search is correct or even physically plausible? To provide a logical answer to this question, an essential first step is to determine the mathematical validity of the P({Lambda}D)= {gamma}a{Lambda}D formula; more specifically, whether this formula, indeed, provides the native topomer probability it purports to describe. Obviously, unless the answer to this basic question is affirmative, the empirical success of the TSM kf({Lambda}D) formula can only be compared with empirical rate correlation with LRO and other proposed parameters but cannot logically be translated into support for the broader topomer-search discourse.

We therefore first focus on the TSM formula P({Lambda}D)= {gamma}a{Lambda}D. By fitting experimental data, the two constants {gamma} and a have been determined to be a=0.86 and {gamma} =4x 10–5 for lc=12 and rc=6 Å (Makarov and Plaxco 2003). Using these fitted values of Makarov and Plaxco in conjunction with the {Lambda}D values for lc=12 and rc=6 Å in Table 2Go, the {gamma}a{Lambda}D values for our four proteins are found to be ranging from 10–5 to 10–10 (Table 4Go). Thus, according to this Makarov-Plaxco formula, populating the native topomer state during a random conformational search is about 3 x 104 times more likely for llmb than for 1aps, for example; and this predicted probability difference is approximately equal to the folding rate difference of these two proteins.


View this table:
[in this window]
[in a new window]
 
Table 4. Estimated probabilities P ({Lambda}D)
 
Now, is this predicted probability difference valid? Does the Makarov-Plaxco P({Lambda}D) formula adequately describe the relative size of the native topomer subensemble relative to the full conformational ensemble? To tackle this question, we compute the probability P({Lambda}D) for our four proteins in explicit-chain conformational ensembles defined by the energy function E0 in Equation 1. To err on the side of affording a higher probability to the native topomer, in this analysis we use the more permissive contact criterion rij<rnij+3.0 Å to determine if amino acid residues (C{alpha} positions) are in contact. The most direct way to ascertain P({Lambda}D) would be to perform Langevin dynamics simulations with E0 and count the fraction of conformations that fulfill the native topomer criterion, that is, having the {Lambda}D amino acid pairs close in space. But such a bruteforce approach is not computationally feasible because of the smallness of P({Lambda}D), making the native topomer very unlikely to be visited by unbiased conformational sampling. Thus, to enhance sampling of the native topomer, we introduce the auxiliary energy function


(3)

where Ebias is a potential with a bias toward the native topomer and {lambda} is a bias strength parameter. Two-dimensional histograms P{lambda} (nD, Ebias) are recorded for a series of {lambda}, where nD denotes the number of sequencedistant native contacts formed. A standard histogram reweighting technique (Ferrenberg and Swendsen 1989) is then applied to obtain an improved estimate of P0(nD, Ebias), and hence of P0(nD={Lambda}D)=P({Lambda}D). Further details of this reweighting procedure, including the precise form of Ebias, are given in Materials and Methods. It should be noted that by design Ebias is not a function of all C{alpha} positions, but only of those that are part of the native topomer constraint set, such that the added bias is toward the somewhat disordered native topomer state (whose sampling we aim to enhance) rather than toward the fully native pdb structure.

Table 4Go shows the results of our P({Lambda}D) calculations for four different choices of lc and rc. For the four proteins studied, our explicit-chain estimates of P({Lambda}D) lie in the range ~10–86 to 10–18, in contrast to the much larger values produced by the formula P({Lambda}D)= {gamma}a{Lambda}D using the particular a and {gamma} parameters of Makarov and Plaxco. In fact, some of the computed explicitchain P({Lambda}D) probabilities in Table 4Go are as small as what would be expected for a Levinthal search. For instance, assuming that each Ramachandran torsion angle can occupy three discrete values, the total number of conformations for a 64-amino-acid protein (such as 2ci2 [PDB] ) would be roughly (3x3)64 {approx} 1061. (The essential exponential increase with chain length N of the number ~ µN of all possible conformations—most of which are noncompact—applies to chains with excluded volume as well. The effect of excluded volume in three dimensions leads to an appreciable but not large decrease in µ. For instance, for chains configured on simple cubic lattices, µ decreases from 6 to {approx}4.68; e.g, see Barber and Ninham 1970; Chan and Dill 1991). From a polymer physics perspective, the results in Table 4Go are not surprising. This is because the native topomer ensembles, especially those with relatively large {Lambda}D’s (e.g., 1aps), P({Lambda}D) is the probability that the {Lambda}D sequence-distant native contacts are formed during a search process described by the energy function E0 (see Equation 1). Two amino acid residues i and j are assumed to contact each other if rij<rnij +3.0 Å . For comparison, we have included the corresponding values from the Makarov-Plaxco P({Lambda}D)= {gamma}a{Lambda}D formula with a=0.86 and {gamma} =4 x 10–5. are constituted of conformations quite closely resembling that of the native pdb structure itself (cf. Figs. 3Go and 4Go). A probable cause of the gross overestimation of native topomer probabilities by the Makarov-Plaxco {gamma}a{Lambda}D formula is its neglect of excluded volume. Excluded volume places enormous restrictions on the conformational freedom of compact chains (Chan and Dill 1990), but excluded volume effects are not taken into account in the Makarov-Plaxco TSM formulation. We note that the relative values of the P({Lambda}D) probabilities computed here mostly follow the same order as that provided by the P({Lambda}D)= {gamma}a{Lambda}D expression, suggesting that this formula might apply approximately for a different set of a and {gamma}. For example, fitting this formula to our explicitchain P({Lambda}D) results for lc=12 and rc=8 Å yields the parameter values of a=0.60 and {gamma} =4.5 x 10–33; but in that case the exceedingly small value of {gamma} would not conform to the TSM picture of Makarov and Plaxco (2003), who have related a {gamma} ~ 10–5 value to "the extra entropy associated with the first few ordering events" (Makarov and Plaxco 2003, p. 21). More fundamentally, however, the very small values of the present explicitchain P({Lambda}D) values strongly support the argument against the TSM speculation that the "entropic cost of finding the native topomer may be reasonable even in the absence of native-like interactions that may favor this set of conformations" (Makarov and Plaxco 2003, p. 22). Quite to the contrary, the vanishingly small native topomer probabilities we obtained imply that an unbiased, diffusive search for the native topomer among all accessible conformations, like the hypothetical Levinthal search, is highly unlikely to succeed and is therefore not a viable mechanism for apparent twostate folding.

{phi}-values and rate-limiting formation of the native topomer
In the TSM picture, the rate-limiting step of folding is the search for the native topomer. Thus, TSM predicts how the conformational ensemble of any given protein should look at the point when the rate-limiting step is achieved during the folding process. In principle, these structural predictions of TSM can be independently verified or falsified by experiments, irrespective of the mathematical/physical question raised above regarding whether the TSM formula provides the correct probability for the hypothetical native topomer state.

The native topomer state has been identified with the folding transition state. According to Makarov et al. (2002), the TSM model "assumes that the folding rate is controlled by the rate of forming all the native contacts observed in the folded protein" (note that these "contacts" refer to native topomer contacts that require a contacting pair to be separated by at least a certain number of residues along the chain), and that the state created when all contacts that define the native topomer is formed is a "transition state" of the model (Makarov et al. 2002, pp. 3536, 3539). Therefore, it is instructive to compare the structural characteristics of the native topomer state with the folding transition state properties inferred by experimental techniques.

A common approach to characterize the transition state ensemble for two-state proteins is through {phi}-value analysis (Fersht et al. 1992). In these experiments, collection of single amino acid mutations is performed and their effects on the folding rate and stability are measured. The goal of {phi}-value analysis is to map out the degree of "interaction" of the mutated amino acids in the transition state. The final result for an amino acid is expressed as a {phi}-value, which states whether the amino acid is "ordered" ({phi} {approx} 1) or "disordered" ({phi} {approx} 0) in the transition state ensemble. Notwithstanding the recent controversy about the interpretation of experimental {phi}-values (Sanchez and Kiefhaber 2003; Fersht 2004; Hubner et al. 2004), {phi}-value analysis is a widely adopted approach. Thus, it is worthwhile to delineate the logical relationship between experimental {phi}-values and the corresponding quantities implied by TSM, even though {phi}-values were not computed by Makarov et al. themselves.

It is relatively straightforward to use the conformational ensembles from our simulations of the hypothetical native topomers to calculate TSM-predicted {phi}-values, which can then be compared with available experimental {phi}-value data for the four proteins studied here. Similar procedures have been applied before by other researchers to compare experimental {phi}-values and theoretical {phi}-values predicted by non-explicit-chain (Alm and Baker 1999; Galzitskaya and Finkelstein 1999; Muñoz and Eaton 1999) as well as explicit-chain (Clementi et al. 2000, 2003; Ejtehadi et al. 2004) models.

Before turning to more detailed calculations, several salient conclusions about TSM-predicted {phi}-values can be drawn by considering the general properties of the native topomer state (cf. Fig. 2Go). Any given amino acid residue making mostly sequence-distant native contacts in a protein’s pdb structurewill, because of the topomer constraints, havemanyof its spatially neighboringamino acid residues in the native structure also in proximity in the native topomer state. Consequently, TSM would predict a high {phi}-value for the given amino acid residue. In fact, if all of this amino acid residue’s native pdb contacts were sequence-distant, {phi} would be predicted byTSMto be {approx}1.Onthe other hand, an aminoacid residuemakingmostlylocal contacts, as mightbe the case for residues in {alpha}-helical regions, is likely to have a lower TSM-predicted {phi}-value because by definition the amino acid residues it contacted in the native pdb structure are not required to be in spatial proximity in the native topomer state.

To calculate the TSM-predicted {phi}-values, we use the following definition of {phi}-value for an amino acid residue i:


(4)

where <Ni> is the average number of native contacts made by amino acid residue i in the native topomer conformational ensemble, and Nn i is the total number of native pdb contacts for this amino acid residue. This definition is identical to that introduced by Vendruscolo et al. (2001) to construct transition-state ensembles from experimental data. In the present calculation, a contact is considered formed if rij<1.2rnij. Figure 5Go compares {phi}icalc and {phi}-values obtained from experiments (Itzhaki et al. 1995; Burton et al. 1997; Chiti et al. 1999; Ternström et al. 1999). As expected, the TSM-predicted {phi}icalc-values display a tendency to be lower in {alpha}-helical regions than in {beta}-sheet regions. It is quite clear from Figure 5Go that the TSM-predicted and experimental {phi}-values do not follow the same trend along the chain sequences. The correlation coefficient between {phi}icalc and {phi}iexp is <0.1 in each of the four cases, although there are isolated individual calculated {phi}-values that match the corresponding experimental values well. These results indicate that, at least for the four proteins studied, the TSM predictions about the structural characteristics of the rate-limiting transition- state conformational ensemble are very different from that obtained from common experimental {phi}-value analysis.

TSM "{psi}-values"
A more informative way to characterize the folding transition state than the {phi}-values for individual amino acid residues is to consider the participation of pairs of contacting amino acid residues in the transition state. We apply this to the conformational ensembles of the four native topomer states, and determine the probability {psi}ijcalc for each amino acid residue pair ij in the native contact set to be in contact in these topomer ensembles (Fig. 6Go). As before, a contact in these ensembles is defined by rij<1.2rnij. These TSM-predicted quantities are analogous to the -values in experimental analysis, wherein bi-histidine metal binding sites are introduced for selected pairs of amino acid residue positions, and the stability of folding pathways with the metal binding pair formed relative to other pathways is regulated by metal ion concentration. The resulting -value of a native contact ij, a number between 0 and 1, then represents the ratio of protein molecules that have the given metal binding site in the transition state (Sosnick and Krantz 2001).



View larger version (30K):
[in this window]
[in a new window]
 
Figure 6. Probabilities of native contacts being formed ({psi}ijcalc-values) in the lc=12, rc=8 Å topomer ensembles. The color scale goes from =0 (green) to =1 (red). The triangular regions above and below the main diagonal provide identical information.

 
Figure 6Go displays our computed {psi}ijcalc values for the lc=12, rc=8 Å native topomer ensembles. By TSM definition, the probability of being in contact is close to 1 for pairs ij that are more than lc steps from the diagonal i=j. It is thus a general TSM prediction that {psi}ijTSM {approx} for amino acid pairs ij that are distant in sequence. Not surprisingly, in line with the calculated {phi}-values, the TSM-predicted -values in Figure 6Go indicate that the {alpha}-helical regions are relatively more disordered in the native topomer states. {psi}-value analysis is a relatively novel method, and experimental data are quite limited thus far. None of our four proteins has been studied experimentally with this technique. Nonetheless, TSM predictions in Figure 6Go should be useful for future assessment when pertinent experimental data become available.

TSM and native-centric explicit-chain folding models
In the above, we have shown by explicit-chain sampling that the population of the hypothetical native topomer state as a fraction of all accessible conformations is dramatically lower than that originally stipulated by Makarov and Plaxco (2003), and that the structural properties of the native topomer states of the four proteins considered in this study do not appear to resemble that of the corresponding folding transition states inferred from conventional experimental {phi}-value analysis. To further elucidate the relationship between the TSM picture and explicit-chain dynamics, we now consider a class of native-centric models that allow for direct kinetic simulations of folding rates (Clementi et al. 2000; Koga and Takada 2001; Kaya and Chan 2003a).

One of the main justifications for the use of nativecentric models to study the folding process has been the empirically observed relationship between native topology and folding rate, emphasizing the important energetic information embodied in the native topology of natural proteins. This principle is, naturally, also central to the TSM. We therefore find it worthwhile to compare and to put into context these two types of approaches. In particular, we choose to compare the TSM with the Langevin dynamics version (Kaya and Chan 2003a) of the continuum C{alpha} model of Clementi et al. (2000), because a recent variation of this explicit-chain modeling construct has been shown to provide a good correlation between model-predicted and real folding rates of 16 apparent two-state proteins (Chavez et al. 2004).

Conformational search in these native-centric models is directed, not unbiased. The main energy term driving chain collapse and folding in this native-centric model consists of Lennard-Jones-like interactions between every pair of amino acid residues ij that satisfies |i – j|>3 and forms a contact in the native pdb structure (Clementi et al. 2000). We consider two different definitions for two residues to be a native contact pair in this model: (1) if their C{alpha}–C{alpha} separation in the native pdb structure is less than a certain cutoff distance rc; (2) if any two heavy (non-hydrogen) atoms, one from each residue, are within 4.5 Å ; this is identical to the definition used recently by Chavez et al. (2004). Definition (1) is sensitive to the choice of rc. It turns out that the thermodynamic behaviors of three of our four proteins do not appear to be two-state-like for rc=6 Å and rc=7 Å (data not shown). Here we focus only on the better-behaved rc=8 Å case. Computational runs of 109 simulation time steps are executed for each of the four proteins for the two different native contact sets we just described, using the potential energy function and Langevin dynamics simulation protocol detailed elsewhere (Equations 1, 2; see related discussion in Kaya and Chan 2003a). In the present investigation, the simulations are performed at these model proteins’ respective folding temperature Tf, the temperature at which the given chain model’s specific heat capacity is at its maximum.

Figure 7Go shows our simulated free-energy profiles in the fraction of native contacts Q. The location of the free-energy barrier along these profiles (the single peak between the unfolded and folded minima at low and high Q, respectively) varies slightly between the four proteins and the two different contact sets. The chain structures in these peak regions are commonly taken as the transition state conformations for these and similar native-centric explicit-chain models (Clementi et al. 2000; Nymeyer et al. 2000; Chavez et al. 2004). However, it has been pointed out that not all structural details of transition-state conformations can be captured by such simple procedures, because the factors determining whether a conformation belongs to the folding transition state can be complex (Du et al. 1998; Hubner et al. 2004). In other words, some true transition- state conformations of the model may reside outside the Q-based free-energy peak region. Nonetheless, inasmuch as kinetic transitions between different Q-values are not too discontinuous, conformations populating the free-energy barrier region are indicative of general features of the conformations associated with the rate-limiting step in folding (Kaya and Chan 2002). With this in mind, in the following discussion we refer to the conformations in the free-energy barrier regions in Figure 7Go as the putative transition states of the native-centric explicit-chain models.



View larger version (63K):
[in this window]
[in a new window]
 
Figure 7. Free-energy profiles for the four proteins are given by the negative logarithm of the conformational distribution P(Q) in Q, obtained in this study by simulations of native-centric models (Kaya and Chan 2003a) at T {approx} Tf. Results from two native contact sets are shown: (1) contact pairs are defined by native C{alpha}–C{alpha} distance rnij<8 Å (dashed curves); and (2) contact pairs are defined by the shortest spatial separation between non-hydrogen atoms in the two amino acid residues as described by Chavez et al. (2004) and in the text (solid curves). For comparison with the hypothetical TSM picture, the range of variation of <Q> of the corresponding native topomers across different (lc, rc) criteria (Table 3Go) is indicated by the shaded areas.

 
Figure 7Go shows that these putative transition-state conformation are found roughly around Q {approx} 0.4–0.6, and that they are quite dissimilar to the corresponding native topomer ensembles, which have very different <Q>-values (Fig. 7Go, shaded areas). For 1aps, 2ci2, and 1urn, in terms of Q, the native topomers are far more native-like than the explicit-chain putative transition states. In these cases, even if some of the true transition- state conformations do reside outside the Q-based free-energy peak region, it is highly unlikely that they would share many similarities with the native topomer ensembles that have Q-values essentially equal to (2ci2 and 1urn) or higher than (1aps) the Q-based native free-energy minima of the explicit-chain models (although the possibility cannot be ruled out entirely without an exhaustive determination of transition-state conformations in the explicit-chain models). For 1lmb, the situation is more complicated because of the strong dependence of <Q> on the cutoff parameters lc and rc. Taken together, these observations suggest strongly that the native topomer states do not generally correspond to the folding transition state in these models. Instead, they are more representative of conformations that are visited after the rate-limiting step in the folding of these explicit-chain models.

Finally, we compare the TSM rate-prediction formula kf {vprop} {Lambda}Da{Lambda}D and the folding rate results from the native-centric models. Following Chavez et al. (2004), we determine folding rates in the native-centric model at Tf using the direct Langevin dynamics simulation method of Kaya and Chan (2003a). Folding and unfolding rates are identical at the transition midpoint. Thus, the folding rate kf nc of the native-centric models at Tf is defined as the reciprocal of the mean first passage time (MFPT) from the beginning of an unfolded "phase" to the initiation of a folded phase, or vice versa, during the Langevin dynamics simulation [i.e., kf nc=(MFPT)–1]. The unfolded and folded phases correspond to Q≤Qu and Q≥Qf, respectively, where Qu and Qf are the unfolded (low Q) and folded (high Q) free-energy minima. The MFPT was determined by averaging over 40 (1aps) to 3225 (1lmb) folding/unfolding events recorded during 109 simulation time steps. Figure 8Go shows that the correlation between {Lambda}Da{Lambda}D and kf nc is good, which is perhaps not surprising given that both sets of results have been shown previously to correlate strongly with real folding rate data (Makarov and Plaxco 2003; Chavez et al. 2004). We note, however, that both the topomer search rate-prediction formula and the native-centric model predict 1urn to fold roughly an order of magnitude slower than 2ci2, but in fact the opposite is true (see Table 1Go).



View larger version (16K):
[in this window]
[in a new window]
 
Figure 8. Comparison between the TSM quantity {Lambda}Da{Lambda}D (a=0.86) (Makarov and Plaxco 2003) of the four proteins and the corresponding folding rate kf nc computed in this study from direct kinetic simulations of the native-centric Langevin dynamics models at each model’s Tf. Results for kf nc denoted by x and {dotsquare} are, respectively, for the native contact sets (1) and (2) specified in the caption for Fig. 7Go.

 
Despite the limitations of native-centric models with essentially pairwise additive interactions (which include the models used here) in their ability to capture experimental features of kinetic cooperativity when native stability is significantly higher than that at the transition midpoint (Kaya and Chan 2003a; Chan et al. 2004), comparing them with TSM predictions offers valuable insights. The predicted folding rates of the two approaches correlate well, but the native topomer states most likely do not correspond to the explicit-chain transition states. Moreover, as an unbiased search for the native topomer state is not viable (see above), it is not clear whether there is any definitive way to construct viable explicit-chain models that are consistent with the TSM. In any event, the results in Figures 7Go and 8Go indicate clearly that a good rate correlation by itself is far from sufficient to pin down the underlying folding mechanism.

For the native-centric explicit-chain model considered here, every point in the conformational space has an energetic bias toward the native state. Frustration in the model arises solely from chain connectivity and excluded-volume effects. As Q increases, the combined effects of these incremental energetic favorabilities and the reduced conformational freedom as the chain acquires an increasing number of favorable contacts result in a free-energy slope (with respect to Q) that becomes first zero and then negative at intermediate values of Q. In contrast, TSM predicts conformational characteristics that are much more native-like at the rate-limiting step of folding. In this connection, it has been pointed out that, as a consequence of experimentally observed protein folding cooperativity, "an excess of 90% of the native structure is required for the free energy of a typical single-domain protein to drop below zero" (Gillespie and Plaxco 2004, p. 855). However, the configurational position of the transition state is determined by the gradient of free energy with respect to the reaction coordinate(s), not by the value of free energy itself. As an illustration of this basic principle, we note that a chain model’s stability and cooperativity can be arbitrarily enhanced by adding an ad hoc favorable energy to the unique native structure as a whole without affecting the folding kinetics (Kaya and Chan 2003b). Therefore, the above observation of Gillespie and Plaxco does not necessarily imply that transition states of real, cooperative proteins have to be very close to the native state. In fact, such a proposal would appear to be inconsistent with many chevron predictions of intermediate solvent exposure of folding transition states (P