Protein Science
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Protein Science (2004), 13:113-124. Published by Cold Spring Harbor Laboratory Press. Copyright © 2004 The Protein Society
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Tiana, G.
Right arrow Articles by Colombo, G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Tiana, G.
Right arrow Articles by Colombo, G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Understanding the determinants of stability and folding of small globular proteins from their energetics

Guido Tiana1,2, Fabio Simona1, Giacomo M.S. De Mori3, Ricardo A. Broglia1,2,4 and Giorgio Colombo3

1 Department of Physics, University of Milano, 20133 Milano, Italy
2 Istituto Nazionale Fisica Nucleare (INFN), 20133 Milano, Italy
3 Istituto di Chimica del Riconoscimento Molecolare, Consiglio Nazionale delle Richerche (CNR), 20131 Milano, Italy
4 The Niels Bohr Institute, 2100 Copenhagen, Denmark

Reprint requests to: G. Colombo, Istituto di Chimica del Riconoscimento Molecolare, CNR, via Mario Bianco 9, 20131 Milano, Italy; e-mail: giorgio. colombo{at}icrm.cnr.it; fax: 39-02-28500036.

(RECEIVED May 26, 2003; FINAL REVISION August 4, 2003; ACCEPTED September 29, 2003)

Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.03223804.


    Abstract
 TOP
 Abstract
 Introduction
 The method and its...
 Application to proteins
 {alpha}-Spectrin SH3 domain
 src SH3 domain
 Protein G
 Chymotrypsin inhibitor 2
 Discussion
 Materials and methods
 References
 
The results of minimal model calculations indicate that the stability and the kinetic accessibility of the native state of small globular proteins are controlled by few "hot" sites. By means of molecular dynamics simulations around the native conformation, which describe the protein and the surrounding solvent at the all-atom level, an accurate and compact energetic map of the native state of the protein is generated. This map is further simplified by means of an eigenvalue decomposition. The components of the eigenvector associated with the lowest eigenvalue indicate which hot sites are likely to be responsible for the stability and for the rapid folding of the protein. The comparison of the results of the model with the findings of mutagenesis experiments performed for four small proteins show that the eigenvalue decomposition method is able to identify between 60% and 80% of these (hot) sites.

Keywords: protein folding; protein stability; molecular dynamics; local elementary structures


    Introduction
 TOP
 Abstract
 Introduction
 The method and its...
 Application to proteins
 {alpha}-Spectrin SH3 domain
 src SH3 domain
 Protein G
 Chymotrypsin inhibitor 2
 Discussion
 Materials and methods
 References
 
The study of how the structure and stability of a protein is connected with its amino acid sequence has been, during recent years, a major focus of research. In particular, the problem of protein folding and its relation to stability has been studied through a variety of experimental and theoretical techniques. These studies have highlighted the central role the free energy landscape plays in determining the properties displayed by proteins. The discussion has then been turned to the problem of determining this landscape for specific proteins (Shea and Brooks 2001).

Experimental techniques have yielded detailed information on macroscopic features of protein dynamical behavior, such as folding times and stability, and on some specific issues at the level of amino acids, such as the sensitivity to mutations (cf. Fersht 1999). However, there is still no experimental procedure capable of providing insight at the amino acid level into either the folding process in its completeness or the stabilization determinants of proteins. To gain insight into these questions, one must turn to theoretical and computational methods.

Because of the high computational costs involved, realistic models that provide an exhaustive description of the free energy landscape have been used in the study of only short peptides. Ferrara and Caflish (2000), for instance, could reconstruct the whole free energy landscape of a small designed ß-sheet peptide with an all-atom representation of the solute and an implicit model of the solvent. Daura et al. (1998) were able to demonstrate the reversible folding of a small (seven residues) helix forming ß-peptide in methanol solvent using long molecular dynamics (MD) simulations.

On the other hand, minimal models (Chan and Dill 1991; Sali et al. 1994; Mirny and Shakhnovich 2001; Micheletti 2003) have provided important results about the general features of the free energy landscape of proteins. They give an approximate description of both the interaction energy among amino acids (usually through a contact potential encoded in a 20 x 20 matrix; see Miyazawa and Jernigan 1985), and of the entropy. By making use of lattice models, that is, describing the protein as a chain of beads occupying the vertices of a cubic lattice, it has been possible to understand the basic energetic properties that characterize a folding sequence, that is, a large gap (measured in terms of typical contact energies) between the energy of the native and of the lowest, structurally dissimilar state (Mirny and Shakhnovich 2001). Studying the space of sequences and the role mutations have on good folders, it has been shown that the existence or less of such a gap depends on the existence of few "hot" sites in the native conformation, occupied by highly conserved, strongly interacting amino acids (Tiana et al. 1998). Based on these results, it has been found that the process of folding takes place in a hierarchical fashion (Baldwin and Rose 1999; Yeh and Rousseau 2000; Levy et al. 2001), guided by local elementary structures (LESs), formed very early in the folding process and stabilized by the interaction among hot amino acids (Tiana and Broglia 2001a). The docking of LES leads to the overcoming of the highest free energy barrier encountered by the protein in the transition between the denatured to the native conformation, giving rise to the folding nucleus (FN; Abkevich et al. 1994), and thus to a conformation that sits in the energetic basin of attraction of the native state. Such hierarchical mechanism makes the folding process possible by efficiently squeezing out the entropy from the system. Although the extension of this solution to the case of real proteins is not expected to be straightforward, it depends on working out the (ab initio) interaction between single protein residues (Broglia and Tiana 2001).

Other approaches focus on the geometric features of the protein, providing a detailed picture of the structure and using simple approximations for the potential function. From the physical point of view, this means focusing the attention on the entropic part of the energy landscape, neglecting the energetics. An example of this approach is to be found in Plaxco et al. (1998), who correlate the folding rate of proteins with a "contact order" parameter accounting for the locality of the interactions. The Go model is another approximation that describes well the entropy of the protein chain, making a crude approximation on the interaction energy (Go 1975), being a contact function that assumes the value -1 if the contact is a native contact and zero otherwise.

In the present work, we use a simple novel approach to extract information concerning the folding process and the stability of proteins from the structure of their native state. It is solely based on a detailed description of the interaction energy and neglects the protein entropy. The rationale behind this choice of approach is that the stabilization energy is distributed quite unevenly in proteins, as testified by the fact that mutations in most of the sites of a protein have little effect on its folding properties, whereas there are few key (hot) sites (where the stabilization energy is concentrated) that are highly sensitive to mutations (Mirny et al. 1998; Tiana et al. 1998). These sites are those that, in the analysis performed with minimal models, build out the LESs that control the folding process (Tiana and Broglia 2001b). Consequently, by determining these sites, one can get insight not only into the mechanism that stabilizes the protein but also into the kinetics of the folding process.

The understanding of how the different amino acids determine the free energy of the protein, in particular which are the residues that play a key role in its the stability and kinetics, is of great interest, not only to unravel how the one-dimensional sequence of amino acids encodes for the three-dimensional native conformation but also to eventually design new proteins with specific tasks (see Dahiyat and Mayo 1997) or to modify existing proteins (Colombo and Merz 1999).

In the following sections, we will describe the protocol used for determining the hot sites of the protein. We shall first test the method on a lattice designed model proteins (The Method and Its Validation From a Minimal Model) and then apply it to four small proteins (Application to Proteins), characterized by different folds and representative of diverse structural motifs. The last two sections contain the discussion of the results and the conclusions.


    The method and its validation from a minimal model
 TOP
 Abstract
 Introduction
 The method and its...
 Application to proteins
 {alpha}-Spectrin SH3 domain
 src SH3 domain
 Protein G
 Chymotrypsin inhibitor 2
 Discussion
 Materials and methods
 References
 
The basic idea of the present analysis is to extract energetic information on the protein from MD simulations and, from it, to gain insight into the determinants of the stability of the native protein conformation and their influence on the folding process. The main information needed to achieve this goal is the interaction matrix Mij, calculated averaging the corresponding interaction energies, comprising all the nonbonded interresidue energy components (e.g., van der Waals and electrostatic), over a MD trajectory starting from the native conformation. The matrix Mij can be decomposed in eigenvalues, in the form


(1)

where N is the number of amino acids in the protein, {lambda}{alpha} is an eigenvalue, and are the components of the associated eigenvector. We assume that the eigenvectors are normalized to unity and, because Mij is symmetrical, all the eigenvalues are real.

For the sake of simplicity, we label the N eigenvalues in increasing order, so that {lambda}1 is the most negative. Accordingly, the different terms in the sum in equation 1Go approximate the real interaction energy Mij to an increasing extent, the first term containing the largest contribution to the stabilization of the native conformation. The components of the associated eigenvector indicate to which extent each amino acid participates to the stabilization. In other words, each term in equation 1Go accounts for an amount of energy {lambda}{alpha} that is shared among the different residues according to the corresponding eigenvector .

If the second eigenvalue {lambda}2 is much higher than {lambda}1, one can approximate the whole interaction matrix as


(2)

reducing the information needed to specify the interaction from N2 to N numbers.

Because the interaction patterns of real protein are particularly complicated ("frustrated," in the language of Frauenfelder and Wolynes 1994), one does not expect this approximation to be particularly accurate. On the other hand, it reduces a complicated two-body interaction into an effective interaction determined solely by a simple number (a generalized "charge," such as in the case of the electrostatic interaction) that can be ascribed to each amino acid. Thus, the approximation given in equation 2Go states that there are some amino acids that are strongly interacting and others that are weakly interacting tout court, depending on the charges and of the two amino acids. Consequently, if amino acid A strongly attracts amino acids B (e.g., both with large positive charge; unlike the case of electrostatic interaction, charges with the same sign attract each other and charges with opposite sign repel each other, because AI in equation 2Go is negative, being defined as the most negative of the eigenvalues), and amino acid B strongly attracts amino acid C (e.g., both with positive large charge), in the present approximation A strongly attracts C. The opposite is also true: If the interactions A–B and B–C are weak, the interaction A–C will be weak. On the other hand, the real interaction Mij could be more complicated than this, being able to account for any combination of attractive and repelling pairs (e.g., A attracts B, B attracts C, but A repels C). Consequently, the approximation introduced in equation 2Go is, in principle, rather crude. On the other hand, from the analysis of minimal models, it emerged that the folding and stabilization of small single domain proteins is essentially controlled by a FN displaying a rich network of contacts among the most strongly interacting residues (Abkevich et al. 1994; Tiana and Broglia 2001a,b) and thus containing a large fraction (~30% to 50%) of the stabilization energy of the system. This network arises from the fact that the FN results from the docking of the LES, in which each residue interacts favorably with many other residues belonging to the same or to one of the other LESs. Consequently, essentially each residue in the nucleus attracts each other, and can be easily described in terms of charges.

Residues belonging to the nucleus display a large charge, whereas the other residues carry a much lower charge. In other words, the presence of the nucleus is the physical reason why {lambda}1<<{lambda}2.

Summing up, the analysis of the eigenvalues spectrum provides information about the folding features of the protein. If {lambda}1 is much smaller (i.e., more negative) than the other eigenvalues, it means that the approximation introduced in equation 2Go is good for our purposes, and the protein displays a well-defined nucleus. Again, we should make the caveat that interactions in real proteins are very complicated, and even if the folding is essentially controlled by a FN, the approximation given in equation 2Go can never be quantitative.

In keeping with this caveat, in what follows, we will use the approximation introduced in equation 2Go not to describe the whole network of interactions within a given protein, but only to investigate the possible presence of a stabilizing nucleus and to describe the energetics of this nucleus. As a consequence, it is not necessary for the approximation to be useful to provide small error bars on all the sites of the protein, but only on those participating in the FN. In fact, the validity test for the method will be its ability to identify which sites of the protein are important for the stability of the native conformation of the protein and for the dynamics of the associated folding process (as measured by {varphi}-value analysis, see below).

In what follows we will also show that for all of the proteins studied, the energy of the nucleus is well approximated by equation 2Go. On the other hand, we will find that the overall correlation coefficient between eigenvector components and {varphi}-values is, as a rule, only qualitative as a consequence of the fact that we are concentrating on the energetics of the folded state of the protein, neglecting the entropic contribution due to the distribution of folding–unfolding events displayed by proteins in solution.

Furthermore, the method used in the present article, which is centered around the approximated relation introduced in equation 2Go, although being an application of the well-known principal component analysis (PCA), is not related in any way with the various kinds of conformational PCA, or "essential dynamics," present in the literature (Amadei et al. 1993), whose goal is to effectively decrease the dimensionality of the conformational space.

To test the above procedure, we have studied a 36mer lattice model sequence designed (making use of the contact energies given in Table 6 of Miyazawa and Jernigan [1985]) to fold to a unique conformation, the nucleus of which is known to be built out of the three LESs: 3–6, 11–14, and 27–30 (Tiana and Broglia 200la), which contains 47% of the stabilization energy. The spectrum of eigenvalues of the interaction matrix calculated in the native conformation is displayed as a solid curve in the upper part of Figure 1Go. For comparison, the spectrum associated with a random sequence is displayed in the lower part of the same figure. In the case of the designed sequence, the lowest eigenvalue is well separated from the others (i.e., {Delta}{lambda} = {lambda}2 - {lambda}1 = 113.6 in units of RTroom, where the average spacing between the others is = 13.3, leading to the ratio {Delta}{lambda}/ = 8.5), whereas in the case of the random sequence this is not true (i.e., {lambda}2 - {lambda}1 = 0.4, the average spacing between the others being 4.0). As discussed above, this difference reflects the fact that the designed sequence displays a nucleus of mutually interacting residues, where the stabilization energy is concentrated (i.e., 50% of the native energy is carried by 30% of the residues, i.e., those belonging to the LES that form the FN). On the other hand, in the random sequence there is a complicated pattern of attractive and repulsive contacts.



View larger version (18K):
[in this window]
[in a new window]
 
Figure 1. The negative eigenvalues of the interaction matrix in the lattice model. The eigenvalues are plotted in ascending order, for a folding sequence and for a random nonfolding sequence. The solid curve is associated with the interaction in the native conformation. The dashed curve displays the eigenvalues of the average interaction matrix, calculated starting from the native conformation, and is allowed to fluctuate for 104 steps at the folding temperature, whereas the dotted curve is associated with the average interaction, starting from a random conformation and letting the system evolve for 104 steps.

 
Because for real proteins the equilibrium state, at the length scale of single amino acids, is an ensemble of different conformations (Frauenfelder et al. 1988) sharing the same overall topology, we have repeated the calculation, letting the system fluctuate among these conformations. This is done by performing a Monte Carlo search (104 MC steps) at folding temperature and calculating the interaction energy between pairs of amino acids averaged over these Monte Carlo steps. The resulting eigenvalue spectrum, displayed as a dashed curve in Figure 1Go, is qualitatively equal to that calculated in the unique native conformation. In particular, the FN energy is -5.90 in the present case, compared with -7.80 in the case of considering a unique conformation.

A difference between the eigenvalue spectra of the designed and of the random sequence is detectable also for interaction matrices calculated starting from random conformations and performing the average over a time that is much smaller than the overall folding time (103 MC steps, compared to a characteristic folding time of 8 x 105 MC steps for the folding sequence). The results are displayed as a dotted curve in Figure 1Go. For the folding sequence {lambda}2 - {lambda}1 = 43.2, to be compared to the average spacing of 7.6, whereas for the random sequence {lambda}2 - {lambda}1 = 1.1 and the average spacing is 2.6. The reason for this behavior is that LESs, which are part of the nucleus and carry some of the stabilization energy of the protein, are formed at the very early stages of the folding process of good sequences. On the other hand, random sequences undergo a disordered collapse, not displaying any internal structure. The details concerning how the stabilization energy is distributed among the amino acids are contained in the normalized eigenvector associated to the lowest eigenvalue. In Figure 2Go are displayed the components of such an eigenvector in the case of the interaction matrix calculated in the native conformation of the designed sequence (corresponding to the solid curve in the upper part of Fig. 1Go). The plots corresponding to the other two curves in the upper part of Figure 1Go are similar. The plot displays six peaks in correspondence to the amino acids building the LESs (residues 3, 6, 11, 14, 27, and 30) and one (residue 16) that, although not participating in a LES, interacts strongly with the FN. Anyway, these seven sites are exactly those that, if mutated, lead to the denaturation of the protein, and are called hot and warm sites in the article (Tiana et al. 1998).



View larger version (18K):
[in this window]
[in a new window]
 
Figure 2. The eigenvector associated with the largest eigenvalue for the lattice-model folding sequence (corresponding to the first point of the solid curve in the upper part of Figure 1Go).

 

    Application to proteins
 TOP
 Abstract
 Introduction
 The method and its...
 Application to proteins
 {alpha}-Spectrin SH3 domain
 src SH3 domain
 Protein G
 Chymotrypsin inhibitor 2
 Discussion
 Materials and methods
 References
 
Four experimentally well-characterized protein domains, displaying different three-dimensional structures, were used to test the model with realistic all-atom simulations, including an atomic representation of the solvent: the {alpha}-spectrin SH3 domain (57 residues), the src SH3 domain (60 residues), the IgG binding domain of protein G (herein called simply "protein G"; 56 residues), and Chymotrypsin inhibitor 2 (CI2; 65 residues). All of these systems have been studied in depth in terms of the mutations needed to (de) stabilize their folded states and to influence their folding kinetics. Most of these studies were based on an extensive experimental analysis analyzing the role of point mutations and on the characterization of the effects of every single mutation on the thermodynamic stability and on the kinetic properties of the protein. The ability to identify the mutation sites via a fast and simple computational method thus becomes an extremely appealing feature of all-atom molecular simulations. For the sake of simplicity in the discussion of the eigenvalue and of the eigenvector properties, the numbering followed here for each protein starts from residue 1.

The four systems mentioned above were analyzed through 10-nsec-long all-atom MD simulations, using an explicit representation of the solvent and the Particle Mesh Ewald method to calculate electrostatic interactions, in order to avoid cut-off induced artifacts.

The lower part of the eigenvalue spectrum, consisting on the first 20 eigenvalues for each of the four proteins, is displayed in Figure 3Go, A–D. For all the proteins examined, the separation {Delta}{lambda} between eigenvalues {lambda}1 and {lambda}2 is much larger than the average spacing between the other eigenvalues (Table 1Go), as in the case of the lattice model protein discussed above. Only in the case of CI2, this property becomes marginal, and in particular, the first three eigenvalues are almost equally spaced.



View larger version (18K):
[in this window]
[in a new window]
 
Figure 3. The negative eigenvalues of the interaction matrix for the all-atom MD simulations. The eigenvalues are plotted in ascending order: filled circles refer to the {alpha}-spectrin SH3 domain; diamonds, to the src SH3 domain; open circles, to the IgG binding domain of protein G; and filled triangles, to CI2.

 

View this table:
[in this window]
[in a new window]
 
Table 1. The difference {Delta}{lambda} between the first two eigenvalues, the average spacing {lambda} between the other eigenvalues, and their ratio
 
The reason of why the ratio between {Delta}{lambda} and is somewhat smaller for the four real proteins than for the idealized lattice protein are as follows: (1) the use of a general-purpose force field, which has not been optimally tuned for the particular fold of the proteins under study; (2) the noise present in MD simulations due to the numerical approximations in the integration steps; and (3) real proteins can concentrate energy in clusters of amino acids for reasons different than their folding and stabilization properties, such as for binding other molecules, for enzymatic activity, etc.


    {alpha}-Spectrin SH3 domain
 TOP
 Abstract
 Introduction
 The method and its...
 Application to proteins
 {alpha}-Spectrin SH3 domain
 src SH3 domain
 Protein G
 Chymotrypsin inhibitor 2
 Discussion
 Materials and methods
 References
 
The spectrum of eigenvalues for {alpha}-spectrin SH3 domain is reported in Figure 3AGo (filled circles). In this case, the difference between the first two lowest eigenvectors is {Delta}{lambda} = 20.9, which means five times the average spacing = 4.2.

In Figure 4AGo are displayed the components of the normalized eigenvector associated with the lowest eigenvalue, components that indicate how the stabilization energy is distributed among the amino acids of the {alpha}-spectrin SH3 domain. The plot displays particularly large values of the components in the intervals 10–20, 34–39, and 46–49 (in particular, the maximum of the peaks are at sites 10, 14, 18, 36, 47, and 48).



View larger version (24K):
[in this window]
[in a new window]
 
Figure 4. The eigenvectors associated with the largest eigenvalue in the all-atom MD simulations: {alpha}-spectrin (A), src SH3 domain (B), IgG-binding domain (C), and CI2 (D).

 
This protein has been extensively studied by Martinez et al. (1998). By means of point mutations, they showed that residues V18, V39, and V48 are important for the stability of the protein, as their mutation leads to a strong destabilization of the native state (Table 2Go) with {Delta}{Delta}GN-U values ranging between 7.5 and 12.9 kJ/mole. These sites are indeed among those showing, in our analysis, a large value of the eigenvector component. This is a straightforward result, considering that the components of the eigenvector express the degree to which a given residue participates in the stability of the protein.


View this table:
[in this window]
[in a new window]
 
Table 2. Effect of point mutations on the amino acids of {alpha}-spectrin SH3 domain, as found by Martinez et al. (1998)
 
A more striking result is that the residues that play a key role in the kinetics display large values of the eigenvector components. In fact, through the analysis of {varphi}-values, Martinez et al. (1998) showed that the formation of the distal hairpin (residues 38–48) and the anchoring of the second strand of the RT loop (residues 18–19) are determinant in the kinetics of folding. The highest peak in Figure 4AGo corresponds to residue VI8 (RT loop), whereas the two peaks centered at residues W36 and V48 correspond to the sites stabilizing the distal hairpin. The summary of stability and kinetic data is listed in Table 2Go.

The connection between the energetic properties of a residue and its role in the folding kinetics is not unexpected and can be easily rationalized in terms of stabilization of LESs and their subsequent assembly into the FN (Tiana and Broglia 2001a). Such LESs, being the elements that lead the folding process, have to be strongly stabilized at the very early stages of the kinetics and, consequently, must display large values of the eigenvector components. In the case of {alpha}-spectrin SH3, the distal hairpin plays the role of LES and, interacting with the second strand of the RT loop, forms the FN (as discussed by Martinez et al. [1998]).

According to Figure 4AGo, the first strand of the RT loop (residues 10–15) also plays an important role, if not for the folding and stability of the whole protein at least for the stabilization of its N-terminal region.

Summing up, the eigenvector components are useful in locating hot sites. The performance of the method is assessed by evaluating how many hot sites it is able to identify. To provide a consistent definition of hot site on the basis of the components of the eigenvector (and, for comparison, of {varphi}-values), we calculate the most probable value of the eigenvector components distribution and define as hot all the sites of the protein displaying an eigenvector component larger than the most likely value. The distribution of eigenvector components displays a peak and a long tail, and consequently, the most probable value is not the average but the one associated to the top of the distribution. On the other hand, because {varphi}-values are not calculated in all sites, the statistics is low and it is difficult to define the most likely value from the {varphi}-values distribution. In this case, hot sites are defined as those displaying {varphi}-values above the average.

For {alpha}-spectrin SH3, the eigenvector method is able to identify three hot sites out of four, displaying two false positives (Table 3Go). One of the two false positives (site 38) is at the borderline of being a hot site, displaying a {varphi}-value of 0.37, when the threshold value we take as threshold for this protein is 0.4. The other false positive (site T19) is directly bound to a correctly predicted hot site (site 18), with its motions and its interactions strongly coupled to those of residue 18. It is then conceivable that perturbing residues in the region of a hot spot should have definite consequences on the protein stability.


View this table:
[in this window]
[in a new window]
 
Table 3. Summary of the correctly predicted hot spots and false positives, and of the energetic contribution of hot spots to the protein stabilization
 
Note that the overall correlation coefficient is quite low, giving a value of 0.45. This is an expected result, because the approximation used is meant to describe only hot sites, not the rest of the protein.

It is also interesting to study the energetics properties of the FN, defined as the set of sites with a numerical value of the first eigenvector that is higher than the most probable value (Table 3Go). This shares a stabilization energy of -1334 kJ/mole (Table 3Go). This is equivalent to the 35% of the overall stabilization energy of the protein. The energy of the nucleus evaluated through the approximation of equation 2Go is -1209 kJ/mole, equivalent to a relative error of 9%.


    src SH3 domain
 TOP
 Abstract
 Introduction
 The method and its...
 Application to proteins
 {alpha}-Spectrin SH3 domain
 src SH3 domain
 Protein G
 Chymotrypsin inhibitor 2
 Discussion
 Materials and methods
 References
 
The SH3 domain of src protein displays the same fold as the {alpha}-spectrin domain discussed above, although its sequence homology is only 36%. Also, for src SH3 the lowest eigenvalue is well separated from the others, being {Delta}{lambda} = 18.7 kJ/mole and the average separation = 4.0 kJ/mole (see Fig. 3CGo). The components of the corresponding eigenvectors are displayed in Figure 5Go.



View larger version (25K):
[in this window]
[in a new window]
 
Figure 5. (Bottom) The {varphi}-values of src SH3 (solid curve) and the corresponding components of the eigenvector corresponding to the lowest eigenvalue (dashed curve). The curve for the eigenvector components is presented in a blown up form (multiplied by four). (Top) The change in free energy upon mutations. Thicker axis lines indicate the areas important for stability.

 
Baker and coworkers performed a mutation analysis on src SH3 domain (Grantcharova et al. 1998; Riddle et al. 1999), concluding that mutating residues G34, W36, 150, and Y54 to alanine causes a destabilization of the native state ranging from 5.8 to 16.7 kJ/mole. Furthermore, they showed that residues D9, Y10, S12, and L18 are important for protein stability, in that their mutations to alanine causes destabilizations ranging between 4.1 and 17.1 kJ/mole (Grantcharova et al. 1998). Also, in this case there is a good agreement between the sites playing a major role in the stability and the kinetics of the protein and the peaks in the components of the eigenvector (residues 10–20, 35–37, 49–51, and 54; Fig. 5Go). Only site 23 is found in experiments to be important for the stability of the protein even though it displays a small component in the eigenvector. This could be due to the fact that this site is occupied by a glycine that, lacking the side chain, may display a lower number of interactions with the flanking amino acids in MD simulations.

The comparison of the eigenvectors associated with spectrin and src SH3 domains shows a remarkable similarity, although the sequences are rather different (only 36% homology). This fact indicates a stronger evolutionary relationship between the two proteins than what the mere comparison of sequences would indicate. In other words, {alpha}-spectrin and src SH3 domains have diverged enough to change 74% of their amino acids but not enough to mutate their topologies and energetic pattern and, consequently, their folding mechanism.

This aspect is reflected by the chemical nature of the residues at the hot sites of the two proteins. Both SH3 domains display peaks corresponding to acidic (D9) and hydrophobic (VI8 for {alpha}-spectrin SH3 and LI8 for src SH3) residues in the N-terminal region, and to sterically demanding hydrophobic (W36 for {alpha}-spectrin SH3, W37 for src SH3) together with aliphatic (V48 for cu-spectrin SH3, 150 for src SH3) at the C-terminal. Moreover, these hot sites are part of a mostly hydrophobic core coincident with the mechanical nucleus defined by Martinez et al. (1998), formed during the folding process of the two domains studied. These investigators showed that passing through the transition state barrier in SH3 domains requires the formation of a defined structure with little conformational variability in well-defined regions, such as the ones identified here. It is worth noting that the regions rich in hot sites, involving residues 34–39 and 46–49, are the peptide ligand binding regions, indicating that the conservation of the pattern also assumes a functional role. We can also conclude that the analysis of the energetic patterns of proteins sharing the same fold (topology) could be used to assess their evolutionary relationships in a way that is more efficient than is the comparison of the sequences, because proteins have some degree of freedom of mutating their amino acids (even the key amino acids) for functional reasons, whereas the energetic pattern has to remain similar.

A summary of the parameters providing information on the validity of the method is given in Table 3Go. Repeating the same procedure as that used for the analysis of {alpha}-spectrin SH3, we are able to correctly identify 13 out of 17 hot sites. In this case there are six false positives, but three of them are consecutive to hot sites (17, 19, and 35), and this determines strong steric interactions with the hot spot regions, resulting in high component values in the first eigenvector. Also, in the case of this protein, it is most likely that the introduction of a mutation in the position of one false positive can be effective in changing the stability properties of the domain. The energy of the nucleus is -1661 kJ/mole (that is the 43% of the overall energy), which has to be compared with the approximated value of -1322 kJ/mole, equivalent to a relative error of 20%.


    Protein G
 TOP
 Abstract
 Introduction
 The method and its...
 Application to proteins
 {alpha}-Spectrin SH3 domain
 src SH3 domain
 Protein G
 Chymotrypsin inhibitor 2
 Discussion
 Materials and methods
 References
 
Protein G is particularly suitable for our analysis because it has been extensively studied by McCallister et al. (2000), who measured the effect of mutations in many sites of the protein.

The {varphi}-values of protein G are shown in Figure 6Go and display a high peak in the interval 45–52 and two small peaks around residues 3 and 22. Moreover, residues D22, A26, and F30 are important for the stability of the native state; their mutation to A causes a decrease of the stabilization free energy of 7.3, 12.3, and 5.9 kJ/mole, respectively.



View larger version (20K):
[in this window]
[in a new window]
 
Figure 6. Same as Figure 5Go, for protein G. (Bottom) The {varphi}-curve (solid line) is multiplied by two for clarity.

 
The spectrum of eigenvalues for protein G, as in the case of SH3 domains, displays a gap between the lowest and the next-to-lowest level {Delta}{lambda} = 15.7 kJ/mole, larger than the average distance between the others = 4.1. The components of the associated eigenvector are also displayed in Figure 6Go (dashed curve). The highest peak in this plot matches the peak in the {varphi}-values at sites 45–52, corresponding to the second hairpin with a formation that is critical to the folding mechanism (Kim et al. 2000). Moreover, the other minor peaks in the eigenvector components at sites 3–7, D22, A26, and V29 correspond to the others sites relevant to the stability and to the kinetics (Fig. 6Go).

Some of the residues relevant for stabilization are also involved in the definition of the binding region of the IgG-binding domain from protein G. In fact, mutations of residues in the region 27–32 and 43–46 cause major variations in the binding constant of human immunoglobulin Fc fragment (Sloan and Hellinga 1999). In particular, the binding properties of the region 27–32 explain the peak in the associated eigenvector components that, from the analysis of the mere folding behavior, has to be considered a false positive.

Also, for proteins G the outcome of the model is summarized in Table 3Go. For protein G the method can identify five hot sites over seven with three false positives (sites 22, 45, and 52; the second and the third being consecutive to hot sites). The energy of the nucleus is -1001 kJ/mole (i.e., 28% of the overall energy), whereas the approximated value is -966 kJ/mole, equivalent to a relative error of 3%.


    Chymotrypsin inhibitor 2
 TOP
 Abstract
 Introduction
 The method and its...
 Application to proteins
 {alpha}-Spectrin SH3 domain
 src SH3 domain
 Protein G
 Chymotrypsin inhibitor 2
 Discussion
 Materials and methods
 References
 
CI2 is a 64-residue polypeptide inhibitor of serine proteases (Bycroft et al. 1991). CI2 is a single module of structure: The interatomic interactions are quite uniform over the structure, and they do not segregate into regions that make more tertiary interactions within themselves than they do with neighboring atoms. As such, CI2 can be considered a basic folding unit or foldon (Panchenko et al. 1996). This is reflected by the fact that most {varphi}-values of this protein are fractionary (Jackson et al. 1993).

This peculiarity of CI2 is underlined both by the fact that the difference between the first two lowest eigenvalues is {Delta}{lambda} = 13.2 kJ/mole, the average spacing being = 3.9, and by the absence of well defined isolated peaks in the components of the eigenvector associated to the first eigenvalue.

Nonetheless, a careful inspection of Figure 7DGo shows that the stabilizing regions of the molecule are located around residues 10–20 and 40–50, corresponding to the helix of the N-terminal region making contacts with residues located on a ß-sheet at the opposite end of the polypeptide. Mutations in these regions cause destabilization of the folded state between 4.0 and 20.6 kJ/mole. The long-range interactions between these regions are those required to stabilize the helix in peptide fragments of CI2, as shown by Gay et al. (1995a) in a study on the stability of the secondary structure of peptides isolated from CI2. Moreover, residues around position 16, 49, and 57 constitute a nucleation site, the disruption of which leads to loss of stability and fast unfolding of the protein at high-temperature conditions (Gay et al. 1995b; Kazmirski et al. 1995; Neira et al. 1997).



View larger version (27K):
[in this window]
[in a new window]
 
Figure 7. Same as Figure 5Go, for CI2. (Bottom) The {varphi}-curve (solid line) is multiplied by two for clarity.

 
As shown in Table 3Go, the method can identify eight hot sites over a total of 12. The number of false positives is 13. We attribute this behavior to the conformational flexibility of CI2, also present in the transition state ensemble as seen by the high number of fractionary {varphi}-values, that determines a wide network of intramolecular interactions in the folded state of the protein. The energy of the nucleus is predicted to be -1690 kJ/mole, with an error of 6% and equivalent to 44% of the overall stabilization energy.


    Discussion
 TOP
 Abstract
 Introduction
 The method and its...
 Application to proteins
 {alpha}-Spectrin SH3 domain
 src SH3 domain
 Protein G
 Chymotrypsin inhibitor 2
 Discussion
 Materials and methods
 References
 
By simulating the equilibrium fluctuations of a protein around its native conformation through MD simulations, it is possible to investigate the energetic pattern (i.e., interaction energy, not free energy) that defines the native state and contributes to make it a minimum in the free energy landscape. A feature common to all proteins that emerges from such calculations is that, in each protein, few residues build clusters of strong interactions, surrounded by the other, weakly interacting residues (Fig. 8Go). This fact can be mathematically assessed, noting that the lowest eigenvalue of the residue–residue interaction matrix is well separated from the others, and the clusters can be detected searching for the peaks in the components of the associated eigenvector.



View larger version (69K):
[in this window]
[in a new window]
 
Figure 8. The hot sites mapped on the three dimensional structure for {alpha}-spectrin SH3 domain (A), src SH3 domain (B), protein G (C), and CI2 (D). The hot sites are part of the folding nucleus.

 
The presence of a strongly interacting nucleus is not unexpected and has been observed in minimal lattice models of proteins (Abkevich et al. 1994; Tiana et al. 1998; Tiana and Broglia 2001a; Micheletti 2003). This nucleus not only gives account for the all-or-none character of the folding transition (Mirny and Shakhnovich 2001) and for its remarkable tolerance to point mutations (Tiana et al. 1998), but it is also responsible for the fast folding of the protein. In fact, the hierarchical assembly of LESs, composed of strongly interacting residues lying close along the chain, has been shown to drive the protein to the native state in the case of simple models (see Introduction). This structure is necessarily associated with a cluster of strong interactions composed of those that stabilize the LES and those that stabilize the nucleus, keeping together the LESs.

Minimal model systems are ideally designed such that the same nucleus residues determine both the folding mechanism and the stability of the native over the unfolded state: The kinetic and thermodynamic determinants of folding are, as a consequence, coincident in these simplified representations.

In real proteins this might not necessarily be the case, as the pathways to the native state can be mediated by interactions with the solvent, by stereospecific interactions among the side chains, etc. However, at least in the case of single domain proteins such as the ones considered here, {varphi}-value analysis has proved that a localized cluster of residues is determinant for both the stability and the kinetics of folding. This is not to be considered a totally unexpected phenomenon, if we consider that folding transition states and intermediates show a high degree of structural similarity to the native states (Snow et al. 2002).

The analysis of the components of the eigenvector associated with the lowest eigenvalue of the native-state interaction matrix is helpful in determining such a cluster. Consequently, the sites individuated by this analysis play a major role in the kinetics and stability of the protein (Fig. 8Go), as shown by the good agreement between the plot of the eigenvector components and the plots displaying {varphi}-values and change in stability upon mutations. From this analysis, it is not possible to tell which of the sites indicated by the eigenvector are connected with stability, which with kinetics, and which with both of them. The eigenvector contain a superposition of the two pieces of information, and only an analysis of the entropy of the system can separate them. From this point of view, the present analysis consists in an approximation of the free energy of the native state in which the entropic term is neglected, and consequently, it is quite obvious that the results it can give are partial.

To be noted that the average spacing between the peaks in the eigenvector component of all the proteins we analyzed is ~19 residues. This is in accordance with the findings of Berezovsky et al. (2001) and Berezovsky and Trifonov (2002), who analyze the size of closed loops of the backbone of a large number of proteins and observe an average ranging from 25 to 30 residues, a size that is optimal from the point of view of a fast search in conformational space. We suggest that the loops found by Berezovsky et al. (2001) and Berezovsky and Trifonov (2002) could be related to the loops between LESs discussed above (Tiana and Broglia 2001a).

Two caveats to the method presented here need to be highlighted. First, the MD simulations with explicit water we used produce an interaction matrix that accounts for the van der Waals energy, for the electrostatic energy, and for the hydrogen bonds energy, but misses the hydrophobic interaction. This is, in fact, an effective interaction that arises from the averaging of the solvent degrees of freedom and does not appear in the potential function of the system. The fact that we obtain good agreement between these model calculations and the experimental results indicates that the hydrophobic interaction does not play a major role in determining the LESs, but only in the generic collapse of the chain into a globule (which is not considered here). Another problem is the choice of the time duration of the simulations. They have to be long enough to average the uncertainties in the structure connected with the experimental determination of the native conformation and to be equilibrated. Running simulations in which multiple folding-unfolding events could be observed would give access to the entropic term that has been neglected in the analysis, but reaching the timescales required for that purpose is currently out of reach even for small proteins. We have nonetheless performed simulations ranging from 2 to 15 nsec, observing no qualitative differences between the different cases belonging to this range.

From the practical point of view, the analysis of the components of the eigenvector associated with the lowest eigenvalue of the native-state interaction matrix can be helpful in determining the positions or the regions where mutations have to be introduced in order to change the stability, and connected with it, the dynamical features of the folded protein. Once the hot sites have been identified, saturation mutagenesis experiments can be focused on that region to modify the properties of the protein under study. In many cases (Liebeton et al. 2000) hot sites have been identified through several cycles of error-prone mutagenesis and analysis, which relies on the random introduction of mutations in different parts of the protein, and subsequent saturation mutagenesis on the above defined hot sites. This procedure, despite being quite efficient and successful, is not based on a rational approach and might be time consuming. Previous MD studies (Li and Daggett 1994) have yielded a reasonable agreement with experimentally determined points of mutation in CI2 by using a geometrical approach based on the relative number of contacts of one particular residue in the native and in the transition state ensembles. This approach, however, relies on the generation via MD of the ensemble of transition state structures based on high-temperature unfolding simulations.

In the case of our method, the straightforward application of realistic MD simulations of the native state followed by the analysis of the eigenvectors of the energy interaction matrix can be a useful help to a rational protein design, without the need to generate transition state structural ensembles.

Conclusions
It is possible to get information on the sites of a protein that contribute most to the stability and to the accessibility of the native state by mean of a simple energetic analysis of MD simulations, simulations that can be performed even for large proteins just with the help of a personal computer. This allows us not only to gain insight into the folding pathways of a given protein but also to be able to predict sites where to focus mutagenesis experiments.


    Materials and methods
 TOP
 Abstract
 Introduction
 The method and its...
 Application to proteins
 {alpha}-Spectrin SH3 domain
 src SH3 domain
 Protein G
 Chymotrypsin inhibitor 2
 Discussion
 Materials and methods
 References
 
The starting structure for the five all-atom MD simulations were taken from the Protein Data Bank: 1SHG [PDB] .pdb for {alpha}-spectrin SH3 domain, 1FMK [PDB] .pdb for src SH3 domain, 1PGB [PDB] .pdb for protein G, and 2CI2 [PDB] .pdb for CI2. The proteins were protonated to give a zwitterionic form (with N-terminal and C-terminal COO- groups) with the carboxyl NH side-chain groups in their charged states. The total charges on the proteins resulted in +1 for {alpha}-spectrin SH3 domain, -4 for src SH3 domain, protein G, and -1 for CI2. The proteins were solvated with water in a octahedral box large enough to contain 1.2 nm of solvent around the peptide. The simple point charge (SPC) water model was used (Berendsen et al. 1987) was used to solvate each protein in the simulation box.

In all simulations, suitably charged counterions were used to yield an electrically neutral system. Each system was subsequently energy-minimized with a steepest descent method for 1000 steps. The calculation of electrostatic forces used the PME implementation of the Ewald summation method. The LINCS algorithm (Hess et al. 1997) was used to constrain all bond lengths. For the water molecules the SETTLE algorithm (Miyamoto and Kollman 1992) was used. A dielectric permittivity, {varepsilon} = 1, and a time step of 2 fsec were used. All atoms were given an initial velocity obtained from a Maxwellian distribution at the desired initial temperature of 300 K. The density of the system was adjusted, performing the first equilibration runs at NPT condition by weak coupling to a bath of constant pressure (P0 = one bar, coupling time {tau}p = 0.5 psec; Berendsen et al. 1984).

In all simulations the temperature was maintained close to the intended values by weak coupling to an external temperature bath (Berendsen et al. 1984) with a coupling constant of 0.1 psec. The peptide and the rest of the system were coupled separately to the temperature bath. Each of the five MD simulations was extended to 10 nsec.

All simulations and analysis were carried out by using the GROMACS package (version 3.0; van der Spoel et al. 1994), using the GROMOS96 43A1 forcefield (van Gunsteren et al. 1998). All calculations were performed on clusters of personal computers, with the Linux operating system.


    Acknowledgments
 
We gratefully ackowledge Dr. Massimiliano Meli for technical support. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.


    References
 TOP
 Abstract
 Introduction
 The method and its...
 Application to proteins
 {alpha}-Spectrin SH3 domain
 src SH3 domain
 Protein G
 Chymotrypsin inhibitor 2
 Discussion
 Materials and methods
 References
 
Abkevich, V.I., Gutin, A.M., and Shakhnovich, E.I. 1994. Specific nucleus as the transition state for protein folding: Evidence from the lattice model. Biochemistry 22: 10026–10036.

Amadei, A., Linssen, A.B.M., and Berendsen, H.J.C. 1993. Essential dynamics of proteins. Proteins 17: 412–425.[CrossRef][Medline]

Baldwin, R.L. and Rose, G.D. 1999. Is protein folding hierarchic? I: Local structure and peptide folding. Trends Biochem. Sci. 24: 26–33.[CrossRef][Medline]

Berendsen, H.J.C., Postma, J.P.M., van Gunsteren, W.F., Di Nola, A., and Haak, J.R. 1984. Molecular dynamics with coupling to an external bath. J. Chem. Phys. 81: 3684.[CrossRef]

Berendsen, H.J.C., Grigera, J.R., and Straatsma, T.P. 1987. The missing term in effective pair potentials. J. Phys. Chem. 91: 6269–6271.[CrossRef]

Berezovsky, I.N. and Trifonov, E.N. 2002. Loop fold structure of proteins: Resolution of Levinthal’s paradox. J. Biomol. Struct. Dyn. 20: 5–6.[Medline]

Berezovsky, I.N., Kirzhner, V.M., Kirzhner, A., and Trifonov, E.N. 2001. Protein folding: Looping from hydrophobic nuclei. Proteins 45: 346–350.[CrossRef][Medline]

Broglia, R.A. and Tiana, G. 2001. Reading the three-dimensional structure of a protein from its amino acid sequence. Proteins 45: 421–427.[CrossRef][Medline]

Bycroft, M., Ludvigsen, S., Fersht, A.R., and Poulsen, F.M. 1991. Determination of the 3-dimensional solution structure of barnase using nuclear magnetic resonance spectroscopy. Biochemistry 30: 8697–8701.[CrossRef][Medline]

Chan, H.S. and Dill, K. 1991. Polymer principles in protein structure and stability. Annu. Rev. Biophys. Biophys. Chem. 20: 447–490.[CrossRef][Medline]

Colombo, G. and Merz Jr., K.M. 1999. Stability and activity of mesophilic subtilisin e and its thermophilic homolog: Insights from molecular dynamics simulations. J. Am. Chem. Soc. 121: 6895–6903.[CrossRef]

Dahiyat, B.I. and Mayo, S.L. 1997. De novo protein design: Fully automated sequence selection. Science 278: 82–87.[Abstract/Free Full Text]

Daura, X., Jaun, B., Seebach, D., van Gunsteren, W.F., and Mark, A.E. 1998. Reversible peptide folding in solution by molecular dynamics simulation. J. Mol. Biol. 280: 925–932.[CrossRef][Medline]