|
|
||||||||
1 Department of Physics, University of Milano, 20133 Milano, Italy
2 Istituto Nazionale Fisica Nucleare (INFN), 20133 Milano, Italy
3 Istituto di Chimica del Riconoscimento Molecolare, Consiglio Nazionale delle Richerche (CNR), 20131 Milano, Italy
4 The Niels Bohr Institute, 2100 Copenhagen, Denmark
Reprint requests to: G. Colombo, Istituto di Chimica del Riconoscimento Molecolare, CNR, via Mario Bianco 9, 20131 Milano, Italy; e-mail: giorgio. colombo{at}icrm.cnr.it; fax: 39-02-28500036.
(RECEIVED May 26, 2003; FINAL REVISION August 4, 2003; ACCEPTED September 29, 2003)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.03223804.
| Abstract |
|---|
|
|
|---|
Keywords: protein folding; protein stability; molecular dynamics; local elementary structures
| Introduction |
|---|
|
|
|---|
Experimental techniques have yielded detailed information on macroscopic features of protein dynamical behavior, such as folding times and stability, and on some specific issues at the level of amino acids, such as the sensitivity to mutations (cf. Fersht 1999). However, there is still no experimental procedure capable of providing insight at the amino acid level into either the folding process in its completeness or the stabilization determinants of proteins. To gain insight into these questions, one must turn to theoretical and computational methods.
Because of the high computational costs involved, realistic models that provide an exhaustive description of the free energy landscape have been used in the study of only short peptides. Ferrara and Caflish (2000), for instance, could reconstruct the whole free energy landscape of a small designed ß-sheet peptide with an all-atom representation of the solute and an implicit model of the solvent. Daura et al. (1998) were able to demonstrate the reversible folding of a small (seven residues) helix forming ß-peptide in methanol solvent using long molecular dynamics (MD) simulations.
On the other hand, minimal models (Chan and Dill 1991; Sali et al. 1994; Mirny and Shakhnovich 2001; Micheletti 2003) have provided important results about the general features of the free energy landscape of proteins. They give an approximate description of both the interaction energy among amino acids (usually through a contact potential encoded in a 20 x 20 matrix; see Miyazawa and Jernigan 1985), and of the entropy. By making use of lattice models, that is, describing the protein as a chain of beads occupying the vertices of a cubic lattice, it has been possible to understand the basic energetic properties that characterize a folding sequence, that is, a large gap (measured in terms of typical contact energies) between the energy of the native and of the lowest, structurally dissimilar state (Mirny and Shakhnovich 2001). Studying the space of sequences and the role mutations have on good folders, it has been shown that the existence or less of such a gap depends on the existence of few "hot" sites in the native conformation, occupied by highly conserved, strongly interacting amino acids (Tiana et al. 1998). Based on these results, it has been found that the process of folding takes place in a hierarchical fashion (Baldwin and Rose 1999; Yeh and Rousseau 2000; Levy et al. 2001), guided by local elementary structures (LESs), formed very early in the folding process and stabilized by the interaction among hot amino acids (Tiana and Broglia 2001a). The docking of LES leads to the overcoming of the highest free energy barrier encountered by the protein in the transition between the denatured to the native conformation, giving rise to the folding nucleus (FN; Abkevich et al. 1994), and thus to a conformation that sits in the energetic basin of attraction of the native state. Such hierarchical mechanism makes the folding process possible by efficiently squeezing out the entropy from the system. Although the extension of this solution to the case of real proteins is not expected to be straightforward, it depends on working out the (ab initio) interaction between single protein residues (Broglia and Tiana 2001).
Other approaches focus on the geometric features of the protein, providing a detailed picture of the structure and using simple approximations for the potential function. From the physical point of view, this means focusing the attention on the entropic part of the energy landscape, neglecting the energetics. An example of this approach is to be found in Plaxco et al. (1998), who correlate the folding rate of proteins with a "contact order" parameter accounting for the locality of the interactions. The Go model is another approximation that describes well the entropy of the protein chain, making a crude approximation on the interaction energy (Go 1975), being a contact function that assumes the value -1 if the contact is a native contact and zero otherwise.
In the present work, we use a simple novel approach to extract information concerning the folding process and the stability of proteins from the structure of their native state. It is solely based on a detailed description of the interaction energy and neglects the protein entropy. The rationale behind this choice of approach is that the stabilization energy is distributed quite unevenly in proteins, as testified by the fact that mutations in most of the sites of a protein have little effect on its folding properties, whereas there are few key (hot) sites (where the stabilization energy is concentrated) that are highly sensitive to mutations (Mirny et al. 1998; Tiana et al. 1998). These sites are those that, in the analysis performed with minimal models, build out the LESs that control the folding process (Tiana and Broglia 2001b). Consequently, by determining these sites, one can get insight not only into the mechanism that stabilizes the protein but also into the kinetics of the folding process.
The understanding of how the different amino acids determine the free energy of the protein, in particular which are the residues that play a key role in its the stability and kinetics, is of great interest, not only to unravel how the one-dimensional sequence of amino acids encodes for the three-dimensional native conformation but also to eventually design new proteins with specific tasks (see Dahiyat and Mayo 1997) or to modify existing proteins (Colombo and Merz 1999).
In the following sections, we will describe the protocol used for determining the hot sites of the protein. We shall first test the method on a lattice designed model proteins (The Method and Its Validation From a Minimal Model) and then apply it to four small proteins (Application to Proteins), characterized by different folds and representative of diverse structural motifs. The last two sections contain the discussion of the results and the conclusions.
| The method and its validation from a minimal model |
|---|
|
|
|---|
![]() | (1) |
where N is the number of amino acids in the protein, 
is an eigenvalue, and
are the components of the associated eigenvector. We assume that the eigenvectors are normalized to unity and, because Mij is symmetrical, all the eigenvalues are real.
For the sake of simplicity, we label the N eigenvalues in increasing order, so that
1 is the most negative. Accordingly, the different terms in the sum in equation 1
approximate the real interaction energy Mij to an increasing extent, the first term containing the largest contribution to the stabilization of the native conformation. The components
of the associated eigenvector indicate to which extent each amino acid participates to the stabilization. In other words, each term in equation 1
accounts for an amount of energy 
that is shared among the different residues according to the corresponding eigenvector
.
If the second eigenvalue
2 is much higher than
1, one can approximate the whole interaction matrix as
![]() | (2) |
reducing the information needed to specify the interaction from N2 to N numbers.
Because the interaction patterns of real protein are particularly complicated ("frustrated," in the language of Frauenfelder and Wolynes 1994), one does not expect this approximation to be particularly accurate. On the other hand, it reduces a complicated two-body interaction into an effective interaction determined solely by a simple number (a generalized "charge," such as in the case of the electrostatic interaction) that can be ascribed to each amino acid. Thus, the approximation given in equation 2
states that there are some amino acids that are strongly interacting and others that are weakly interacting tout court, depending on the charges
and
of the two amino acids. Consequently, if amino acid A strongly attracts amino acids B (e.g., both with large positive charge; unlike the case of electrostatic interaction, charges with the same sign attract each other and charges with opposite sign repel each other, because AI in equation 2
is negative, being defined as the most negative of the eigenvalues), and amino acid B strongly attracts amino acid C (e.g., both with positive large charge), in the present approximation A strongly attracts C. The opposite is also true: If the interactions AB and BC are weak, the interaction AC will be weak. On the other hand, the real interaction Mij could be more complicated than this, being able to account for any combination of attractive and repelling pairs (e.g., A attracts B, B attracts C, but A repels C). Consequently, the approximation introduced in equation 2
is, in principle, rather crude. On the other hand, from the analysis of minimal models, it emerged that the folding and stabilization of small single domain proteins is essentially controlled by a FN displaying a rich network of contacts among the most strongly interacting residues (Abkevich et al. 1994; Tiana and Broglia 2001a,b) and thus containing a large fraction (~30% to 50%) of the stabilization energy of the system. This network arises from the fact that the FN results from the docking of the LES, in which each residue interacts favorably with many other residues belonging to the same or to one of the other LESs. Consequently, essentially each residue in the nucleus attracts each other, and can be easily described in terms of charges.
Residues belonging to the nucleus display a large charge, whereas the other residues carry a much lower charge. In other words, the presence of the nucleus is the physical reason why
1<<
2.
Summing up, the analysis of the eigenvalues spectrum provides information about the folding features of the protein. If
1 is much smaller (i.e., more negative) than the other eigenvalues, it means that the approximation introduced in equation 2
is good for our purposes, and the protein displays a well-defined nucleus. Again, we should make the caveat that interactions in real proteins are very complicated, and even if the folding is essentially controlled by a FN, the approximation given in equation 2
can never be quantitative.
In keeping with this caveat, in what follows, we will use the approximation introduced in equation 2
not to describe the whole network of interactions within a given protein, but only to investigate the possible presence of a stabilizing nucleus and to describe the energetics of this nucleus. As a consequence, it is not necessary for the approximation to be useful to provide small error bars on all the sites of the protein, but only on those participating in the FN. In fact, the validity test for the method will be its ability to identify which sites of the protein are important for the stability of the native conformation of the protein and for the dynamics of the associated folding process (as measured by
-value analysis, see below).
In what follows we will also show that for all of the proteins studied, the energy of the nucleus is well approximated by equation 2
. On the other hand, we will find that the overall correlation coefficient between eigenvector components and
-values is, as a rule, only qualitative as a consequence of the fact that we are concentrating on the energetics of the folded state of the protein, neglecting the entropic contribution due to the distribution of foldingunfolding events displayed by proteins in solution.
Furthermore, the method used in the present article, which is centered around the approximated relation introduced in equation 2
, although being an application of the well-known principal component analysis (PCA), is not related in any way with the various kinds of conformational PCA, or "essential dynamics," present in the literature (Amadei et al. 1993), whose goal is to effectively decrease the dimensionality of the conformational space.
To test the above procedure, we have studied a 36mer lattice model sequence designed (making use of the contact energies given in Table 6 of Miyazawa and Jernigan [1985]) to fold to a unique conformation, the nucleus of which is known to be built out of the three LESs: 36, 1114, and 2730 (Tiana and Broglia 200la), which contains 47% of the stabilization energy. The spectrum of eigenvalues of the interaction matrix calculated in the native conformation is displayed as a solid curve in the upper part of Figure 1
. For comparison, the spectrum associated with a random sequence is displayed in the lower part of the same figure. In the case of the designed sequence, the lowest eigenvalue is well separated from the others (i.e., 
=
2 -
1 = 113.6 in units of RTroom, where the average spacing between the others is
= 13.3, leading to the ratio 
/
= 8.5), whereas in the case of the random sequence this is not true (i.e.,
2 -
1 = 0.4, the average spacing between the others being 4.0). As discussed above, this difference reflects the fact that the designed sequence displays a nucleus of mutually interacting residues, where the stabilization energy is concentrated (i.e., 50% of the native energy is carried by 30% of the residues, i.e., those belonging to the LES that form the FN). On the other hand, in the random sequence there is a complicated pattern of attractive and repulsive contacts.
|
A difference between the eigenvalue spectra of the designed and of the random sequence is detectable also for interaction matrices calculated starting from random conformations and performing the average over a time that is much smaller than the overall folding time (103 MC steps, compared to a characteristic folding time of 8 x 105 MC steps for the folding sequence). The results are displayed as a dotted curve in Figure 1
. For the folding sequence
2 -
1 = 43.2, to be compared to the average spacing of 7.6, whereas for the random sequence
2 -
1 = 1.1 and the average spacing is 2.6. The reason for this behavior is that LESs, which are part of the nucleus and carry some of the stabilization energy of the protein, are formed at the very early stages of the folding process of good sequences. On the other hand, random sequences undergo a disordered collapse, not displaying any internal structure. The details concerning how the stabilization energy is distributed among the amino acids are contained in the normalized eigenvector
associated to the lowest eigenvalue. In Figure 2
are displayed the components of such an eigenvector in the case of the interaction matrix calculated in the native conformation of the designed sequence (corresponding to the solid curve in the upper part of Fig. 1
). The plots corresponding to the other two curves in the upper part of Figure 1
are similar. The plot displays six peaks in correspondence to the amino acids building the LESs (residues 3, 6, 11, 14, 27, and 30) and one (residue 16) that, although not participating in a LES, interacts strongly with the FN. Anyway, these seven sites are exactly those that, if mutated, lead to the denaturation of the protein, and are called hot and warm sites in the article (Tiana et al. 1998).
|
| Application to proteins |
|---|
|
|
|---|
-spectrin SH3 domain (57 residues), the src SH3 domain (60 residues), the IgG binding domain of protein G (herein called simply "protein G"; 56 residues), and Chymotrypsin inhibitor 2 (CI2; 65 residues). All of these systems have been studied in depth in terms of the mutations needed to (de) stabilize their folded states and to influence their folding kinetics. Most of these studies were based on an extensive experimental analysis analyzing the role of point mutations and on the characterization of the effects of every single mutation on the thermodynamic stability and on the kinetic properties of the protein. The ability to identify the mutation sites via a fast and simple computational method thus becomes an extremely appealing feature of all-atom molecular simulations. For the sake of simplicity in the discussion of the eigenvalue and of the eigenvector properties, the numbering followed here for each protein starts from residue 1. The four systems mentioned above were analyzed through 10-nsec-long all-atom MD simulations, using an explicit representation of the solvent and the Particle Mesh Ewald method to calculate electrostatic interactions, in order to avoid cut-off induced artifacts.
The lower part of the eigenvalue spectrum, consisting on the first 20 eigenvalues for each of the four proteins, is displayed in Figure 3
, AD. For all the proteins examined, the separation 
between eigenvalues
1 and
2 is much larger than the average spacing
between the other eigenvalues (Table 1
), as in the case of the lattice model protein discussed above. Only in the case of CI2, this property becomes marginal, and in particular, the first three eigenvalues are almost equally spaced.
|
|

and
is somewhat smaller for the four real proteins than for the idealized lattice protein are as follows: (1) the use of a general-purpose force field, which has not been optimally tuned for the particular fold of the proteins under study; (2) the noise present in MD simulations due to the numerical approximations in the integration steps; and (3) real proteins can concentrate energy in clusters of amino acids for reasons different than their folding and stabilization properties, such as for binding other molecules, for enzymatic activity, etc.
-Spectrin SH3 domain
|
|---|
|
|
|---|
-spectrin SH3 domain is reported in Figure 3A
= 20.9, which means five times the average spacing
= 4.2.
In Figure 4A
are displayed the components
of the normalized eigenvector associated with the lowest eigenvalue, components that indicate how the stabilization energy is distributed among the amino acids of the
-spectrin SH3 domain. The plot displays particularly large values of the components in the intervals 1020, 3439, and 4649 (in particular, the maximum of the peaks are at sites 10, 14, 18, 36, 47, and 48).
|

GN-U values ranging between 7.5 and 12.9 kJ/mole. These sites are indeed among those showing, in our analysis, a large value of the eigenvector component. This is a straightforward result, considering that the components of the eigenvector express the degree to which a given residue participates in the stability of the protein.
|
-values, Martinez et al. (1998) showed that the formation of the distal hairpin (residues 3848) and the anchoring of the second strand of the RT loop (residues 1819) are determinant in the kinetics of folding. The highest peak in Figure 4A
The connection between the energetic properties of a residue and its role in the folding kinetics is not unexpected and can be easily rationalized in terms of stabilization of LESs and their subsequent assembly into the FN (Tiana and Broglia 2001a). Such LESs, being the elements that lead the folding process, have to be strongly stabilized at the very early stages of the kinetics and, consequently, must display large values of the eigenvector components. In the case of
-spectrin SH3, the distal hairpin plays the role of LES and, interacting with the second strand of the RT loop, forms the FN (as discussed by Martinez et al. [1998]).
According to Figure 4A
, the first strand of the RT loop (residues 1015) also plays an important role, if not for the folding and stability of the whole protein at least for the stabilization of its N-terminal region.
Summing up, the eigenvector components are useful in locating hot sites. The performance of the method is assessed by evaluating how many hot sites it is able to identify. To provide a consistent definition of hot site on the basis of the components of the eigenvector (and, for comparison, of
-values), we calculate the most probable value of the eigenvector components distribution and define as hot all the sites of the protein displaying an eigenvector component larger than the most likely value. The distribution of eigenvector components displays a peak and a long tail, and consequently, the most probable value is not the average but the one associated to the top of the distribution. On the other hand, because
-values are not calculated in all sites, the statistics is low and it is difficult to define the most likely value from the
-values distribution. In this case, hot sites are defined as those displaying
-values above the average.
For
-spectrin SH3, the eigenvector method is able to identify three hot sites out of four, displaying two false positives (Table 3
). One of the two false positives (site 38) is at the borderline of being a hot site, displaying a
-value of 0.37, when the threshold value we take as threshold for this protein is 0.4. The other false positive (site T19) is directly bound to a correctly predicted hot site (site 18), with its motions and its interactions strongly coupled to those of residue 18. It is then conceivable that perturbing residues in the region of a hot spot should have definite consequences on the protein stability.
|
It is also interesting to study the energetics properties of the FN, defined as the set of sites with a numerical value
of the first eigenvector that is higher than the most probable value (Table 3
). This shares a stabilization energy of -1334 kJ/mole (Table 3
). This is equivalent to the 35% of the overall stabilization energy of the protein. The energy of the nucleus evaluated through the approximation of equation 2
is -1209 kJ/mole, equivalent to a relative error of 9%.
| src SH3 domain |
|---|
|
|
|---|
-spectrin domain discussed above, although its sequence homology is only 36%. Also, for src SH3 the lowest eigenvalue is well separated from the others, being 
= 18.7 kJ/mole and the average separation
= 4.0 kJ/mole (see Fig. 3C
|
The comparison of the eigenvectors associated with spectrin and src SH3 domains shows a remarkable similarity, although the sequences are rather different (only 36% homology). This fact indicates a stronger evolutionary relationship between the two proteins than what the mere comparison of sequences would indicate. In other words,
-spectrin and src SH3 domains have diverged enough to change 74% of their amino acids but not enough to mutate their topologies and energetic pattern and, consequently, their folding mechanism.
This aspect is reflected by the chemical nature of the residues at the hot sites of the two proteins. Both SH3 domains display peaks corresponding to acidic (D9) and hydrophobic (VI8 for
-spectrin SH3 and LI8 for src SH3) residues in the N-terminal region, and to sterically demanding hydrophobic (W36 for
-spectrin SH3, W37 for src SH3) together with aliphatic (V48 for cu-spectrin SH3, 150 for src SH3) at the C-terminal. Moreover, these hot sites are part of a mostly hydrophobic core coincident with the mechanical nucleus defined by Martinez et al. (1998), formed during the folding process of the two domains studied. These investigators showed that passing through the transition state barrier in SH3 domains requires the formation of a defined structure with little conformational variability in well-defined regions, such as the ones identified here. It is worth noting that the regions rich in hot sites, involving residues 3439 and 4649, are the peptide ligand binding regions, indicating that the conservation of the pattern also assumes a functional role. We can also conclude that the analysis of the energetic patterns of proteins sharing the same fold (topology) could be used to assess their evolutionary relationships in a way that is more efficient than is the comparison of the sequences, because proteins have some degree of freedom of mutating their amino acids (even the key amino acids) for functional reasons, whereas the energetic pattern has to remain similar.
A summary of the parameters providing information on the validity of the method is given in Table 3
. Repeating the same procedure as that used for the analysis of
-spectrin SH3, we are able to correctly identify 13 out of 17 hot sites. In this case there are six false positives, but three of them are consecutive to hot sites (17, 19, and 35), and this determines strong steric interactions with the hot spot regions, resulting in high component values in the first eigenvector. Also, in the case of this protein, it is most likely that the introduction of a mutation in the position of one false positive can be effective in changing the stability properties of the domain. The energy of the nucleus is -1661 kJ/mole (that is the 43% of the overall energy), which has to be compared with the approximated value of -1322 kJ/mole, equivalent to a relative error of 20%.
| Protein G |
|---|
|
|
|---|
The
-values of protein G are shown in Figure 6
and display a high peak in the interval 4552 and two small peaks around residues 3 and 22. Moreover, residues D22, A26, and F30 are important for the stability of the native state; their mutation to A causes a decrease of the stabilization free energy of 7.3, 12.3, and 5.9 kJ/mole, respectively.
|

= 15.7 kJ/mole, larger than the average distance between the others
= 4.1. The components of the associated eigenvector are also displayed in Figure 6
-values at sites 4552, corresponding to the second hairpin with a formation that is critical to the folding mechanism (Kim et al. 2000). Moreover, the other minor peaks in the eigenvector components at sites 37, D22, A26, and V29 correspond to the others sites relevant to the stability and to the kinetics (Fig. 6Some of the residues relevant for stabilization are also involved in the definition of the binding region of the IgG-binding domain from protein G. In fact, mutations of residues in the region 2732 and 4346 cause major variations in the binding constant of human immunoglobulin Fc fragment (Sloan and Hellinga 1999). In particular, the binding properties of the region 2732 explain the peak in the associated eigenvector components that, from the analysis of the mere folding behavior, has to be considered a false positive.
Also, for proteins G the outcome of the model is summarized in Table 3
. For protein G the method can identify five hot sites over seven with three false positives (sites 22, 45, and 52; the second and the third being consecutive to hot sites). The energy of the nucleus is -1001 kJ/mole (i.e., 28% of the overall energy), whereas the approximated value is -966 kJ/mole, equivalent to a relative error of 3%.
| Chymotrypsin inhibitor 2 |
|---|
|
|
|---|
-values of this protein are fractionary (Jackson et al. 1993).
This peculiarity of CI2 is underlined both by the fact that the difference between the first two lowest eigenvalues is 
= 13.2 kJ/mole, the average spacing being
= 3.9, and by the absence of well defined isolated peaks in the components of the eigenvector associated to the first eigenvalue.
Nonetheless, a careful inspection of Figure 7D
shows that the stabilizing regions of the molecule are located around residues 1020 and 4050, corresponding to the helix of the N-terminal region making contacts with residues located on a ß-sheet at the opposite end of the polypeptide. Mutations in these regions cause destabilization of the folded state between 4.0 and 20.6 kJ/mole. The long-range interactions between these regions are those required to stabilize the helix in peptide fragments of CI2, as shown by Gay et al. (1995a) in a study on the stability of the secondary structure of peptides isolated from CI2. Moreover, residues around position 16, 49, and 57 constitute a nucleation site, the disruption of which leads to loss of stability and fast unfolding of the protein at high-temperature conditions (Gay et al. 1995b; Kazmirski et al. 1995; Neira et al. 1997).
|
-values, that determines a wide network of intramolecular interactions in the folded state of the protein. The energy of the nucleus is predicted to be -1690 kJ/mole, with an error of 6% and equivalent to 44% of the overall stabilization energy. | Discussion |
|---|
|
|
|---|
|
Minimal model systems are ideally designed such that the same nucleus residues determine both the folding mechanism and the stability of the native over the unfolded state: The kinetic and thermodynamic determinants of folding are, as a consequence, coincident in these simplified representations.
In real proteins this might not necessarily be the case, as the pathways to the native state can be mediated by interactions with the solvent, by stereospecific interactions among the side chains, etc. However, at least in the case of single domain proteins such as the ones considered here,
-value analysis has proved that a localized cluster of residues is determinant for both the stability and the kinetics of folding. This is not to be considered a totally unexpected phenomenon, if we consider that folding transition states and intermediates show a high degree of structural similarity to the native states (Snow et al. 2002).
The analysis of the components of the eigenvector associated with the lowest eigenvalue of the native-state interaction matrix is helpful in determining such a cluster. Consequently, the sites individuated by this analysis play a major role in the kinetics and stability of the protein (Fig. 8
), as shown by the good agreement between the plot of the eigenvector components and the plots displaying
-values and change in stability upon mutations. From this analysis, it is not possible to tell which of the sites indicated by the eigenvector are connected with stability, which with kinetics, and which with both of them. The eigenvector contain a superposition of the two pieces of information, and only an analysis of the entropy of the system can separate them. From this point of view, the present analysis consists in an approximation of the free energy of the native state in which the entropic term is neglected, and consequently, it is quite obvious that the results it can give are partial.
To be noted that the average spacing between the peaks in the eigenvector component of all the proteins we analyzed is ~19 residues. This is in accordance with the findings of Berezovsky et al. (2001) and Berezovsky and Trifonov (2002), who analyze the size of closed loops of the backbone of a large number of proteins and observe an average ranging from 25 to 30 residues, a size that is optimal from the point of view of a fast search in conformational space. We suggest that the loops found by Berezovsky et al. (2001) and Berezovsky and Trifonov (2002) could be related to the loops between LESs discussed above (Tiana and Broglia 2001a).
Two caveats to the method presented here need to be highlighted. First, the MD simulations with explicit water we used produce an interaction matrix that accounts for the van der Waals energy, for the electrostatic energy, and for the hydrogen bonds energy, but misses the hydrophobic interaction. This is, in fact, an effective interaction that arises from the averaging of the solvent degrees of freedom and does not appear in the potential function of the system. The fact that we obtain good agreement between these model calculations and the experimental results indicates that the hydrophobic interaction does not play a major role in determining the LESs, but only in the generic collapse of the chain into a globule (which is not considered here). Another problem is the choice of the time duration of the simulations. They have to be long enough to average the uncertainties in the structure connected with the experimental determination of the native conformation and to be equilibrated. Running simulations in which multiple folding-unfolding events could be observed would give access to the entropic term that has been neglected in the analysis, but reaching the timescales required for that purpose is currently out of reach even for small proteins. We have nonetheless performed simulations ranging from 2 to 15 nsec, observing no qualitative differences between the different cases belonging to this range.
From the practical point of view, the analysis of the components of the eigenvector associated with the lowest eigenvalue of the native-state interaction matrix can be helpful in determining the positions or the regions where mutations have to be introduced in order to change the stability, and connected with it, the dynamical features of the folded protein. Once the hot sites have been identified, saturation mutagenesis experiments can be focused on that region to modify the properties of the protein under study. In many cases (Liebeton et al. 2000) hot sites have been identified through several cycles of error-prone mutagenesis and analysis, which relies on the random introduction of mutations in different parts of the protein, and subsequent saturation mutagenesis on the above defined hot sites. This procedure, despite being quite efficient and successful, is not based on a rational approach and might be time consuming. Previous MD studies (Li and Daggett 1994) have yielded a reasonable agreement with experimentally determined points of mutation in CI2 by using a geometrical approach based on the relative number of contacts of one particular residue in the native and in the transition state ensembles. This approach, however, relies on the generation via MD of the ensemble of transition state structures based on high-temperature unfolding simulations.
In the case of our method, the straightforward application of realistic MD simulations of the native state followed by the analysis of the eigenvectors of the energy interaction matrix can be a useful help to a rational protein design, without the need to generate transition state structural ensembles.
Conclusions
It is possible to get information on the sites of a protein that contribute most to the stability and to the accessibility of the native state by mean of a simple energetic analysis of MD simulations, simulations that can be performed even for large proteins just with the help of a personal computer. This allows us not only to gain insight into the folding pathways of a given protein but also to be able to predict sites where to focus mutagenesis experiments.
| Materials and methods |
|---|
|
|
|---|
-spectrin SH3 domain, 1FMK
[PDB]
.pdb for src SH3 domain, 1PGB
[PDB]
.pdb for protein G, and 2CI2
[PDB]
.pdb for CI2. The proteins were protonated to give a zwitterionic form (with N-terminal
and C-terminal COO- groups) with the carboxyl NH side-chain groups in their charged states. The total charges on the proteins resulted in +1 for
-spectrin SH3 domain, -4 for src SH3 domain, protein G, and -1 for CI2. The proteins were solvated with water in a octahedral box large enough to contain 1.2 nm of solvent around the peptide. The simple point charge (SPC) water model was used (Berendsen et al. 1987) was used to solvate each protein in the simulation box.
In all simulations, suitably charged counterions were used to yield an electrically neutral system. Each system was subsequently energy-minimized with a steepest descent method for 1000 steps. The calculation of electrostatic forces used the PME implementation of the Ewald summation method. The LINCS algorithm (Hess et al. 1997) was used to constrain all bond lengths. For the water molecules the SETTLE algorithm (Miyamoto and Kollman 1992) was used. A dielectric permittivity,
= 1, and a time step of 2 fsec were used. All atoms were given an initial velocity obtained from a Maxwellian distribution at the desired initial temperature of 300 K. The density of the system was adjusted, performing the first equilibration runs at NPT condition by weak coupling to a bath of constant pressure (P0 = one bar, coupling time
p = 0.5 psec; Berendsen et al. 1984).
In all simulations the temperature was maintained close to the intended values by weak coupling to an external temperature bath (Berendsen et al. 1984) with a coupling constant of 0.1 psec. The peptide and the rest of the system were coupled separately to the temperature bath. Each of the five MD simulations was extended to 10 nsec.
All simulations and analysis were carried out by using the GROMACS package (version 3.0; van der Spoel et al. 1994), using the GROMOS96 43A1 forcefield (van Gunsteren et al. 1998). All calculations were performed on clusters of personal computers, with the Linux operating system.
| Acknowledgments |
|---|
| References |
|---|
|
|
|---|
Amadei, A., Linssen, A.B.M., and Berendsen, H.J.C. 1993. Essential dynamics of proteins. Proteins 17: 412425.[CrossRef][Medline]
Baldwin, R.L. and Rose, G.D. 1999. Is protein folding hierarchic? I: Local structure and peptide folding. Trends Biochem. Sci. 24: 2633.[CrossRef][Medline]
Berendsen, H.J.C., Postma, J.P.M., van Gunsteren, W.F., Di Nola, A., and Haak, J.R. 1984. Molecular dynamics with coupling to an external bath. J. Chem. Phys. 81: 3684.[CrossRef]
Berendsen, H.J.C., Grigera, J.R., and Straatsma, T.P. 1987. The missing term in effective pair potentials. J. Phys. Chem. 91: 62696271.[CrossRef]
Berezovsky, I.N. and Trifonov, E.N. 2002. Loop fold structure of proteins: Resolution of Levinthals paradox. J. Biomol. Struct. Dyn. 20: 56.[Medline]
Berezovsky, I.N., Kirzhner, V.M., Kirzhner, A., and Trifonov, E.N. 2001. Protein folding: Looping from hydrophobic nuclei. Proteins 45: 346350.[CrossRef][Medline]
Broglia, R.A. and Tiana, G. 2001. Reading the three-dimensional structure of a protein from its amino acid sequence. Proteins 45: 421427.[CrossRef][Medline]
Bycroft, M., Ludvigsen, S., Fersht, A.R., and Poulsen, F.M. 1991. Determination of the 3-dimensional solution structure of barnase using nuclear magnetic resonance spectroscopy. Biochemistry 30: 86978701.[CrossRef][Medline]
Chan, H.S. and Dill, K. 1991. Polymer principles in protein structure and stability. Annu. Rev. Biophys. Biophys. Chem. 20: 447490.[CrossRef][Medline]
Colombo, G. and Merz Jr., K.M. 1999. Stability and activity of mesophilic subtilisin e and its thermophilic homolog: Insights from molecular dynamics simulations. J. Am. Chem. Soc. 121: 68956903.[CrossRef]
Dahiyat, B.I. and Mayo, S.L. 1997. De novo protein design: Fully automated sequence selection. Science 278: 8287.
Daura, X., Jaun, B., Seebach, D., van Gunsteren, W.F., and Mark, A.E. 1998. Reversible peptide folding in solution by molecular dynamics simulation. J. Mol. Biol. 280: 925932.[CrossRef][Medline]