|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Department of Biochemistry and Molecular Biophysics
2 Program in Applied Mathematics and Department of Mathematics, University of Arizona, Tucson, Arizona 85721, USA
(RECEIVED November 13, 2005; FINAL REVISION January 9, 2006; ACCEPTED January 12, 2006)
| Abstract |
|---|
|
|
|---|
Keywords: protein fold space; repeat proteins; fold evolution; polyhelix
| Introduction |
|---|
|
|
|---|
To investigate these problems, we introduce a representation of the fold of a protein as a continuous space curve that follows the path of the protein backbone in three dimensions. The fold is considered as a geometric object that is distinct from the atomic model of the protein that displays that fold. Differential geometry is capable of describing general curves and so represents a natural language for the exploration of the possible forms that protein folds may take, and the variation about such forms.
Continuous representations have played an important role in the study of nucleic acid structure and dynamics (Marko and Siggia 1994; Manning et al. 1996; Goriely and Tabor 1997; Klapper and Qian 1998). Although there are some notable exceptions (Maritan et al. 2000; Banavar et al. 2002; Trovato et al. 2005), in general the greater conformational diversity of proteins has limited the use of continuous models. However, the rich repertoire of tertiary forms attained by proteins invites a geometric inquiry. Despite the apparent spatial complexity of protein folds, a considerable simplification is possible using a local geometrical description.
In general, sufficiently smooth three-dimensional space curves are completely specified up to a rotation and translation by their curvature and torsion (referred to together as curvatures), which are the local properties that describe the twisting and bending of the curve at each point along its length. The local description in terms of curvatures and the global description in terms of spatial coordinates are equivalent: Given any curvature profile, a corresponding space curve can be constructed, and conversely, given any curve, the corresponding curvature profile can be obtained. We have developed methods to construct curvature profiles for curves that follow the path of protein backbones and to construct atomic coordinate models from such curves (A.C. Hausrath and A. Goriely, in prep.). In this report these methods are used to examine properties of the fold space of helical repeat proteins.
| Results |
|---|
|
|
|---|
and torsion
:
|
| (1) |
Here ()' denotes differentiation with respect to s. The curvature and torsion can be extracted from the curve by repeated differentiation, and conversely, the curve can be constructed from the curvature and torsion profiles by integration of the Frenet equations. A convenient way to perform this integration is to introduce a 12-dimensional vector
|
| (2) |
whose entries are the nine components of the three basis vectors in the Frenet frame as well as the three coordinates of a point on the curve.
Then, the Frenet equations can be written as a differential matrix equation
|
| (3) |
where Vi is the 3 x·3 matrix whose single nonvanishing entry is a 1 in row i, column 1.
For arbitrary curvatures, Equation 3 cannot be solved exactly and numerical integration is required. However, if the curvatures are piecewise constant, Equation 3 is piecewise linear with constant coefficients and an exact analytical solution can be obtained. Such a curvature profile is specified by a list of triples P = {(
(i),
(i); L(i), i = 1...N}, with each triple corresponding to a segment. A curve with constant curvature and torsion is a helix. Therefore, a curve constructed from a piecewise constant curvature profile consists of a series of connected helical arcs and will be referred to as a polyhelix. The following polyhelix construction avoids the need for numerical integration techniques of differential equations normally required for obtaining solutions to the Frenet equations. It is computationally efficient and enables both curve and coordinate model construction utilizing only straightforward linear algebra well known in the structural biology community.
For a single segment with curvature
and torsion
starting at s = 0 and ending at s = L, the solution to Equation 3 is given by
|
| (4) |
where Y(0) defines the initial position and the orientation of the Frenet basis at s = 0, and A (
,
; s) = esM is the matrix exponential which can be written
|
| (5) |
where
= 
2 +
2 and the 3 x·3 submatrices bi have the single nonzero row i with entries
|
| (6) |
A polyhelix with N segments is completely characterized by the list P and an initial position and basis orientation Y(0). A parametric expression in arc length for the j th segment of the curve r(s) is given by the last three components of the vector Y(j) (s):
|
| (7) |
where
The matrix A propagates both its corresponding helix and the associated Frenet frame from the initial frame to the frame at its endpoint, thereby supplying the initial frame for the subsequent segment. Joining these recursively defined parametric representations for each segment creates a parametric representation of the entire curve.
The nature of the polyhelix construction can be more clearly seen in the explicit expressions for matrix products specifying the first three segments:
|
| (8) |
Note that only the first matrix in the product contains the parametrization in s. The rest of the product amounts to a constant vector. This constant vector supplies the initial basis for that segment and is obtained as the endpoint of the previous segment. As an example, a three-segment polyhelix is shown in Figure 1.
The first nine components of Y(j) (s) give the components of the basis vectors for the local Frenet frame at s. These coordinate systems are particularly convenient for constructing atomic coordinate models from the curve. Selecting points on the curve separated by the length of a peptide plane gives the positions of the atoms in the model and defines a discrete set of corresponding arc-length values. The Frenet frames at positions {s1,..., sN} are the natural coordinate systems in which to express the atoms of the corresponding residue in the atomic model. A set of local coordinates a = (a1, a2, a3) in the Frenet frame at s represents the point Pext = r(s) + a1t(s) + a2n(s) + a3b(s) in the external coordinates. Conversely, any external point Pext has local coordinates a = {(Pext r(s)) · t(s), (Pext r(s)) · n(s), (Pext r(s)) · b(s)} in the Frenet frame at s. The closed-form expression for Y(j) (s) allows the coordinates of the atoms of the model, and therefore derived quantities such as energies, geometric quantities including bond geometries or buried surface areas, and also agreement with experimental data, to be the subject of analytical studies. Local coordinates used for construction of backbone atomic models (e.g., see Figure 4, below) are shown in Table 1. The backbone atoms are specified by the curvature profile. The construction of side chains can be accomplished in a similar manner although information in addition to the curvature profile must be supplied (A.C. Hausrath and A. Goriely, in prep.).
| Discussion |
|---|
|
|
|---|
-helical proteins. General helical protein folds can be described with such curvature profiles, although the difficulty of obtaining an accurate representation increases rapidly for more complex structures. In this article we focus on a simple family of such structuresthe helical repeat proteins. As the fold of a repeat protein is a repetitive curve, its curvature profiles are periodic. The periodicity greatly simplifies the curvature profile, as only one period need be specified to generate an extended regular structure.
Using the polyhelix construction, a two-helix repeat protein fold can be specified with six {
,
, L} triples with each turn represented by two such triples. We use two helical arcs to connect successive helices. In general, six parameters are required to specify the relative orientation of two rigid bodies, and two {
,
, L} triples therefore supply the necessary parameters.
In the
-helical segments, curvature and torsion are fixed. The construction of a two-helix repeat therefore requires 14 parameters. The canonical form for each fold of this type corresponds to a single point in this 14-dimensional space. The space of parameters for a given architecture will be referred to as the curvature space. There is an exact correspondence between points in the curvature space and individual three-dimensional curves obtained using the polyhelix construction. Further, using the local coordinate systems appropriately placed along the curve, backbone atomic models may be constructed using these curvature parameters. The polyhelix construction represents a mathematically well-defined mapping between the Euclidean curvature space and the complicated fold space of helical repeat proteins.
As an example, Figure 2 shows curvature profiles for three different helical repeat proteins: yeast vesicular transport protein sec17 (PDB code 1qqe), human protein phosphatase 2A (1b3u), and bacterial transcription factor MalT (1hz4), which will be subsequently referred to by their PDB codes for brevity (Groves et al. 1999; Rice and Brunger 1999; Steegborn et al. 2001). The parameters defining the curvature profiles are provided in Table 2. Figure 3 shows stereo views of the three repeats of the repetitive curves specified by these profiles and the C
trace to which they were fitted. The profiles were also used to construct the backbone models represented by ribbon diagrams in Figure 4B.
Protein folds represent a subset of the possible space curves. Our strategy to predict new folds is to search within a curvature space specifying a family of space curves and to identify instances that are compatible with protein geometry. Mathematically, this is accomplished by devising a function to score the potential of a curve to be realized as a protein. While it is unrealistic to expect that a single function could reliably confirm the existence of a protein whose fold conforms to a given curve, it is not difficult to create functions that can eliminate curves that are incompatible with realization as a protein. We refer to these functions as protein quality functions. Given an appropriately constructed protein quality function, a contour plot over a curvature space will have islands in regions that correspond to protein-like curves. These islands may correspond to either: (1) curves that resemble known folds; (2) "false-positives," curves that are not realizable as protein folds but that have not been eliminated from the search because of the limitations of the quality function; (3) folds that exist in nature but have not been experimentally observed; or (4) folds that could be realized but have not been utilized in nature. Once such a function has been constructed for a particular architecture, it is possible to seek new folds within this architecture and to examine relationships between the individual instances of folds having this architecture. The computational efficiency of the polyhelix construction allows the exploration of the entire curvature space. Therefore it is possible to examine properties of the continuum of possible forms, such as the density of folds in fold space or the connectedness of fold space (Shindyalov and Bourne 2000; Harrison et al. 2002; Hou et al. 2003).
Remarkably, simple, well-chosen quality functions are adequate to resolve coarse features of the fold space and to identify regions of interest. Energetic calculations on complete models built from the curve-derived scaffolds could then be applied to promising candidates in these regions. Actual proof of existence requires constructing a physical protein molecule with the correct fold and the experimental determination of its three-dimensional structure (Kuhlman et al. 2003). Modern protein design methods are capable of designing sequences compatible with a backbone scaffold model (Dahiyat and Mayo 1997; Dwyer and Hellinga 2004).
Here we use a simple protein quality function expressing a balance between curve compactness and self-avoidance. It is formulated as the ratio of a term that quantifies curve compactness and a term that penalizes a curve that approaches itself too closely. Given a set of points { p(sk)} on a curve, two points p(si) and p(sj) are said to form a contact when they are within a prescribed contact distance d in space. Applied to points on continuous curves, the contact order is the arc-length separation |sj si| averaged over all contacts (Plaxco et al. 1998). Contact order is large for curves in which many pairs of points distant in arc length are close in space and so serves as a simple quantitative measure of compactness. Explicitly, the contact order is
|
| (9) |
where N is the total number of contacts, L is the number of points, and the sum runs over all contacts. However, a curve that is too compact will approach too close to itself. Defining a clash as a pair of points that are closer than a prescribed clash distance c in space, curves with more than a very few close self-approaches are severely penalized by using the quality function
|
| (10) |
where CO(d) is the contact order of the curve and M(c) the number of clashes. With this function, curves with no such self-approaches are not penalized and so are ranked only by their relative contact order. Other functions could certainly be devised, but as any effective functions should be in agreement about the "poor" regions of the curvature space, a very simple scheme such as this one is suitable for an initial survey.
The quality function Q was used to investigate the curves parameterized by points in a subspace of the larger 14-dimensional curvature space. The three points corresponding to the profiles in Figure 2 define a plane and a coordinate system on that plane. Defining
0,1,2 as the vectors of curvature parameters for 1qqe, 1hz4, and 1b3u, any point on this plane can be represented by an ordered pair (a, b) corresponding to the position in the curvature space of
0 + a(
1
0) + b(
2
0), so that the parameter vector for 1qqe is at (0, 0), 1hz4 is at (1, 0) and 1b3u is at (0, 1). We have plotted the value of the function Q for a and b in the ranges 0.2 to 1.2 and displayed the results in Figure 4A.
In this plane, representing a small subset of the curvature space, a variety of distinct forms can be found, including some whose "curve quality" compares favorably with the fitted curves we have used to represent natural proteins and yet not corresponding to any known protein structure. (Ribbon diagrams of some examples are shown in Fig. 4C.) This result suggests that there are quite a large number of curves consistent with protein geometry, which could be found and constructed by a more systematic search of curvature spaces, either of the two-helix repeats or other protein architectures.
With the rapid increase in structural knowledge has come the realization that nature has made use of a limited set of protein folds (Chothia 1992; Zhang and DeLisi 1998). It is not clear to what extent the set of folds used in nature represents the structural repertoire attainable by a polypeptide chain. The number of possible amino acid sequences clearly means that nature has only sampled a tiny fraction of possible polypeptides, but a given fold may be compatible with a large number of sequences. The numbers of sequences associated with stable folds has been investigated using lattice models (Li et al. 1996). This study suggested that protein folds may be those tertiary forms with energetically unique states that also have unusually large numbers of compatible sequences. Lattice models are currently the only types of models for which it is possible to enumerate exhaustively the structures associated with all possible sequences, and, despite their simplicity, appear to capture the essential characteristics of the folding problem. But lattice models are not intended to represent particular natural proteins, so they are not able to be predictive of natural protein folds.
Our method differs in that a continuum representation is used, and it can represent particular natural protein folds. It is not capable of considering all possible types of folds at once, but the method can be comprehensive within a given architecture. In contrast to the discrete lattice models, smooth and continuous deformations of the curve representation can be used to model both subtle and large-scale changes in such protein folds. Consideration of fitness or energetic properties could be superimposed on top of the representation, and the "quality function" Q is a first step in this direction. The key difference is that folds of particular natural proteins can be parameterized, and so in principle the method can be predictive. But the method does not address whether a sequence might exist that could confer such a fold.
It is interesting to consider the relationship between curvature space and sequence space. The evolutionary history of a protein sequence can often be reconstructed from phylogenies. Doing so creates the path through sequence space that the protein has followed during its evolution. The question arises as to whether it is possible to establish a correspondence between such paths in sequence space and paths in curvature space. John Maynard Smith (1970) articulated a model of protein evolution as a series of sitewise changes in sequence that ultimately led from one sequence to a completely different sequence, but all the while retaining the ability to fold and carry out a cellular function. (Larger scale modifications of sequence may happen that result in discontinuous changes in structure [Cui et al. 2002].) It is not known if the evolution of new folds can be accomplished by the stepwise process envisioned by Smith: Can the tertiary structure of proteins be changed by successive pointwise changes from one fold to another?
The helical repeat proteins provide an example where incremental tertiary structure changes are observed. The variation in sequence among individual repeats creates small differences in the relative orientations of successive repeat units in the structure. The overall superhelical character of the array of repeats is a consequence of such fine adjustments, especially when amplified by repetition. A dramatic example is provided by the two HEAT proteins importin-
and protein phosphatase 2A PR65/A subunit, which form right-and left-handed superhelical arrays, respectively; yet, the two proteins are derived from a common ancestral sequence (Cingolani et al. 1999; Groves et al. 1999; Andrade et al. 2001). Examination of the spatial relationships between the individual repeats of these and other HEAT repeat structures shows that there is considerable diversity. The large collection of these structures with small differences between them suggests that the helical repeats constitute a densely sampled continuum of tertiary forms.
A mathematically explicit example of a continuous change between natural forms within this continuum can be viewed in Figure 4B. The diagonal bridge in the quality function between coordinates (0,1) and (1,0) suggests that realizable protein forms parameterized by the points along this path through the curvature space may exist. More abstractly, two forms that are connected by a high-scoring path might be related in the sense that if an important cellular function originally resided on a polypeptide with a fold somewhere on this path, the process of evolution might conceivably allow continuous change in the tertiary structure of this protein to both points while retaining a specific folded form and thereby also retaining any capability dependent on that structure. Points that are not connected could not be related in this manner. For example, in Figure 4B, no path exists between (0,0) and (1,0), suggesting that these structures do not share a common precursor located in the portion of the curvature space sampled so far. Parametrization by curvatures is a means to investigate Smith's abstract protein space as an explicit mathematical object, and quality functions might be thought of as the fitness of forms inhabiting this landscape (Macken and Perelson 1989). Investigation of the connectivity of the level sets of quality functions on curvature spaces allows for insight into whether folds could be evolutionarily related or disjoint.
| Materials and Methods |
|---|
|
|
|---|
-helix using EDPDB (Zhang and Matthews 1995). A helix with curvature 0.3812 and torsion 0.1492 was created and the set of points with spatial separation of 3.8 Å, one peptide plane, along this curve were obtained. The polyglycine model was superimposed on the curve by overlaying its C
positions on the set of corresponding points obtained from the curve. The Frenet frames at each C
position were then used to express the coordinates of the remaining backbone atoms. A helix of 100 residues was used to minimize any error introduced by the superposition procedure or conversion process, and the local coordinates reported in Table 1 are the average of these 100 instances.
For curve construction, curvature profiles were devised by a combination of manual adjustment and least-squares minimization of the sum of pairwise distances between point sets obtained from curves with the spatial separation of 3.8 Å and C
traces of coordinate models. Initial orientation vectors Y(0) were obtained by superposition of the C
trace of the first
-helix from a coordinate set and a corresponding set of points from an
-helical curve. This transformation was then applied to the initial Frenet frame from the
-helical curve to obtain the initial basis for curve construction.
Each curvature profile was fitted to a three-repeat section from its corresponding coordinate set as follows. For each helixturnhelix motif from the selected section of the coordinate model, an initial basis was created as above and a four-segment polyhelix, comprised of two
-helical curves (with curvature 0.38 and torsion 0.15) connected by two general helical arcs, was fitted to its C
coordinates. An average of the values for the curvatures in the turn regions obtained from this procedure supplied the starting values for a periodic curvature profile. This periodic curvature profile was fitted first against the coordinates of the first two repeats in the selected section and then against the coordinates of the three repeats.
Calculations were carried out using Mathematica (Mathematica 5.2, Wolfram Research Inc.), Maple (Maple 10, Maple-soft), or with custom C programs. Figures were created using MOLSCRIPT (Kraulis 1991), Raster3D (Merritt and Bacon 1997), MOLMOL (Koradi et al. 1996), and Mathematica (Wolfram Research Inc.).
|
|
|
|
|
|
| Footnotes |
|---|
Article published online ahead of print. Article and publication date are at http://www.proteinscience.org/cgi/doi/10.1110/ps.051971106.
| Acknowledgments |
|---|
| References |
|---|
|
|
|---|
Banavar J.R., Maritan A., Micheletti C., Trovato A. 2002. Geometry and physics of proteins Proteins 47: 315322.[CrossRef][Medline]
Binz H.K., Stumpp M.T., Forrer P., Amstutz P., Pluckthun A. 2003. Designing repeat proteins: Well-expressed, soluble and stable proteins from combinatorial libraries of consensus ankyrin repeat proteins J. Mol. Biol. 332: 489503.[CrossRef][Medline]
Chothia C. 1992. Proteins1000 families for the molecular biologist Nature 357: 543544.[CrossRef][Medline]
Cingolani G., Petosa C., Weis K., Muller C.W. 1999. Structure of importin-
-bound to the IBB domain of importin-
Nature 399: 221229.[CrossRef][Medline]
Cui Y., Wong W.H., Bornberg-Bauer E., Chan H.S. 2002. Recombinatoric exploration of novel folded structures: A heteropolymer-based model of protein evolutionary landscapes Proc. Natl. Acad. Sci. 99: 809814.
Dahiyat B.I. and Mayo S.L. 1997. De novo protein design: Fully automated sequence selection Science 278: 8287.
Dwyer M.A. and Hellinga H.W. 2004. Periplasmic binding proteins: A versatile superfamily for protein engineering Curr. Opin. Struct. Biol. 14: 495504.[CrossRef][Medline]
Goriely A. and Tabor M. 1997. Nonlinear dynamics of filaments. 1. Dynamical instabilities Physica D. 105: 2044.[CrossRef]
Groves M.R. and Barford D. 1999. Topological characteristics of helical repeat proteins Curr. Opin. Struct. Biol. 9: 383389.[CrossRef][Medline]
Groves M.R., Hanlon N., Turowski P., Hemmings B.A., Barford D. 1999. The structure of the protein phosphatase 2A PR65/A subunit reveals the conformation of its 15 tandemly repeated HEAT motifs Cell 96: 99110.[CrossRef][Medline]
Harrison A., Pearl F., Mott R., Thornton J., Orengo C. 2002. Quantifying the similarities within fold space J. Mol. Biol. 323: 909926.[CrossRef][Medline]
Hou J.T., Sims G.E., Zhang C., Kim S.H. 2003. A global representation of the protein fold space Proc. Natl. Acad. Sci. 100: 23862390.
Klapper I. and Qian H. 1998. Remarks on discrete and continuous large-scale models of DNA dynamics Biophys. J. 74: 25042514.
Kohl A., Binz H.K., Forrer P., Stumpp M.T., Pluckthun A., Grutter M.G. 2003. Designed to be stable: Crystal structure of a consensus ankyrin repeat protein Proc. Natl. Acad. Sci. 100: 17001705.
Koradi R., Billeter M., Wuthrich K. 1996. MOLMOL: A program for display and analysis of macromolecular structures J. Mol. Graph. 14: 5155.[CrossRef][Medline]
Kraulis P.J. 1991. MOLSCRIPTA program to produce both detailed and schematic plots of protein structures J. Appl. Crystallogr. 24: 946950.[CrossRef]
Kuhlman B., Dantas G., Ireton G.C., Varani G., Stoddard B.L., Baker D. 2003. Design of a novel globular protein fold with atomic-level accuracy Science 302: 13641368.
Li H., Helling R., Tang C., Wingreen N. 1996. Emergence of preferred structures in a simple model of protein folding Science 273: 666669.[Abstract]
Macken C.A. and Perelson A.S. 1989. Protein evolution on rugged landscapes Proc. Natl. Acad. Sci. 86: 61916195.
Main E.R.G., Xiong Y., Cocco M.J., D'Andrea L., Regan L. 2003. Design of stable
-helical arrays from an idealized TPR motif Structure 11: 497508.[Medline]
Manning R.S., Maddocks J.H., Kahn J.D. 1996. A continuum rod model of sequence-dependent DNA structure J. Chem. Phys. 105: 56265646.[CrossRef]
Maritan A., Micheletti C., Trovato A., Banavar J.R. 2000. Optimal shapes of compact strings Nature 406: 287290.[CrossRef][Medline]
Marko J.F. and Siggia E.D. 1994. Bending and twisting elasticity of DNA Macromolecules 27: 981988.[CrossRef]
Merritt E.A. and Bacon D.J. 1997. Raster3D: Photorealistic molecular graphics Methods Enzymol. 277: 505524.[Medline]
Mosavi L.K., Minor D.L., Peng Z.Y. 2002. Consensus-derived structural determinants of the ankyrin repeat motif Proc. Natl. Acad. Sci. 99: 1602916034.
Plaxco K.W., Simons K.T., Baker D. 1998. Contact order, transition state placement and the refolding rates of single domain proteins J. Mol. Biol. 277: 985994.[CrossRef][Medline]
Rice L.M. and Brunger A.T. 1999. Crystal structure of the vesicular transport protein Sec17: Implications for SNAP function in SNARE complex disassembly Mol. Cell 4: 8595.[CrossRef][Medline]
Shindyalov I.N. and Bourne P.E. 2000. An alternative view of protein fold space Proteins 38: 247260.[CrossRef][Medline]
Smith J.M. 1970. Natural selection and concept of a protein space Nature 225: 563564.[CrossRef][Medline]
Steegborn C., Danot O., Huber R., Clausen T. 2001. Crystal structure of transcription factor MaIT domain III: A novel helix repeat fold implicated in regulated oligomerization Structure 9: 10511060.[Medline]
Stumpp M.T., Forrer P., Binz H.K., Pluckthun A. 2003. Designing repeat proteins: Modular leucine-rich repeat protein libraries based on the mammalian ribonuclease inhibitor family J. Mol. Biol. 332: 471487.[CrossRef][Medline]
Trovato A., Hoang T.X., Banavar J.R., Maritan A., Seno F. 2005. What determines the structures of native folds of proteins? J. Phys. Condens. Matter 17: S1515S1522.[CrossRef]
Zhang C.O. and DeLisi C. 1998. Estimating the number of protein folds J. Mol. Biol. 284: 13011305.[CrossRef][Medline]
Zhang X.J. and Matthews B.W. 1995. EDPDBA multifunctional tool for protein-structure analysis J. Appl. Crystallogr. 28: 624630.[CrossRef]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |