|
|
||||||||
1 Department of Biophysics and Biophysical Chemistry, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA
2 Jenkins Department of Biophysics, Johns Hopkins University, Baltimore, Maryland 21218, USA
Reprint requests to: George D. Rose, Department of Biophysics and Biophysical Chemistry, Johns Hopkins University School of Medicine, 725 N. Wolfe Street, Baltimore, MD 21205; e-mail: rose{at}grserv.med.jhmi.edu; fax: (410) 614-3971.
(RECEIVED June 20, 2001; FINAL REVISION November 8, 2001; ACCEPTED November 9, 2001)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.24701.
| Abstract |
|---|
|
|
|---|
Keywords: Protein evolution; protein domains; protein folding; protein topology; folding rules; hierarchy
| Introduction |
|---|
|
|
|---|
The application of elementary rules to generate structure from basic building blocks is an intrinsically hierarchical process (Crippen 1978; Rose 1979). Richardson's composition rules for ß-sheet are an example of this approach (Richardson 1977). Quoting from her concise formulation:
Assuming that the
helices and ß strands are already present at least statistically, let us say that each succeeding step in any possible folding pathway for a ß sheet must consist of either (1) forming a ±1 or ±1x connection between two ß strands adjacent in sequence, or (2) taking either a ß strand or a prefolded unit and laying it down next to a prefolded part of the sheet with which it is also contiguous in sequence.
Using her rules, consecutive ß-strands grow into larger hydrogen-bonded structures in successive steps, and blocks of strands obtained in this way coalesce, providing they are consecutive in the chain. Of course, it is tempting to hypothesize that such procedures are related to actual protein folding pathways (Richardson 1977; Stirk et al. 1992; Hutchinson and Thornton 1993; Zhang and Kim 2000).
How does one uncover the grammar of a language? Assuming that protein folds can be generated from a set of simple rules, how might such rules be discovered? Effective rules should have the potential to generate a diverse range of physically feasible folds, including previously unobserved structures. The ubiquitous occurrence of super-secondary structures (Levitt and Chothia 1976) across unrelated families indicates that there is a physical basis for their independent formation and motivates our choice of simple rules.
Formally, the rules are operators. Their operands are structures, and an operation results in a new operand. This is a familiar definition, similar to binary addition.
Motivated in large part by Richardson's early work (Richardson 1977), we propose four simple folding rules for all-ß proteins, corresponding to the four prevalent super-secondary structure ß-motifs: ß-hairpin, ß-ß-ß unit, jelly roll, and Greek key. As such, the rules embody physically based topological and hydrogen-bonding relationships between neighboring strands. Two strands are classified as neighbors when they either (1) are consecutive in sequence or (2) become juxtaposed in space from a previously applied folding rule. This later relationship is identified via closure. When a folding rule juxtaposes two strands, they are classified as neighbors under closure, after which they become a valid object for subsequent applications of the folding rules. In general, closure results in new neighborhood relationships that incorporate the topology from previous folding steps, in a process that is intrinsically hierarchic.
A recursive domain is defined as a compact fold, or part of a fold, that can be generated by repeated application of the four folding rules, with closure. Whenever a protein fold can be generated entirely from the folding rules, it is composed of one recursive domain. Otherwise, it can be partitioned into multiple recursive domains, each of which is generated entirely by the rules. If the rules successfully capture the underlying folding process, then we might expect single-domain proteins to be comprised of a single recursive domain, with larger proteins comprised of only a small number of recursive domains. To test this expectation, the number of recursive domains was computed for each protein in two large all-ß test sets. One set consists of representatives of SCOP (structural classification of proteins) families, and the other consists of representatives of SCOP folds (Murzin et al. 1995). The second set is a subset of the first one, of course, but it was included as a control to ensure that folds which span many SCOP families do not bias the results.
The test is performed in a fully automatic way, using graph-theoretic tools. A recursive domain translates conveniently into the language of graph theory, as described in the Appendix.
We find that the majority of families (
80%) of small ß-proteins correspond to a single recursive domain, whereas larger proteins are typically comprised of a small number of recursive domains. Specifically, >90% of all proteins, both families and folds, can be decomposed into at most three recursive components.
Is the ability to represent a protein by a small number of recursive domains a characteristic behavior for protein folds, or will the composition rules decompose any compact assembly of ß-strands into just a few recursive domains? To address this question, we tested all possible up-down 2 x 4 beta-sheet topologies. Only 14% of these topologies can be generated as a single recursive domain.
| The folding rules |
|---|
|
|
|---|
|
Specifically, the four folding rules were motivated by prevalent super-secondary structure motifs found in proteins. Each rule represents an observed topological relationship between/among neighboring strands.
With the exception of the interposed strand in the indirect ß-wind rule, these rules apply only to neighboring strands. By definition, neighboring stands are either sequentially consecutive or become so on recursion. This condition, together with restrictions on the interposed strand in the indirect ß-wind rule, ensures that sequentially nonadjacent strands are not subject to a folding operation unless they have been brought together via rule-based iteration.
Along with the four folding rules, there is an explicit closure operation (Fig. 2
). Without closure, decomposition would be arrested at the level of super-secondary structure and isolated ß-strands. This follows from the fact that super-secondary structure is local, in that it is comprised of consecutive elements of secondary structure. However, once identified, the presence of a unit of super-secondary imposes spatial restrictions on remaining components of the fold. In particular, strands that are distant in sequence can be restricted to be close in space. In effect, the closure operation introduces a new virtual connection, like a shortcut through the sequence. As such, two secondary structures are classified as neighbors if they are consecutive in sequence or if they are linked by a virtual connection that is realized on application of closure. Correspondingly, the folding rules are applicable to both consecutive strands and nonconsecutive strands that become neighbors via these virtual connections.
|
Multistep hierarchic decomposition is illustrated for the jelly roll motif (Stirk et al. 1992) in Figure 3
. A jelly roll is a ribbon of antiparallel strands, and it can be parsed by reiterating the hairpin rule. Initially, the hairpin rule is applied to strands s3s4 (Fig. 3a
), the only two strands that are sequential neighbors. On closure, s2s5 become neighbors, and the hairpin rule is applied again (Fig. 3b
), after which s1s6 become neighbors, followed by a final application of the hairpin rule (Fig. 3c
). At this point, the motif has been reduced to three consecutive hairpins (Fig. 3d
). Ultimately, the jelly roll in Figure 3
can be reduced to a single recursive domain. Initially, each secondary structure is classified as a recursive domain, containing itself as a singleton. Strands that can be grouped by any folding rule become members of the same recursive domain. Accordingly, each application of a folding rule has the potential to merge two or more recursive domains into one, as illustrated for the jelly roll example in Figure 3
. The hairpin folding rules identify three recursive domains, each annotated by a different color in Figure 3d
. Clearly, the hairpin rule alone has a limited capacity to generate interesting recursive domains. However, the indirect ß-wind rule can be applied to strands s1s2s3s4, merging all three hairpins into one recursive domain. Alternatively, the antiparallel bridge rule could be applied twice: first to strands s2s3s4s5 and then to strands s1s2s5s6, again merging all three hairpins into one recursive domain. In general, the folding pathway obtained by successive applications of the folding rules is not unique. Figure 4
follows the step-wise decomposition of desulfoferrodoxin (1dfx, residues 37125) into a single recursive domain. Alternate panels document successive partitions into recursive components; these are interleaved with panels illustrating the pertinent folding rules.
|
|
| Technical details |
|---|
|
|
|---|
The DSSP (database of secondary structure in proteins) algorithm (Kabsch and Sander 1983), with minor modifications, was used to identify ß-strands in proteins of known structure. The three modifications are as follows: (1) A ß-strand must be at least two residues in length, but optionally, a strand of three or fewer residues can be treated as a loop; (2) if two consecutive ß-strands are hydrogen bonded to a ß-third strand, and directionality is preserved (i.e., both strands are parallel to the third strand or both are antiparallel to the third strand), they are classified as one distorted ß-strand; and (3) two ß-strands interrupted by a single residue are treated as one distorted ß-strand, providing they are approximately colinear.
The ß-wind, indirect ß-wind, and bridge rules are applied only if the relevant fragments interact. Fragments are defined to interact when the contact area buried between them is at least 20% of the total area of the smaller fragment.
Finally, two strands are considered to be hydrogen bonded if their direction is similar (either parallel or antiparallel) and the contact area between their backbones is at least 25% of the total area of the shorter strand. In rare instances, this definition can classify two proximate strands as hydrogen bonded even if they lack explicit donor/acceptor interactions.
| Results |
|---|
|
|
|---|
|
Summarizing these results for SCOP families, 62% of the folds can be fully generated by our folding rules as a single recursive domain, 83% by two such domains, and 89% by three such domains. For the subset of proteins restricted to at most 10 strands, 80% reduce to one recursive domain and 95% to two such domains. Similar results are seen for SCOP folds: the corresponding percentages are 56%, 81%, and 89%, respectively, for the full set, and 68% and 90%, respectively, for the restricted set.
As a control, we tested whether random assembly of strands can also be reduced to a small number of recursive domains, or whether instead this is a characteristic property of proteins. To this end, all combinatorially possible eight-strand sandwiches with up-down topology were generated, and the folding rules were applied to this control set. The distribution of recursive domains is shown in Figure 6
(upper bars). Only 18% of these topologies can be generated by a single recursive domain. Clearly, the distribution for a random assembly of ß-strands differs from that of authentic proteins, even when restricted to a ß-sandwich.
|
The set of exceptions is dominated by ß-propellers, in which the number of recursive domains equals the number of blades. None of our rules collapse the blades in a single domain, although it would be simple to devise such a rule; for example, two neighboring recursive domains can collapse along a common hydrophobic core. However, such a rule would be different in kind than the ones proposed here, and we did not consider such an extension at this stage. Of the remaining eight proteins, four are explained by an incorrect secondary structure assignment caused by a minor threshold violation.
The remaining two proteins from the fold test set are shown in Figure 7
. One is an ISP domain, described in SCOP as a two-domain protein, in which one of the domains is a six-stranded sandwich or barrel. The representative of this fold chosen by ASTRAL is the ISP subunit of the mitochondrial cytochrome bc1-complex (1rie). Our rules partition this subunit into four recursive domains. Interestingly, another structurally similar member of this family (the ISP subunit from chloroplast cytochrome bf complex, 1rfs) can be generated as a single recursive domain using the rules.
|
Examining proteins in the family test set revealed one additional exception, a ß-trefoil (1wba). However, this fold is not an exception in the fold test set. Closer examination shows that the antiparallel bridge rule obtains for most of the other representatives of this fold, but this particular case is an exception.
Are there conceivable ß-folds that nature avoids? Our approach formalizes the intuition that observed folds are dominated by conformations in which chain connectivity avoids random hops between disparate points in space. Others have addressed such questions as well.
Richardson (1977) documented two such fold properties, one based on structural chirality and the other on the topology of the backbone. The first property is completely independent of ours. When applied after the fact to structures generated using our rules, about half can be rejected because they lack the correct orientation. Her second property is a noncrossing criterion, and it is related to ours. In the context of a two-layer ß-sandwich, the noncrossing property can be stated as follows: let the two ß-sheets be embedded on opposite sides of a cube, with interstrand connections that traverse the surface of the cube. Retain only those sheets for which no two loops cross. Eliminating structures that fail to satisfy the noncrossing property would further decrease the list of acceptable folds.
We tested the degree to which the set of folds generated by our rules can be captured by Richardson's noncrossing criterion (Fig. 6
, lower bars), and we find that this criterion eliminates <37% of all combinatorially possible up-down eight-strand sandwiches. Furthermore, although the number of recursive components that satisfy the noncrossing restriction is reduced in comparison to a random assembly of ß-strands, their distribution is similar.
Naturally-occurring ß-sheet topologies were also analyzed by Zhang and Kim (2000), who observed that among the 96 possible topologies for four-stranded sheet, only 42 are observed. The investigators identified two characteristic properties of the underrepresented topologies. One group, G1, includes sheets in which two parallel strands are situated in opposition to two antiparallel strands. The second group, G2, includes sheets in which two sequentially consecutive strands occupy nonadjacent positions in the sheet, for example, the first and fourth positions. There is only one pair of strands in G2 that is consecutive in both sequence and structure.
The first of these two criteria is independent from ours and has the potential to be an additional screen for valid fold candidates. The second criterion involves the degree to which main-chain connectivity is free to hop at random in a four-stranded ß-sheet topology, and it is a special case of the property that we address in this paper.
Accordingly, we tested whether our recursive domain formalism can rationalize the absence of G2 topologies among observed folds. Given that four-stranded sheets do not occur in isolation, we adopted a broader test set consisting of all theoretically possible 2 x 4 up-down sandwiches. A count was made of the number of times each four-strand sheet topology in G2 is represented (1) in this test set and (2) in a reduced test subset that was restricted to include only sandwiches having one recursive domain. One expects that G2 topologies will occur rarely in the reduced test subset. Indeed, there are 5040 occurrences of folds from G2 in the unrestricted test set, but only 4.9% remain in the reduced subset. Moreover, when the test subset is further reduced by removing topologies that fail to satisfy the noncrossing property, only 1.9% remain. It follows that G2 topologies are selectively depleted in recursive domains.
In essence, our folding rules quantify the impression gleaned from visual inspection: ß-folds show a simple, underlying organization, with orderly patterns of chain connectivity.
| Discussion |
|---|
|
|
|---|
In the preceding, we introduced a grammar for all-ß protein domains, based on four simple composition rules. With this definition, a domain corresponds to a collection of ß-strands that can be lumped into a single structural unit on application of the rules, with closure. The rules were motivated by four types of commonly observed super-secondary structure. Using them, we showed that almost every all-ß fold can be iteratively decomposed into a small number of recursive domains, usually just one. Here, our goal has been to explore the rules, not to tinker with them, and we expect that an improved rule set could be devised with modification and/or extension. Also, the existence of similar rules that span
- and
/ß-proteins is anticipated.
The four simple rules provide a compact description of folding for all-ß proteins, and they give rise to the observed hierarchic organization of proteins (Crippen 1978; Rose 1979) quite naturally. Often, there are multiple ways to generate a given fold using the rules, and the absence of a unique parse tree is consistent with the existence of multiple folding paths. Richardson (1977) made a similar observation years ago.
Do these abstract rules have physical correlates in the actual mechanism of protein folding? We suspect so. The fact that unrelated proteins can be generated from the same set of simple rules is strongly suggestive, with the following as a plausible connection. At the cartoon level of resolution (
5 Å), a protein structure can be described as a series of isodirectional segments (i.e.,
-helices and ß-strands) interconnected by tight turns and larger loops (Rose and Seltzer 1977). This partitioning is already anticipated in the unfolded molecule by sterically imposed, conformational bias (Srinivasan and Rose 1999; Pappu et al. 2000). Segmental bias is then fortified on folding as watera poor solvent for polypeptide chainssqueezes the protein from its midst, pushing it toward compactness. It seems likely that our rules, which were abstracted from observed structural motifs, are a reflection of this underlying process.
Evolution is the history of contingent experiments of nature (Gould 1989), recorded in life's molecules. Do the structures of these molecules evolve at random? Or, are there hidden constraints on their patterns (Banavar et al. 2002), scope, and complexity? The very existence of a grammar argues for the latter view. Some structures, albeit conceivable, are simply not valid sentences in the language of proteins. Further, if a grammar for proteins is anchored in the chemistry of polypeptide chains, then the set of valid folds is predetermined, and evolution can only fill in the blanks. It is our conjecture that the discovered grammar is an expression of nature, not just a coincidental post hoc invention that happens to be consistent with the facts of life.
| Appendix |
|---|
|
|
|---|
|
Domain edges connect strands within the same recursive domain. At the start of the procedure, before a folding rule is applied, there are no domain edges in the graph. On application of a folding rule, any strands grouped by the rule become members of the same recursive domain and are connected by a domain edge. In general, two strands belong to the same recursive domain if there is a path along domain edges connecting the vertex corresponding to one strand to the vertex corresponding to the second strand. The relation of belonging-to-the-same-recursive-domain is transitive.
Neighbor edges reflect information about neighboring strands. At the start of the procedure, the only neighbor edges are between pairs of strands that are consecutive in sequence. New neighbor edges can be introduced on closure. At each new iteration, the closure operation simply searches the graph for the existence of pairs of vertices corresponding to strands that are distant in sequence but close in space. Such pairs arise as a consequence of spatial restrictions that are imposed by the folding rules during the previous iteration. If such a pair is found, then a neighbor edge is introduced. In effect, the neighbor edge is a short-cut between the two vertices. Unlike domain edges, neighbor edges have a direction: the vertex corresponding to the strand that is closer to the N terminus is the beginning of the edge, and the vertex corresponding to the strand that is closer to the C terminus is the end of the edge.
To partition a protein into recursive domains, the folding rules are applied repeatedly until no further application is possible. At this point, the graph will have a number of domain edges. Next, recursive domains are determined. Only domain edges are pertinent for this step; neighbor edges are disregarded.
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| REFERENCES |
|---|
|
|
|---|
Banavar, J.R., Maritan, A., Micheletti, C., and Trovato, A. 2002. Geometry and physics of proteins. Proteins (in press).
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235242.
Brenner, S.E., Koehl, P., and Levitt, M. 2000. The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res. 28: 254256.
Cormen, T.H., Leiserson, C.E., and Rivest, R.L. 1990. Introduction to algorithms. MIT Press, Cambridge, MA.
Crippen G.M. 1978. The tree structural organization of proteins. J. Mol. Biol. 126:315332.[CrossRef][Medline]
Doolittle, R.F. 1995. The multiplicity of domains in proteins. Annu. Rev. Biochem. 64: 287314..
Efimov, A.V. 1993a. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60: 201239.[CrossRef][Medline]
. 1993b. Super-secondary structures involving triple-strand ß-sheets. FEBS Lett. 334: 253256.[CrossRef][Medline]
. 1996. A structural tree for
-helical proteins containing
-
-corners and its application to protein classification. FEBS Lett. 391: 167170.[CrossRef][Medline]
. 1997. A structural tree for proteins containing three ß-corners. FEBS Lett. 407: 3741.[CrossRef][Medline]
Gould, S.J. 1989. Wonderful life: The Burgess shale and the nature of history. W.W. Norton, New York, NY.
Hall, T.M., Porter, J.A., Beachy, P.A., and Leahy, D.J. 1995. A potential catalytic site revealed by the 1.7 Å crystal structure of the amino-terminal signaling domain of Sonic hedgehog. Nature 378: 212216.[CrossRef][Medline]
Hutchinson, E.G. and Thornton, J.M. 1993. The Greek key motif: Extraction, classification and analysis. Protein Eng. 6: 233245.
Kabsch, W. and Sander, C. 1983. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22: 25772637.[CrossRef][Medline]
Lesk, A.M. 1995. Systematic representation of protein folding patterns. J. Mol. Graph. 13: 159164.[CrossRef][Medline]
Levitt, M. and Chothia, C. 1976. Structural patterns in globular proteins. Nature 261: 552558.[CrossRef][Medline]
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536540.[CrossRef][Medline]
Pappu, R.V., Srinivasan, R., and Rose, G.D. 2000. The flory isolated-pair hypothesis is not valid for polypeptide chains: Implications for protein folding. Proc. Natl. Acad. Sci. 97: 1256512570.
Richardson, J.S. 1977. ß-Sheet topology and the relatedness of proteins. Nature 268:495500.[CrossRef][Medline]
Rose, G.D. 1979. Hierarchic organization of domains in globular proteins. J. Mol. Biol. 134: 447470.[CrossRef][Medline]
Rose, G.D. and Seltzer, J. 1977. A new algorithm for finding the peptide chain turns in a globular protein. J. Mol. Biol. 113: 153164.[CrossRef][Medline]
Srinivasan, R. and Rose, G.D. 1999. A physical basis for protein secondary structure. Proc. Natl. Acad. Sci. 96: 1425814263.
Stickle, D.F., Presta, L.G., Dill, K.A., and Rose, G.D. 1992. Hydrogen bonding in globular proteins. J. Mol. Biol. 226: 11431159.[CrossRef][Medline]
Stirk, H.J., Woolfson, D N., Hutchinson, E.G., and Thornton, J.M. 1992. Depicting topology and handedness in jellyroll structures. FEBS Lett. 308:13.[CrossRef][Medline]
Woolfson, D.N., Evans, P.A., Hutchinson, E.G., and Thornton, J.M. 1993. Topological and stereochemical restrictions in ß-sandwich protein structures. Protein Eng. 6:46170.
Zhang, C. and Kim, S.H. 2000. The anatomy of protein ß-sheet topology. J. Mol. Biol. 299:107589.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
J. Viksna and D. Gilbert Assessment of the probabilities for evolutionary structural changes in protein folds Bioinformatics, April 1, 2007; 23(7): 832 - 841. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |