|
|
||||||||
1 Biomolecular Structure and Design Program and
2 Department of Medicinal Chemistry, University of Washington, Seattle, Washington 98195, USA
Reprint requests to: Valerie Daggett, Biomolecular Structure and Design Program, Department of Medicinal Chemistry, University of Washington, Box 357610, Seattle, WA 98195, USA; e-mail: daggett{at}u.washington.edu; fax: (206) 685-3252.
(RECEIVED February 17, 2003; FINAL REVISION July 11, 2003; ACCEPTED July 14, 2003)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.0306803.
| Abstract |
|---|
|
|
|---|
Keywords: CATH; Dali; SCOP; fold classification
| Introduction |
|---|
|
|
|---|
The problem of classifying folds can be considered in two steps. In the first step, a protein chain or complex must be broken into its constituent domains, and in the second step, these domains are assigned a fold. The methods for classifying folds range from a variety of fully automated procedures (for review, see Eidhammer et al. 2000) to a largely manual method on the basis of expert knowledge. Here, we consider three popular systems, all of which have used various methods of comparing folds in constructing hierarchies of protein similarity.
Structural Classification of Proteins, or SCOP, was among the earliest efforts to classify protein structures into folds. Protein domains with no obvious sequence homology to other domains are defined and classified manually (Murzin et al. 1995). In many ways, this database has been considered the standard for protein structure classification. The human mind is quite adept at recognizing slightly dissimilar objects as belonging to a common architecture. Thus, simplified representations of protein structures can be grouped by expert inspection.
Class Architecture Topology Homologous (CATH) superfamily makes use of a combination of manual and automated procedures in defining and classifying protein domains (Orengo et al. 1997). CATH relies on the consensus of three automated classification methods to break protein chains into domains (Jones et al. 1998). This approach is effective in defining the domains of ~53% of the chains. The domains of chains for which consensus is not reached are defined manually. Homology is defined largely in terms of sequence, but distant sequence matches with high structural similarity may be defined as being homologous. Domains lacking homology to already defined folds are given a topology-level classification on the basis of the CORA (Orengo 1999) or SSAP (Taylor and Orengo 1989) algorithm.
The Dali Domain Dictionary is the product of a fully automated method of defining and classifying domains (Dietmann and Holm 2001). Domains are defined by a version of the PUU algorithm (Holm and Sander 1994) that has been updated to consider the recurrence of putative domains (Holm and Sander 1998). Structural similarities between domains are determined using Dali, which first uses a fast algorithm that represents secondary structural elements as vectors, and then a slower algorithm that compares residue centers to fine tune matches. Dali scores have traditionally been presented only in the FSSP database as pairwise comparisons between structures that have not been divided into domains. In the Dali Domain Dictionary (Dietmann et al. 2001), the Dali scores have been used to create a hierarchical structure, allowing for direct comparison with SCOP and CATH.
Although several studies have compared various fold classification systems for a variety of purposes (Gerstein and Levitt 1998, Sauder et al. 2000, Shindyalov and Bourne 2000, Dengler et al. 2001, Getz et al. 2002), Hadley and Jones work remains the only explicit comparison of the fold classifications produced by SCOP, CATH, and FSSP (Hadley and Jones 1999). This study found that the classification systems tend to agree on their classifications for chains, although many chains classified in one system were not present in others. More recent work by Dietmann and Holm explicitly compared the Dali Domain Definitions (version 3.1b) to SCOP (version 1.53) to validate their map of fold space (Dietmann and Holm 2001). They too found good agreement between the classification systems. Our motivations for carrying out another comparison of these classification systems are as follows. Firstly, the number of structures deposited in the PDB has grown and continues to grow exponentially. In their 1999 study, Hadley and Jones considered fewer than 10,000 chains from each classification system; there are now over 20,000 chains in each. Even in their 2001 study, Dietmann and Holm were comparing a version of the Dali Domain Dictionary containing 35,492 domains to an older version of SCOP containing only 26,219. The number of chains and domains classified in the versions of the three systems considered has increased dramatically (Table 1
). It seems worthwhile, therefore, to see whether the classification systems can maintain agreement through such growth. The second reason for carrying out this study is the introduction of the Dali Domain Dictionary, referred to simply as Dali in this work. This allows explicit comparison of fold level classifications of domains in all three classification systems. Previous work could only look at the fold classifications of SCOP and CATH explicitly. We have observed that the numerical domain identifiers given by SCOP, CATH, and Dali do not necessarily correspond to the same residue windows. Thus, unlike previous investigators, we have followed the lead of Dietmann and Holm in identifying which domains correspond among the classification systems.
|
| Results |
|---|
|
|
|---|
Fold pair comparisons
The level of agreement between two or more classification systems can be quantified by comparing the fold definitions of pairs of domains. Two domains having the same fold identifier in a given fold classification constitute a fold pair. The pairwise comparisons of the fold definitions in the three classification systems gave 5,328,268 pairs in SCOP, 12,174,184 pairs in CATH, and 7,325,286 in the Dali Domain Database. The much larger number of pairs found in CATH is surprising in that CATH defines 1453 fold types compared with 783 and 1088 for SCOP and Dali, respectively. If similar distributions of domains into folds are assumed for the three classification systems, we would expect CATH, with many more fold types, to have fewer pairs. As it turns out, however, CATH populates its most common folds (Rossman and immunoglobulin-like) more than SCOP or Dali. The number of pairs increases as the square of the population, leading to a much larger number of pairs (Fig. 1
). If both domains are defined in another fold classification system, we have domain agreement. If both are defined as having the same fold, we have fold agreement. The fractions of pairs having domain agreement or fold agreement between the classification systems are given in Figures 2
and 3
.
|
|
|
To determine how the agreement between these classification systems has changed as the PDB has grown, we carried out our analysis on the subset of structures classified in earlier SCOP releases (Fig. 3
). The number of PDB files considered increased from 4391 in SCOP release 1.35 to 11,302 in SCOP release 1.57. Agreement at the level of domain and fold has remained remarkably constant over this period, although fold agreement does appear to be declining slightly faster than domain agreement.
Consensus folds
A consensus between SCOP, CATH, and Dali was determined on the basis of the fold classifications of a nonredundant subset of domains with the same classification by at least two of the three classification systems. These classifications formed a fold identifier. Domains were clustered on the basis of their fold identifiers. Fold identifier clusters that were identical in at least two classifications were combined into metafolds. The metafolds were ranked according to their population. This process is described in more detail in Materials and Methods.
The nonredundant set of 5720 domains clustered into 1130 metafolds. About half of these nonredundant domains were described by one of the top 30 folds. Over 1000 of the domains fall into the top two folds, corresponding to immunoglobulin-like folds and Rossman folds. The top 30 metafolds, their populations, and the SCOP, CATH, and Dali codes of their members are given in Table 2
(all 1130 metafolds are posted on our Web sitewww.dynameomics.org). Additionally, representative structures are given for these 30 metafolds in Figure 4
and Table 3
. Two-dimensional topology diagrams (Westhead et al. 1999) of these representatives are also provided for easy comparison between folds (Fig. 4
). Many well-characterized folding systems, including protein G, the SH3 domain, CspA, S6, and Che Y, fall into these top 30, but others, such as barnase, chymotrypsin inhibitor 2, and protein A, have low population folds.
|
|
|
| Discussion |
|---|
|
|
|---|
The differences in methods used by these classifications largely reflect differences in their initial purpose. The Dali Domain Dictionary was created in order to explicitly map a high-dimensional protein fold space with an eye toward determining evolutionary relationships between proteins and their functions. CATH is directed more toward defining and classifying the geometric similarities between proteins in order to support and guide structure determination studies and structural genomics. SCOP was originally intended to illuminate evolutionary relationships between proteins, but its curators recognize its utility in more general geometric classification.
We find that the agreement between SCOP and Dali is remarkably good, although there are considerable differences between the two, especially in the process of defining domains. Dalis domain definition algorithm breaks up domains by compactness. This method appears to break proteins into smaller domains than human inspection. As the human mind classifies folds on a variety of criteria and heuristics beyond the contact matrices used in Dali, the level of fold agreement is quite impressive. The similarity score cutoff that defines a fold in Dali was chosen empirically, which allows some of the human minds heuristics to come into the classifications. Additionally, Dalis comparison of secondary structure orientations probably mimics much of what the human mind uses in comparing structures.
There is much more agreement between the domain definitions of SCOP and CATH than between Dali and either SCOP or CATH. CATHs consensus method of defining domains requires human intervention in >50% of domain assignments. The similarities between the resulting domain definitions and the manually curated definitions of SCOP are probably due to this intervention.
The agreement between CATHs fold definitions and either of the other classification systems is not as good. The principal reason for this was termed the fold overlap problem by Hadley and Jones (1999). This problem arises when a fold defined in one system encompasses many folds defined in another. Specifically, the Rossman and immunoglobulin-like folds in CATH are broken into many smaller folds in SCOP and Dali. The fold pairs defined in these smaller folds are also present in the Rossman and IG-like folds in CATH, so CATHs agreement with SCOP or Dali fold pairs appears quite good. Many of the pairs defined within CATHs Rossman and IG-like folds, however, are not found in SCOP or Dali, leading to a much lower level of agreement than is suggested by looking at pairs defined by SCOP or Dali. It is apparent that a much broader range of structures are consistent with the templates for these highly populated folds in CATH than the range deemed similar by inspection or direct comparison of structures. At the other end of the spectrum, there are many more folds in CATH than in either SCOP or Dali. Most of these cases are single proteins that match no template, suggesting that there are some folds that are not adequately represented by the templates.
Our simple majority rules approach allowed a compromise to be reached between restrictive and broad fold definitions. Topology definitions that are overly broad in one classification system are broken into multiple, finer classifications, whereas relatively narrow topology definitions are combined to form broader classifications. Thus, for instance, the Rossman fold of CATH is broken into many smaller metafolds, 3 of which appear in the 30 most populated folds, whereas 4 Dali folds and 13 SCOP folds fall into the most populated Rossman metafold. In most cases, however, this process merely filters borderline cases that are only defined by one classification system of the fold definitions.
The differences in domain definition methods may cause our methods to overemphasize fold types populated by smaller, single-domain proteins. Nonetheless, our final list is quite reflective of the top folds in the individual classification systems and covers a wide variety of chain lengths and topologies. The relationships between proteins within metafolds are recognized by multiple comparison methods and are, as such, relatively independent of the comparison method. Additionally, the grouping of multiple folds into larger metafolds may be of use to researchers looking for a broader template in fold-recognition studies or more distant evolutionary relationships between proteins. We are now using this list to characterize the native dynamic behavior and folding/unfolding of representatives of all the major fold types in the PDB through molecular dynamics simulation, an effort we are calling dynameomics. A full list of metafolds is available on the Web at http://www.dynameomics.org.
| Materials and methods |
|---|
|
|
|---|
|
|
Pairwise comparisons
Agreement between the three classification systems on specific fold assignments can be measured by comparing the fold assignments of all domains within the classification system in a pairwise fashion. Any two domains that share the same fold classification are considered a pair. Once domains have been matched between the classification systems, this can be extended in order to determine how many of the pairs defined in one classification system are defined in another. Pairwise comparisons were performed at three consensus levels. First, they were performed at the level of the individual domain lists acquired directly from SCOP, CATH, and Dali, then at the level of the domain matches with one other classification system, and finally at the level of domains agreed upon by all three classification systems. Pairs were determined in a list containing only PDB identifiers common to all three classification systems. The time dependence of the pair agreements was determined by using lists containing only PDB identifiers from early SCOP releases.
Nonredundancy
Redundant domains were considered to be those having >95% sequence identity to a previously counted domain. This was achieved by comparing the near-identical classification level in CATH and by comparing with the 95% sequence identity list of SCOP domains from the ASTRAL project (Brenner et al. 2000; Chandonia et al. 2002).
Clustering
Clustering of the nonredundant domain set was accomplished by defining a metafold for each domain record. This metafold consists of the fold or topology level classification of each domain in each of the two or three databases in which it is classified. A cluster consists of all members of the nonredundant domain set having the same metafold. Clusters of domains defined as being the same fold by two of three classification systems were combined. Clusters were ranked according to their population. Figure 6
shows how the metafolds created for the example domains in Figure 5
contribute to a given fold type.
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Berman, H.M., Bhat, T.N., Bourne, P.E., Feng, Z., Gilliland, G., Wiessig, H., and Westbrook, J. 2000b. The Protein Data Bank and the challenge of structural genomics. Nat. Struct. Biol. 7: 957959.
Brenner, S.E., Koehl, P., and Levitt, M. 2000. The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Res. 28: 254256.
Chandonia, J.M., Walker, N.S., Lo Conte, L., Koehl, P., Levitt, M., and Brenner, S.E. 2002. ASTRAL compendium enhancements. Nucleic Acids Res. 30: 260263.
Dengler, U., Siddiqui, A.S., and Barton, G.J. 2001. Protein structural domains: Analysis of the 3Dee domains database. Proteins 42: 332344.[CrossRef][Medline]
Dietmann, S. and Holm, L. 2001. Identification of homology in protein structure classification. Nat. Struct. Biol. 8: 953957.[CrossRef][Medline]
Dietmann, S., Park, J., Notredame, C., Heger, A., Lappe, M., and Holm, L. 2001. A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3. Nucleic Acids Res. 29: 5557.
Eidhammer, I., Jonassen, I., and Taylor, W.R. 2000. Structure comparison and structure patterns. J. Comp. Biol. 7: 685716.
Ferrin, T.E., Huang, C.C., Jarvis, L.E., and Langridge, R. 1988. The MIDAS display system. J. Mol. Graphics 6: 1327.
Gerstein, M. and Levitt, M. 1998. Comprehensive assessment of automatic structural alignment against a manual standard, the SCOP classification of proteins. Protein Sci. 7: 445456.[Abstract]
Getz, G., Vendruscolo, M., Sachs, D., and Domany, E. 2002. Automated assignment of SCOP and CATH protein structure classifications from FSSP scores. Proteins 46: 405415.[Medline]
Hadley, C. and Jones, D.T. 1999. A systematic comparison of protein structure classifications: SCOP, CATH, and FSSP. Structure 7: 10991112.[Medline]
Holm, L. and Sander, C. 1994. Parser for protein folding units. Proteins 19: 256268.[CrossRef][Medline]
. 1998. Dictionary of recurrent domains in protein structures. Proteins 33: 8896.[CrossRef][Medline]
Jones, S., Stewart, M., Michie, A., Swindells, M.B., Orengo, C., and Thornton, J.M. 1998. Domain assignment for protein structures using a consensus approach: Characterization and analysis. Protein Sci. 7: 233242.[Abstract]
May, A.C.W. 1999. Toward more meaningful hierarchical classification of protein three-dimensional structures. Proteins 37: 2029.[CrossRef][Medline]
McGuffin, L.J., Bryson, K., and Jones, D.T. 2001. What are the baselines of protein fold recognition? Bioinformatics 17: 6372.
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536540.[CrossRef][Medline]
Orengo, C.A. 1999. CORATopological fingerprints for protein structural families. Protein Sci. 8: 699715.[Abstract]
Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. 1997. CATHA hierarchic classification of protein domain structures. Structure 5: 10931108.[Medline]
Sauder, J.M., Arthur, J.W., and Dunbrack, R.L. 2000. Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins 40: 622.[CrossRef][Medline]
Shindyalov, I.N. and Bourne, P.E. 2000. An alternative view of protein fold space. Proteins 38: 247260.[CrossRef][Medline]
Taylor, W.R. and Orengo, C.A. 1989. Protein structure alignment. J. Mol. Biol. 208: 122.[CrossRef][Medline]
Westhead, D.R., Slidel, T.W.F., Flores, T.P.J., and Thornton, J.M. 1999. Protein structural topology: Automated analysis and diagrammatic representation. Protein Sci. 8: 897904.[Abstract]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
A. Emperador, O. Carrillo, M. Rueda, and M. Orozco Exploring the Suitability of Coarse-Grained Techniques for the Representation of Protein Dynamics Biophys. J., September 1, 2008; 95(5): 2127 - 2138. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. C. Beck, D. O. V. Alonso, D. Inoyama, and V. Daggett The intrinsic conformational propensities of the 20 naturally occurring amino acids and reflection of these propensities in proteins PNAS, August 26, 2008; 105(34): 12259 - 12264. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Guerler and E.-W. Knapp Novel protein folds and their nonsequential structural analogs Protein Sci., August 1, 2008; 17(8): 1374 - 1382. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A.C. Beck, A. L. Jonsson, R. D. Schaeffer, K. A. Scott, R. Day, R. D. Toofanny, D. O.V. Alonso, and V. Daggett Dynameomics: mass annotation of protein dynamics and unfolding in water by high-throughput atomistic molecular dynamics simulations Protein Eng. Des. Sel., June 1, 2008; 21(6): 353 - 368. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. M. Simms, R. D. Toofanny, C. Kehl, N. C. Benson, and V. Daggett Dynameomics: design of a computational lab workflow and scientific data repository for protein simulations Protein Eng. Des. Sel., June 1, 2008; 21(6): 369 - 377. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Kehl, A. M. Simms, R. D. Toofanny, and V. Daggett Dynameomics: a multi-dimensional analysis-optimized database for dynamic protein data Protein Eng. Des. Sel., June 1, 2008; 21(6): 379 - 386. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. N.I. Pang, K. Lin, M. A. Wouters, J. Heringa, and R. A. George Identifying foldable regions in protein sequence from the hydrophobic signal Nucleic Acids Res., February 2, 2008; 36(2): 578 - 588. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. C. Beck and V. Daggett A One-Dimensional Reaction Coordinate for Identification of Transition States from Explicit Solvent Pfold-Like Calculations Biophys. J., November 15, 2007; 93(10): 3382 - 3391. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. A. Scott, D. O. V. Alonso, S. Sato, A. R. Fersht, and V. Daggett Conformational entropy of alanine versus glycine in protein denatured states PNAS, February 20, 2007; 104(8): 2661 - 2666. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Rueda, C. Ferrer-Costa, T. Meyer, A. Perez, J. Camps, A. Hospital, J. L. Gelpi, and M. Orozco A consensus view of protein dynamics PNAS, January 16, 2007; 104(3): 796 - 801. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |