Protein Science
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Day, R.
Right arrow Articles by Daggett, V.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Day, R.
Right arrow Articles by Daggett, V.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?
Protein Science (2003), 12:2150-2160.
Copyright © 2003 The Protein Society

A consensus view of fold space: Combining SCOP, CATH, and the Dali Domain Dictionary

Ryan Day1, David A.C. Beck1, Roger S. Armen1 and Valerie Daggett1,2

1 Biomolecular Structure and Design Program and
2 Department of Medicinal Chemistry, University of Washington, Seattle, Washington 98195, USA

Reprint requests to: Valerie Daggett, Biomolecular Structure and Design Program, Department of Medicinal Chemistry, University of Washington, Box 357610, Seattle, WA 98195, USA; e-mail: daggett{at}u.washington.edu; fax: (206) 685-3252.

(RECEIVED February 17, 2003; FINAL REVISION July 11, 2003; ACCEPTED July 14, 2003)

Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.0306803.


    Abstract
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
We have determined consensus protein-fold classifications on the basis of three classification methods, SCOP, CATH, and Dali. These classifications make use of different methods of defining and categorizing protein folds that lead to different views of protein-fold space. Pairwise comparisons of domains on the basis of their fold classifications show that much of the disagreement between the classification systems is due to differing domain definitions rather than assigning the same domain to different folds. However, there are significant differences in the fold assignments between the three systems. These remaining differences can be explained primarily in terms of the breadth of the fold classifications. Many structures may be defined as having one fold in one system, whereas far fewer are defined as having the analogous fold in another system. By comparing these folds for a nonredundant set of proteins, the consensus method breaks up broad fold classifications and combines restrictive fold classifications into metafolds, creating, in effect, an averaged view of fold space. This averaged view requires that the structural similarities between proteins having the same metafold be recognized by multiple classification systems. Thus, the consensus map is useful for researchers looking for fold similarities that are relatively independent of the method used to compare proteins. The 30 most populated metafolds, representing the folds of about half of a nonredundant subset of the PDB, are presented here. The full list of metafolds is presented on the Web.

Keywords: CATH; Dali; SCOP; fold classification


    Introduction
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
The number of known protein structures deposited in the Protein Data Bank (PDB; Berman et al. 2000a) has grown exponentially over the last 30 years. This trend can be expected to continue as structural genomics projects gain momentum and techniques allowing higher throughput structure determination are developed (Berman et al. 2000b). There are currently some 17,000 protein structures comprised of on the order of 35,000 domains in the PDB. It was recognized in the 1970’s that the number of topologies that these structures assume is, however, relatively limited. Additionally, it is widely recognized that much, but not all of the folding process is encoded by these topologies. By understanding the details of the folding processes of a relatively small, but representative number of topologies, we can gain a broader view of how proteins fold. Here, we describe our efforts to compare three major databases of fold classifications in order to gain a consensus view of protein fold space. Additionally, we critically compare the level of agreement between these three classification systems.

The problem of classifying folds can be considered in two steps. In the first step, a protein chain or complex must be broken into its constituent domains, and in the second step, these domains are assigned a fold. The methods for classifying folds range from a variety of fully automated procedures (for review, see Eidhammer et al. 2000) to a largely manual method on the basis of expert knowledge. Here, we consider three popular systems, all of which have used various methods of comparing folds in constructing hierarchies of protein similarity.

Structural Classification of Proteins, or SCOP, was among the earliest efforts to classify protein structures into folds. Protein domains with no obvious sequence homology to other domains are defined and classified manually (Murzin et al. 1995). In many ways, this database has been considered the standard for protein structure classification. The human mind is quite adept at recognizing slightly dissimilar objects as belonging to a common architecture. Thus, simplified representations of protein structures can be grouped by expert inspection.

Class Architecture Topology Homologous (CATH) superfamily makes use of a combination of manual and automated procedures in defining and classifying protein domains (Orengo et al. 1997). CATH relies on the consensus of three automated classification methods to break protein chains into domains (Jones et al. 1998). This approach is effective in defining the domains of ~53% of the chains. The domains of chains for which consensus is not reached are defined manually. Homology is defined largely in terms of sequence, but distant sequence matches with high structural similarity may be defined as being homologous. Domains lacking homology to already defined folds are given a topology-level classification on the basis of the CORA (Orengo 1999) or SSAP (Taylor and Orengo 1989) algorithm.

The Dali Domain Dictionary is the product of a fully automated method of defining and classifying domains (Dietmann and Holm 2001). Domains are defined by a version of the PUU algorithm (Holm and Sander 1994) that has been updated to consider the recurrence of putative domains (Holm and Sander 1998). Structural similarities between domains are determined using Dali, which first uses a fast algorithm that represents secondary structural elements as vectors, and then a slower algorithm that compares residue centers to fine tune matches. Dali scores have traditionally been presented only in the FSSP database as pairwise comparisons between structures that have not been divided into domains. In the Dali Domain Dictionary (Dietmann et al. 2001), the Dali scores have been used to create a hierarchical structure, allowing for direct comparison with SCOP and CATH.

Although several studies have compared various fold classification systems for a variety of purposes (Gerstein and Levitt 1998, Sauder et al. 2000, Shindyalov and Bourne 2000, Dengler et al. 2001, Getz et al. 2002), Hadley and Jones’ work remains the only explicit comparison of the fold classifications produced by SCOP, CATH, and FSSP (Hadley and Jones 1999). This study found that the classification systems tend to agree on their classifications for chains, although many chains classified in one system were not present in others. More recent work by Dietmann and Holm explicitly compared the Dali Domain Definitions (version 3.1b) to SCOP (version 1.53) to validate their map of fold space (Dietmann and Holm 2001). They too found good agreement between the classification systems. Our motivations for carrying out another comparison of these classification systems are as follows. Firstly, the number of structures deposited in the PDB has grown and continues to grow exponentially. In their 1999 study, Hadley and Jones considered fewer than 10,000 chains from each classification system; there are now over 20,000 chains in each. Even in their 2001 study, Dietmann and Holm were comparing a version of the Dali Domain Dictionary containing 35,492 domains to an older version of SCOP containing only 26,219. The number of chains and domains classified in the versions of the three systems considered has increased dramatically (Table 1Go). It seems worthwhile, therefore, to see whether the classification systems can maintain agreement through such growth. The second reason for carrying out this study is the introduction of the Dali Domain Dictionary, referred to simply as Dali in this work. This allows explicit comparison of fold level classifications of domains in all three classification systems. Previous work could only look at the fold classifications of SCOP and CATH explicitly. We have observed that the numerical domain identifiers given by SCOP, CATH, and Dali do not necessarily correspond to the same residue windows. Thus, unlike previous investigators, we have followed the lead of Dietmann and Holm in identifying which domains correspond among the classification systems.


View this table:
[in this window]
[in a new window]
 
Table 1. General statistics of the number of protein chains and domains as defined by SCOP, CATH, and Dali
 
The principle aim of this work is to create a map showing how specific fold classifications in SCOP, CATH, and Dali correspond to one another. As McGuffin et al. (2001) point out, those wishing to make use of fold classifications will probably be best served by focusing on the consensus of multiple classification methods. The definition of a fold is necessarily dependent on the method used to compare proteins. The different comparison algorithms described above give different views of fold space. Whereas the three views of fold space used in this study are the most widely used, no single method can be considered to be definitive. The consensus view described here finds fold relationships that are defined by multiple comparison methods. These fold relationships can be considered to be relatively independent of the method used to compare proteins.


    Results
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
Domain matching
SCOP, CATH, and Dali break chains into domains differently, with Dali tending to break chains into more domains than SCOP or CATH (Table 1Go). SCOP and CATH had the largest overlap for domain definitions. A total of 24,764 domains were classified as spanning the same residues in SCOP and CATH. Both CATH and SCOP overlapped to about the same degree with the Dali database. A total of 19,333 of the domains defined in SCOP and 19,354 CATH domains were defined in Dali. Combining these definitions led to a total of 31,141 domains defined in at least two classification systems, with 16,345 of these defined in all three. Only 5720 of the original 31,141 domains remained when those with 95% sequence identity were removed. SCOP, CATH, and Dali initially define 35,759 domains, 36,480 domains, and 35,492 domains, respectively.

Fold pair comparisons
The level of agreement between two or more classification systems can be quantified by comparing the fold definitions of pairs of domains. Two domains having the same fold identifier in a given fold classification constitute a fold pair. The pairwise comparisons of the fold definitions in the three classification systems gave 5,328,268 pairs in SCOP, 12,174,184 pairs in CATH, and 7,325,286 in the Dali Domain Database. The much larger number of pairs found in CATH is surprising in that CATH defines 1453 fold types compared with 783 and 1088 for SCOP and Dali, respectively. If similar distributions of domains into folds are assumed for the three classification systems, we would expect CATH, with many more fold types, to have fewer pairs. As it turns out, however, CATH populates its most common folds (Rossman and immunoglobulin-like) more than SCOP or Dali. The number of pairs increases as the square of the population, leading to a much larger number of pairs (Fig. 1Go). If both domains are defined in another fold classification system, we have domain agreement. If both are defined as having the same fold, we have fold agreement. The fractions of pairs having domain agreement or fold agreement between the classification systems are given in Figures 2Go and 3Go.



View larger version (28K):
[in this window]
[in a new window]
 
Figure 1. Distribution of domains into the most populated fold types in each classification system. (A) Number of domains in the 100 most populated folds. (B) Number of domains in the 25 most populated folds. (C) Number of pairs in the 100 most populated folds. (D) Number of pairs in the 25 most populated folds. The number of pairs goes as (n2-n)/2, in which n is the number of domains. The most populated folds in CATH are significantly more populated than the most populated folds in SCOP or Dali.

 


View larger version (35K):
[in this window]
[in a new window]
 
Figure 2. Agreement among the most recent versions of SCOP, CATH, and Dali expressed as a percentage of the total number of fold pairs in each system. The top charts represent the fraction of fold pairs in which both domains defining the pair are defined, and the bottom charts represent the fraction of fold pairs that are defined. (A) Fraction of SCOP fold pairs (5,328,268 total) also defined in CATH, Dali, or both. (B) Fraction of CATH fold pairs (12,174,184 total) also defined in SCOP, Dali, or both. (C) Fraction of Dali fold pairs (7,325,286 total) also defined in SCOP, CATH, or both. The number of fold pairs defined in all three systems is the same in all three bottom charts. Whereas the number of domains defined in all three systems would be the same in all three charts, here we are looking at the number of fold pairs in which both domains are defined, which can vary with different fold definitions. Thus, the ‘both’ categories in the top charts do not have the same number of pairs in all three charts.

 


View larger version (17K):
[in this window]
[in a new window]
 
Figure 3. Trends in agreement between the classification systems over time. Agreement at the levels of domain definition (open symbols) and fold assignment (closed symbols) are considered. (A) SCOP fold pairs defined in CATH ({diamondsuit}) or Dali (•). (B) CATH fold pairs defined in SCOP ({diamondsuit}) or Dali (•). (C) Dali fold pairs defined in SCOP ({diamondsuit}) or CATH (•). The gap between the lines is of considerable interest as this represents pairs for which the domains are defined the same in the two systems; however, the fold is defined differently.

 
To get a true picture of the level of agreement between different methods of classifying folds, it is necessary to look at the difference between these levels of agreement. For instance, we see that only 47% of Dali’s fold pairs are also paired in SCOP, but we also see that only 63% of Dali’s fold pairs have both domains defined in SCOP (Fig. 2CGo, bottom and top, respectively). Thus, in the case in which Dali defines domains the same as SCOP, there is a 75% chance that the fold pair will be recognized. In this sense, there is a remarkable agreement between the fold classification methods of SCOP and Dali. The number of SCOP and Dali fold pairs found in CATH is quite large, suggesting good agreement. However, this is a bit misleading. CATH classifies many domains in a few broad fold definitions, leading to a much higher number of pairs than either SCOP or Dali. The fraction of CATH pairs defined in SCOP and Dali, thus, is a much more telling indicator of the agreement between CATH and the other two classification systems. This agreement is quite low even when differences in domain definitions are considered.

To determine how the agreement between these classification systems has changed as the PDB has grown, we carried out our analysis on the subset of structures classified in earlier SCOP releases (Fig. 3Go). The number of PDB files considered increased from 4391 in SCOP release 1.35 to 11,302 in SCOP release 1.57. Agreement at the level of domain and fold has remained remarkably constant over this period, although fold agreement does appear to be declining slightly faster than domain agreement.

Consensus folds
A consensus between SCOP, CATH, and Dali was determined on the basis of the fold classifications of a nonredundant subset of domains with the same classification by at least two of the three classification systems. These classifications formed a fold identifier. Domains were clustered on the basis of their fold identifiers. Fold identifier clusters that were identical in at least two classifications were combined into metafolds. The metafolds were ranked according to their population. This process is described in more detail in Materials and Methods.

The nonredundant set of 5720 domains clustered into 1130 metafolds. About half of these nonredundant domains were described by one of the top 30 folds. Over 1000 of the domains fall into the top two folds, corresponding to immunoglobulin-like folds and Rossman folds. The top 30 metafolds, their populations, and the SCOP, CATH, and Dali codes of their members are given in Table 2Go (all 1130 metafolds are posted on our Web site—www.dynameomics.org). Additionally, representative structures are given for these 30 metafolds in Figure 4Go and Table 3Go. Two-dimensional topology diagrams (Westhead et al. 1999) of these representatives are also provided for easy comparison between folds (Fig. 4Go). Many well-characterized folding systems, including protein G, the SH3 domain, CspA, S6, and Che Y, fall into these top 30, but others, such as barnase, chymotrypsin inhibitor 2, and protein A, have low population folds.


View this table:
[in this window]
[in a new window]
 
Table 2. SCOP, CATH, and Dali codes associated with the 30 most populated metafolds
 



View larger version (159K):
[in this window]
[in a new window]
 
Figure 4. Structures of the 30 representative domains. The metafold is given, followed by the PDB identifier and common name of the representative protein. Coloring goes from red (amino terminus) to blue (carboxyl terminus). (A) Structures in ribbon format created with Midas (Ferrin et al. 1988). (B) Two-dimensional topology diagrams created with TOPS (Westhead et al. 1999). Triangles indicate ß-strands and circles indicate helices, with smaller symbols representing smaller ß-strands (six residues or less) and smaller helices (five residues or less). The approximate direction of the corresponding secondary structure element is either up (out of the page) or down (into the page). These directions are indicated in the diagram by the way that connecting lines are drawn to the symbols; connections drawn to the edge of the symbol connect to its base, whereas those drawn to the center connect to the top.

 

View this table:
[in this window]
[in a new window]
 
Table 3. Protein representatives from the most populated metafolds
 
Two processes are seen at work in the clustering process. The first is a breakdown of folds that are highly populated in one classification system if the other two classification systems define them less broadly. The second is a bootstrap process by which many similar folds from one classification system are combined when their domains are defined by a smaller number of folds in the other classification systems. The Rossman fold provides the most dramatic example of these processes. CATH defines some 3500 domains as being Rossman folds (3.40.50). By adding in information from SCOP and Dali, the Rossman folds are divided into 3 fold types in the 30 most populated folds. By not requiring strict consensus, however, this breakdown also involves bootstrapping many similar SCOP and Dali folds into one fold type.


    Discussion
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
The fold classification systems considered here are typically thought of biologically as illuminating the evolutionary relationships among proteins. However, we are more interested in understanding the relationships and similarities between protein topologies, and specifically, topologies that are commonly used in nature. When considered in this sense, there is no outside standard against which the classification systems may be judged, although they may be tested for internal consistency (May 1999). Experimentalists and theoreticians seeking to make use of fold classifications in protein structure prediction or other applications have traditionally chosen one or another classification system. Whereas the ability of the human eye to accurately discern similarities between varied objects has given the SCOP database a certain standing, computational methods have the advantage of being inherently self-consistent and reproducible. The size of the problem, classifying >35,000 domains into almost 1000 folds, with both numbers expected to grow quickly, suggests that automated solutions are necessary. Similarities between protein folds can be defined in many ways, however, and automated methods of classifying folds will give different solutions depending on the method used. Additionally, automated methods define continuous fold spaces that must be discretized in order to define individual folds. The two automated systems considered here use slightly different methods in comparing protein structures. Dali compares structures directly, first by defining orientations of secondary structure relative to one another, and then by directly comparing residue positions. The similarity scores generated by these comparisons are used to build up a map of fold space in an all-versus-all fashion. Proteins that cluster tightly in this fold space are considered to have the same fold. CATH starts with a direct comparison of protein structures using contact maps. Highly similar proteins are then used in creating fold templates, which are essentially reduced representations of the contact matrix. Proteins are then compared with these templates in order to classify them into folds.

The differences in methods used by these classifications largely reflect differences in their initial purpose. The Dali Domain Dictionary was created in order to explicitly map a high-dimensional protein fold space with an eye toward determining evolutionary relationships between proteins and their functions. CATH is directed more toward defining and classifying the geometric similarities between proteins in order to support and guide structure determination studies and structural genomics. SCOP was originally intended to illuminate evolutionary relationships between proteins, but its curators recognize its utility in more general geometric classification.

We find that the agreement between SCOP and Dali is remarkably good, although there are considerable differences between the two, especially in the process of defining domains. Dali’s domain definition algorithm breaks up domains by compactness. This method appears to break proteins into smaller domains than human inspection. As the human mind classifies folds on a variety of criteria and heuristics beyond the contact matrices used in Dali, the level of fold agreement is quite impressive. The similarity score cutoff that defines a fold in Dali was chosen empirically, which allows some of the human mind’s heuristics to come into the classifications. Additionally, Dali’s comparison of secondary structure orientations probably mimics much of what the human mind uses in comparing structures.

There is much more agreement between the domain definitions of SCOP and CATH than between Dali and either SCOP or CATH. CATH’s consensus method of defining domains requires human intervention in >50% of domain assignments. The similarities between the resulting domain definitions and the manually curated definitions of SCOP are probably due to this intervention.

The agreement between CATH’s fold definitions and either of the other classification systems is not as good. The principal reason for this was termed the fold overlap problem by Hadley and Jones (1999). This problem arises when a fold defined in one system encompasses many folds defined in another. Specifically, the Rossman and immunoglobulin-like folds in CATH are broken into many smaller folds in SCOP and Dali. The fold pairs defined in these smaller folds are also present in the Rossman and IG-like folds in CATH, so CATH’s agreement with SCOP or Dali fold pairs appears quite good. Many of the pairs defined within CATH’s Rossman and IG-like folds, however, are not found in SCOP or Dali, leading to a much lower level of agreement than is suggested by looking at pairs defined by SCOP or Dali. It is apparent that a much broader range of structures are consistent with the templates for these highly populated folds in CATH than the range deemed similar by inspection or direct comparison of structures. At the other end of the spectrum, there are many more folds in CATH than in either SCOP or Dali. Most of these cases are single proteins that match no template, suggesting that there are some folds that are not adequately represented by the templates.

Our simple majority rules approach allowed a compromise to be reached between restrictive and broad fold definitions. Topology definitions that are overly broad in one classification system are broken into multiple, finer classifications, whereas relatively narrow topology definitions are combined to form broader classifications. Thus, for instance, the Rossman fold of CATH is broken into many smaller metafolds, 3 of which appear in the 30 most populated folds, whereas 4 Dali folds and 13 SCOP folds fall into the most populated Rossman metafold. In most cases, however, this process merely filters borderline cases that are only defined by one classification system of the fold definitions.

The differences in domain definition methods may cause our methods to overemphasize fold types populated by smaller, single-domain proteins. Nonetheless, our final list is quite reflective of the top folds in the individual classification systems and covers a wide variety of chain lengths and topologies. The relationships between proteins within metafolds are recognized by multiple comparison methods and are, as such, relatively independent of the comparison method. Additionally, the grouping of multiple folds into larger metafolds may be of use to researchers looking for a broader template in fold-recognition studies or more distant evolutionary relationships between proteins. We are now using this list to characterize the native dynamic behavior and folding/unfolding of representatives of all the major fold types in the PDB through molecular dynamics simulation, an effort we are calling dynameomics. A full list of metafolds is available on the Web at http://www.dynameomics.org.


    Materials and methods
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
The three classification systems—SCOP, CATH, and Dali—use different methods of breaking protein chains into their constituent domains and classifying these domains into folds. We have subjected protein domains to a democratic process in which at least two of three classification systems must agree in order to classify the protein fold. This method involves first comparing the classification systems in a pairwise fashion, keeping those domains that were defined as spanning essentially the same residues in both classification systems. The pairwise lists were then combined into one large list, with domains in which all three classification systems agree being reduced down to one record. Each fold identifier was assigned a population on the basis of the number of nonredundant domains having that fold. As a final step in our democratic process, fold identifiers that differed by the domain’s classification in only one of the three classification systems were combined into larger metafolds. This process of matching and clustering is followed in Figures 5Go and 6Go. Except where noted otherwise, we have used SCOP version 1.57, CATH version 2.4, and version 3.1b of the Dali Domain Dictionary in this work.



View larger version (24K):
[in this window]
[in a new window]
 
Figure 5. Example domain entries from SCOP, CATH, and Dali. Matching domains are given in bold-faced type. (A) Ubiquitin, a trivial case. The chain is defined as a single domain in all three systems. (B) Glutamine synthase, a less trivial case. The chain consists of multiple domains. SCOP and CATH break it into the same two domains (although they number them differently), whereas Dali breaks the chain into three domains, none of which meet the required 80% sequence overlap with the SCOP or CATH domains. The amino-terminal domain of glutamine synthase forms a fold pair with ubiquitin in both SCOP and CATH. If the amino-terminal domain of glutamine synthase were defined by Dali (italic type) to be one or two residues longer, such that it made the domain match cutoff, it still would not form a fold pair with ubiquitin.

 


View larger version (13K):
[in this window]
[in a new window]
 
Figure 6. Individual members of the ß-grasp metafold. To the left of the arrow, fold identifiers for ubiquitin and glutamine synthase are determined on the basis of their fold types in those classification systems in which they are defined (Fig. 5Go). As two classification systems (SCOP and CATH) agree that the amino-terminal domain of glutamine synthase has the same fold (bold text) as ubiquitin, this fold identifier is bootstrapped into the ß grasp metafold to the right of the arrow. Other fold identifiers that are considered part of the ß grasp metafold are given in regular type. Populations are based on a sequence unique (<95% sequence identity) subset of domains.

 
Matching domains
To map the domains of the three classification systems onto one another, we have used the criteria put forth by Dietmann and Holm in mapping Dali onto SCOP (supplemental material in Dietmann and Holm 2001). Essentially, the method involves determining the fraction of a given domain as defined by one classification system that is contained within the analogous domain in another classification system. We have adopted the 80% cutoff for this fraction used by Dietmann and Holm. Thus, to be considered in our clustering, 80% of the sequence defining a domain in SCOP must be present in a Dali domain definition, 80% of the Dali domain must be present in the SCOP definition, and so on for the other pairwise combinations of classification systems. We also considered 70% and 90% cutoffs. As one would expect, the number of domain matches was significantly affected by the cutoff used. The level of fold agreement among recognized domains did not, however, change significantly with an increase or decrease of the domain cutoff. In comparing the number of CATH pairs recognized by SCOP with an 80% versus 90% domain cutoff, for instance, we see the fraction of recognized domain pairs drop from 70% to 64%. The fraction of these domain pairs that are also fold pairs rises only slightly from 50% to 52%. Figure 5Go gives examples of a trivial and less than trivial case for domain matching.

Pairwise comparisons
Agreement between the three classification systems on specific fold assignments can be measured by comparing the fold assignments of all domains within the classification system in a pairwise fashion. Any two domains that share the same fold classification are considered a pair. Once domains have been matched between the classification systems, this can be extended in order to determine how many of the pairs defined in one classification system are defined in another. Pairwise comparisons were performed at three consensus levels. First, they were performed at the level of the individual domain lists acquired directly from SCOP, CATH, and Dali, then at the level of the domain matches with one other classification system, and finally at the level of domains agreed upon by all three classification systems. Pairs were determined in a list containing only PDB identifiers common to all three classification systems. The time dependence of the pair agreements was determined by using lists containing only PDB identifiers from early SCOP releases.

Nonredundancy
Redundant domains were considered to be those having >95% sequence identity to a previously counted domain. This was achieved by comparing the near-identical classification level in CATH and by comparing with the 95% sequence identity list of SCOP domains from the ASTRAL project (Brenner et al. 2000; Chandonia et al. 2002).

Clustering
Clustering of the nonredundant domain set was accomplished by defining a metafold for each domain record. This metafold consists of the fold or topology level classification of each domain in each of the two or three databases in which it is classified. A cluster consists of all members of the nonredundant domain set having the same metafold. Clusters of domains defined as being the same fold by two of three classification systems were combined. Clusters were ranked according to their population. Figure 6Go shows how the metafolds created for the example domains in Figure 5Go contribute to a given fold type.


    Acknowledgments
 
We are grateful for financial support provided by the NIH (GM 50789 to V.D.) and an institutional NIH training grant for molecular biophysics (National Research Service Award 5 T32 GM 08268 to R.D., D.A.C.B., and R.S.A.).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.


    References
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000a. The Protein Data Bank. Nucleic Acids Res. 28: 235–242.[Abstract/Free Full Text]

Berman, H.M., Bhat, T.N., Bourne, P.E., Feng, Z., Gilliland, G., Wiessig, H., and Westbrook, J. 2000b. The Protein Data Bank and the challenge of structural genomics. Nat. Struct. Biol. 7: 957–959.

Brenner, S.E., Koehl, P., and Levitt, M. 2000. The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Res. 28: 254–256.[Abstract/Free Full Text]

Chandonia, J.M., Walker, N.S., Lo Conte, L., Koehl, P., Levitt, M., and Brenner, S.E. 2002. ASTRAL compendium enhancements. Nucleic Acids Res. 30: 260–263.[Abstract/Free Full Text]

Dengler, U., Siddiqui, A.S., and Barton, G.J. 2001. Protein structural domains: Analysis of the 3Dee domains database. Proteins 42: 332–344.[CrossRef][Medline]

Dietmann, S. and Holm, L. 2001. Identification of homology in protein structure classification. Nat. Struct. Biol. 8: 953–957.[CrossRef][Medline]

Dietmann, S., Park, J., Notredame, C., Heger, A., Lappe, M., and Holm, L. 2001. A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3. Nucleic Acids Res. 29: 55–57.[Abstract/Free Full Text]

Eidhammer, I., Jonassen, I., and Taylor, W.R. 2000. Structure comparison and structure patterns. J. Comp. Biol. 7: 685–716.

Ferrin, T.E., Huang, C.C., Jarvis, L.E., and Langridge, R. 1988. The MIDAS display system. J. Mol. Graphics 6: 13–27.

Gerstein, M. and Levitt, M. 1998. Comprehensive assessment of automatic structural alignment against a manual standard, the SCOP classification of proteins. Protein Sci. 7: 445–456.[Abstract]

Getz, G., Vendruscolo, M., Sachs, D., and Domany, E. 2002. Automated assignment of SCOP and CATH protein structure classifications from FSSP scores. Proteins 46: 405–415.[Medline]

Hadley, C. and Jones, D.T. 1999. A systematic comparison of protein structure classifications: SCOP, CATH, and FSSP. Structure 7: 1099–1112.[Medline]

Holm, L. and Sander, C. 1994. Parser for protein folding units. Proteins 19: 256–268.[CrossRef][Medline]

———. 1998. Dictionary of recurrent domains in protein structures. Proteins 33: 88–96.[CrossRef][Medline]

Jones, S., Stewart, M., Michie, A., Swindells, M.B., Orengo, C., and Thornton, J.M. 1998. Domain assignment for protein structures using a consensus approach: Characterization and analysis. Protein Sci. 7: 233–242.[Abstract]

May, A.C.W. 1999. Toward more meaningful hierarchical classification of protein three-dimensional structures. Proteins 37: 20–29.[CrossRef][Medline]

McGuffin, L.J., Bryson, K., and Jones, D.T. 2001. What are the baselines of protein fold recognition? Bioinformatics 17: 63–72.[Abstract/Free Full Text]

Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536–540.[CrossRef][Medline]

Orengo, C.A. 1999. CORA—Topological fingerprints for protein structural families. Protein Sci. 8: 699–715.[Abstract]

Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. 1997. CATH—A hierarchic classification of protein domain structures. Structure 5: 1093–1108.[Medline]

Sauder, J.M., Arthur, J.W., and Dunbrack, R.L. 2000. Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins 40: 6–22.[CrossRef][Medline]

Shindyalov, I.N. and Bourne, P.E. 2000. An alternative view of protein fold space. Proteins 38: 247–260.[CrossRef][Medline]

Taylor, W.R. and Orengo, C.A. 1989. Protein structure alignment. J. Mol. Biol. 208: 1–22.[CrossRef][Medline]

Westhead, D.R., Slidel, T.W.F., Flores, T.P.J., and Thornton, J.M. 1999. Protein structural topology: Automated analysis and diagrammatic representation. Protein Sci. 8: 897–904.[Abstract]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
Biophys. JHome page
A. Emperador, O. Carrillo, M. Rueda, and M. Orozco
Exploring the Suitability of Coarse-Grained Techniques for the Representation of Protein Dynamics
Biophys. J., September 1, 2008; 95(5): 2127 - 2138.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
D. A. C. Beck, D. O. V. Alonso, D. Inoyama, and V. Daggett
The intrinsic conformational propensities of the 20 naturally occurring amino acids and reflection of these propensities in proteins
PNAS, August 26, 2008; 105(34): 12259 - 12264.
[Abstract] [Full Text] [PDF]


Home page
Protein Sci.Home page
A. Guerler and E.-W. Knapp
Novel protein folds and their nonsequential structural analogs
Protein Sci., August 1, 2008; 17(8): 1374 - 1382.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
D. A.C. Beck, A. L. Jonsson, R. D. Schaeffer, K. A. Scott, R. Day, R. D. Toofanny, D. O.V. Alonso, and V. Daggett
Dynameomics: mass annotation of protein dynamics and unfolding in water by high-throughput atomistic molecular dynamics simulations
Protein Eng. Des. Sel., June 1, 2008; 21(6): 353 - 368.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
A. M. Simms, R. D. Toofanny, C. Kehl, N. C. Benson, and V. Daggett
Dynameomics: design of a computational lab workflow and scientific data repository for protein simulations
Protein Eng. Des. Sel., June 1, 2008; 21(6): 369 - 377.
[Abstract] [Full Text] [PDF]


Home page
Protein Eng Des SelHome page
C. Kehl, A. M. Simms, R. D. Toofanny, and V. Daggett
Dynameomics: a multi-dimensional analysis-optimized database for dynamic protein data
Protein Eng. Des. Sel., June 1, 2008; 21(6): 379 - 386.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. N.I. Pang, K. Lin, M. A. Wouters, J. Heringa, and R. A. George
Identifying foldable regions in protein sequence from the hydrophobic signal
Nucleic Acids Res., February 2, 2008; 36(2): 578 - 588.
[Abstract] [Full Text] [PDF]


Home page
Biophys. JHome page
D. A. C. Beck and V. Daggett
A One-Dimensional Reaction Coordinate for Identification of Transition States from Explicit Solvent Pfold-Like Calculations
Biophys. J., November 15, 2007; 93(10): 3382 - 3391.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
K. A. Scott, D. O. V. Alonso, S. Sato, A. R. Fersht, and V. Daggett
Conformational entropy of alanine versus glycine in protein denatured states
PNAS, February 20, 2007; 104(8): 2661 - 2666.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
M. Rueda, C. Ferrer-Costa, T. Meyer, A. Perez, J. Camps, A. Hospital, J. L. Gelpi, and M. Orozco
A consensus view of protein dynamics
PNAS, January 16, 2007; 104(3): 796 - 801.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Day, R.
Right arrow Articles by Daggett, V.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Day, R.
Right arrow Articles by Daggett, V.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS