Protein Science
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplemental Research Data
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Matsuda, K.
Right arrow Articles by Go, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Matsuda, K.
Right arrow Articles by Go, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?
Protein Science (2003), 12:2239-2251.
Copyright © 2003 The Protein Society

Finding evolutionary relations beyond superfamilies: Fold-based superfamilies

Keiko Matsuda1, Takaaki Nishioka2, Kengo Kinoshita3,4, Takeshi Kawabata1 and Nobuhiro Go1,5

1 Graduate School of Information Science, Nara Institute of Science and Technology, Ikoma, 630-0101, Japan
2 Graduate School of Agriculture, Kyoto University, Kyoto 606-8502, Japan
3 Graduate School of Integrated Science, Yokohama City University, Yokohama 230-0045, Japan
4 Structure and Function of Biomolecules, PRESTO, Japan Science and Technology Corporation, Kawaguchi, Saitama 332-0012, Japan
5 Center for Promotion of Computational Science and Engineering, Japan Atomic Energy Research Institute, Kizu, Souraku, Kyoto, 619-0215, Japan

Reprint requests to: Takaaki Nishioka, Graduate School of Agriculture, Kyoto University, Kyoto 606-8502, Japan; e-mail: nishioka{at}scl.kyoto-u.ac.jp; fax: 81-75-753-6408.

(RECEIVED May 7, 2003; FINAL REVISION July 8, 2003; ACCEPTED July 8, 2003)

Supplemental material: See www.proteinscience.org

Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.0383603.


    Abstract
 TOP
 Abstract
 Introduction
 Results and Discussion
 Materials and methods
 References
 
Superfamily classifications are based variably on similarity of sequences, global folds, local structures, or functions. We have examined the possibility of defining superfamilies purely from the viewpoint of the global fold/function relationship. For this purpose, we first classified protein domains according to the ß-sheet topology. We then introduced the concept of kinship relations among the classified ß-sheet topology by assuming that the major elementary event leading to creation of a new ß-sheet topology is either an addition or deletion of one ß-strand at the edge of an existing ß-sheet during the molecular evolution. Based on this kinship relation, a network of protein domains was constructed so that the distance between a pair of domains represents the number of evolutionary events that lead one from the other domain. We then mapped on it all known domains with a specific core chemical function (here taken, as an example, that involving ATP or its analogs). Careful analyses revealed that the domains are found distributed on the network as >20 mutually disjointed clusters. The proteins in each cluster are defined to form a fold-based superfamily. The results indicate that >20 ATP-binding protein superfamilies have been invented independently in the process of molecular evolution, and the conservative evolutionary diffusion of global folds and functions is the origin of the relationship between them.

Keywords: ATP-binding domains; kinship relations of global folds; purine biosynthesis; structure/function relationship


    Introduction
 TOP
 Abstract
 Introduction
 Results and Discussion
 Materials and methods
 References
 
Empirical relationships between the three-dimensional structures of proteins and their functions provide powerful tools for deducing functions from structures. Two types of such relationships appear to exist; the first is between local structures and functions (Wallace et al. 1996; Fetrow and Skolnick 1998), and the other is between global folds and functions (Orengo et al. 1994; Lo Conte et al. 2000). Although the mechanisms of enzymatic reaction are likely to explain the former type of relationship, such direct mechanisms to explain the latter type are rather difficult to elucidate. The latter type of relationship may have its origin in molecular evolution; that is, both global folds and functions are conserved during the molecular evolution within certain ranges of variation (Todd et al. 2001), and hence, a relationship exists between global folds and functions both with certain ranges of variation. Mechanisms should exist only indirectly behind the conservation of global folds and their associated functions.

The validity of the this picture can be attested if clear clusters of superfamilies of proteins with respective common core functions can be identified in a fold space of proteins with a certain measure of fold similarity. Each such cluster, if identified, should deserve to be called as a fold-based superfamily. The aim of this article is to show that this is actually possible with an objective procedure.

This aim was achieved by first constructing a network of kinship relations of global folds of protein domains. To limit the scope of the study, we focused on proteins with functions involving ATP or its analogs. This choice of proteins was made because of their functional importance and abundance of three-dimensional structures. We mapped all of the known ATP-binding domains on the constructed network. If the mapped domains formed mutually disjointed clear clusters, each with a similar core chemical function, then such clusters are examples of fold-based superfamilies. The fold-based superfamilies thus identified in this article will lead us to some interesting new findings.


    Results and Discussion
 TOP
 Abstract
 Introduction
 Results and Discussion
 Materials and methods
 References
 
Construction of a global fold kinship relations network
First, we must construct a fold space with a proper measure of fold similarity. Because domains rather than entire protein molecules are thought to be the units of long-range evolution and because we are interested in the extent of global fold diversity, we should construct a fold space in which protein domains with different sets of secondary structural elements can be mutually related. We introduce the concept of a network of kinship relations, which is intended to locate essentially all ß-sheet-containing protein domains. We focus our attention on ß-sheet topology (Richardson 1977; Ptitsyn and Finkelstein 1980), that is, the directions of individual ß-strands and their connectivity (Fig. 1AGo). The kinship relations among various ß-sheet topologies are defined based on the assumption that a predominant elementary event leading to the creation of a new ß-sheet topology during the molecular evolution is either the addition or deletion of one ß-strand at the edge of an existing ß-sheet (Fig. 1BGo). ß-Sheets are occasionally in contact with one or two {alpha}-helix layers. However, we disregarded {alpha}-helices from our consideration in this article.



View larger version (29K):
[in this window]
[in a new window]
 
Figure 1. Definitions of ß-sheet topology and kinship relation of ß-topology groups. (A) Definition of ß-sheet topology. ß-Strands are numbered according to the order of the amino acid sequence. The underlined numbers represent ß-strands that are antiparallel to strand 1. The first characters—P, A, and M—refer to parallel-, antiparallel-, and mixed-type ß-sheets, respectively. The same ß-sheet can be expressed in two ways, for example, M15423 or M32451, depending on the face of the ß-sheet from which the sheet is viewed. However, we usually were not concerned with the face of the ß-sheet. In such cases, we used the convention that the strand number at the left edge is lower than that at the right edge; for example, the ß-topology group of the example is M15423, but not M32451. (B) Kinship 1 relation among ß-topology groups. The pair of M15423 and M1432 is an example of a kinship 1 relation. Deleting strand 1 from M15423 and renumbering the rest of the strands gives M1432. (C) Examples of a network of kinship 1 relations. M15423 and A2134, M15423 and M1432, M15423 and M216534, and A21345 and A2134 are kinship 1 pairs. M15423 and A21345 are connected by two kinship 1 distances, because both of them form a kinship 1 pair with A2134. M216534 is connected to A21345 by three kinship 1 distances.

 
Each domain from the 1011 nonhomologous ß-sheet-containing domains within the CATH database (Orengo et al. 1997) was classified by focusing our attention on the topology of the largest, and sometimes the second largest, ß-sheets. Domains with the same ß-sheet topology but with different {alpha}-helix arrangements were collected into one group (Supplementary Table 1Go). Such a group is defined here as a ß-topology group. We classified the global folds of the nonhomologous domains into 428 ß-topology groups (Supplementary Table 2Go).


View this table:
[in this window]
[in a new window]
 
Table 1. Distribution of domains, ß-topology groups, and kinship relation
 

View this table:
[in this window]
[in a new window]
 
Table 2. List of ß-topology groups that might be created by a fusion of two ß-sheets and examples of proteins with their PDB entry codes
 
Note that a ß-topology group can be a mixture of evolutionarily related and unrelated domains. A pair of domains, which are closely related evolutionarily, should belong to the same ß-topology group. However, an evolutionarily unrelated pair of domains with an accidentally identical ß-sheet may also be members.

Then, we introduce the kinship relations among the ß-topology groups, based on the assumption described above. A pair of ß-topology groups, related by the deletion or addition of one ß-strand at the edge of a ß-sheet, is defined as having the first degree of kinship, kinship 1 (Fig. 1BGo). We note here that a kinship relation between a pair of ß-topology groups can, in some cases, stand for two or more different evolutionary lineages.

When the kinship 1 relationship was introduced into the 428 ß-topology groups, the majority of them were connected into one large network. The remaining ß-topology groups were found either as an isolated orphans or as members of small networks with up to three members. However, a large fraction of these orphans and small networks can be connected to the one large network via a ß-topology group that is not a member of the 428 ß-topology groups. There are 101 such ß-topology groups that can act as a bridge, called bridging ß-topology groups. We added these 101 ß-topology groups to the 428 ß-topology groups in the following analyses. The resulting 529 ß-topology groups now belonged to one major network, except for 36 orphans and seven groups in three isolated small networks (Table 1Go; Supplementary Table 2). Even though the majority of the ß-topology groups were connected into one main network, this does not necessarily mean that they are all mutually evolutionarily related. This is because a kinship relation can stand for two or more different evolutionary lineages. In fact, later in this article, we will describe how the major network can be decomposed into >=20 disjointed clusters, by an analysis that dissects different evolutionary lineages.

The 36 orphans discussed above are those lacking any intermediates connecting them to the network. This lack of an intermediate group might be due to the limited number of known structures. The occurrence of orphans is rare in ß-topology groups of small ß-sheets; they comprise only 3.5% (13 of 376) among the ß-topology groups of two- to eight-stranded ß-sheets (Table 1Go). This rarity supports the validity of the assumption that the addition or deletion of one ß-strand at the existing ß-sheet is the major elementary event that took place during evolution. The lack of intermediates is either due to a paucity of known structures or due to a consequence of minor events during evolution, such as domain swapping (Fukami-Kobayashi et al. 1999). In contrast, orphans are found among almost half (44% = 23 of 52) of the ß-topology groups of 9- to 18-stranded ß-sheets. Some of these orphans may have evolved by the fusion of two smaller ß-sheets rather than by the stepwise addition of a ß-strand (Table 2Go).

Basic features of the network of global fold kinship relations
On the network, parallel and antiparallel ß-topology groups are not separated from each other but are connected via mixed type ß-topology groups (Fig. 2Go); for example, the P12 ß-topology group is connected to the A12 ß-topology group by six mixed-type, three-stranded ß-topology groups such as M123.



View larger version (43K):
[in this window]
[in a new window]
 
Figure 2. Network of global fold kinship relations. This figure is the network of ß-sheets with two to five strands. Circles are two-stranded ß-sheets; triangles, three-stranded ß-sheets; squares, four-stranded ß-sheets; and pentagons, five-stranded ß-sheets. The symbols are colored red, blue, or green, indicating a parallel, antiparallel, or mixed ß-sheet, respectively. The connecting lines are kinship 1 relations. This graph was drawn by using Graphviz (http://www.research.att.com/sw/tools/graphviz/).

 
Most of the ß-topology groups, ~79% (337 of 428), have only one or two domains in each group. Some ß-topology groups are more populated. The most populous ß-topology groups are A12 (e.g., mainly ß-hairpin; 91), A123 (e.g. mainly meander; 88), A2134 (55), A1234 (51), A1243 (50), P12345 [GenBank] 678 (TIM-barrel; 48), A132 (34), A1423 (32), and A213 (27) in which the number of domains classified in each ß-topology group is given in the parentheses. The total number of domains that can be classified in these nine groups is about one third (476 of 1444) of all of the domains analyzed in the present study. These populated ß-topology groups apparently tolerate various kinds of amino acid sequences and, therefore, should be highly flexible for creating new functions. The well-populated ß-topology groups also frequently appear as substructures of larger sheets; for example, A123 also appears also in A2134, A1234, and A1243. The TIM-barrel P12345 [GenBank] 678, which is one of the nine most populated groups, is an exception and does not occur as a substructure of other larger sheets.

About 44% (188 of 428) of the ß-topology groups are connected to the network by a single kinship relation, whereas the rest are multiply connected by more than one kinship relation.

Mapping ATP-binding domains on the network
The next step is to analyze the distribution of the ß-topology groups of the ATP-binding domains on this network. For this purpose, we systematically collected as many three-dimensional structures of ATP-binding proteins as possible, by including those within the CATH nonhomologous domain set. According to the LIGAND chemical database for enzymatic reactions in KEGG (Goto et al. 2000), 406 enzymes use ATP as a substrate or an effector. For 75 among these enzymes, crystal structures have been registered in the Protein Data Bank (PDB, April 2000; Berman et al. 2000). Their ATP-binding domains belong to 32 different Structural Classification of Proteins (SCOP) superfamilies (Hubbard et al. 1997). To enrich the analysis of the distribution, we extended the data from the 75 enzymes to 182 proteins belonging to the 32 superfamilies (see Materials and Methods). In this process of enrichment, some GTP-binding domains were also included. Finally, we obtained 93 ß-topology groups from the PDB structures of 285 ATP-binding (and exceptionally some GTP-binding) domains (Table 3Go). Some of the SCOP domains consist of two substructures, which are each defined as a domain in our treatment, for example, the superfamily "Phosphofructokinase". Some of the structures were determined after the collection of the CATH data set. For this reason, their ß-topology groups might not be found in the network.


View this table:
[in this window]
[in a new window]
 
Table 3. Summary of ATP-binding domains: their SCOP superfamilies (S) and families (F), and ß-topology groups
 
Upon the network of kinship relations, we mapped the 93 ß-topology groups of the ATP-binding domains. Of them, 59 ß-topology groups were found on the network. Another 27 groups were not found on the network but were connected to the network either by a kinship 1 relation or through a bridging ß-topology group. A total of 86 (59 + 27) ß-topology groups were widely distributed on the network (Fig. 3A,BGo). The remaining seven groups were orphans to the network and were not included in the following analysis.




View larger version (206K):
[in this window]
[in a new window]
 
Figure 3. Distribution and clusters of ATP-binding domains on the network of kinship relations. Two figures (A and B) show the part of the kinship relation network that covers the range of distributions of 93 ß-topology groups of ATP-binding (and exceptionally some GTP-binding) domains. For example, there are 43 four-stranded ß-topology groups on the network, but only 14 are shown in these two figures. For ease of perception, the network is divided into two groups: one (A) is mainly for antiparallel and the other (B) is mainly for parallel ß-topology groups. The main components of these figures are the ß-topology groups and the lines connecting them. If a domain of a ß-topology group is involved in ATP-binding, then its name is enclosed by a box. The names of ß-topology groups in parentheses are those that are not in the network of kinship relations (mainly due to their absence from the CATH S-reps data set) but were inserted to help trace the kinship relations. The network is arranged to show the four-stranded ß-topology groups in the central area of each figure, and the other ß-topology groups with more strands are in the upper and lower areas. The black lines connect a pair of kinship 1 ß-topology groups. Chemical categories are indicated by colors: yellow, red, blue, green, and white correspond to categories 1 to 5, respectively (see Table 4Go). The color of the box indicates the chemical category. Thick colored lines connect ß-topology groups that form a fold-based superfamily. Numbers in circles are those given to each of the fold-based superfamilies in Table 4Go.

 
With the exception of just two cases, the "motor proteins" family in the "P-loop–containing nucleotide triphosphate hydrolases" SCOP superfamily and the "SAICAR synthase-like" SCOP superfamily (S1F8 and S25, respectively, in Table 3Go), all of the ß-topology groups in the same SCOP superfamily (and domain, if applicable) are connected within three kinship 1 distances to form a cluster. This verifies the assumption that each global fold evolves by adding or deleting one ß-strand at the edge of an existing ß-sheet. The above two exceptions are probably due to the lack of known structures that would connect them to the other members within three kinship 1 distances, or due to incorrect classification of evolutionarily unrelated proteins into a SCOP superfamily.

Chemical reactions catalyzed by ATP-binding domains
To study the relationship between the global folds and the molecular functions, we introduce five coarse-grained chemical categories of ATP-binding proteins (given in Table 4Go). The reactions are first classified into two types: phosphoryl-transfer reactions on the {gamma}-phosphorus atom, and nucleotidyl-transfer reactions on the {alpha}-phosphorus atom. Both of the reactions are further classified into chemical energy-retaining and -releasing reactions. Chemical energy-retaining reactions produce high-energy compounds, such as acyl phosphates, as a product or reaction intermediate, whereas chemical energy-releasing ones proceed by ATP hydrolysis or produce low-energy phosphate compounds (Walsh 1979; Voet et al. 1999; Berg et al. 2001). We included a fifth category, "other," for the various reactions that occur on atoms other than the {alpha}- and {gamma}-phosphorus atoms and for ATP binding as an effector.


View this table:
[in this window]
[in a new window]
 
Table 4. Chemical categories and fold-based superfamilies of ATP-binding domains
 
Identifying fold-based superfamilies
The distribution of ATP-binding domains was examined by assigning the category and the topological position of binding site to each domain (shown in Fig. 3Go by a color-coded distribution). If a clear cluster with common chemical categories was identified on this network, then we define it as a fold-based superfamily. We operationally define each cluster connected within three kinship 1 distances (cutoff kinship distance = 4) as a fold-based superfamily. Furthermore, the mode of ATP binding to the fold is assumed to be conserved in each fold-based superfamily, at least at the level of the topological position of the binding site with respect to the plane of the ß-sheet, for example, whether it is on one particular face or edge of the sheet out of the two possible faces or edges.

This definition is based on the assumption that no pair of proteins of the same evolutionary origin catalyzes reactions belonging to different categories. Although we think that this is a reasonable assumption, it remains as an assumption in this article and needs to be critically examined in the future. Todd et al. (2001) found that a few number of proteins changed their reaction chemistry during evolution. However, no ATP-binding domains collected in the present study changed the chemical category during evolution. Chemical category is definable at different level of granularity. Reaction chemistry in the study by Todd et al. (2001) was defined at the level of EC-number classification, but core chemical function in our present study is more roughly defined at the level of superfamily classification.

The fold-based superfamilies thus obtained depend on the used cutoff kinship distance. When we tried to form clusters with different cutoff kinship distances, from three to six, a few different clusters appeared, depending on the cutoff kinship distances. Almost the same picture for the evolution of ATP-binding domains emerged. In the following, we describe with the 29 fold-based superfamilies (Table 4Go), corresponding to the case with the cutoff kinship distance 4. Each fold-based superfamily often includes more than one SCOP superfamily. Two of the 29 fold-based superfamilies are orphans to the network.

Limitations of fold-based superfamilies
Some pairs of fold-based superfamilies have similar chemical functions, although they are separated by long kinship distances. For instance, the fold-based superfamilies "nucleotidyl polymerase" and "mononucleotidyl transferase" are functionally similar, but they cannot be merged by small kinship distances. Such a united cluster must involve the two-stranded ß-topology group A12 (Fig. 2Go). However, this is impossible, because the ATP-binding domains invariably contain a ß-sheet consisting of at least four ß-strands. There are at least 10 such clear cases of separation, including the pair of class I aminoacyl-tRNA synthetase and class II aminoacyl-tRNA synthetase fold-based superfamilies (Fig. 3A,BGo).

We should reexamine the identification process for fold-based superfamilies. The categorization was motivated by the need to dissect the complicated network of global fold kinship relations. If we need a more refined fold space than the network used in this work (which would probably also involve information contained in conventional fold representations based on atom positions), then we would be able to identify the fold-based superfamilies as mutually disjointed clear clusters, without resorting to a classification of functions. If separation into two clusters is really impossible in any fold space, then it means that the two should in fact be regarded as defining one cluster and, therefore, exist in one fold-based superfamily with reactions of both categories. Our research has not yet attained this horizon. At the level of using the coarse-grained fold space of the network of kinship relations, the categorization of reactions is a very powerful method to complement its coarse-grained nature.

New findings in the fold-based superfamilies
The glutathione synthetase fold-based superfamily 2 in Table 4Go, contains two SCOP single-domain superfamilies (S14 and S15) and three domains that are one CATH domain of the SCOP double-domain superfamilies S12Domain2, S25F1Domain2, and S31Domain2. Each of these SCOP superfamilies is characterized by a reaction that proceeds via a carboxyphosphate or phosphohistidine intermediate, by an energy-retaining phosphoryl-transfer mechanism. One isolated domain, S25F1Domain1, and the histidine kinase fold-based superfamily 7 are in the same chemical reaction category and within a cutoff distance of four from each other. However, these two are treated as independent fold-based superfamilies, because ATP binds to the ß-sheet differently.

The finding that domains related to the five SCOP superfamilies are in fact members of one fold-based superfamily leads to new insight into the evolution of enzymes in the de novo purine biosynthesis. Purine is synthesized in microorganisms by a pathway involving 10 successive enzymes (Buchanan 1973): PurF, PurD, PurT, PurL, PurM, PurK, PurE, PurC, PurB, and PurH, in which the six underlined enzymes use ATP as a substrate. The three-dimensional structures have been determined for all of these ATP-using enzymes, except PurL. A PSI-BLAST sequence analysis of PurL revealed PurM as its sequence homolog. The structures of PurD (Wang et al. 1998), PurK (Thoden et al. 1999), PurT (Thoden et al. 2000), PurC (Levdikov et al. 1998), and PurM (and by inference, also PurL; Li et al. 1999) are distributed in three different SCOP double-domain superfamilies (S12, S25F1, and S31 in Table 3Go). The ATP-binding cleft in each enzyme is formed by the two domains. One domain in each of these SCOP double-domain superfamilies was found in our analysis to belong to one common glutathione synthetase fold-based superfamily. The ß-strands in these common domains are shown in red in Figure 4Go. The fact that these domains have a common evolutionary origin implies that an ancestral form of this domain alone was capable of catalyzing the chemical reaction of the same category. In fact, a single-domain enzyme, glutamine synthetase, which belongs to the SCOP single-domain superfamily S14 in Table 3Go and catalyzes the chemical reaction of the same category (Gill and Eisenberg 2001), is a member of our glutathione synthetase fold-based superfamily. Our present analysis indicates that all the six of the ATP-using enzymes in de novo purine biosynthesis have evolved from an ancestral glutamine synthetase as the common evolutionary origin, by combining with another domain to improve their catalytic specificities.



View larger version (45K):
[in this window]
[in a new window]
 
Figure 4. Five de novo purine biosynthesis enzymes and glutamine synthetase, all in the glutathione synthetase fold-based superfamily. The ATP-binding cleft in each de novo purine biosynthesis enzyme is formed by two domains: domain 1 on the right and domain 2 on the left. Only domain 2, which is shown by ß-strands in red, is a member of the glutathione synthetase fold-based superfamily. The ATP-binding cleft in glutamine synthetase is formed by a single domain. Their PDB entry codes and SCOP classification codes (Table 3Go) are as follows: PurT (1ez1 [PDB] , S12F2), PurK (1b6s [PDB] , S12F2), PurD (1gso [PDB] , S12F2), PurM (1cli [PDB] , S31), PurC (1a48 [PDB] , S25F1), and glutamine synthetase (1lgr [PDB] , S14). Structures were drawn by MOLSCRIPT (Kraulis 1991).

 
This is the first evidence that as many as six enzymes catalyzing the reactions in one metabolic pathway are evolutionarily related from each other. Previous structure and function analyses of other enzymes indicated that they could evolve from an ancestral enzyme by retaining its core chemical reaction and adjusting its substrate specificity flexibly (Petsko et al. 1993; Babbitt and Gerlt 1997; Gerlt and Babbitt 1998; Hasson et al. 1998). However, these cases have mostly applied to only two or three enzymes in one metabolic pathway (Farber and Petsko 1990; Fani et al. 1995).

The green-colored fold-based superfamily 19 in Table 4Go, nucleotidyl polymerase fold-based superfamily, includes two SCOP single-domain superfamilies, S20 and S21 in Table 3Go, and one domain, S13Domain2 in Table 3Go, from one SCOP double-domain superfamily. This fold-based superfamily includes all of the enzymes with known structures involved in nucleotidyl polymerization, transcription and capping of polynucleotides, and cyclization of mononucleotides. DNA polymerase ß (S19 in Table 3Go), which is incorrectly named as a polymerase but is actually a repair enzyme (Sawaya et al. 1997; Arndt et al 2001), is clustered into a separate fold-based superfamily.

Phosphorylation of sugars is catalyzed by kinases that have been classified into four SCOP superfamilies (S1F3, F5, F12, F13; S4F1; S6Domain1; S8F1 in Table 3Go). The domains in these superfamilies are now found to belong to two different fold-based superfamilies, superfamilies 8 and 9 in Table 4Go. These superfamilies are treated here as separate families, because of the different topological positions of the binding sites.

SCOP superfamily P-loop–containing nucleotide triphosphate hydrolases (S1 in Table 3Go) now appears to be a mixture of six evolutionarily unrelated proteins: adenylate kinase 1, phosphosugar hydroxykinase I 8, GTPase I 10, GTPase II 12, myosin 11, and PAPS sulfotransferase 22 fold-based superfamilies (Table 4Go). On the other hand, the adenylate kinase fold-based superfamily 1 in Table 4Go includes S1F1, S1F6, S1F9, and S30 in Table 3Go, among which S30 was somehow not assigned to be a member of S1 in SCOP, even though they share the P-loop motif (Saraste et al. 1990; Bertrand et al. 1997).

Some of the interesting new findings are described above. Descriptions of the 29-fold–based superfamilies are provided given in Table 4Go, which contains numerous new findings not described above.

Implications for the evolution of proteins
An important aspect of understanding biological systems at the molecular level is not only to know the structure and function of enzymes but also to be able to relate enzymes to each other, from an evolutionary viewpoint. In this work, we found that there are ~>=20 independently invented ATP-binding protein families, each with its specific core chemical reaction. During the course of evolution, each fold-based superfamily diverged somewhat in both the fold-space and the function-space, as summarized in Figure 3Go and Table 4Go. Even during the course of divergence, the specific core chemical reaction was maintained. Because the use of ATP should always be necessary for the core chemical reaction, it leads us to realize that the mechanisms of ATP-binding and ATP-usage in the core chemical reaction are both created at the conception of a fold-based superfamily.

By closely examining the process of divergence, we should be able to establish a phylogeny of proteins in each fold-based superfamily. Through such a process, we could reexamine the currently popular, but often very confusing, concepts of family and superfamily.


    Materials and methods
 TOP
 Abstract
 Introduction
 Results and Discussion
 Materials and methods
 References
 
The CATH S-reps data set (June 1999), which is the set of nonhomologous representatives with <35% sequence identities, consists of 1173 domains that were members of the mainly-ß and the {alpha}-ß classes. We selected 1011 domains by removing those that were low in resolution or determined by NMR. The topology of each ß-sheet assigned by TOPS (Flores et al. 1994) was examined by inspecting the structures on a graphic display. Topologies of barrels and sandwich layers were assigned by assuming that the corresponding flat ß-sheet is first formed and then bent. When one ß-sheet is composed of strands from two or more different chains, we separated it into ß-sheets that are composed of the strands from the same peptide and assigned the topology for each sheet. Among the 1011 domains, 77% consisted of two or more ß-sheets. When the largest ß-sheet in each domain differed from the remaining ones by only one ß-strand, the same domain was analyzed multiple times from the viewpoints of the individual ß-sheets. Because of this treatment, our set of 1011 domains now appears to consist of 1444 domains (Supplementary Table 1).

The 32 SCOP superfamilies contain 83 SCOP families and 182 proteins (Supplementary Table 3). In the cases in which a protein had two or more PDB structures derived from different biological species, we selected one PDB structure from each different species. We selected 254 PDB structures from 182 proteins. When the ATP-binding site was composed of two domains, two ATP-binding domains were selected from one PDB structure. Thus, we obtained a total of 263 ATP- and 22 GTP-binding domains from 254 PDB structures. The identification of the ATP-binding domains is confirmed by the references cited in the PDB database and by the distribution of the residues interacting with the bound ATP or its analog. By LIGPLOT (Wallace et al. 1995), we further confirmed that such interacting residues are mainly located on the ß-strands of the largest ß-sheet in the ATP-binding domain or on the loops connecting the ß-strands.


    Acknowledgments
 
We thank Dr. K. Yura for helpful discussions. We also thank Dr. G. Basu for reading the manuscript carefully. This work was partly supported by the Special Coordination Funds Promoting Science and Technology from the MEXT (Ministry of Education, Culture, Sports, Science, and Technology, Japan).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.


    References
 TOP
 Abstract
 Introduction
 Results and Discussion
 Materials and methods
 References
 
Arndt, J.W., Gong, W., Zhong, X., Showalter, A.K., Liu, J., Dunlap, C.A., Lin, Z., Paxson, C., Tsai, M.D., and Chan, M.K. 2001. Insight into the catalytic mechanism of DNA polymerase beta: structures of intermediate complex. Biochemistry 40: 5368–5375.[CrossRef][Medline]

Babbitt, P.C. and Gerlt, J.A. 1997. Understanding enzyme superfamilies: Chemistry as the fundamental determinant in the evolution of new catalytic activities. J. Biol. Chem. 272: 30591–30594.[Free Full Text]

Berg, J.M., Tymoczko, J.L., and Stryer, L. 2001. Biochemistry. W.H. Freeman and Co., New York.

Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235–242.[Abstract/Free Full Text]

Bertrand, J.A., Auger, G., Fanchon, E., Martin, L., Blanot, D., van Heijenoort, J., and Dideberg, O. 1997. Crystal structure of UDP-N-acetylmuramoyl-L-alanine:D-glutamate ligase from Escherichia coli. EMBO J. 16: 3416–3425.[CrossRef][Medline]

Buchanan, J.M. 1973. The amidotransferases. Adv. Enzymol. Relat. Areas Mol. Biol. 39: 91–183.[Medline]

Fani, R., Lio, P., and Lazcano, A. 1995. Molecular evolution of the histidine biosynthetic pathway. J. Mol. Evol. 41: 760–774.[Medline]

Farber, G.K. and Petsko, G.A. 1990. The evolution of {alpha}/ß barrel enzymes. Trends Biochem. Sci. 15: 228–234.[CrossRef][Medline]

Fetrow, J.S. and Skolnick, J. 1998. Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and T1 ribonucleases. J. Mol. Biol. 281: 949–968.[CrossRef][Medline]

Flores, T.P., Moss, D.S., and Thornton, J.M. 1994. An algorithm for automatically generating protein topology cartoons. Protein Eng. 7: 31–37.[Abstract/Free Full Text]

Fukami-Kobayashi, K., Tateno, Y., and Nishikawa, K. 1999. Domain dislocation: A change of core structure in periplasmic binding proteins in their evolutionary history. J. Mol. Biol. 286: 279–290.[CrossRef][Medline]

Gerlt, J.A. and Babbitt, P.C. 1998. Mechanistically diverse enzyme superfamilies: The importance of chemistry in the evolution of catalysis. Curr. Opin. Chem. Biol. 2: 607–612.[CrossRef][Medline]

Gill, H.S. and Eisenberg, D. 2001. The crystal structure of phosphinothricin in the active site of glutamine synthetase illuminates the mechanism of enzymatic inhibition. Biochemistry 40: 1903–1912.[CrossRef][Medline]

Goto, S., Nishioka, T., and Kanehisa, M. 2000. LIGAND: Chemical database of enzyme reactions. Nucleic Acids Res. 28: 380–382.[Abstract/Free Full Text]

Hasson, M.S., Schlichting, I., Moulai, L., Taylor, K., Barrett, W., Kenyon, G.L., Babbitt, P.C., Gerlt, J.A., Petsko, G.A., and Ringe, D. 1998. Evolution of an enzyme active site: The structure of a new crystal form of muconate lactonizing enzyme compared with mandelate racemase and enolase. Proc. Natl. Acad. Sci. 95: 10396–10401.[Abstract/Free Full Text]

Hubbard, T.P.J., Murzin, A.G., Brenner, S.E., and Chothia, C. 1997. SCOP: A structural classification of proteins database. Nucleic Acids Res. 25: 236–239.[Abstract/Free Full Text]

Kraulis, P.J. 1991. MOLSCRIPT: A program to produce both detailed and schematic plots of protein structures. J. Appl. Crystallogr. 24: 946–950.[CrossRef]

Levdikov, V.M., Barynin, V.V., Grebenko, A.I., Melik-Adamyan, W.R., Lamzin, V.S., and Wilson, K.S. 1998. The structure of SAICAR synthase: An enzyme in the de novo pathway of purine nucleotide biosynthesis. Structure 6: 363–376.[Medline]

Li, C., Kappock, T.J., Stubbe, J.A., Weaver, T.M., and Ealick, S.E. 1999. X-ray crystal structure of aminoimidazole ribonucleotide synthetase (PurM), from the Escherichia coli purine biosynthetic pathway at 2.5 Å resolution. Structure 7: 1155–1166.[Medline]

Lo Conte, L., Ailey, B., Hubbard, T.J.P., Brenner, S.E., Murzin, A.G., and Chothia, C. 2000. SCOP: A structural classification of proteins database. Nucleic Acids Res. 28: 257–259.[Abstract/Free Full Text]

Orengo, C.A., Jones, D.T., and Thornton, J.M. 1994. Protein superfamilies and domain superfolds. Nature 372: 631–634.[CrossRef][Medline]

Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. 1997. CATH—A hierarchic classification of protein domain structures. Structure 5: 1093–1108.[Medline]

Petsko, G.A., Kenyon, G.L., Gerlt, J.A., Ringe, D., and Kozarich, J. W. 1993. On the origin of enzymatic species. Trends Biochem. Sci. 18: 372–376.[CrossRef][Medline]

Ptitsyn, O.B. and Finkelstein, A.V. 1980. Similarities of protein topologies: Evolutionary divergence, functional convergence or principles of folding? Q. Rev. Biophysics 13: 339–386.[Medline]

Richardson, J.S. 1977. ß-Sheet topology and the relatedness of proteins. Nature 268: 495–500.[CrossRef][Medline]

Saraste, M., Sibbald, P.R., and Wittinghofer, A. 1990. The P-loop: A common motif in ATP- and GTP-binding proteins. Trends Biochem. Sci. 15: 430–434.[CrossRef][Medline]

Sawaya, M.R., Prasad, R., Wilson, S.H., Kraut, J., and Pelletier, H. 1997. Crystal structures of human DNA polymerase ß complexed with gapped and nicked DNA: Evidence for an induced fit mechanism. Biochemistry 36: 11205–11215.[CrossRef][Medline]

Thoden, J.B., Kappock, T.J., Stubbe, J., and Holden, H.M. 1999. Three-dimensional structure of N 5-carboxyaminoimidazole ribonucleotide synthetase: A member of the ATP grasp protein superfamily. Biochemistry 38: 15480–15492.[CrossRef][Medline]

Thoden, J.B., Firestine, S., Nixon, A., Benkovic, S.J., and Holden, H.M. 2000. Molecular structure of Escherichia coli purt-encoded glycinamide ribonucleotide transformylase. Biochemistry 39: 8791–8802.[CrossRef][Medline]

Todd, A.E., Orengo, C.A., and Thornton, J.M. 2001. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307: 1113–1143.[CrossRef][Medline]

Voet, D., Voet, J.G., and Pratt, C.W. 1999. Fundamentals of biochemistry. John Wiley & Sons, New York.

Wallace, A.C., Laskowski, R.A., and Thornton, J.M. 1995. LIGPLOT: A program to generate schematic diagrams of protein–ligand interactions. Protein Eng. 8: 127–134.[Abstract/Free Full Text]

———. 1996. Derivation of three-dimensional coordinate templates for searching structural databases. Protein Sci. 5: 1001–1013.[Abstract]

Walsh, C. 1979. Enzymatic reaction mechanisms. W.H. Freeman and Co., San Francisco.

Wang, W., Kappock, T.J., Stubbe, J., and Ealick, S.E. 1998. X-ray crystal structure of glycinamide ribonucleotide synthetase from Escherichia coli. Biochemistry 37: 15647–15652.[CrossRef][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
J. Viksna and D. Gilbert
Assessment of the probabilities for evolutionary structural changes in protein folds
Bioinformatics, April 1, 2007; 23(7): 832 - 841.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplemental Research Data
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Matsuda, K.
Right arrow Articles by Go, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Matsuda, K.
Right arrow Articles by Go, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS