Protein Science
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Aloy, P.
Right arrow Articles by Russell, R. B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Aloy, P.
Right arrow Articles by Russell, R. B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?
Protein Science (2002), 11:1101-1116.
Copyright © 2002 The Protein Society

Structural similarity to link sequence space: New potential superfamilies and implications for structural genomics

Patrick Aloy1, Baldomero Oliva2, Enrique Querol2, Francesc X. Aviles2 and Robert B. Russell1,3

1 EMBL, Meyerhofstrasse, 1. D-69117, Heidelberg, Germany
2 Institut de Biotecnologia i Biomedicina, Universitat Autonoma de Barcelona, Bellaterra 08193, Barcelona, Spain

(RECEIVED September 28, 2001; FINAL REVISION January 29, 2002; ACCEPTED January 29, 2002)

3 Reprint requests to: Robert B. Russell, EMBL, Biocomputing, Meyerhofstrasse, 1. D-69117, Heidelberg, Germany; e-mail: russell{at}embl-heidelberg.de; fax: 49-6221-387-517. Back

Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.3950102.


    Abstract
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
The current pace of structural biology now means that protein three-dimensional structure can be known before protein function, making methods for assigning homology via structure comparison of growing importance. Previous research has suggested that sequence similarity after structure-based alignment is one of the best discriminators of homology and often functional similarity. Here, we exploit this observation, together with a merger of protein structure and sequence databases, to predict distant homologous relationships. We use the Structural Classification of Proteins (SCOP) database to link sequence alignments from the SMART and Pfam databases. We thus provide new alignments that could not be constructed easily in the absence of known three-dimensional structures. We then extend the method of Murzin (1993b) to assign statistical significance to sequence identities found after structural alignment and thus suggest the best link between diverse sequence families. We find that several distantly related protein sequence families can be linked with confidence, showing the approach to be a means for inferring homologous relationships and thus possible functions when proteins are of known structure but of unknown function. The analysis also finds several new potential superfamilies, where inspection of the associated alignments and superimpositions reveals conservation of unusual structural features or co-location of conserved amino acids and bound substrates. We discuss implications for Structural Genomics initiatives and for improvements to sequence comparison methods.

Keywords: Protein structure; sequence; function; homology; structural genomics

Abbreviations: 3D, three dimensional • Ig, immunoglobulin • RMSD, root mean square deviation • PDB, Protein Data Bank • ATP, adenosine triphosphate • SCOP, structural classification of proteins • NCBI, National Center for Biotechnology Information • URL, universal resource locator


    Introduction
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
Structural genomics initiatives attempt to infer details of protein function by way of 3D structure determination (e.g.,Eisenberg et al. 2000; Shapiro and Harris 2000), and efforts already have produced structures for proteins of unknown function (e.g.,Yang et al. 1998; Zarembinski et al. 1998; Boggon et al. 1999; Christendat et al. 2000). Structure can provide insights into function in a number of different ways. For example, aspects of molecular function can be revealed if the act of solving the structure reveals details of bound non-protein atoms (e.g.,Zarembinski et al. 1998). Alternatively, if a new protein structure adopts a previously observed fold, then it sometimes is possible to infer functional details by considering the function of other proteins adopting the same fold (e.g.,Murzin et al. 1995; Artymiuk et al. 1997; Holm and Sander 1997). If fold similarities are ambiguous (e.g., the fold performs many functions) or if a protein adopts a new fold, it still is possible to infer function by comparison of key active site residues (e.g., Wallace et al. 1997; Russell 1998; Aloy et al. 2001) or by similarities between protein-binding sites or surfaces (e.g., Russell et al. 1998; Boggon et al. 1999).

For instances where a new protein structure adopts a previously observed fold, a major goal of structural genomics initiatives is to provide a structural link between sequence families that are not detectably similar when only sequences are compared (e.g., Brenner et al. 1998). A key issue is whether protein structure similarity per se can be used to infer that proteins are homologous and/or whether they are likely to show similarities in their molecular function. If proteins sharing similar 3D structures can be said, with confidence, to share a common ancestor, then a similarity in molecular function is more likely (e.g., Hegyi and Gerstein 1999).

For groups of proteins sharing a common fold in the absence of strong sequence similarity, previous work has focussed on the distinction between "remote homologs" and "analogs" (proteins sharing a similar fold in the absence of convincing evidence of a common ancestor). Studies were aided greatly by the creation of structural classification databases, which produced a reliable set of homologous proteins lacking significant sequence identity. Proteins now are generally classified as belonging to "superfamilies" if evidence for homology comes from evidence apart from structure similarity. Such evidence can include the conservation of key active site or structural residues, common functions, or unusual structural characteristics unlikely to have arisen by chance (e.g., left-handed ß{alpha}ß motifs). Even in the absence of detectable sequence similarity, proteins are considered to share a common ancestor based on such evidence. Probably the best source of these data is the Structural Classification of Proteins (SCOP) database (Murzin et al. 1995). Here, proteins that adopt a similar fold with little or no sequence similarity are placed in the same superfamily if there is evidence that they share a common ancestor. SCOP has been used in numerous studies on the distinction between remote homology and analogy (e.g., Russell and Barton 1994; Russell et al. 1997; Matsuo and Bryant 1999). Although an automated means to predict superfamily relationships remains elusive, limited success in discerning homology has come from analysis of features such as sequence similarity calculated from structure-based alignments (Murzin 1993b; Russell et al. 1997; Ponting and Russell 2000), structurally conserved core residues (Matsuo and Bryant 1999) or combinations of multiple features (e.g., Holm 1998; Dietmann and Holm 2001).

In parallel with the above developments has been the emergence of databases that classify protein sequences. Genome projects have produced vast numbers of protein sequences, which are frequently grouped into aligned domain families. Accurate alignment of protein sequences permits the construction of sensitive profiles, or hidden Markov models (HMMs; e.g., Eddy 1998) that can be used to detect other remote members of the family. In addition, sequence comparisons enable long protein sequences to be divided into discrete functional or structural domains. Simple modular architecture research tool (SMART; Schulz et al. 2000), protein families (Pfam; Bateman et al. 2000), protein fingerprints (PRINTS; Attwood et al. 2000), clusters of orthologous groups (COGs; Tatusov et al. 2000) and BLOCKS (Henikoff et al. 2000) are examples of protein domain sequence alignment databases.

We describe here a method to detect superfamily relationships based on the statistical significance of an inferred structure-based sequence alignment. We first construct a database that merges sequence alignments from SMART and Pfam according to folds and superfamilies defined within the SCOP database. After finding overlapping sequences within the structure and sequence databases, we construct structural alignments and use these to merge different SMART/Pfam alignments. We then apply the statistical P-value described by Murzin (1993b) to assess the significance of sequence identities between pairs of sequences aligned by structure comparison. We discuss interesting new potential superfamilies in addition to implications for protein sequence comparison and structural genomics.


    Results
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
Linking SMART domains via structure
A total of 193 out of 419 SMART domains could be matched to one or more domains in SCOP, and a total of 20,447 sequences out of 30,050 in the database can be matched in whole or part to a domain of known three-dimensional structure. There were several partial overlaps between the databases implying differences in the way in which they define domains. This is perhaps not surprising as SCOP frequently divides protein structures into domains that could not readily be detected by sequence comparison, and which are only apparent upon analysis of the 3D structure. SMART also sometimes divides domains to account for domain insertions that can hinder sequence analysis (e.g., Ponting and Russell 1998).

A total of 30 SCOP superfamilies could be matched to more than one SMART domain, producing a total of 183 potential pairings of SMART domains via structurally similar superfamilies. If one considers "fold" level similarities within SCOP (i.e., where proteins are in the same fold, but in different superfamilies; where evidence for an evolutionary relationship has not yet been found), an additional 120 pairings are added from a total of 11 folds. Details of all of the pairings are given in Table 1Go. A table showing the results for all the potential pairings, at both superfamily and fold level can be found at http://www.embl-heidelberg.de/~aloy/struct_align.


View this table:
[in this window]
[in a new window]
 
Table 1. Potential pairings for SMART domains at fold and superfamily level
 
Superfamily links
For the superfamily pairs, 82 out of 183 had P3D<=5x10-3 (see Materials and Methods), 31 had P3D<=10-5 and 13 had P3D<=10-10. These are instances where SMART domains are clearly sequence similar, but have been split to aid analysis of new protein sequences (C. Ponting, pers. comm.) or where similarities that are known are yet to be merged into a single SMART domain. Considering the P values corrected for multiple observations, 52 still showed MP3D<=5x10-3. At least one significant link was found for 20 out of the 30 SCOP superfamilies assigned to more than one SMART domain. It is important to emphasize that many of the similarities are not easily found by sequence methods, with evolutionary relationships only inferred once 3D structures are available.

The SCOP P-loop containing nucleotide triphosphate hydrolases (e.g., Gay and Walker 1983; Saraste et al. 1990) are a large superfamily containing 11 SMART domains. Similarities between ATP/GTP-binding and other motifs provided significant links for 38 pairs out of 55. This is despite the fact that proteins within this superfamily frequently show topological differences, such as variations in ß-strand order and direction (Murzin et al. 1995).

We also linked several ancient DNA-binding protein families. For example, the nucleosome core histone superfamily in SCOP contains at least five different SMART domains, and six of the 10 possible pairings were found to be significant. Similarly, the winged helix DNA-binding proteins, characterized by a 3-helix DNA-binding bundle and a small ß-sheet (wing), contains six SMART domains and 15 potential pairs, five of which could be linked with confidence.

The method also links the tumor necrosis factor (TNF) and collagen-1q (C1Q) domains from SMART. This similarity was first reported after the structure of ACRP30, a homolog of C1Q domains, was determined by Shapiro and Scherer (1998), who inferred an evolutionary relationship based on similarities in key motifs and in the trimeric structures. Here, the sequence of the low P3D value links human CD40 ligand (PDB code 1aly; a distant TNF homolog) and human C1q B (C1QB_HMAN). The alignment of TNFs with C1qs shows that there are numerous conserved hydrophobic positions as well as conservation of key residues around the trimer interface as has been discussed previously (Shapiro and Scherer 1998). In the absence of functional knowledge for one of these families, a functional similarity might be inferred by this P3D calculation.

Fold links
The situation is different for fold-level similarities, where the SCOP database does not consider the pairs necessarily to be homologous. Here, 35 of a possible 120 pairings had P3D<=5x10-3, five had P3D<=10-5, and none had P3D<=10-10. However, six out of the 11 different SCOP folds assigned to more than one SMART domain contain at least one significant pair, but only 10 links remain significant at when the MP3D is calculated. Figure 1Go shows the SMART domains linked at fold level.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 1. Significant links (P3D <=5x10-3) at fold level between SMART domains identified by the method discussed in the text. Thick continuous lines indicate MP3D <=5x10-3 and dashed lines MP3D >5x10-3.

 
We found significant links, at either superfamily or fold level, for all the possible pairs within the DNA/RNA-binding 3-helical bundle fold, thus all nine SMART domains adopting his fold could be linked together. The most significant links (Fig. 1Go) join the Paired Box (PAX), Arsenical Resistance Operon Repressor, helix-turn-helix (HTH_ARSR), SW13 ADA2 N-CoR TFIIIB (SANT), homeodomain (HOX), helix-turn-helix and cAMP regulatory protein (HTH_CRP) all with P3D<=10-4 and MP3D<=5x10-3. This suggests that they may share a common origin, like other proteins adopting this fold.

The method found several links between proteins adopting an Ig-like fold. Common ancestors have been proposed for many of these, particularly between the Ig and fibronectin domains (e.g., Bazan 1990). The most significant link is that between the Cadherin (CA) and polycystic kidney disease (PKD) domains (P3D-value = 3.9x10-5, MP3D = 1.1x10-3). Both of these domains are extracellular and both are thought to be involved in cell-cell contacts (Hughes et al. 1995; Shapiro et al. 1995), thus the P3D-value could be indicating a common ancestor that also is associated with a common function.

Within the OB fold (Murzin 1993c), the method found links between two SMART domains in the superfamily containing ribosomal protein S1-like RNA-binding domain (S1) and cold-shock proteins (CSP) and the superfamily containing staphylococcal nuclease (SNc). MP3D between these families were also low (<=5.7x10-3). Figure 2Go shows a superimposition and structure-based sequence alignment of the known structures and the sequences providing the best link between these two families. For SNc, the location of the nucleotide-binding site is known (the location of a bound nucleotide analog inhibitor thymidine 3`,5`-bisphosphate is shown in Fig. 2Go). Several of the residues within the nucleases that are in contact with the nucleotide also are found in S1 and comprise a D[RK]xxGR motif that shows good (but not perfect) conservation in both families. These residues occur in an unusual loop within the most C-terminal ß-hairpin found in the OB fold. In addition, an aspartate from a different region in both structures that is in contact with a bound calcium atom within SNc also shows good conservation within each of these diverse families.



View larger version (44K):
[in this window]
[in a new window]
 
Fig. 2. (a) Molscript (Kraulis 1991) figures showing Staphylococcus aureus nuclease (staphylococcal nuclease; left; PDB code 1kdc) and Escherichia coli S1 RNA-binding protein (right; 1sro) in a similar orientation. Structural equivalent regions (identified by the method of Russell and Barton 1992) are labeled with arrows (ß-strands) or ribbons ({alpha}-helices) or coils, with nonequivalent regions shown as C{alpha} trace. Residues common to both structures are shown in ball-and-stick format. The coloring scheme moves through the spectrum from blue to red from N- to C-terminus. Linkage details: RMSD = 2.0 Å in 38 C{alpha} atoms; 13 identities in 32 equivalent residues; P3D-value = 4.7x10-6 MP3D = 8.2x10-5. (b) Alscript (Barton 1993) figure showing the structural alignment of the two structures in a with secondary structures (below). The best linking sequence is also shown for the nucleases (NUC_SFLX; S. flexneri nuclease); the best link was between this sequence and the E. coli S1 RNA-binding domain. Positions within the alignment showing conservation of residue character are colored as follows: yellow background, conserved hydrophobic; blue background, conserved small; red text, conserved polar. Identical residues are boxed. Secondary structures are shown as arrows (ß-strands) and cylinders ({alpha}-helices) below the alignment and colored as for a.

 
Links were found within the ß-grasp fold between ubiquitin (UBQ) and two members of the Ras-binding superfamily: the Ras association domain (RA) and the Raf-like Ras-binding domain (RBD). Ubiquitin proteolysis plays a crucial role in protein degradation that controls the timed destruction of cellular regulatory proteins, including cyclins, tumor suppressor p53, or transcription factors. RA and RBD families bind Ras-like proteins and are involved in signaling processes. The related functions (signaling and cell-cycle control) of these superfamilies together with the low P3D-values may indicate that they share a common ancestor.

The procedure also found low P3D-values linking several members of the cysteine-knot fold, including the insulin growth-factor–binding protein (IB; e.g., Clemmons 1993), chitin-binding domains (ChtBD), and the epidermal growth factor. Although functional similarities have been reported for the EGF family (Mas et al. 1998; Blanco-Aparicio et al. 1998), there is no obvious similarity in the functions of these domains, and the low P3D-values may be an artefact of being small cysteine-rich domains. The similarities involve only a few structurally equivalent residues (<10) and the identities consist mostly of the cysteines forming the disulfides that define the fold.

The method also found linkages within the ß-trefoil fold. Previously, analysis of 3D structures and sequences showed that the interleukin-1 (IL1), fibroblast growth factor (FGF), Ricin-type (RICIN), and the soybean trypsin inhibitor (Kunitz) families share a degree of sequential and functional similarities (Ponting and Russell 2000).

Linking Pfam families via structure
A total of 711 out of 2008 Pfam families could be assigned to one or more SCOP domains (117,017 out of 178,110 sequences). As for SMART, there were some disagreements between domain definitions. A total of 61 Pfam domains could be aligned to multiple SCOP domains in nonoverlapping regions. Table 2Go shows the 113 SCOP superfamilies and 26 SCOP folds that could be matched to more than one Pfam family. Associated with this are 954 potential pairings within superfamilies and 1406 potential parings within folds. A table showing the results for all the significant pairings at superfamily and fold level can be found at http://www.embl-heidelberg.de/~aloy/struct_align.


View this table:
[in this window]
[in a new window]
 
Table 2. Potential pairings for Pfam families at fold and superfamily level
 
Superfamily links
We found P3D<=5x10-3 for 231 out of the 954 potential pairs within superfamilies. Eighty-five of the pairings had P3D<=10-5 and four had P3D<=10-10. The last were clear cases of sequence-detectable homologs residing in separate domains within Pfam. In addition, 157 of the significant links showed corrected MP3D<=5x10-3. Structural alignment was able to identify at least one significant pair for 81 out of the 113 SCOP superfamilies assigned to more than one Pfam family.

The results are fully consistent with those obtained with SMART (above). For example, the P-loop nucleotide triphosphate hydrolases superfamily also contains the most different Pfam families. However, the wider coverage of Pfam means that many more relationships were found, several of which are discussed below.

For example, within the NAD(P)-binding Rossmann-fold domains we found significant links for 24 of the 136 potential pairs and a total of 12 out of the 17 Pfam families belonging to this superfamily. This NAD/NADP-binding domain is present in a large number of families, including dehydrogenases, synthetases, reductases, and methylases, and the difficulty of detecting similarities using only sequence analysis has been discussed recently (Kunin et al. 2001).

We also found several links within the four helical cytokine superfamily. This superfamily contains, at least, one pair with a significant P3D-value for eight out of its 10 Pfam families. The relationship between these proteins is well known, though the similarities frequently are not detectable by sequence comparison alone.

Fold links
The situation is again different for families residing in the same fold but in different superfamilies. With the same cutoff (P3D<=5x10-3) we found 152 significant pairs out of the potential 1406. Eleven pairs had P3D<=10-5 and only one had P3D<=10-10. However, we found at least one significant link for 16 out of the 26 different SCOP folds containing two or more Pfam families. When the family diversity score is considered, 51 out of the 152 significant links still show MP3D<=5x10-3. Figure 3Go shows the Pfam families linked at fold level. As for the superfamily links, all overlapping links found in SMART domains also were present involving the Pfam equivalents, and again, because of the larger coverage of sequence space by Pfam, additional links were between families not present in SMART, many of which are discussed below.



View larger version (39K):
[in this window]
[in a new window]
 
Fig. 3. Significant links at fold level between Pfam domains identified by the method discussed in the text. Lines are drawn as for Figure 1Go.

 
The (ß{alpha})8 (TIM) barrel fold is the largest, comprising a total of 38 different Pfam families. Of the 703 potential pairings, 84 showed to be significant. These pairings effectively mean that 33 out of the 38 families assigned to this fold can be linked. Moreover, the five families that were not linked to the others (Glyco_hydro_1, Glyco_hydro_14, PI-PLC-X, ALAD, and AP_endonucleas2) belong to the same SCOP superfamily as other linked families, meaning that there are homologous links within SCOP between these and the others. The results are consistent with the work of Copley and Bork (2000) who found that all but one enzyme family adopting a TIM-barrel fold can be linked to the others using sequence search methods like PSI-blast (Altschul et al. 1990), suggesting a common ancestor for this fold that performs many different biochemical functions. They were unable to link the dihydroorotase family (proposed to adopt a TIM barrel by Holm and Sander 1997). No structures of the dihydroorotase family are known, meaning that our study could not suggest an evolutionary link between this family and any of the others in this fold.

Six out of the eight Pfam families within the flavodoxin-like fold could be linked into a single group, with the most significant link being that between Response regulator receiver domain (response_reg) and flavodoxin families (P3D-value = 3.85x10-5; MP3D = 1.1x10-5). The Response regulator family includes CheY (Volz 1993), the receiver domain among two-component signal transduction systems. In response to a specific stimulus, these proteins are phosphorylated leading to a conformational change that is detected by an effector domain. The flavodoxins (Vervoort et al. 1994) are small proteins that bind FMN and serve as redox centers and electron transport proteins. Two of the other three-linked families (DHquinase_II and Ligase-CoA) are ATP/GTP-binding proteins. Our results indicate that the function of the five families may have diverged from common nucleotide-binding site. Though the CheY family does not bind nucleotides, structural and functional analogies between the CheY-like receiver domains and small GTP-binding proteins have been noted and a common ancestry has been proposed (Artymiuk et al. 1990; Lukat et al. 1991).

We found several links between superfamilies within the SCOP ferredoxin-like fold, some of which appeared to be associated with functional similarities (see below). This fold comprises a repeat of a split {alpha}ß{alpha} motif that forms an anti-parallel ß-sheet flanked on one side by two {alpha}-helices. It is one of the most populated in SCOP, with 31 different superfamilies, and also performs a large number of different functions. In total, 12 out of the 17 Pfam families assigned to this fold were linked with significant P3D-values.

One of the more intriguing similarities within the ferredoxin-like fold is that between the RNA recognition motif (rrm) and heavy-metal-associated (HMA) domain with a P3D-value = 4.7x10-5 MP3D = 9.0x10-4. The RNA recognition motifs are found in a variety of proteins, including proteins implicated in regulation of alternative splicing, and function to bind single-stranded RNA. The heavy-metal-associated proteins are implicated in the regulation of cytoplasmatic metal concentration, and although they are known to be ATP dependent, the location of the ATP-binding site has not yet been determined. Figure 4Go shows two representative structures from these superfamilies and the associated alignment of the best link. The conserved residues are found in the nucleotide-binding site of the RNA recognition protein, which suggests that it also may correspond to the ATP-binding site for the HMA proteins. Interestingly, the location of this binding site corresponds with the ferredoxin-like fold "supersite" proposed by Russell et al. (1998). This fold shows a tendency to bind a diversity of different ligands at a common location, which is on the side of the ß-sheet not flanked by {alpha}-helices. It is worth noting that both rrm and HMA domains occur as tandem repeats, and an oligomeric structure has been proposed for the rrm domains to aid the binding of single-stranded RNA.



View larger version (43K):
[in this window]
[in a new window]
 
Fig. 4 (a) Molscript (Kraulis 1991) figures showing the Poly(A)-binding protein (left; 1cvj, chain F) and the Copper transporter ATPase (right; 1aw0) in a similar orientation. Details for the figures are as for Figure 2Go. Linkage details: RMSD = 1.8 Å in 39 C{alpha} atoms; seven identities in 12 equivalent residues; P3D-value = 4.7x10-5 MP3D = 9.1x10-4. (b) Alscript (Barton 1993) figure showing the structural alignment of the two structures in a with secondary structures (below). The best linking sequences are also shown (COPA_ENTHR and RO33_NICSY). Conserved positions and secondary structures are shown as described in Figure 2Go.

 
We detected links between different superfamilies within the barrel-sandwich hybrid fold, the best being that between the glycine cleavage H protein (GCV_H) and phosphoenolpyruvate-dependent sugar phosphotransferase system, EIIA 1 (PTS_EIIA_1). The core of this fold comprises seven or eight strands in two ß-sheets, and structural representatives from these two families can be superimposed such that seven ß strands are equivalent (35 C{alpha} atoms superimpose with RMS = 2.0 Å and percent sequence identity of 4%). However, the initial structural alignment identified a shorter region of high structure and local sequence similarity (Fig. 5Go), which identifies a repeat found at the N-terminus of transit peptide H protein of the Gly cleavage system (20–65 of PDB 1hpca) and toward the C-terminus of glucose permase domain IIA (75–120 of 1gpr). Investigating the similarity further reveals that these families are better superimposed if one considers a circular permutation (45 C{alpha}s; RMS = 1.3; %I = 33%) as is shown in Figure 5Go. This similarity includes two unusual loop structures containing several identities that are absent from the unpermuted alignment. This fold is composed of repeats of three ß strands, of which the Gly cleavage system protein contains two and the permase contains four (bottom of Fig. 5Go). The permutated structure may have been the result of a partial duplication of a longer protein as has been described previously (Ponting and Russell 1998). The repeat structure within the Gly cleavage system family has been discussed recently (Anantharaman et al. 2001).



View larger version (33K):
[in this window]
[in a new window]
 
Fig. 5. (a) Molscript (Kraulis 1991) figures showing the N-terminus of transit peptide H protein of the Gly cleavage system (left; 1hpc, chain a) and the C-terminus of glucose permase domain IIA (right; 1gpr) in a similar orientation. Details for the figures are as for Figure 2Go. Linkage details: RMSD = 1.4 Å in 45 C{alpha} atoms; 15 identities in 45 equivalent residues; P3D-value = 1.3x10-4 MP3D = 2.5x10-3. (b) Alscript (Barton 1993) figure showing the structural alignment of the two structures in a with secondary structures (below). The best linking sequences are also shown (O59049 and PTBA_ERWCH). Conserved positions and secondary structures are shown as described in Figure 2Go. The numbers within the alignment denote the start and end of the aligned segments (note, in particular, that 1hpc is permuted relative to 1gpr). (c) Figure showing the similarity in a topology diagram. ß-strands are denoted as triangles; {alpha}-helices as circles, colored in an analogous fashion to a.

 
Hypothetical protein ybl036c: Output from a Structural Genomics initiative
A result from a Structural Genomics initiative provides an interesting test for the method. The three dimensional structure of a member of the UPF0001 Pfam family (hypothetical protein ybl036c; PDB codes 1ct5, 1b54) was solved recently as part of the BNL Human Proteome Project (http://proteome.bnl.gov), and was found to adopt a (ß{alpha})8–TIM-barrel structure. The 3D structure also showed the protein to bind pyridoxal-phosphate (PLP).

The structure shows a striking similarity to members of the alanine racemase family (Ala_racemase). This similarity is detectable by sequence comparison alone (e.g., Psi-blast with an E-value of 7x10-31) and also shows a very significant pairwise P3D (1x10-44), meaning that they are placed within the same SCOP superfamily. Ignoring this obvious similarity, the method also finds significant links to five other families adopting the (ß{alpha})8–(TIM)-barrel fold: aldo/keto reductases (aldo_ket_red), indol-3-glycerol phosphate synthase (IGPS), dihydrodipicolinate synthetase (DHDPS), dihydropteroate synthase (DHPS), and fructose-bisphosphate aldolase class II (F_bp_aldolase). As discussed above, this fold performs many different biochemical functions, making attempts to assign function from structure difficult. Moreover, there also is growing evidence that many of the (ß{alpha})8-barrels have evolved from a common ancestor (see above; Copley and Bork 2000).

Inspection of the alignments and superimpositions with the Pfam families having the most significant P3D-values suggests that probably the best functional match is that with the second lowest value: the IGPS Pfam family, the second best score (P3D-value = 1.94x10-3), rather than the aldolase/ketolase reductases (aldo_ket_red; P3D-value = 4.84x10-4). Inspection showed the length of the structurally equivalent regions to be longer when comparing UPF0001 and IGPS, in addition to overlap many of the residues involved in function in IGPS and PRAI (Wilmanns et al. 1992), some of which are fully conserved in the merged alignments (K55 and G236). IGPS, together with the PRAI and Trp_synthase families, belong to the Ribulose phosphate-binding barrel superfamily. Most of the enzymes in this superfamily are involved in amino-acid synthesis pathways, and some also bind PLP. Moreover, three of the other Pfam families that could be linked to UPF0001 are involved in tryptophan or lysine synthesis. All these results indicate that the 26 members of the UPF0001 Pfam family could play a role in amino-acid synthesis.

Implications for structural genomics
The results above show that the method often is able to identify superfamily relationships, and possible functional similarities between domains that have been linked by a structure similarity. The accuracy of identifying such relationships is important for Structural Genomics projects that target domains of unknown structure from databases such as SMART or Pfam. Using P3D<=5x10-3 to predict superfamily relationships gives a sensitivity (the percentage of correct relationships identified) of 45% for SMART domains and 24% for those from Pfam. For the more stringent MP3D<=5x10-3 the values are 27% and 16%, respectively. We suspect that the lower sensitivity for the Pfam domains is because the alignments are less diverse than their SMART counterparts, meaning that one is less likely to find a significant link when considering all sequences.

Quoting an associated specificity (the percentage of incorrect relationships identified) is problematic, as there is no definitive set of "false positives." Ideally false positives would consist of proteins with different folds, but for these it would not be possible to build meaningful structure-based alignments. An alternative is to use pairs of proteins in the same fold that are definitely "not" in the same superfamily (i.e., fold level links). Though many folds in SCOP contain multiple superfamilies, new evidence often emerges (i.e., like the examples above) that permits them to be merged together. Thus calculating specificity in this way will give an underestimate of the correct value. With this caveat in mind, the specificities for P3D<=5x10-3 are 80% for SMART domains and 89% for Pfam, and those for MP3D<=5x10-3 are 95% and 96%, respectively.

These values of sensitivity and specificity would be applicable for situations where a new structure has been determined for a domain from SMART or Pfam and adopts a fold known previously. We anticipate that our method will often be able to identify many superfamily relationships and thus place a new structure into the correct evolutionary and often functional context.


    Discussion
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
We have demonstrated how a merger of protein structure and sequence databases can suggest likely evolutionary links between protein domain families. We have also identified potentially new homologous relationships that may be associated with similarities in molecular function.

This study has gone some way toward quantifying how Structural Genomics projects will gradually link sequence families together, ultimately providing 3D structure and additional functional information for all protein families with at least one member amenable to structure determination by X-ray crystallography or nuclear magnetic resonance. A protocol like that described here will permit evolutionary and functional similarities to be uncovered automatically as the number of known structures and sequences continues to increase.

It is intriguing that current sensitive sequence searching methods apparently fail to detect some similarities that are quite clearly associated with a degree of sequence conservation (e.g., TNF/C1Q). It may be that sequence profiles are too specific to one family to detect more distantly related sequences, even when key sequence motifs are conserved. Another possible explanation comes from inspection of aligned segment lengths. We found that aligned segments (i.e., those not containing a gap in any sequence) are typically shorter within the SCOP-linked alignments than in alignments derived only by a comparison of sequence (results not shown). This may mean that the model for aligned segments and gaps currently used by sequence comparison methods is too strict to permit alignments such as those obtained by structure comparison.

The P-value first described by Murzin (1993b) attempts to assess the likelihood that a pair of proteins, aligned based on their three-dimensional structures, will have a certain degree of sequence similarity. The prior probability of amino acid identity is based on the abundance of the amino acids, and accommodates the assumption that certain features of common protein structures, such as burial in the hydrophobic core or surface exposure, will increase the chances of amino-acid identities. What it does not take into account is the possibility that certain folds will have strict requirements for particular amino acids at certain positions, which could well be the result of convergent evolution to a stable fold. It has been argued that this is the case for the ß-trefoils (including the FGFs and IL-1s; Murzin 1993a; Ponting and Russell 2000), and could well be the case for other folds. It is impossible to discern such occurrences at present, thus this possibility should be remembered when considering the links proposed.

Structural Genomics initiatives provide structures for proteins that often are of unknown function. In the absence of further experiments, the ability to place a new structure in the correct evolutionary context is currently the best method for predicting details regarding molecular function. Methods such as that described here and elsewhere (Copley and Bork 2000; Todd et al. 2001; Landgraf et al. 2001; Aloy et al. 2001) will thus be of growing importance in this new age of 3D structural annotation.


    Materials and methods
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
Data
We obtained aligned sequence data from the SMART (version 4.1, http://smart.embl-heidelberg.de/) and Pfam (release 5.0, http:// www.sanger.ac.uk/Software/Pfam/) world wide web (WWW) pages. We extracted SCOP classifications from the WWW page (release 1.50, http://scop.mrc-lmb.cam.ac.uk/scop/), and converted them into protein sequence data via the STAMP package (Russell and Barton 1992; http://barton.ebi.ac.uk/manuals/stamp.html).

Merging SMART/Pfam alignments with SCOP sequences
SMART and Pfam contain cross-references to appropriate PDB identifiers. However, the PDB identifier alone is not sufficient to identify a domain in SCOP, since such identifiers can contain both multiple polypeptide chains and multiple domains. Accordingly, we constructed a hidden Markov model for each SMART and Pfam alignment using the HMMBUILD program from the HMMer package (S.R. Eddy, unpubl.; http://hmmer.wustl.edu), and used this to search the SCOP database (HMMSEARCH) to provide links between the two databases. SCOP sequences were considered to reside in the SMART or Pfam domain if they had HMMSEARCH E-values <=10-3 and if the alignment covered 60% of either the SCOP sequence or the SMART/Pfam domain. Once identified, we aligned the sequences for SCOP entries with those from SMART/Pfam using HMMALIGN. This resulted in a set of alignments containing the original sequences from SMART/Pfam in addition to those from SCOP.

Combining SMART/Pfam alignments from the same SCOP classifications
When we found that different SMART/Pfam entries contained domains from the same SCOP fold or superfamily, we merged the alignments via an alignment of structures. We aligned structures using the STAMP package for protein structure alignment and superimposition. All alignments were checked and, if necessary, edited manually to avoid situations where structural alignment was ambiguous, or lead to erroneous results owing to distortions of the structure as a result of bound substrates or poorly determined/missing residues. In two cases, STAMP found several alignments with good scores. These comparisons were manually edited and the best alignment was selected. The structural alignments then were used to merge all associated sequence data. A summary of these linkages can be found in Table 1Go. For folds/superfamilies containing more than two SMART/Pfam families, we constructed all "pairwise" alignments. This was done to ensure maximum alignment quality.

It is difficult to do a direct comparison between alignments derived by consideration of protein sequence alone with those derived from three-dimensional structure. The main difficulty is that structural alignment methods either do not necessarily give a meaningful alignment of those regions that are different between protein structures, or they do not attempt to align them at all. Accordingly, we processed the structural alignments prior to merging them with sequence alignments. Sequences outside of the structural conserved regions were shortened to the minimum possible length. This mimics what would likely happen during a sequence alignment of the same proteins, assuming that, in the best possible sequence alignment for the proteins, only the structurally equivalent regions would be accurately aligned.

All alignments are available via the WWW (http://www.embl-heidelberg.de/~aloy/struct_align).

Statistical significance of SMART/Pfam families linked by SCOP structures
Murzin (1993b) proposed a P-value to suggest the likelihood that a "sequence" identity found after "structure"-based alignment could occur by chance (hereafter called P3D). Given n structurally conserved sites (i.e., C{alpha} positions) between two similar protein 3D structures, he suggested that the probability that m of these sites would contain identical amino acids would be:


where is the mean probability of finding identical residues and structurally equivalent sites, m0 = n (where the bionomial has its maximum), and {sigma} = {surd}np(1 - ) (the half-width of the approximating distribution). All that is required is to approximate the probability that structurally equivalent sites in two similar protein structures will have identical residues by chance. Murzin suggested that the value would be larger than 1/20, probably about 1/15 but certainly smaller than 1/10. These values attempt to account for the tendency for buried residues to be hydrophobic and exposed residues to be hydrophilic. In other words, the probability would be >1/20 as buried or exposed sites are more likely to contain a smaller subset of the 20 amino acids (i.e., hydrophobic and polar residues, respectively). This calculation was originally applied to the cystatin-monellin similarity, where an evolutionary relationship was inferred based on a P3D of ~10-3. For more details, we refer the reader to Murzin (1993b).

Here, we calculated P3D for all pairs of sequences coming from different SMART/Pfam alignments as aligned according to the structures of one or more members of the alignment. We defined structurally equivalent regions by the method of Russell and Barton (1992), and extrapolated these positions to all sequences in the aligned SMART/Pfam families. This calculation assumes that the alignment of sequences is correct and that the known structures for SMART/Pfam families are good models for the remaining sequences. Owing to the high quality of both the sequence alignments within individual SMART/Pfam domains and the structure-based alignments, we do not suspect that the calculation would differ greatly if all proteins were of known structure.

Here, we used the most stringent value of 1/10 meaning that P3D values are higher than if we had used 1/15. We also considered significant links those with P3D <=5x10-3. This value is an order of magnitude lower than those calculated, and argued to be biologically significant, with a more lenient P3D calculation (1/15) for ß-trefoil proteins (Ponting and Russell 2000). Thus, we are confident that all the linkages reported are biologically relevant when considering single pairs of protein structures. Selecting the more lenient value of 1/15 has the effect of identifying more of both fold and superfamily links, which we suspect is lowering the specificity of the approach, as many more links between folds are found that may not be true superfamily relationships. For more details, see Implications for structural genomics in Results, above.

P3D was originally described for the comparison of a single pair of protein structures. Because here we are looking for the minimum value from a (sometimes large) number of pairs of proteins, there is a statistical tendency that means low values are more likely to arise by chance (i.e., akin to the difference between a P-value and E-value in database searches such as BLAST; Altschul et al. 1990). A drastic over estimation of the correction needed would be to multiply the lowest pairwise P-value by the number of possible pairs. However, this assumes that all observed pairs are independent observations, which is certainly not the case for sequences that are highly similar. We thus sought a quantity that measures the "effective sequence number," giving more weight to unique sequences (i.e., those without close homologs in the alignments) and less to those with many similar sequences. We used a diversity score for a multiple alignment described by Rychlewski et al. (2000):


where Si,j is the sequence identity between sequences i and j, and n is the number of sequences in the alignment.

If all sequences in the alignment are very similar, D tends to one, otherwise it increases as a function of the diversity of the sequences, with the total number of sequences in the alignment the upper limit. We thus define the P3D for a multiple set of sequences between two families as:

where DA and DB are the diversity scores for alignments A and B.


    Acknowledgments
 
We thank Chris Ponting (Functional Genetics Unit, Oxford) and Rich Copley (EMBL, Heidelberg) for helpful discussions. This work was supported by grants BIO2000–0647, BIO2001–2064, BIO98–0362, and FEDER-2FD97–0872 from the CICYT, by CERBA and by C4-CESCA (Barcelona, Spain).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.


    References
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Materials and methods
 References
 
Aloy, P., Querol, E., Aviles, F.X., and Sternberg, M.J.E. 2001. Automated structure-based prediction of functional sites in proteins—Application to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J. Mol. Biol. 311: 395–408.[CrossRef][Medline]

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403–410.[CrossRef][Medline]

Anantharaman, V., Koonin, E.V., and Aravind, L. 2001. Regulatory potential, phyletic distribution and evolution of ancient, intracellular small-molecule-binding domains. J. Mol. Biol. 307: 1271–1292.[CrossRef][Medline]

Artymiuk, P.J., Rice, D.W., Mitchell, E.M., and Willet, P. 1990. Structural resemblance between the families of bacterial signal-transduction proteins and G proteins revealed by graph theoretical techniques. Protein Eng. 4: 39–43.[Abstract/Free Full Text]

Artymiuk, P.J., Poirrette, A.R., Rice, D.W., and Willett, P. 1997. A polymerase I palm in adenylyl cyclase? Nature 388: 33–34.[CrossRef][Medline]

Attwood, T.K., Croning, M.D., Flower, D.R., Lewis, A.P., Mabey, J.E., Scordis, P., Selley, J.N., and Wright, W. 2000. PRINTS-S: The database formerly known as PRINTS. Nucleic Acids Res. 28: 225–227.[Abstract/Free Full Text]

Bairoch, A. and Apweiler, R. 1999. The SWISSPROT protein sequence data bank and its new supplement TrEMBL in 1999. Nucleic Acids Res. 27: 49–54.[Abstract/Free Full Text]

Barton, G.J. 1993. ALSCRIPT: A tool to format multiple sequence alignments. Protein Eng. 6: 37–40.[Free Full Text]

Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Howe, K.L., and Sonnhammer, E.L. 2000. The Pfam protein families database. Nucleic Acids Res. 28: 263–266.[Abstract/Free Full Text]

Bazan, J.F. 1990. Structural design and molecular evolution of a cytokine receptor superfamily. Proc. Natl. Acad. Sci. 87: 6934–6938.[Abstract/Free Full Text]

Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235–242.[Abstract/Free Full Text]

Blanco-Aparicio, C., Molina, M.A., Fernandez-Salas, E., Frazier, M.L., Mas, J.M., Querol, E., Aviles, F.X., and de Llorens, R. 1998. Potato carboxypeptidase inhibitor, a T-knot protein, is an epidermal growth factor antagonist that inhibits tumor cell growth. J. Biol. Chem. 273: 12370–12377.[Abstract/Free Full Text]

Boggon, T.J., Shan, W.S., Santagata, S., Myers, S.C., and Shapiro, L. 1999. Implication of tubby proteins as transcription factors by structure-based functional analysis. Science 286: 2119–2125.[Abstract/Free Full Text]

Brenner, S.E., Chothia, C., and Hubbard, T.J. 1998. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. 95: 6073–6078.[Abstract/Free Full Text]

Christendat, D., Yee, A., Dharamsi, A., Kluger, Y., Savchenko, A., Cort, J.R., Booth, V., Mackereth, C.D., Saridakis, V., Ekiel, I., Kozlov, G., Maxwell, K.L., Wu, N., McIntosh, L.P., Gehring, K., Kennedy, M.A., Davidson, A.R., Pai, E.F., Gerstein, M., Edwards, A.M., and Arrowsmith, C.H. 2000. Structural proteomics of an archaeon. Nat. Struct. Biol. 7: 903–909.[CrossRef][Medline]

Clemmons, D.R. 1993. IGF binding proteins and their functions. Mol. Reprod. Dev. 35: 368–374.[CrossRef][Medline]

Copley, R.R. and Bork, P. 2000. Homology among (ß/{alpha}) (8) barrels: Implications for the evolution of metabolic pathways. J. Mol. Biol. 303: 627–641.[CrossRef][Medline]

Dietmann, S. and Holm, L. 2001. Identification of homology in protein structure classification. Nat. Struct. Biol. 8: 953–957.[CrossRef][Medline]

Eddy, S.R. 1998. Profile hidden Markov models. Bioinformatics 14: 755–763.[Abstract/Free Full Text]

Eisenberg, D., Marcotte, E.M., Xenarios, I., and Yeates, T.O. 2000. Protein function in the post-genomic era. Nature 405: 823–826.[CrossRef][Medline]

Flores, T.P., Orengo, C.A., Moss, D.S., and Thornton, J.M. 1993. Comparison of conformational characteristics in structurally similar protein pairs. Protein Sci. 2: 1811–1826.[Abstract]

Gay, N.J. and Walker, J.E. 1983. Homology between human bladder carcinoma oncogene product and mitochondrial ATP-synthase. Nature 301: 262–264.[CrossRef][Medline]

Hegyi, H. and Gerstein, M. 1999. The relationship between protein structure and function: A comprehensive survey with application to the yeast genome. J. Mol. Biol. 288: 147–164.[CrossRef][Medline]

Henikoff, J.G., Greene, E.A., Pietrokovski, S., and Henikoff, S. 2000. Increased coverage of protein families with the blocks database servers. Nucleic Acids Res. 28: 228–230.[Abstract/Free Full Text]

Holm, L. 1998. Unification of protein families. Curr. Opin. Struct. Biol. 8: 372–379.[CrossRef][Medline]

Holm, L. and Sander, C. 1997. Decision support system for the evolutionary classification of protein structures. Ismb 5: 140–246.

Hughes, J., Ward, C.J., Peral, B., Aspinwall, R., Clark, K., San Millan, J.L., Gamble, V., and Harris, P.C. 1995. The polycystic kidney disease 1 (PKD1) gene encodes a novel protein with multiple cell recognition domains. Nature Genet. 10: 151–160.[Medline]

Kraulis, P.J. 1991. MOLSCRIPT: A program to produce both detailed and schematic plots of protein structures. J. Appl. Cryst. 24: 946–950.

Kunin, V., Chan, B., Sitbon, E., Lithwick, G., and Pietrokovski, S. 2001. Consistency analysis of similarity between multiple alignments: Prediction of protein function and fold structure from analysis of local sequence motifs. J. Mol. Biol. 307: 939–949.[CrossRef][Medline]

Landgraf, R., Xenarios, I., and Eisenberg, D. 2001. Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J. Mol. Biol. 307: 1487–1502.[CrossRef][Medline]

Lukat, G.S., Lee, B.H., Mottonen, J.M., Stock, A.M., and Stock, J.B. 1991. Roles of the highly conserved aspartate and lysine residues in the response regulator of bacterial chemotaxis. J. Biol. Chem. 266: 8348–8354.[Abstract/Free Full Text]

Mas, J.M., Aloy, P., Marti-Renom, M.A., Oliva, B., Blanco-Aparicio, C., Molina, M.A., de Llorens, R., Querol, E., and Aviles, F.X. 1998. Protein similarities beyond disulphide bridge topology. J. Mol. Biol. 284: 541–548.[CrossRef][Medline]

Matsuo, Y. and Bryant, S.H. 1999. Identification of homologous core structures. Proteins 35: 70–79.[CrossRef][Medline]

Murzin, A.G. 1993a. Can homologous proteins evolve different enzymatic activities? Trends Biochem Sci. 18: 403–405.[CrossRef][Medline]

Murzin, A.G. 1993b. Sweet-tasting protein monellin is related to the cystatin family of thiol proteinase inhibitors. J. Mol. Biol. 230: 689–694.[CrossRef][Medline]

Murzin, A.G. 1993c. OB (oligonucleotide/oligosaccharide binding)-fold: Common structural and functional solution for non-homologous sequences. EMBO J. 12: 861–867.[Medline]

Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536–540.[CrossRef][Medline]

Ponting, C.P. and Russell, R.B. 1998. Protein fold irregularities that hinder sequence analysis. Curr. Opin. Struct. Biol. 8: 364–371.[CrossRef][Medline]

Ponting, C.P. and Russell, R.B. 2000. Identification of distant homologues of FGFs suggests a common ancestor for all ß-trefoil proteins. J. Mol. Biol. 302: 1041–1047.[CrossRef][Medline]

Russell, R.B. 1998. Detection of protein three-dimensional side-chain patterns: New examples of convergent evolution. J. Mol. Biol. 279: 1211–1227.[CrossRef][Medline]

Russell, R.B. and Barton, G.J. 1992. Multiple protein sequence alignment from tertiary structure comparison: Assignment of global and residue confidence levels. Proteins 14: 309–323.[CrossRef][Medline]

Russell, R.B. and Barton, G.J. 1994. Structural features can be unconserved in proteins with similar folds. An analysis of side-chain to side-chain contacts secondary structure and accessibility. J. Mol. Biol. 244: 332–350.[CrossRef][Medline]

Russell, R.B., Sasieni, P.D., and Sternberg, M.J.E. 1998. Supersites within superfolds. Binding site similarity in the absence of homology. J. Mol. Biol. 282: 903–918.[CrossRef][Medline]

Russell, R.B., Saqi, M.A., Sayle, R.A., Bates, P.A., and Sternberg, M.J.E. 1997. Recognition of analogous and homologous protein folds: Analysis of sequence and structure conservation. J. Mol. Biol. 269: 423–439.[CrossRef][Medline]

Rychlewski, L., Jaroszewski, L., Weizhong, L.I., and Godzik, A. 2000. Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci. 9: 232–241.[Abstract]

Saraste, M., Sibbald, P.R., and Wittinghofer, A. 1990. The P-loop—a common motif in ATP- and GTP-binding proteins. Trends Biochem. Sci. 15: 430–434.[CrossRef][Medline]

Schultz, J., Copley, R.R., Doerks, T., Ponting, C.P., and Bork, P. 2000. SMART: A web-based tool for the study of genetically mobile domains. Nucleic Acids Res. 28: 231–234.[Abstract/Free Full Text]

Shapiro, L. and Harris, T. 2000. Finding function through Structural Genomics. Curr. Opin. Biotechnol. 11: 31–35.[CrossRef][Medline]

Shapiro, L. and Scherer, P.E. 1998. The crystal structure of a complement-1q family protein suggests an evolutionary link to tumor necrosis factor. Curr. Biol. 8: 335–338.[CrossRef][Medline]

Shapiro, L., Kwong, P.D., Fannon, A.M., Colman, D.R., and Hendrickson, W.A. 1995. Considerations on the folding topology and evolutionary origin of cadherin domains. Proc. Natl. Acad. Sci. 92: 6793–6797.[Abstract/Free Full Text]

Tatusov, R.L., Galperin, M.Y., Natale, D.A., and Koonin, E.V. 2000. The COG database: A tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28: 33–36.[Abstract/Free Full Text]

Todd, A.C., Orengo, C.A., and Thornton, J.M. 2001. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307: 1113–1143.[CrossRef][Medline]

Vervoort, J., Heering, D., Peelen, S., and van Berkel, W. 1994. Flavodoxins. Methods Enzymol. 243: 188–203.[CrossRef][Medline]

Volz, K. 1993. Structural conservation in the CheY superfamily. Biochemistry 32: 11741–11753.[CrossRef][Medline]

Wallace, A.C., Borkakoti, N., and Thornton, J.M. 1997. TESS: A geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites. Protein Sci. 6: 2308–2323.[Abstract]

Wilmanns, M., Priestle, J.P., Niermann, T., and Jansonius, J.N. 1992. Three-dimensional structure of the bifunctional enzyme phosphoribosylanthranilate isomerase: Indoleglycerolphosphate synthase from Escherichia coli refined at 2.0 A resolution. J. Mol. Biol. 223: 477–507.[CrossRef][Medline]

Yang, F., Gustafson, K.R., Boyd, M.R., and Wlodawer, A. 1998. Crystal structure of Escherichia coli HdeA. Nat. Struct. Biol. 5: 763–764.[CrossRef][Medline]

Zarembinski, T.I., Hung, L.W., Mueller-Dieckmann, H.J., Kim, K.K., Yokota, H., Kim, R., and Kim, S.H. 1998. Structure-based assignment of the biochemical function of a hypothetical protein: A test case of Structural Genomics. Proc. Natl. Acad. Sci. 95: 15189–15193.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us