|
|
||||||||
1 EMBL, Meyerhofstrasse, 1. D-69117, Heidelberg, Germany
2 Institut de Biotecnologia i Biomedicina, Universitat Autonoma de Barcelona, Bellaterra 08193, Barcelona, Spain
(RECEIVED September 28, 2001; FINAL REVISION January 29, 2002; ACCEPTED January 29, 2002)
3 Reprint requests to: Robert B. Russell, EMBL, Biocomputing, Meyerhofstrasse, 1. D-69117, Heidelberg, Germany; e-mail: russell{at}embl-heidelberg.de; fax: 49-6221-387-517. ![]()
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.3950102.
| Abstract |
|---|
|
|
|---|
Keywords: Protein structure; sequence; function; homology; structural genomics
Abbreviations: 3D, three dimensional Ig, immunoglobulin RMSD, root mean square deviation PDB, Protein Data Bank ATP, adenosine triphosphate SCOP, structural classification of proteins NCBI, National Center for Biotechnology Information URL, universal resource locator
| Introduction |
|---|
|
|
|---|
For instances where a new protein structure adopts a previously observed fold, a major goal of structural genomics initiatives is to provide a structural link between sequence families that are not detectably similar when only sequences are compared (e.g., Brenner et al. 1998). A key issue is whether protein structure similarity per se can be used to infer that proteins are homologous and/or whether they are likely to show similarities in their molecular function. If proteins sharing similar 3D structures can be said, with confidence, to share a common ancestor, then a similarity in molecular function is more likely (e.g., Hegyi and Gerstein 1999).
For groups of proteins sharing a common fold in the absence of strong sequence similarity, previous work has focussed on the distinction between "remote homologs" and "analogs" (proteins sharing a similar fold in the absence of convincing evidence of a common ancestor). Studies were aided greatly by the creation of structural classification databases, which produced a reliable set of homologous proteins lacking significant sequence identity. Proteins now are generally classified as belonging to "superfamilies" if evidence for homology comes from evidence apart from structure similarity. Such evidence can include the conservation of key active site or structural residues, common functions, or unusual structural characteristics unlikely to have arisen by chance (e.g., left-handed ß
ß motifs). Even in the absence of detectable sequence similarity, proteins are considered to share a common ancestor based on such evidence. Probably the best source of these data is the Structural Classification of Proteins (SCOP) database (Murzin et al. 1995). Here, proteins that adopt a similar fold with little or no sequence similarity are placed in the same superfamily if there is evidence that they share a common ancestor. SCOP has been used in numerous studies on the distinction between remote homology and analogy (e.g., Russell and Barton 1994; Russell et al. 1997; Matsuo and Bryant 1999). Although an automated means to predict superfamily relationships remains elusive, limited success in discerning homology has come from analysis of features such as sequence similarity calculated from structure-based alignments (Murzin 1993b; Russell et al. 1997; Ponting and Russell 2000), structurally conserved core residues (Matsuo and Bryant 1999) or combinations of multiple features (e.g., Holm 1998; Dietmann and Holm 2001).
In parallel with the above developments has been the emergence of databases that classify protein sequences. Genome projects have produced vast numbers of protein sequences, which are frequently grouped into aligned domain families. Accurate alignment of protein sequences permits the construction of sensitive profiles, or hidden Markov models (HMMs; e.g., Eddy 1998) that can be used to detect other remote members of the family. In addition, sequence comparisons enable long protein sequences to be divided into discrete functional or structural domains. Simple modular architecture research tool (SMART; Schulz et al. 2000), protein families (Pfam; Bateman et al. 2000), protein fingerprints (PRINTS; Attwood et al. 2000), clusters of orthologous groups (COGs; Tatusov et al. 2000) and BLOCKS (Henikoff et al. 2000) are examples of protein domain sequence alignment databases.
We describe here a method to detect superfamily relationships based on the statistical significance of an inferred structure-based sequence alignment. We first construct a database that merges sequence alignments from SMART and Pfam according to folds and superfamilies defined within the SCOP database. After finding overlapping sequences within the structure and sequence databases, we construct structural alignments and use these to merge different SMART/Pfam alignments. We then apply the statistical P-value described by Murzin (1993b) to assess the significance of sequence identities between pairs of sequences aligned by structure comparison. We discuss interesting new potential superfamilies in addition to implications for protein sequence comparison and structural genomics.
| Results |
|---|
|
|
|---|
A total of 30 SCOP superfamilies could be matched to more than one SMART domain, producing a total of 183 potential pairings of SMART domains via structurally similar superfamilies. If one considers "fold" level similarities within SCOP (i.e., where proteins are in the same fold, but in different superfamilies; where evidence for an evolutionary relationship has not yet been found), an additional 120 pairings are added from a total of 11 folds. Details of all of the pairings are given in Table 1
. A table showing the results for all the potential pairings, at both superfamily and fold level can be found at http://www.embl-heidelberg.de/
aloy/struct_align.
|
5x10-3 (see Materials and Methods), 31 had P3D
10-5 and 13 had P3D
10-10. These are instances where SMART domains are clearly sequence similar, but have been split to aid analysis of new protein sequences (C. Ponting, pers. comm.) or where similarities that are known are yet to be merged into a single SMART domain. Considering the P values corrected for multiple observations, 52 still showed MP3D
5x10-3. At least one significant link was found for 20 out of the 30 SCOP superfamilies assigned to more than one SMART domain. It is important to emphasize that many of the similarities are not easily found by sequence methods, with evolutionary relationships only inferred once 3D structures are available. The SCOP P-loop containing nucleotide triphosphate hydrolases (e.g., Gay and Walker 1983; Saraste et al. 1990) are a large superfamily containing 11 SMART domains. Similarities between ATP/GTP-binding and other motifs provided significant links for 38 pairs out of 55. This is despite the fact that proteins within this superfamily frequently show topological differences, such as variations in ß-strand order and direction (Murzin et al. 1995).
We also linked several ancient DNA-binding protein families. For example, the nucleosome core histone superfamily in SCOP contains at least five different SMART domains, and six of the 10 possible pairings were found to be significant. Similarly, the winged helix DNA-binding proteins, characterized by a 3-helix DNA-binding bundle and a small ß-sheet (wing), contains six SMART domains and 15 potential pairs, five of which could be linked with confidence.
The method also links the tumor necrosis factor (TNF) and collagen-1q (C1Q) domains from SMART. This similarity was first reported after the structure of ACRP30, a homolog of C1Q domains, was determined by Shapiro and Scherer (1998), who inferred an evolutionary relationship based on similarities in key motifs and in the trimeric structures. Here, the sequence of the low P3D value links human CD40 ligand (PDB code 1aly; a distant TNF homolog) and human C1q B (C1QB_HMAN). The alignment of TNFs with C1qs shows that there are numerous conserved hydrophobic positions as well as conservation of key residues around the trimer interface as has been discussed previously (Shapiro and Scherer 1998). In the absence of functional knowledge for one of these families, a functional similarity might be inferred by this P3D calculation.
Fold links
The situation is different for fold-level similarities, where the SCOP database does not consider the pairs necessarily to be homologous. Here, 35 of a possible 120 pairings had P3D
5x10-3, five had P3D
10-5, and none had P3D
10-10. However, six out of the 11 different SCOP folds assigned to more than one SMART domain contain at least one significant pair, but only 10 links remain significant at when the MP3D is calculated. Figure 1
shows the SMART domains linked at fold level.
|
10-4 and MP3D
5x10-3. This suggests that they may share a common origin, like other proteins adopting this fold. The method found several links between proteins adopting an Ig-like fold. Common ancestors have been proposed for many of these, particularly between the Ig and fibronectin domains (e.g., Bazan 1990). The most significant link is that between the Cadherin (CA) and polycystic kidney disease (PKD) domains (P3D-value = 3.9x10-5, MP3D = 1.1x10-3). Both of these domains are extracellular and both are thought to be involved in cell-cell contacts (Hughes et al. 1995; Shapiro et al. 1995), thus the P3D-value could be indicating a common ancestor that also is associated with a common function.
Within the OB fold (Murzin 1993c), the method found links between two SMART domains in the superfamily containing ribosomal protein S1-like RNA-binding domain (S1) and cold-shock proteins (CSP) and the superfamily containing staphylococcal nuclease (SNc). MP3D between these families were also low (
5.7x10-3). Figure 2
shows a superimposition and structure-based sequence alignment of the known structures and the sequences providing the best link between these two families. For SNc, the location of the nucleotide-binding site is known (the location of a bound nucleotide analog inhibitor thymidine 3`,5`-bisphosphate is shown in Fig. 2
). Several of the residues within the nucleases that are in contact with the nucleotide also are found in S1 and comprise a D[RK]xxGR motif that shows good (but not perfect) conservation in both families. These residues occur in an unusual loop within the most C-terminal ß-hairpin found in the OB fold. In addition, an aspartate from a different region in both structures that is in contact with a bound calcium atom within SNc also shows good conservation within each of these diverse families.
|
The procedure also found low P3D-values linking several members of the cysteine-knot fold, including the insulin growth-factorbinding protein (IB; e.g., Clemmons 1993), chitin-binding domains (ChtBD), and the epidermal growth factor. Although functional similarities have been reported for the EGF family (Mas et al. 1998; Blanco-Aparicio et al. 1998), there is no obvious similarity in the functions of these domains, and the low P3D-values may be an artefact of being small cysteine-rich domains. The similarities involve only a few structurally equivalent residues (<10) and the identities consist mostly of the cysteines forming the disulfides that define the fold.
The method also found linkages within the ß-trefoil fold. Previously, analysis of 3D structures and sequences showed that the interleukin-1 (IL1), fibroblast growth factor (FGF), Ricin-type (RICIN), and the soybean trypsin inhibitor (Kunitz) families share a degree of sequential and functional similarities (Ponting and Russell 2000).
Linking Pfam families via structure
A total of 711 out of 2008 Pfam families could be assigned to one or more SCOP domains (117,017 out of 178,110 sequences). As for SMART, there were some disagreements between domain definitions. A total of 61 Pfam domains could be aligned to multiple SCOP domains in nonoverlapping regions. Table 2
shows the 113 SCOP superfamilies and 26 SCOP folds that could be matched to more than one Pfam family. Associated with this are 954 potential pairings within superfamilies and 1406 potential parings within folds. A table showing the results for all the significant pairings at superfamily and fold level can be found at http://www.embl-heidelberg.de/
aloy/struct_align.
|
5x10-3 for 231 out of the 954 potential pairs within superfamilies. Eighty-five of the pairings had P3D
10-5 and four had P3D
10-10. The last were clear cases of sequence-detectable homologs residing in separate domains within Pfam. In addition, 157 of the significant links showed corrected MP3D
5x10-3. Structural alignment was able to identify at least one significant pair for 81 out of the 113 SCOP superfamilies assigned to more than one Pfam family. The results are fully consistent with those obtained with SMART (above). For example, the P-loop nucleotide triphosphate hydrolases superfamily also contains the most different Pfam families. However, the wider coverage of Pfam means that many more relationships were found, several of which are discussed below.
For example, within the NAD(P)-binding Rossmann-fold domains we found significant links for 24 of the 136 potential pairs and a total of 12 out of the 17 Pfam families belonging to this superfamily. This NAD/NADP-binding domain is present in a large number of families, including dehydrogenases, synthetases, reductases, and methylases, and the difficulty of detecting similarities using only sequence analysis has been discussed recently (Kunin et al. 2001).
We also found several links within the four helical cytokine superfamily. This superfamily contains, at least, one pair with a significant P3D-value for eight out of its 10 Pfam families. The relationship between these proteins is well known, though the similarities frequently are not detectable by sequence comparison alone.
Fold links
The situation is again different for families residing in the same fold but in different superfamilies. With the same cutoff (P3D
5x10-3) we found 152 significant pairs out of the potential 1406. Eleven pairs had P3D
10-5 and only one had P3D
10-10. However, we found at least one significant link for 16 out of the 26 different SCOP folds containing two or more Pfam families. When the family diversity score is considered, 51 out of the 152 significant links still show MP3D
5x10-3. Figure 3
shows the Pfam families linked at fold level. As for the superfamily links, all overlapping links found in SMART domains also were present involving the Pfam equivalents, and again, because of the larger coverage of sequence space by Pfam, additional links were between families not present in SMART, many of which are discussed below.
|
)8 (TIM) barrel fold is the largest, comprising a total of 38 different Pfam families. Of the 703 potential pairings, 84 showed to be significant. These pairings effectively mean that 33 out of the 38 families assigned to this fold can be linked. Moreover, the five families that were not linked to the others (Glyco_hydro_1, Glyco_hydro_14, PI-PLC-X, ALAD, and AP_endonucleas2) belong to the same SCOP superfamily as other linked families, meaning that there are homologous links within SCOP between these and the others. The results are consistent with the work of Copley and Bork (2000) who found that all but one enzyme family adopting a TIM-barrel fold can be linked to the others using sequence search methods like PSI-blast (Altschul et al. 1990), suggesting a common ancestor for this fold that performs many different biochemical functions. They were unable to link the dihydroorotase family (proposed to adopt a TIM barrel by Holm and Sander 1997). No structures of the dihydroorotase family are known, meaning that our study could not suggest an evolutionary link between this family and any of the others in this fold. Six out of the eight Pfam families within the flavodoxin-like fold could be linked into a single group, with the most significant link being that between Response regulator receiver domain (response_reg) and flavodoxin families (P3D-value = 3.85x10-5; MP3D = 1.1x10-5). The Response regulator family includes CheY (Volz 1993), the receiver domain among two-component signal transduction systems. In response to a specific stimulus, these proteins are phosphorylated leading to a conformational change that is detected by an effector domain. The flavodoxins (Vervoort et al. 1994) are small proteins that bind FMN and serve as redox centers and electron transport proteins. Two of the other three-linked families (DHquinase_II and Ligase-CoA) are ATP/GTP-binding proteins. Our results indicate that the function of the five families may have diverged from common nucleotide-binding site. Though the CheY family does not bind nucleotides, structural and functional analogies between the CheY-like receiver domains and small GTP-binding proteins have been noted and a common ancestry has been proposed (Artymiuk et al. 1990; Lukat et al. 1991).
We found several links between superfamilies within the SCOP ferredoxin-like fold, some of which appeared to be associated with functional similarities (see below). This fold comprises a repeat of a split
ß
motif that forms an anti-parallel ß-sheet flanked on one side by two
-helices. It is one of the most populated in SCOP, with 31 different superfamilies, and also performs a large number of different functions. In total, 12 out of the 17 Pfam families assigned to this fold were linked with significant P3D-values.
One of the more intriguing similarities within the ferredoxin-like fold is that between the RNA recognition motif (rrm) and heavy-metal-associated (HMA) domain with a P3D-value = 4.7x10-5 MP3D = 9.0x10-4. The RNA recognition motifs are found in a variety of proteins, including proteins implicated in regulation of alternative splicing, and function to bind single-stranded RNA. The heavy-metal-associated proteins are implicated in the regulation of cytoplasmatic metal concentration, and although they are known to be ATP dependent, the location of the ATP-binding site has not yet been determined. Figure 4
shows two representative structures from these superfamilies and the associated alignment of the best link. The conserved residues are found in the nucleotide-binding site of the RNA recognition protein, which suggests that it also may correspond to the ATP-binding site for the HMA proteins. Interestingly, the location of this binding site corresponds with the ferredoxin-like fold "supersite" proposed by Russell et al. (1998). This fold shows a tendency to bind a diversity of different ligands at a common location, which is on the side of the ß-sheet not flanked by
-helices. It is worth noting that both rrm and HMA domains occur as tandem repeats, and an oligomeric structure has been proposed for the rrm domains to aid the binding of single-stranded RNA.
|
atoms superimpose with RMS = 2.0 Å and percent sequence identity of 4%). However, the initial structural alignment identified a shorter region of high structure and local sequence similarity (Fig. 5
s; RMS = 1.3; %I = 33%) as is shown in Figure 5
|
)8TIM-barrel structure. The 3D structure also showed the protein to bind pyridoxal-phosphate (PLP).
The structure shows a striking similarity to members of the alanine racemase family (Ala_racemase). This similarity is detectable by sequence comparison alone (e.g., Psi-blast with an E-value of 7x10-31) and also shows a very significant pairwise P3D (1x10-44), meaning that they are placed within the same SCOP superfamily. Ignoring this obvious similarity, the method also finds significant links to five other families adopting the (ß
)8(TIM)-barrel fold: aldo/keto reductases (aldo_ket_red), indol-3-glycerol phosphate synthase (IGPS), dihydrodipicolinate synthetase (DHDPS), dihydropteroate synthase (DHPS), and fructose-bisphosphate aldolase class II (F_bp_aldolase). As discussed above, this fold performs many different biochemical functions, making attempts to assign function from structure difficult. Moreover, there also is growing evidence that many of the (ß
)8-barrels have evolved from a common ancestor (see above; Copley and Bork 2000).
Inspection of the alignments and superimpositions with the Pfam families having the most significant P3D-values suggests that probably the best functional match is that with the second lowest value: the IGPS Pfam family, the second best score (P3D-value = 1.94x10-3), rather than the aldolase/ketolase reductases (aldo_ket_red; P3D-value = 4.84x10-4). Inspection showed the length of the structurally equivalent regions to be longer when comparing UPF0001 and IGPS, in addition to overlap many of the residues involved in function in IGPS and PRAI (Wilmanns et al. 1992), some of which are fully conserved in the merged alignments (K55 and G236). IGPS, together with the PRAI and Trp_synthase families, belong to the Ribulose phosphate-binding barrel superfamily. Most of the enzymes in this superfamily are involved in amino-acid synthesis pathways, and some also bind PLP. Moreover, three of the other Pfam families that could be linked to UPF0001 are involved in tryptophan or lysine synthesis. All these results indicate that the 26 members of the UPF0001 Pfam family could play a role in amino-acid synthesis.
Implications for structural genomics
The results above show that the method often is able to identify superfamily relationships, and possible functional similarities between domains that have been linked by a structure similarity. The accuracy of identifying such relationships is important for Structural Genomics projects that target domains of unknown structure from databases such as SMART or Pfam. Using P3D
5x10-3 to predict superfamily relationships gives a sensitivity (the percentage of correct relationships identified) of 45% for SMART domains and 24% for those from Pfam. For the more stringent MP3D
5x10-3 the values are 27% and 16%, respectively. We suspect that the lower sensitivity for the Pfam domains is because the alignments are less diverse than their SMART counterparts, meaning that one is less likely to find a significant link when considering all sequences.
Quoting an associated specificity (the percentage of incorrect relationships identified) is problematic, as there is no definitive set of "false positives." Ideally false positives would consist of proteins with different folds, but for these it would not be possible to build meaningful structure-based alignments. An alternative is to use pairs of proteins in the same fold that are definitely "not" in the same superfamily (i.e., fold level links). Though many folds in SCOP contain multiple superfamilies, new evidence often emerges (i.e., like the examples above) that permits them to be merged together. Thus calculating specificity in this way will give an underestimate of the correct value. With this caveat in mind, the specificities for P3D
5x10-3 are 80% for SMART domains and 89% for Pfam, and those for MP3D
5x10-3 are 95% and 96%, respectively.
These values of sensitivity and specificity would be applicable for situations where a new structure has been determined for a domain from SMART or Pfam and adopts a fold known previously. We anticipate that our method will often be able to identify many superfamily relationships and thus place a new structure into the correct evolutionary and often functional context.
| Discussion |
|---|
|
|
|---|
This study has gone some way toward quantifying how Structural Genomics projects will gradually link sequence families together, ultimately providing 3D structure and additional functional information for all protein families with at least one member amenable to structure determination by X-ray crystallography or nuclear magnetic resonance. A protocol like that described here will permit evolutionary and functional similarities to be uncovered automatically as the number of known structures and sequences continues to increase.
It is intriguing that current sensitive sequence searching methods apparently fail to detect some similarities that are quite clearly associated with a degree of sequence conservation (e.g., TNF/C1Q). It may be that sequence profiles are too specific to one family to detect more distantly related sequences, even when key sequence motifs are conserved. Another possible explanation comes from inspection of aligned segment lengths. We found that aligned segments (i.e., those not containing a gap in any sequence) are typically shorter within the SCOP-linked alignments than in alignments derived only by a comparison of sequence (results not shown). This may mean that the model for aligned segments and gaps currently used by sequence comparison methods is too strict to permit alignments such as those obtained by structure comparison.
The P-value first described by Murzin (1993b) attempts to assess the likelihood that a pair of proteins, aligned based on their three-dimensional structures, will have a certain degree of sequence similarity. The prior probability of amino acid identity is based on the abundance of the amino acids, and accommodates the assumption that certain features of common protein structures, such as burial in the hydrophobic core or surface exposure, will increase the chances of amino-acid identities. What it does not take into account is the possibility that certain folds will have strict requirements for particular amino acids at certain positions, which could well be the result of convergent evolution to a stable fold. It has been argued that this is the case for the ß-trefoils (including the FGFs and IL-1s; Murzin 1993a; Ponting and Russell 2000), and could well be the case for other folds. It is impossible to discern such occurrences at present, thus this possibility should be remembered when considering the links proposed.
Structural Genomics initiatives provide structures for proteins that often are of unknown function. In the absence of further experiments, the ability to place a new structure in the correct evolutionary context is currently the best method for predicting details regarding molecular function. Methods such as that described here and elsewhere (Copley and Bork 2000; Todd et al. 2001; Landgraf et al. 2001; Aloy et al. 2001) will thus be of growing importance in this new age of 3D structural annotation.
| Materials and methods |
|---|
|
|
|---|
Merging SMART/Pfam alignments with SCOP sequences
SMART and Pfam contain cross-references to appropriate PDB identifiers. However, the PDB identifier alone is not sufficient to identify a domain in SCOP, since such identifiers can contain both multiple polypeptide chains and multiple domains. Accordingly, we constructed a hidden Markov model for each SMART and Pfam alignment using the HMMBUILD program from the HMMer package (S.R. Eddy, unpubl.; http://hmmer.wustl.edu), and used this to search the SCOP database (HMMSEARCH) to provide links between the two databases. SCOP sequences were considered to reside in the SMART or Pfam domain if they had HMMSEARCH E-values
10-3 and if the alignment covered 60% of either the SCOP sequence or the SMART/Pfam domain. Once identified, we aligned the sequences for SCOP entries with those from SMART/Pfam using HMMALIGN. This resulted in a set of alignments containing the original sequences from SMART/Pfam in addition to those from SCOP.
Combining SMART/Pfam alignments from the same SCOP classifications
When we found that different SMART/Pfam entries contained domains from the same SCOP fold or superfamily, we merged the alignments via an alignment of structures. We aligned structures using the STAMP package for protein structure alignment and superimposition. All alignments were checked and, if necessary, edited manually to avoid situations where structural alignment was ambiguous, or lead to erroneous results owing to distortions of the structure as a result of bound substrates or poorly determined/missing residues. In two cases, STAMP found several alignments with good scores. These comparisons were manually edited and the best alignment was selected. The structural alignments then were used to merge all associated sequence data. A summary of these linkages can be found in Table 1
. For folds/superfamilies containing more than two SMART/Pfam families, we constructed all "pairwise" alignments. This was done to ensure maximum alignment quality.
It is difficult to do a direct comparison between alignments derived by consideration of protein sequence alone with those derived from three-dimensional structure. The main difficulty is that structural alignment methods either do not necessarily give a meaningful alignment of those regions that are different between protein structures, or they do not attempt to align them at all. Accordingly, we processed the structural alignments prior to merging them with sequence alignments. Sequences outside of the structural conserved regions were shortened to the minimum possible length. This mimics what would likely happen during a sequence alignment of the same proteins, assuming that, in the best possible sequence alignment for the proteins, only the structurally equivalent regions would be accurately aligned.
All alignments are available via the WWW (http://www.embl-heidelberg.de/~aloy/struct_align).
Statistical significance of SMART/Pfam families linked by SCOP structures
Murzin (1993b) proposed a P-value to suggest the likelihood that a "sequence" identity found after "structure"-based alignment could occur by chance (hereafter called P3D). Given n structurally conserved sites (i.e., C
positions) between two similar protein 3D structures, he suggested that the probability that m of these sites would contain identical amino acids would be:
|
|
is the mean probability of finding identical residues and structurally equivalent sites, m0 = n
(where the bionomial has its maximum), and
=
np(1 -
) (the half-width of the approximating distribution). All that is required is to approximate the probability that structurally equivalent sites in two similar protein structures will have identical residues by chance. Murzin suggested that the value would be larger than 1/20, probably about 1/15 but certainly smaller than 1/10. These values attempt to account for the tendency for buried residues to be hydrophobic and exposed residues to be hydrophilic. In other words, the probability would be >1/20 as buried or exposed sites are more likely to contain a smaller subset of the 20 amino acids (i.e., hydrophobic and polar residues, respectively). This calculation was originally applied to the cystatin-monellin similarity, where an evolutionary relationship was inferred based on a P3D of
10-3. For more details, we refer the reader to Murzin (1993b). Here, we calculated P3D for all pairs of sequences coming from different SMART/Pfam alignments as aligned according to the structures of one or more members of the alignment. We defined structurally equivalent regions by the method of Russell and Barton (1992), and extrapolated these positions to all sequences in the aligned SMART/Pfam families. This calculation assumes that the alignment of sequences is correct and that the known structures for SMART/Pfam families are good models for the remaining sequences. Owing to the high quality of both the sequence alignments within individual SMART/Pfam domains and the structure-based alignments, we do not suspect that the calculation would differ greatly if all proteins were of known structure.
Here, we used the most stringent value of 1/10 meaning that P3D values are higher than if we had used 1/15. We also considered significant links those with P3D
5x10-3. This value is an order of magnitude lower than those calculated, and argued to be biologically significant, with a more lenient P3D calculation (1/15) for ß-trefoil proteins (Ponting and Russell 2000). Thus, we are confident that all the linkages reported are biologically relevant when considering single pairs of protein structures. Selecting the more lenient value of 1/15 has the effect of identifying more of both fold and superfamily links, which we suspect is lowering the specificity of the approach, as many more links between folds are found that may not be true superfamily relationships. For more details, see Implications for structural genomics in Results, above.
P3D was originally described for the comparison of a single pair of protein structures. Because here we are looking for the minimum value from a (sometimes large) number of pairs of proteins, there is a statistical tendency that means low values are more likely to arise by chance (i.e., akin to the difference between a P-value and E-value in database searches such as BLAST; Altschul et al. 1990). A drastic over estimation of the correction needed would be to multiply the lowest pairwise P-value by the number of possible pairs. However, this assumes that all observed pairs are independent observations, which is certainly not the case for sequences that are highly similar. We thus sought a quantity that measures the "effective sequence number," giving more weight to unique sequences (i.e., those without close homologs in the alignments) and less to those with many similar sequences. We used a diversity score for a multiple alignment described by Rychlewski et al. (2000):
|
|
If all sequences in the alignment are very similar, D tends to one, otherwise it increases as a function of the diversity of the sequences, with the total number of sequences in the alignment the upper limit. We thus define the P3D for a multiple set of sequences between two families as:
![]() |
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403410.[CrossRef][Medline]
Anantharaman, V., Koonin, E.V., and Aravind, L. 2001. Regulatory potential, phyletic distribution and evolution of ancient, intracellular small-molecule-binding domains. J. Mol. Biol. 307: 12711292.[CrossRef][Medline]
Artymiuk, P.J., Rice, D.W., Mitchell, E.M., and Willet, P. 1990. Structural resemblance between the families of bacterial signal-transduction proteins and G proteins revealed by graph theoretical techniques. Protein Eng. 4: 3943.
Artymiuk, P.J., Poirrette, A.R., Rice, D.W., and Willett, P. 1997. A polymerase I palm in adenylyl cyclase? Nature 388: 3334.[CrossRef][Medline]
Attwood, T.K., Croning, M.D., Flower, D.R., Lewis, A.P., Mabey, J.E., Scordis, P., Selley, J.N., and Wright, W. 2000. PRINTS-S: The database formerly known as PRINTS. Nucleic Acids Res. 28: 225227.
Bairoch, A. and Apweiler, R. 1999. The SWISSPROT protein sequence data bank and its new supplement TrEMBL in 1999. Nucleic Acids Res. 27: 4954.
Barton, G.J. 1993. ALSCRIPT: A tool to format multiple sequence alignments. Protein Eng. 6: 3740.
Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Howe, K.L., and Sonnhammer, E.L. 2000. The Pfam protein families database. Nucleic Acids Res. 28: 263266.
Bazan, J.F. 1990. Structural design and molecular evolution of a cytokine receptor superfamily. Proc. Natl. Acad. Sci. 87: 69346938.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235242.
Blanco-Aparicio, C., Molina, M.A., Fernandez-Salas, E., Frazier, M.L., Mas, J.M., Querol, E., Aviles, F.X., and de Llorens, R. 1998. Potato carboxypeptidase inhibitor, a T-knot protein, is an epidermal growth factor antagonist that inhibits tumor cell growth. J. Biol. Chem. 273: 1237012377.
Boggon, T.J., Shan, W.S., Santagata, S., Myers, S.C., and Shapiro, L. 1999. Implication of tubby proteins as transcription factors by structure-based functional analysis. Science 286: 21192125.
Brenner, S.E., Chothia, C., and Hubbard, T.J. 1998. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. 95: 60736078.
Christendat, D., Yee, A., Dharamsi, A., Kluger, Y., Savchenko, A., Cort, J.R., Booth, V., Mackereth, C.D., Saridakis, V., Ekiel, I., Kozlov, G., Maxwell, K.L., Wu, N., McIntosh, L.P., Gehring, K., Kennedy, M.A., Davidson, A.R., Pai, E.F., Gerstein, M., Edwards, A.M., and Arrowsmith, C.H. 2000. Structural proteomics of an archaeon. Nat. Struct. Biol. 7: 903909.[CrossRef][Medline]
Clemmons, D.R. 1993. IGF binding proteins and their functions. Mol. Reprod. Dev. 35: 368374.[CrossRef][Medline]
Copley, R.R. and Bork, P. 2000. Homology among (ß/
) (8) barrels: Implications for the evolution of metabolic pathways. J. Mol. Biol. 303: 627641.[CrossRef][Medline]
Dietmann, S. and Holm, L. 2001. Identification of homology in protein structure classification. Nat. Struct. Biol. 8: 953957.[CrossRef][Medline]
Eddy, S.R. 1998. Profile hidden Markov models. Bioinformatics 14: 755763.
Eisenberg, D., Marcotte, E.M., Xenarios, I., and Yeates, T.O. 2000. Protein function in the post-genomic era. Nature 405: 823826.[CrossRef][Medline]
Flores, T.P., Orengo, C.A., Moss, D.S., and Thornton, J.M. 1993. Comparison of conformational characteristics in structurally similar protein pairs. Protein Sci. 2: 18111826.[Abstract]
Gay, N.J. and Walker, J.E. 1983. Homology between human bladder carcinoma oncogene product and mitochondrial ATP-synthase. Nature 301: 262264.[CrossRef][Medline]
Hegyi, H. and Gerstein, M. 1999. The relationship between protein structure and function: A comprehensive survey with application to the yeast genome. J. Mol. Biol. 288: 147164.[CrossRef][Medline]
Henikoff, J.G., Greene, E.A., Pietrokovski, S., and Henikoff, S. 2000. Increased coverage of protein families with the blocks database servers. Nucleic Acids Res. 28: 228230.
Holm, L. 1998. Unification of protein families. Curr. Opin. Struct. Biol. 8: 372379.[CrossRef][Medline]
Holm, L. and Sander, C. 1997. Decision support system for the evolutionary classification of protein structures. Ismb 5: 140246.
Hughes, J., Ward, C.J., Peral, B., Aspinwall, R., Clark, K., San Millan, J.L., Gamble, V., and Harris, P.C. 1995. The polycystic kidney disease 1 (PKD1) gene encodes a novel protein with multiple cell recognition domains. Nature Genet. 10: 151160.[Medline]
Kraulis, P.J. 1991. MOLSCRIPT: A program to produce both detailed and schematic plots of protein structures. J. Appl. Cryst. 24: 946950.
Kunin, V., Chan, B., Sitbon, E., Lithwick, G., and Pietrokovski, S. 2001. Consistency analysis of similarity between multiple alignments: Prediction of protein function and fold structure from analysis of local sequence motifs. J. Mol. Biol. 307: 939949.[CrossRef][Medline]
Landgraf, R., Xenarios, I., and Eisenberg, D. 2001. Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J. Mol. Biol. 307: 14871502.[CrossRef][Medline]
Lukat, G.S., Lee, B.H., Mottonen, J.M., Stock, A.M., and Stock, J.B. 1991. Roles of the highly conserved aspartate and lysine residues in the response regulator of bacterial chemotaxis. J. Biol. Chem. 266: 83488354.
Mas, J.M., Aloy, P., Marti-Renom, M.A., Oliva, B., Blanco-Aparicio, C., Molina, M.A., de Llorens, R., Querol, E., and Aviles, F.X. 1998. Protein similarities beyond disulphide bridge topology. J. Mol. Biol. 284: 541548.[CrossRef][Medline]
Matsuo, Y. and Bryant, S.H. 1999. Identification of homologous core structures. Proteins 35: 7079.[CrossRef][Medline]
Murzin, A.G. 1993a. Can homologous proteins evolve different enzymatic activities? Trends Biochem Sci. 18: 403405.[CrossRef][Medline]
Murzin, A.G. 1993b. Sweet-tasting protein monellin is related to the cystatin family of thiol proteinase inhibitors. J. Mol. Biol. 230: 689694.[CrossRef][Medline]
Murzin, A.G. 1993c. OB (oligonucleotide/oligosaccharide binding)-fold: Common structural and functional solution for non-homologous sequences. EMBO J. 12: 861867.[Medline]
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536540.[CrossRef][Medline]
Ponting, C.P. and Russell, R.B. 1998. Protein fold irregularities that hinder sequence analysis. Curr. Opin. Struct. Biol. 8: 364371.[CrossRef][Medline]
Ponting, C.P. and Russell, R.B. 2000. Identification of distant homologues of FGFs suggests a common ancestor for all ß-trefoil proteins. J. Mol. Biol. 302: 10411047.[CrossRef][Medline]
Russell, R.B. 1998. Detection of protein three-dimensional side-chain patterns: New examples of convergent evolution. J. Mol. Biol. 279: 12111227.[CrossRef][Medline]
Russell, R.B. and Barton, G.J. 1992. Multiple protein sequence alignment from tertiary structure comparison: Assignment of global and residue confidence levels. Proteins 14: 309323.[CrossRef][Medline]
Russell, R.B. and Barton, G.J. 1994. Structural features can be unconserved in proteins with similar folds. An analysis of side-chain to side-chain contacts secondary structure and accessibility. J. Mol. Biol. 244: 332350.[CrossRef][Medline]
Russell, R.B., Sasieni, P.D., and Sternberg, M.J.E. 1998. Supersites within superfolds. Binding site similarity in the absence of homology. J. Mol. Biol. 282: 903918.[CrossRef][Medline]
Russell, R.B., Saqi, M.A., Sayle, R.A., Bates, P.A., and Sternberg, M.J.E. 1997. Recognition of analogous and homologous protein folds: Analysis of sequence and structure conservation. J. Mol. Biol. 269: 423439.[CrossRef][Medline]
Rychlewski, L., Jaroszewski, L., Weizhong, L.I., and Godzik, A. 2000. Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci. 9: 232241.[Abstract]
Saraste, M., Sibbald, P.R., and Wittinghofer, A. 1990. The P-loopa common motif in ATP- and GTP-binding proteins. Trends Biochem. Sci. 15: 430434.[CrossRef][Medline]
Schultz, J., Copley, R.R., Doerks, T., Ponting, C.P., and Bork, P. 2000. SMART: A web-based tool for the study of genetically mobile domains. Nucleic Acids Res. 28: 231234.
Shapiro, L. and Harris, T. 2000. Finding function through Structural Genomics. Curr. Opin. Biotechnol. 11: 3135.[CrossRef][Medline]
Shapiro, L. and Scherer, P.E. 1998. The crystal structure of a complement-1q family protein suggests an evolutionary link to tumor necrosis factor. Curr. Biol. 8: 335338.[CrossRef][Medline]
Shapiro, L., Kwong, P.D., Fannon, A.M., Colman, D.R., and Hendrickson, W.A. 1995. Considerations on the folding topology and evolutionary origin of cadherin domains. Proc. Natl. Acad. Sci. 92: 67936797.
Tatusov, R.L., Galperin, M.Y., Natale, D.A., and Koonin, E.V. 2000. The COG database: A tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28: 3336.
Todd, A.C., Orengo, C.A., and Thornton, J.M. 2001. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307: 11131143.[CrossRef][Medline]
Vervoort, J., Heering, D., Peelen, S., and van Berkel, W. 1994. Flavodoxins. Methods Enzymol. 243: 188203.[CrossRef][Medline]
Volz, K. 1993. Structural conservation in the CheY superfamily. Biochemistry 32: 1174111753.[CrossRef][Medline]
Wallace, A.C., Borkakoti, N., and Thornton, J.M. 1997. TESS: A geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites. Protein Sci. 6: 23082323.[Abstract]
Wilmanns, M., Priestle, J.P., Niermann, T., and Jansonius, J.N. 1992. Three-dimensional structure of the bifunctional enzyme phosphoribosylanthranilate isomerase: Indoleglycerolphosphate synthase from Escherichia coli refined at 2.0 A resolution. J. Mol. Biol. 223: 477507.[CrossRef][Medline]
Yang, F., Gustafson, K.R., Boyd, M.R., and Wlodawer, A. 1998. Crystal structure of Escherichia coli HdeA. Nat. Struct. Biol. 5: 763764.[CrossRef][Medline]
Zarembinski, T.I., Hung, L.W., Mueller-Dieckmann, H.J., Kim, K.K., Yokota, H., Kim, R., and Kim, S.H. 1998. Structure-based assignment of the biochemical function of a hypothetical protein: A test case of Structural Genomics. Proc. Natl. Acad. Sci. 95: 1518915193.