|
|
||||||||
Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, London NW7 1AA, UK
Requests for reprints to: Alex C.W. May, Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK; e-mail: amay{at}nimr.mrc.ac.uk; fax: 44 (0) 20 8816 2460.
(RECEIVED April 18, 2002; FINAL REVISION August 6, 2002; ACCEPTED September 23, 2002)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.0211202.
| Abstract |
|---|
|
|
|---|
Keywords: Sequence motif; multiple sequence alignment; optimal segmentation; protein homologous family; protein superfamily
| Importance of sequence alignment |
|---|
|
|
|---|
| Relationships between protein sequence and 3D structure |
|---|
|
|
|---|
| Earlier methods to partition sequence alignments |
|---|
|
|
|---|
| Results and Discussion |
|---|
|
|
|---|
0.003042). This is surprising, given that, by definition, the HOMSTRAD families comprise closely related proteins while, of course, the CAMPASS superfamilies consist of distantly related ones.
|
0.5459 (Fig. 1
0.7125 (Fig. 2
0.537).
|
|
0.1415 (Fig. 3
0.3658 (Fig. 4
0.131).
|
|
Consider a pairwise alignment comprising 10 positions of alternating nonidentity and identity. Thus, the informationtheoretical entropy (Shenkin et al. 1991) profile consists of alternating 1.00 and 0.00, respectively. According to the criterion for choice of optimal segmentation specified in Materials and Methods, the optimal number of partitions for this contrived sequence alignment is 10 (data not shown). So, it seems likely then that the HOMSTRAD and CAMPASS optimal segmentation data (Table 1
) are meaningful.
The "jumbling" test is a standard approach to estimate the significance of the optimal alignment score for two protein sequences (for a review, see Doolittle 1986). The sequences are repeatedly randomly reordered ("jumbled") and then aligned to generate a distribution of scores for the pair. The significance of the score of the real alignment can then be expressed in terms of the familiar Z-score, that is, how far (in units of the SD of the distribution) and in what direction the score of the real alignment is from the mean of the distribution. The "jumbling" test for optimal segmentation of an alignment consists here of 100 random reorderings of the aligned position entropies and subsequent optimal segmentation thereof. The idea is to test whether there is an intrinsic algorithmic bias for the optimal number of segments for an alignment irrespective of the linear order of aligned positions. The linear association between optimal number of segments for the real alignment and mean optimal number of segments for the 100 "jumbled" alignments is poor (HOMSTRAD linear correlation coefficient r = -0.0108, P
0.8767 (Fig. 5
); CAMPASS linear correlation coefficient r = 0.3033, P
0.01134 (Fig. 6
); there is no significant difference between the values of r, P
0.022). (The smallest mean optimal number of segments for the 100 "jumbled" alignments for a HOMSTRAD family = 2.8, while that for a CAMPASS superfamily = 3.0. The largest mean optimal number of segments for the 100 "jumbled" alignments for a HOMSTRAD family = 3.5, while that for a CAMPASS superfamily = 3.6. Coefficient of variation (CV) is a measure of relative spread: it is defined as the SD as a percent of the mean. The smallest CV of optimal number of segments for the 100 "jumbled" alignments for a HOMSTRAD family = 17.2, while that for a CAMPASS superfamily = 20.0. The largest CV of optimal number of segments for the 100 "jumbled" alignments for a HOMSTRAD family = 70.6, while that for a CAMPASS superfamily = 83.3. The mean [±SD] CV of optimal number of segments for the 100 "jumbled" alignments for a HOMSTRAD family = 35.1 [7.1], while that for a CAMPASS superfamily = 36.1 [11.9]. Clearly, there is usually a wide dispersion in the optimal number of segments for the 100 "jumbled" alignments for both a HOMSTRAD family and a CAMPASS superfamily.) The linear order of aligned positions is indeed found to be important for HOMSTRAD and CAMPASS, so optimal partitioning of their alignments (Table 1
) is useful.
|
|
Protein 3D structure clues from the tempo of sequence diversity across alignments
Anfinsen and colleagues first demonstrated that the information needed to specify the 3D structure of a protein resides in its amino acid sequence (the principle of self-assembly; for a review, see Anfinsen 1973). This means that the tempo and organization of sequence variability across a meaningful multiple alignment will hold clues as to the nature of the shared 3D structure. On this basis, it seems useful to cluster protein families (superfamilies) with the same number of segments after optimal partitioning (Table 1
) according to similarity in corresponding segment relative length, segment mean relative sequence variability and a measure of the spread of segment relative sequence variability, SD. Each family (superfamily) is represented as a vector, the number of components of which is (optimal number of segments) x 3. In other words, the 3 above values represent each segment. The dissimilarity between a pair of families is defined as the Euclidean distance between the respective vectors. The most widely used algorithm for obtaining a hierarchical classification (Romesburg 1984), the unweighted pair-group method using arithmetic averages (UPGMA), is used here. As in May (1999a, 1999b), the support for the clusters defined by a tree is assessed with a jackknife test (Lanyon 1985). Thus, reliable internal nodes of a tree are identified (Table 2
). Two questions can be asked about the meaningful family (superfamily) groupings. First, how many of them comprise families (superfamilies) of a common structural class? Second, how many of them consist of families (superfamilies) with a common fold? Class and fold assignments are according to SCOP (Murzin et al. 1995). Twenty of the 209 HOMSTRAD families comprise proteins consisting of more than one SCOP domain while all 69 CAMPASS superfamilies comprise single SCOP domains.
|
and ß proteins [
/ß]; one all-ß proteins); the other consists of three families (all-ß proteins). (Twenty-two of the 87 families here are
and ß proteins [
/ß] while 12 are all-ß proteins.) The single reliable family cluster with the same fold is made up of two families: immunoglobulin domainC1 set: constant nonimmunoglobulin (N = 5; number of aligned positions = 102; vector = [0.24, 0.56, 0.21, 0.76, 0.44, 0.24]) and Cu/Zn superoxide dismutase (N = 7; number of aligned positions = 169; vector = [0.25, 0.57, 0.21, 0.75, 0.42, 0.26]). The Euclidean distance between the two vectors is 0.03. The common fold between these all ß proteins is the immunoglobulin-like ß-sandwich (4 of the 87 families here share this fold). The most typical sequence within a family can be defined as that sequence accorded the lowest weight by a sequence weighting method (May 2001). (A standard weighting schemethat of Henikoff and Henikoff, 1994is used here.) A pairwise global sequence alignment between the most typical sequence for each family1hsaa and 1cbja, respectively (protein names are PDB codes)yields only 15.8% identity (data not shown). Clearly, there is only a distant sequence relationship between the two families.
The only other instance of a reliable family cluster comprising families with a common fold is that obtained after hierarchical classification of the 92 HOMSTRAD families partitioned optimally into three segments (Table 2
). This group comprises two families: FMN-linked oxidoreductases (N = 2; number of aligned positions = 378; vector = [0.79, 0.66, 0.48, 0.08, 0.21, 0.41, 0.13, 0.73, 0.45]), and glycosyl hydrolases family 17 (N = 2; number of aligned positions = 310; vector = [0.74, 0.54, 0.50, 0.19, 0.24, 0.43, 0.08, 0.67, 0.48]). The Euclidean distance between the two vectors is 0.19. The shared fold between these
and ß proteins (
/ß) is the TIM ß/
-barrel (4 of the 92 families here share this fold). By definition, a most typical sequence cannot be identified for a pair of sequences. However, the maximum identity obtained for the four interfamily pairwise global sequence alignments is only 17.3% (data not shown). Again then, there is only a remote sequence relationship between the two families.
Seventeen of the 21 superfamily clusters are reliable for the 22 CAMPASS superfamilies partitioned optimally into two segments (Table 2
). Of these 17 groups, one consists of superfamilies of the same class (
and ß proteins [
/ß]; 10 of the 22 superfamilies here are
and ß proteins [
/ß]). The single reliable superfamily cluster with the same class comprises two superfamilies: flavodoxin-like (N = 7; number of aligned positions = 192) and periplasmic binding type I (domain 2) (N = 6; number of aligned positions = 196). Although the folds of the two superfamilies do not coincide in SCOP, both are three-layer (
/ß/
) sandwiches, that is,
-helices on both sides of a ß-sheet. Both ß-sheets are parallel, consisting of five (order 21345) and six strands (order 213456), respectively. A similar situation occurs for one of the two reliable superfamily clusters of the same class for the 35 CAMPASS superfamilies segmented optimally into three segments (Table 2
): ß-glucosyltransferaseglycogen phosphorylase (N = 2; number of aligned positions = 830) and cytidine deaminase domains (N = 2; number of aligned positions = 120). The ß-glucosyltransferaseglycogen phosphorylase fold comprises two nonsimilar domains with three layers (
/ß/
) each, while that for the cytidine deaminase domains is a single three-layer (
/ß/
) sandwich (10 of the 35 superfamilies here are
and ß proteins [
/ß]).
A further test of the significance of the two HOMSTRAD reliable family clusters with a common fold is to combine the HOMSTRAD and CAMPASS data for each optimal number of segments (Table 1
). One of the 22 superfamilies with optimal segmentation into two has the immunoglobulin-like ß-sandwich fold: the immunoglobulins. Thus, the combined data for optimal number of segments = 2 comprises 109 alignments (Table 1
), of which five have the immunoglobulin-like ß-sandwich fold. Fifty-three of the 108 family/superfamily clusters here are reliable (Lanyon 1985) including that described above comprising two HOMSTRAD families sharing the immunoglobulin-like ß-sandwich fold (data not shown). Two of the 35 superfamilies with optimal partitioning into three has the TIM ß/
-barrel fold. Thus, the combined data for optimal number of segments = 3 consists of 127 alignments (Table 1
), of which six have the TIM ß/
-barrel fold. Fifty-four of the 126 family/superfamily clusters here are reliable (Lanyon 1985) including that described above consisting of two HOMSTRAD families sharing the TIM ß/
-barrel fold (data not shown).
| Conclusion |
|---|
|
|
|---|
0.003042). Further, optimal segmentation identifies an unusual protein superfamily: nitrogenase iron protein-like (N = 3). Here, the optimal segmentation of the 437 aligned positions is into two: a short N-terminal partition with high mean relative variability, and a C-terminal long one with lower mean relative variability.
Of course, it must be remembered that selection of members for protein families within HOMSTRAD and that for protein superfamilies within CAMPASS is by hand (for a discussion of this problem, see May 1999b). So, there is sampling bias due to human choice of proteins. Sequence weighting (for a review, see Durbin et al. 1998) could be used to address this problem. In fact, lack of sequence weighting might explain why there are only two reliable HOMSTRAD family clusters with the same class and same fold (Table 2
). After all, a guiding principle of comparative sequence analysis is that it is possible to obtain protein 3D structure clues from the tempo of sequence diversity across alignments. (Similarly, it will be useful to apply this method to many more sequence alignments of multi-domain proteins to examine relationships between sequence and domain organization for different numbers of domains.) A related issue is the stability of the optimal segmentations (Table 1
): a bootstrap or jackknife method could be used to assess this.
This article highlights the need to look further than a "headline" figure of percentage sequence identity and to ask the question: What is the underlying distribution of the identical amino acids within a given sequence alignment?
It is proposed that use of the automatic method described here to partition optimally sequence alignments will add value to comparative sequence analysis. Further, given a protein that undergoes a conformational change (e.g., that which might occur on ligand binding) and the availability of 3D structures of the relevant states, the approach here could be used to decompose the protein into elements that move essentially as rigid bodies during the conformational change. Here the data for optimal zonation would be the inter-CA atom distances after rigid-body superposition of the 3D structures. Of course, such an approach might also be useful to monitor the dynamics of a molecule during a computer simulation. Another application lies in the analysis of the mobility of the atoms within a protein 3D structure determined by X-ray crystallography. Regions of a protein molecule with low/high mobility are usually identified by eye from a plot of mean temperature factor for all atoms in each amino acid versus residue number. Automation of such an approach in this way would have obvious advantages. Evidently, there are many possible areas of comparative sequence and 3D structure analysis where this method could be helpful.
The results of this analysis are available on the Web at http://mathbio.nimr.mrc.ac.uk/
amay.
| Materials and methods |
|---|
|
|
|---|
The problem is this: how to partition optimally the N aligned positions within a sequence alignment into all predefined numbers of subsets k where 1
k
N. A measure of the variability of a segment is the sum of squared deviations of the constituent points about their mean. Thus, a homogeneous segment has a low sum of squared deviations about its mean. Defining the total within-segment variability W by summation for k segments, the globally optimal partitioning into k segments is that which minimizes W. The Hawkins and Merriam (1973) method identifies all optimal sets of k segments covering all N points. For a segment comprising points i, i + 1, . . ., j let r(i, j) denote the sum of squared deviations of these points about their mean. Let Fj(m) be the within segment sum of squared deviations when an optimal j segment covering of point 1 to m is made. Of course, the 1 segment coverings are unique so:
![]() |
![]() |
n
m. Their formula is inconsistent with the description of their algorithm. In fact, Hawkins agrees, and has confirmed that the correct equation is as above (personal communication).
Thus, the value of W is deduced as Fk(N). The boundaries of the optimal segments are obtained from a "trace-back". Of course, the most appropriate value of k for an alignment is not known a priori. However, k can be inferred automatically in the following way. The improvement in W on addition of another segment is expressed in terms of "explained" variation: the percent reduction in W with respect to W when k = 1, that is, no segmentation. The optimal value of k is taken as that value of k that shows the largest increase in "explained" variation with respect to k - 1 (Everitt 1993). By this criterion, the optimal value of k for the alignment in Figure 7
is k = 3 (Table 3
).
|
|
Comparison of protein homologous families and superfamilies
Protein sequences are usually grouped into homologous families (i.e., proteins related by divergent evolution from a common ancestor) on the basis of "significant" sequence identity. For example, the SCOP classification (Murzin et al. 1995) defines family membership operationally on the basis of sequence identity
30%. Of course, homologous proteins also share a common fold and usually function. Superfamilies consist of proteins that despite having "low" sequence identities are probably distantly related by descent with modification. Similar 3D structure and function suggest common ancestry here. An advantage of sequence identity as a similarity measure between proteins is its transparency. For instance, % sequence similarity values are often quoted without stating the amino acid classification on which they are based. Sequence identity is unambiguous in that way. However, it must be remembered that sequence identity is only a count of residue identities in two sequences (cf. the Hamming distance), and does not say anything about the distribution of the matches as specified by an alignment.
Reliable 3D structure-based sequence alignments: HOMSTRAD and CAMPASS
The Blundell group has compiled two databases of protein 3D structure alignments. The first, HOMSTRAD (Mizuguchi et al. 1998), is a database of aligned 3D structures of homologous proteins. The HOMSTRAD families comprise what the authors describe as "representative members." The second, CAMPASS (Sowdhamini et al. 1998), is a database of aligned 3D structures of superfamilies. An advantage of these databases is that not only are the sequence alignments based on 3D structure but also they are of very high quality. The availability of these alignments allows investigation of the tempo of sequence diversity across meaningful alignments of proteins related at different levels of sequence identity. Here I use the November 1999 full-release of HOMSTRAD (comprises 1193 proteins organized into 209 families; the mean [±SD] number of structures per family is 5.7 [4.9]), and the currently available release of CAMPASS (consists of 288 protein domains grouped into 69 superfamilies; the mean [±SD] number of structures per superfamily is 4.2 [2.9]). Interestingly, the HOMSTRAD and CAMPASS family (superfamily) size distributions are very similar (linear correlation coefficient r = 0.9363, P
0.0000001; data not shown). The mean (±SD) number of aligned positions for the 209 HOMSTRAD families is 249.7 (178.3), while that for the 69 CAMPASS superfamilies is 248.8 (160.5). One assumption made by the Students t-test for a significant difference between two means is that the two distributions share the same variance. Clearly, the variances here are not identical and so the Students t-test is not appropriate. A nonparametric alternative is the Wilcoxon signed rank test that, by definition, makes no assumptions about the two distributions. The Wilcoxon test tests two distributions for equality. According to the Wilcoxon test, there is not a significant difference between the HOMSTRAD and CAMPASS size distributions (p
0.6572). Two hundred of the 209 HOMSTRAD families have a mean sequence identity of >30% (the mean [±SD] mean sequence identity is 41.18% [9.51]) while all 69 CAMPASS superfamilies have a mean sequence identity of <25% (the mean [±SD] mean sequence identity is 15.00 [3.21]). (Sequence identity is defined here over the multiply aligned positions only. This is because if a gap were considered as a different residue type then it is not clear how to treat matching gap characters in terms of sequence identity for a given pair; May 2001.) However defined, sequence identity does not indicate the nature of the gaps within an alignment. The ratio sequence length: alignment length reflects the relative number of gaps included in a sequence for alignment. As expected, although the mean numbers of aligned positions for HOMSTRAD and CAMPASS are very similar, there is a difference between the two in terms of gap density within the alignments. In general, the CAMPASS alignments contain more gaps: only 5.3% (11) of the 209 HOMSTRAD families have a mean value of the ratio sequence length:alignment length <0.7 while 33.3% (23) of the 69 CAMPASS superfamilies do.
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Anfinsen, C.B. 1973. Principles that govern the folding of protein chains. Science 181: 223230.
Axe, D.D. 2000. Extreme functional sensitivity to conservative amino acid changes on enzyme exteriors. J. Mol. Biol. 301: 585595.[CrossRef][Medline]
Bement, T.R. and Waterman, M.S. 1977. Locating maximum variance segments in sequential data. Math. Geol. 9: 5561.[CrossRef]
Blundell, T.L., Sibanda, B.L., Sternberg, M.J.E., and Thornton, J.M. 1987. Knowledge-based prediction of protein structures and the design of novel molecules. Nature 326: 347352.[CrossRef][Medline]
Chothia, C. and Lesk, A.M. 1986. The relation between the divergence of sequence and structure in proteins. EMBO J. 5: 823826.[Medline]
Doolittle, R.F. 1986. Of urfs and orfs: A primer on how to analyze derived amino acid sequences. University Science Books, Mill Valley, CA.
Durbin, R., Eddy, S.R., Krogh, A., and Mitchison, G. 1998. Biological sequence analysis: Probalistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, MA.
Everitt, B. 1993. Cluster analysis. E. Arnold, London.
Gordon, A.D. 1996. A survey of constrained classification. Comput. Stat. Data Anal. 21: 1729.[CrossRef]
Hawkins, D.M. and Merriam, D.F. 1973. Optimal zonation of digitized sequential data. Math. Geol. 5: 389395.
Henikoff, S. and Henikoff, J.G. 1994. Position-based sequence weights. J. Mol. Biol. 243: 574578.[CrossRef][Medline]
Herrmann, G., Schon, A., Brack-Werner, R., and Werner, T. 1996. CONRAD: A method for identification of variable and conserved regions within proteins by scale-space filtering. Comput. Appl. Biosci. 12: 197203.
Heymann, J.B. and Engel, A. 2000. Structural clues in the sequences of the aquaporins. J. Mol. Biol. 295: 10391053.[CrossRef][Medline]
Hofmann, K., Bucher, P., Falquet, L., and Bairoch, A. 1999. The PROSITE database, its status in 1999. Nucleic Acids Res. 27: 215219.
Johnson, M.S., May, A.C.W., Rodionov, M.A., and Overington, J.P. 1996. Discrimination of common protein folds: Application of protein structure to sequence/structure comparisons. Methods Enzymol. 266: 575598.[Medline]
Jonassen, I., Eidhammer, I., and Taylor, W.R. 1999. Discovery of local packing motifs in protein structures. Proteins 34: 206219.[CrossRef][Medline]
Lanyon, S.M. 1985. Detecting internal inconsistencies in distance data. Syst. Zool. 34: 397403.[CrossRef]
May, A.C.W. 1996. Pairwise iterative superposition of distantly related proteins and assessment of the significance of 3-D structural similarity. Protein Eng. 9: 10931101.
. 1999a. A cautionary note on interpretation of hierarchical classifications of protein folds. Struct. Fold. Des. 7: R213.[Medline]
. 1999b. Toward more meaningful hierarchical classification of protein three-dimensional structures. Proteins 37: 2029.[CrossRef][Medline]
. 2001. Optimal classification of protein sequences and selection of representative sets from multiple alignments: Application to homologous families and lessons for structural genomics. Protein Eng. 14: 209217.
Mizuguchi, K., Deane, C.M., Blundell, T.L., and Overington, J.P. 1998. HOMSTRAD: A database of protein structure alignments for homologous families. Protein Sci. 7: 24692471.[Abstract]
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536540.[CrossRef][Medline]
Romesburg, H.C. 1984. Cluster analysis for researchers. Lifetime Learning Publications, Belmont, CA.
Shenkin, P.S., Erman, B., and Mastrandrea, L.D. 1991. Information-theoretical entropy as a measure of sequence variability. Proteins 11: 297313.[CrossRef][Medline]
Smith, T.F. 1999. The art of matchmaking: Sequence alignment methods and their structural implications. Struct. Fold. Des. 7: R7R12.[Medline]
Sowdhamini, R., Burke, D.F., Huang, J.F., Mizuguchi, K., Nagarajaram, H.A., Srinivasan, N., Steward, R.E., and Blundell, T.L. 1998. CAMPASS: A database of structurally aligned protein superfamilies. Structure 6: 10871094.[Medline]
Stojanovic, N., Florea, L., Riemer, C., Gumucio, D., Slightom, J., Goodman, M., Miller, W., and Hardison, R. 1999. Comparison of five methods for finding conserved sequences in multiple alignments of gene regulatory regions. Nucleic Acids Res. 27: 38993910.
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
D. Lupyan, A. Leo-Macias, and A. R. Ortiz A new progressive-iterative algorithm for multiple structure alignment Bioinformatics, August 1, 2005; 21(15): 3255 - 3263. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |