|
|
||||||||
1 Department of Mathematics, Rutgers University, Piscataway, New Jersey 08854, USA
2 Institute of Mathematical Problems of Biology, RAS, Pushchino, Moscow Region 142292, Russia
3 MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, United Kingdom
4 Oncological Scientific Center of Russia, Moscow 115478, Russia
Reprint requests to: Alexander E. Kister, Department of Mathematics, Rutgers University, Piscataway, NJ 08854, USA; e-mail: akister{at}math.rutgers.edu; fax: 732-445-55-30.
(RECEIVED August 29, 2000; FINAL REVISION January 23, 2001; ACCEPTED June 11, 2001)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1101/ps.37001.
| Abstract |
|---|
|
|
|---|
Keywords: Classic cadherins; cell adhesion molecules; method for protein family recognition; sequence comparison/classification
| Introduction |
|---|
|
|
|---|
Cadherins are a group of proteins essential for the formation of stable specialized cellcell contacts, that is, adherent contacts in various tissues, and therefore for organization of these tissues and organs. Cadherins are found in many types of animals ranging from nematodes to humans. Humans and other vertebrate animals have several classes of cadherins, each class being characteristic for a group of tissues (Takeichi 1991,1995; Gumbliner 1996; Suzuki 1996; Gallin 1998; Shapiro and Colman 1999). For example, E-cadherins are specific for epithelial tissues, P-cadherins are found in placenta and other tissues, and N-cadherins are typical of neural and mesenchymal tissues.
The cadherin-like family comprises five subfamilies: classic cadherins types I and II, desmosomal cadherins, and protocadherins and cadherin-related proteins (Koch et al. 1999). In this work, we focus on the classic cadherins. The classic cadherins are transmembrane glycoproteins with five extracellular domains, a single membrane-spanning domain and a single cytoplasmic domain, which are linked to act in microfilaments via several linker proteins such as ß-catenin and
-catenin. Cellcell contacts are formed by homophilic adhesion of external N-terminal domains of cadherin molecules on the surface of one cell with the corresponding domains of cadherin molecules on another cell. Cadherin adhesion is calcium dependent. Within the extracellular region of cadherins, Ca2+ ions bind between domains to produce a rigid link part. In the absence of calcium, these domains display excessive motions relative to one another and stable adhesions cannot be formed.
The goal of this work to find the sequence determinants: the residues that occupy the conserved positions in classic cadherins. To describe the sequence determinants, we extend here the methods of sequence and structural analysis that were developed in our previous works (Gelfand and Kister 1995; Chothia et al. 1998). We show here that the sequence determinants can serve as patterns of the classic cadherins. A new method of identification of proteins that is based on the pattern recognition in sequences was suggested. Using this method, we were able to distinguish sequences of the classic cadherins in the SWISS-PROT database.
The currently known structures for the first and the second domains show that they have the same overall immunoglobulin-like fold (Shapiro et al. 1995; Overduin et al. 1995; Nagar et al. 1996; Pertz et al.1999). However, three-dimensional structures of the third, fourth, and fifth domains are unknown. The multialignment of the sequences of all five domains revealed the common conserved positions for extracellular part of the classic cadherins. Discovering the common sequence determinants supports the idea that the all extracellular domains share the immunoglobulin-like structure with the N-terminal domain.
In the second part of this work, we show the possibility of predicting the secondary structure of proteins based on the results of the sequence multialignment. We focus on the analysis of cytoplasmic part of cadherins whose X-ray structures are unknown. We based our research on the results of the sequence multialignment of these sequences. In fact, the multialignment of sequences of a protein family that have no strong homology forces one to make insertion and deletions to make sequences align. As a rule, these gaps in sequences correspond to a beginning or end of the secondary structural units: strands, helices, or loops. On the basis of this observation and of the results of sequence multialignment of the cytoplasmic part, we propose a model for the secondary structures of the cytoplasmic domains of cadherins.
| Methods and Results |
|---|
|
|
|---|
90100 amino acids, which form seven ß-strands. According to the accepted classification of the immunoglobulin fold, the seven successive strands are termed A', B, C, D, E, F, and G, and the loops between them are named, respectively, A'B, BC, CD, DE, EF, EF', and FG (Chothia and Jones 1997). Strands B, E, and D make up one sheet, and strands A', C, E, and G make up another (Fig. 1
|
|
coordinates, H bonds, and accessibility values (for details, see Gelfand and Kister 1995). As long as multialignments performed in several ways gives the same results, we can be retroactively assured that the division of sequences into secondary structure units was essentially accurate. Nonetheless, it is clear that one cannot be absolutely sure where the border between two secondary structure units lies. We therefore separately studied such borderline regions for the presence of conserved positions. Our analysis shows that conserved positions rarely, if ever, are to be found at the very periphery of strands or loops (Gelfand and Kister 1997). It appears therefore that lack of absolute precision in secondary structure definition has very little effect on the final result. To classify the conservation of residues, we collected from the various structures all the amino acid fragments that correspond to each of the strands or loops. Alignment was conducted separately for each set of amino acid fragments that describe a particular strand or loop. In our approach, the amino acid sequences of the aligned fragments are given the term "word" (Gelfand and Kister 1995). From this alignment, each residue in a sequence is assigned to a position in a word. Residues in sequences are referred to by an index that contains the letter code of the word and its position therein. For example, A'1 is the address of the first residue in the A' word. Describing residues with the two-part index gives us a common system of numbering for various cadherin sequences. It allows us to compare residue occupation in each position for various sequences and determine residue conservation at all positions.
Residue conservation: Patterns of strands and loops of the N-terminal domain
The first step toward defining characteristic patterns of cadherin strands and loops consists of the analysis of residue frequencies at all positions of words. This analysis reveals the nature and extent of residue conservation at each position. After the classification of residue conservation in immunoglobulins suggested in our previous article (Chothia et al. 1998), we divided residues into three groups: (1) V, L, I, M, A, F, W, and C; (2) R, K, E, D, Q, and N; and (3) P, H, Y, G, S, and T. This classification is based on two properties: hydrophobicity and the tendency to be on the surface or in the interior of a protein.
Inspection of residue frequencies showed that six positions are occupied by a single residue in almost all sequences, and 23 have only a few chemically similar residues from the same group (Table 2
). For example, E residue is found at the position A'5 in all known cadherin sequences (Table 1
). These 29 positions are considered to be the conserved positions. The other
66 positions in sequences are variable. They can be occupied by residues from various groups.
|
Secondary and three-dimensional structure prediction for five extracellular domains
For most molecules in the cadherin family, the three-dimensional structure is unknown. However, for these proteins it is possible to make secondary structure predictions for all extracellular domains based on the knowledge of the patterns of words in the first two domains. To determine secondary structures of cadherin chains in all domains, we have matched the patterns of the domains I and II with the sequences of the domains III, IV, and V. The result of this analysis showed that the patterns of the N-terminal domains fit with the sequences of all domains. It allowed us to divide the sequences of the domains III, IV, and V into the words. Because words describe secondary structural units, dividing a sequence of amino acids into words permits us to predict the secondary structure of a protein.
Because the alignment of cadherins was based on both sequence and structural information, it follows that residues at the identical positions of the words have the same structural role in various molecules. Analysis of the structural role of residues involves determining residueresidue interactions, residue exposure on the surface, and their coordinates in the system of coordinates unified for a given protein family. We can use for this preferred coordinate system, for example, the coordinate system of any of known structure of the cadherin molecule. Thus, it is possible to identify coordinates of residues for extracellular domains of all analyzed cadherins. We suppose that the C
atoms of the residues at the same positions of the words in various domains can be superimposed on each other.
Conserved positions in the strands and loops in five extracellular domains
Inspection of the sequences of different cadherins shows that the nature of residues and extent of conservation varies greatly at various positions. For example, comparison of the sequences of human E- and K-cadherins shows that in domain I
32% of the residues are identical. Domains I and II of E-cadherins share only 25% identity. In comparison to the sequences of domains I and II the sequences of domains III, IV, and V show no significant similarity (<20%).
The alignment of the words allowed us to calculate the frequency of residues at every position in the words. Analysis of the residue frequency in the various domains showed that there are no positions that are occupied by a single type of residue in all domains. However, there are many positions where residue conservation was found in one or several domains but not in all five domains. For example, position A'5 is occupied by Glu in all sequences of domains I, II, and III, whereas in the sequences of domains IV and V Glu shares this position with Gln and Asp residues. Residues at the A'1 position are hydrophobic in all sequences of the first domain whereas in the second domain Gly and Ala are the most common residues. The D1 position can be considered as a conserved hydrophobic position in the first domain and conserved hydrophobic and aromatic position in domains II and IV, but a variable position in domains III and V. The residue conservation in the fifth domain differs in many cases from residue variations in the other domains.
The residues at the conserved positions for all strands and EF' loops in five extracellular domains are presented in Table 2
. The comparison of the conserved positions in various domains revealed 15 extracellular conserved positions. All positions except one are occupied by hydrophobic residues in all five domains. The polar and charged residues are found at A'5 position.
Buried and surface positions in cadherins
The role of residues at each position was determined from the examination of their accessible surface areas. To give an overview of the positions of residues, we calculated the accessible surface area (ASA) of residues in three structures: domain 1 of N-cadherins and domains 1 and 2 of E-cadherins (Table 3
). ASA are divided into 0, 1, 2, ..., 9 groups, where 0 indicates ASA in the range 09 Å2, 1 indicates 1019 Å2, etc. Residues at 12 positions in all structures are buried in the protein interior (ASA are calculated in the range 02). Eight of these positions (A'3, B6, C3, D4, E3, EF''1, F3, F5) are hydrophobic and aromatic conserved positions at the center of the structure.
|
To assign a query sequence to its proper protein family, we need to find a match between residues at positions in the query sequences and the residues in the patterns of the words of family members. In fact, we need not know residues at all positions in the query sequence. The advantage of our approach is that it allows one to find a few of the class-determining positions that uniquely determine a family. We developed a new approach for assigning a protein to a protein family, which we applied for identification of classic cadherins.
Algorithm
A sequence in a protein family can be defined in terms of an ordered set of patterns of words. For each pattern of a word the following are determined: (1) number of positions in a given word; (2) conserved positions and the various sets of residues that can occupy these positions; (3) interval (a possible range of residues) to the next word in the sequence.
In the search procedure, we matched the patterns of words with a query sequence. To check it, we implemented an algorithm based on appropriate modification of the dynamic programming. The algorithm of the method is the following: patterns of all or several secondary structural units are matched with a query sequence in consecutive order, starting from the first pattern. First, we pick out those sequences of the database that contain a fragment that fits one of the known basic patterns describing the first (A') fragment of cadherins. Then, we again search out the entire database, this time using patterns for B fragment as our query patterns, and selecting sequences containing one of the B patterns. We continue this procedure with patterns of other words.
Results of the analysis are formulated in the following way: how many words (more precisely: fragments describable by cadherin patterns) are found in a given sequence. If in a sequence in question fragments are found that match with patterns of all, or almost every, cadherin word, then that sequence is considered to belong to the cadherin family.
Results of the analysis of sequences in SWISS-PROT database
We used patterns of eight words (A', B, C, D, E, EF', F, and G) of the first domain of classic cadherins in the search procedure (Table 2
). These patterns are presented in Table 2
. The goal of this test is to show that these patterns are sufficient to identify the classic cadherins. We analyzed the sequences in SWISS-PROT (release 38 with 79,909 entries). The results of the analysis are presented in Table 4
. Thirty sequences were found to contain all eight cadherin patterns, that is, there are eight fragments within these sequences that sequentially match with A', B, C, D, E, EF' F, and G patterns (the first row in the table). According to the description in SWISS-PROT, all of these proteins are classic cadherins.
|
In total, there are 43 sequences (30 + 6 + 7) in which the patterns of at least six words were found. Thirty-seven of these proteins are classic cadherins. Six other proteins in which at least six or seven patterns were found can be called false-positive. These proteins are identified in SWISS-PROT as desmogleins and desmocollins. They are not classic cadherins but belong to cadherin family. These proteins have sequence homology with classic cadherins. However, the patterns of the classic cadherins developed in this work mainly allow us to distinguish the classic cadherins from other cadherin-like proteins.
Thus, the result of a search of the cadherin sequences shows that patterns at least of six words allow us to find all classic cadherins in the database. It gives us a new tool for identifying of proteins. Thus, if the patterns of eight, seven, or six words are observed in a sequence in question, then there is a great probability that the sequence is a classic cadherin. Because in total there are 27 conserved positions in the patterns of eight words, we can classify a protein sequence if we know residues at no more than 27 conserved positions.
Comparison of secondary structural units with the results of sequence multialignment
The comparison of sequence and structural multialignment shows that the gaps (deletions and insertions) in the sequences are almost never found in the middle of the strands or helices but at the borders. This observation could help us to predict a secondary structure for proteins with unknown three-dimensional structure. Consider, for example, the sequence multialignment. We present the results of the multialignments for seven cadherin sequences of the I domains in Table 5
. Sequence multialignment shows the sequences to be divided into 10 ungapped fragments. For example, there are two ungapped fragments at the beginning of E-cadherin of the xenla (E-CAD_X) sequence: VSENE (fragment 1) and KGPFP (fragment 2). In such manner the sequences were divided into 10 fragments (Table 5a
).
|
It is obvious that the greater the number of sequences we consider for multialignment, the greater the accuracy in predicting the secondary structure. The classic cadherins give us a good example of this. We have analyzed 37 sequences, involving 14 types of cadherins (Table 1
). Thus, we propose that the results of sequence multialignment gives a reliable basis to predict secondary structure.
Sequence multialignment gives important information about three-dimensional structure as well. Residues of molecules that are aligned with each other have approximately the same structural characteristics, such as H bonds between main chain atoms, approximately the same residueresidue of contacts, or equal values of accessibility. This observation has been made in the analysis of proteins (see, e.g., Lesk et al. 1987)
Classic cadherins: Cytoplasmic part
In this part, we describe the result of our investigation of the cytoplasmic domain of cadherins. Currently, there is no structural information about the intracellular domains. We analyzed amino acid sequences of 36 cytoplasmic domains. They consist of
120 amino acids. Because we have found the relationship between sequence and structural alignment for the extracellular domains, it is likely that the sequence alignment can give some information about secondary structures of the cytoplasmic domains. The mutialignment of 36 sequences resulted in 14 ungapped fragments. We can speculate that these fragments correspond to some extent to the helices or strands and loops in this part of cadherins.
The residue frequency was calculated at each position of the sequences. It was found that 71 of
120 positions are occupied by only one residue or very similar residues in all or almost all sequences (Table 6
). This observation shows that unlike the extracellular domain, the cytoplasmic part is characterized by a high degree of residue conservation. Twenty-four positions are occupied by hydrophobic and/or aromatic residues. The polar and charged amino acids are found in 26 positions, and hydrophilic and neutral residues are found in 21 positions. The conserved positions are mostly found near the N and C termini in sequences. Fragments 4 and 14 have the most conserved positions (13 and 18 positions, respectively), whereas the ungapped fragments in the middle of the cytoplasmic part (fragments 6, 7, 8, 9, and 10) have one conserved positions in each fragment. (Note that residues in fragment 4 are involved in binding with ß-catenin.)
|
|
| Discussion |
|---|
|
|
|---|
We propose another approach for classification of protein families. An essential feature of the method is that it combines sequence and structural data. Putting together the results of the sequence and structural multialignments, we are able to give a description of the major structural units in a protein family. Patterns of strands and loops serve as defining characteristics of a protein family. In this work, we applied this method to one particular protein family, cadherins. The results of this analysis showed that, in fact, on the basis of defining characteristics one could unequivocally select all members of the cadherin family from
80,000 proteins. Qualitatively specific patterns are characteristics of both the extracellular and the cytoplasmic domains. We can use independently the patterns of any of these parts. Notably, the sequence of the cytoplasmic tail is especially specific: the pattern of one unit is sufficient to determine a family. In contrast, patterns of transmembrane parts cannot assign proteins to a proper family, because they were found in >2000 proteins. These results confirm that defining patterns can be successfully used for reliable assignment of proteins to a proper protein family. We plan to expand the investigation of defining characteristics of protein families of the ß fold.
In this work, we found that the gaps in sequences of cadherins obtained as the result of insertions and deletions in the sequence multialignment divide the sequences into the structural units (strands and loops). Thus, sequence multialignments may give us a clue about secondary structure. The assignment of sequence units to a secondary structure has, however, some limitations. The multialignment of sequences with homology results in long ungapped fragments that include several structural units. To obtain a more reliable secondary structural assignment in the protein family, we need to use as many diverse sequences as possible. In our further analysis of other protein families, we plan to test the hypothesis about relationship between the sequence and structural alignments.
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Chothia, C. and Jones, E.Y. 1997. The molecular structure of cell adhesion molecules. Annu. Rev. Biochem. 66: 823862.[CrossRef][Medline]
Chothia, C., Gelfand, I.M., and Kister, A.E 1998. Structural determinants in the sequences of immunoglobulin variable domain. J. Mol. Biol. 278: 457479.[CrossRef][Medline]
Galitsky, B., Gelfand, I.M., and Kister, A.E. 1998. Predicting amino acids sequences of antibody human VH chains from its first several residues. Proc. Natl. Acad. Sci. 95: 51935198.
. 1999. Class-defining characteristics in the mouse heavy chains of variable domains. Protein Eng. 12: 101107.
Gallin, W.J. 1998. Evolution of the classical cadherin family of cell adhesion molecules in vertebrates. Mol. Biol. Evol. 15: 10991107.[Abstract]
Gelfand, I.M. and Kister, A.E., 1995. Analysis of the relation between the sequence and secondary and three dimensional structures of immunoglobulin molecules. Proc. Natl. Acad. Sci. 92: 1088410888.
. 1997. A very limited number of keywords main patterns) describes all sequences of the human variable heavy (VH) and
(V
) domains. Proc. Natl. Acad. Sci. 94: 1256212567.
Gumbliner, B.M. 1996. Cell adhesion: The molecular basis of tissue architecture and morphonegenesis. Cell 84: 345357.[CrossRef][Medline]
Gusfield, D. 1997. Algorithms on strings, trees and sequences: Computer science and computational biology. Cambridge University Press, New York.
Eddy, S.R. 1996. Hidden Markov models. Curr. Opin. Struct. Biol. 6: 361365[CrossRef][Medline]
Hill, E., Broadbent, I., Chothia, C., and Peltitt, J. 2001. Cadherin superfamily proteins in Caenorhabditis elegans and Drosophila melanogaster. J. Mol. Biol. 305: 10111024.
Hofmann, K., Bucher, P., Falquet, L., and Bairoch, A. 1999. The PROSITE database, its status in 1999. Nucleic Acids Res. 27: 215219.
Jensen, P.H., Soroka, V., Thomsen, N.K., Ralets, I., Berezin, V., Bock, E., and Poulsen, F.M. 1999. Structure and interactions of Ncam modules 1 and 2, basic elements in neural cell adhesion. Nat. Struct. Biol. 6: 486493.[CrossRef][Medline]
Koch, A.W., Bozic, D., Pertz, O., and Engel, J. 1999. Homophilic adhesion by cadherins. Curr. Opin. Struct. Biol. 9: 275281.[CrossRef][Medline]
Lesk, A.M., Levitt, M., and Chothia, C. 1987. Alignment of the amino acid sequences of distantly related proteins using variable gap penalties. Protein Eng. 1: 7778.
Nagar, O., M., Ikura, M., and Rinl, J.M. 1996. Structural basis of calcium-induced E-cadherin rigidification and dimerization. Nature 380: 360364.[CrossRef][Medline]
Overduin, M., Harvey, T.S., Bagby, S., Tong, K.L., Yau, P., Takeishi, M., and Ikura, M. 1995. Solution structure of the epithelial cadherin domain responsible for selective cell adhesion. Science 267: 386389.
Pearson, W.R. 1996. Effective protein sequence comparison. Methods Enzymol. 266: 227258.[Medline]
Pertz, O., Bozic, D., Koch, A.W., Fauser, C., Brancaccio, A., and Engel, J. 1999. A new crystal structure, Ca2+ dependence and mutational analysis reveal molecular details of E-cadherin homoassociation. EMBO J. 18: 17381747.[CrossRef][Medline]
Takeichi, M. 1991. Cadherin cell adhesion receptors as a morphogenetic regulator. Science 251: 14511455.
Takeichi, M. 1995. Morphogenetic roles of classic cadherins. Current Opin. Cell Biol. 7: 619627.[CrossRef][Medline]
Shapiro, L. and Colman, D.R. 1999. The diversity of cadherins and implications for a synaptic adhesive code in the CNS. Neuron 23: 427430.[CrossRef][Medline]
Shapiro, L., Fannon, A.M., Kwong, P.D., Thompson, A., Lehmann, M.S., Grubel, G., Legrand, J-F., Als-Nielsen, J., Colman, D.R., and Hendrickson, W.A. 1995. Structural basis of cell-cell adhesion by cadherins. Nature 374: 327336.[CrossRef][Medline]
Smith T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195197.[CrossRef][Medline]
Suzuki, S.T. 1996. Structural and functional diversity of cadherin superfamily: Are new members of cadherin superfamily involved in signal transduction pathway? J. Cell. Biochem.. 61: 531542.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
A. E. Kister, A. V. Finkelstein, and I. M. Gelfand Common features in structures and sequences of sandwich-like proteins PNAS, October 29, 2002; 99(22): 14137 - 14141. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Reva, A. Kister, S. Topiol, and I. Gelfand Determining the roles of different chain fragments in recognition of immunoglobulin fold Protein Eng. Des. Sel., January 1, 2002; 15(1): 13 - 19. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |