|
|
||||||||
1 Computational Biology and Bioinformatics Program, and 2 Developmental Biology Program, Victor Chang Cardiac Research Institute, Sydney, NSW 2010, Australia3 Bioinformatics and Pattern Discovery Group, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA4 Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA5 Schools of Biotechnology & Biomolecular Sciences, and Medical Sciences, and 6 St. Vincents Clinical School, University of New South Wales, NSW 2052, Australia
Reprint requests to: Merridee A. Wouters, Victor Chang Cardiac Research Institute, 384 Victoria St., Darlinghurst 2010, Sydney, NSW, Australia; e-mail: m.wouters{at}victorchang.unsw.edu.au; fax: +61-2-9295-8501.
(RECEIVED November 9, 2004; FINAL REVISION December 21, 2004; ACCEPTED December 22, 2004)
| Abstract |
|---|
|
|
|---|
Keywords: EGF; bidirectional signaling; fucosylation; RIP; shedding; HER tyrosine receptor kinase ligand; hydroxylation
Article and publication date are at http://www.proteinscience.org/cgi/doi/10.1110/ps.041207005.
| Introduction |
|---|
|
|
|---|
, are shed from the surface by extracellular proteases, resulting in longer range extracellular signaling (Harris et al. 2003). Some transmembrane proteins, such as Spitz, undergo cleavage by intramembrane proteases resulting in intracellular signaling cascades including release of transcription factors (Urban and Freeman 2002). In the case of the Notch receptor, ligand-stimulated intracellular cleavage results in release of a transcriptional factor (Lai 2004). The ligand can also be cleaved releasing a domain that localizes to the nucleus and inhibits Notch signaling (Kiyota and Kinoshita 2004). EGF domains are also found in components of the blood coagulation system including factors VII, IX, X, protein C and thrombomodulin where they may mediate interactions between the various components and thus have an adhesive function (Stenflo 1991). Because of their prevalence in long arrays, often combined with other domains in a mosaic fashion, it has been suggested that EGF domains also play an important structural role as a spacer at the cellular level (Campbell and Bork 1993).
Structurally, the EGF domain is typically described as a small domain of 3040 amino acids primarily stabilized by three disulfides with disulfide connectivity ababcc (13,24,56) (Fig. 1A
). The domain consists of two
-sheets, usually referred to as the major (N-terminal) and minor (C-terminal) sheets. The half-cystines of the abc motif are arranged in a triangle on the major sheet (Fig. 1B
).
|
-hairpin that comprises the minor sheet (Fig. 1CIn addition to these three-disulfide EGFs, crystal structures of two four-disulfide EGF domains, laminin and integrin, have been solved (Stetefeld et al. 1996; Xiong et al. 2001). Four-disulfide EGF domains have an additional interdomain disulfide (disulfide d) as well as the three intradomain disulfides. In both laminin and integrin, the N-terminal half of disulfide d (dN) is located two or three residues C-terminal to half-cystine cC. Laminin and integrin differ in the location of the C-terminal half-cystine of disulfide d (dC) within the adjacent C-terminal domain. Tandem arrays of laminin EGF modules form stiff rods with each subunit adding 30 Å to the rod (Yurchenco and Cheng 1993; Yurchenco 1994).
Here we describe a structural analysis of EGF domains which compares the two major types of three-disulfide EGF domains to more recently acquired structures of four-disulfide EGF domains. We have derived sequence descriptors for the two major types of three-disulfide EGFs which allow automated detection in sequence databases. By re-annotating Swiss-Prot and correlating the results with experimental data, we present evidence that suggests that the divergence of EGF subtypes has been accompanied by functional specialization.
| Results |
|---|
|
|
|---|
-turn of the minor sheet, whereas in cEGFs half-cystine cC is located on the second strand of the sheet itself. Laminin and integrin EGFs resemble hEGFs both in their tertiary structure and the location of half-cystine cC in the
-turn of the minor sheet (Figs. 1C
|
|
|
An all-against-all structural comparison showed that
60% of EGF domains of known structure cluster into these three groupshEGF, cEGF-1, and cEGF-2when superimposed semi-automatically. EGFs that fall outside these clusters generally have atypical loop lengths that unduly influence the superposition due to the lack of a true hydrophobic core in these tiny domains. An alignment of these EGFs is shown in Figure 2A
. For the hEGF-1 cluster, the C
atoms of 23 residues in eight structures can be aligned with a root-mean-square deviation of 1.6 Å (PDB codes 1tpg
[PDB]
, 1eqg, 1edm, 1dan1, 1g1t, 1fsb, 1jl9, 1xdt).
Evolutionary and functional implications of subtypes
In sequence-based alignments of EGF domains all six of the highly conserved cysteines are aligned implying homology between these cysteines (e.g., Bersch et al. 1998). On the basis of these alignments one might assume that the two groups of three-disulfide EGF subunits have diverged from insertion or deletion of residues in the
-turn of the minor sheet. If hydrogen-bonding residues in the
-sheet are maintained and the cysteines are homologous, this must have occurred as two separate events: insertion/deletion of residues N-terminal to half-cystine cC in the
-turn and insertion/deletion of a residue between half-cystine cC and the hydrogen-bonding residue of the second strand of the sheet. Given the presence of half-cystine cC in alternate registers in the two types of three-disulfide EGFs, it is also possible that these cysteines may not be homologous and the evolutionary situation is reflected more correctly by a structure-based sequence alignment based on hydrogen-bonding in the minor sheet (Fig. 3A
). Furthermore, the presence of half-cystine cC in alternate registers in the two types of three-disulfide EGFs and the location of the nearby half-cystine dN in four-disulfide EGFs suggests an intriguing evolutionary scenario: that generation of the cEGF group from a four-disulfide ancestor with an hEGF-like minor sheet conformation involved a single event where half-cystine cC was lost and the nearby half-cystine dN was captured as the half-cystine cN pair partner. Such an event would simultaneously generate the longer loop length of the cEGF group and, if the
-sheet register was maintained, put the new half-cystine cC (cC') in the alternate register (Fig. 2B
). This disulfide-capture model requires that a four-disulfide EGF module is an ancestral form of the EGF domain. The hEGFs and cEGF types are derived from the ancestor by selective loss of two cysteines of the ancestral sequence. The hEGF type retains the ancestral connectivity of disulfide c and the conformation of the minor sheet, with disulfide d being lost. In the cEGF type, half-cystine dN subsumes the role of half-cystine cC, with the other halves of disulfide c and d being lost (Fig. 3B
). As a result, the hairpin loop is lengthened and the novel half-cystine cC, while retaining its original registration in the
-sheet, appears to adopt a new registration. In addition to the structural data, the disulfide-capture model is further supported by the bimodal nature of the distribution of disulfide c loop lengths found in EGF modules (Fig. 3E
).
A proposed evolutionary model originating with a four-disulfide model is presented in Figure 3C
. As hEGFs seem to be exclusively of subtype 1, it is likely the divergence of the cEGF-1 and 2 subtypes occurred after the hEGF split (Fig. 3C
). This hypothesis could be tested by examining the appearance of EGF subtypes in complete genomes. However, all extant three and four-disulfide subtypes are represented in Caenorhabditis elegans, the earliest diverging metazoan with a completely sequenced genome. No EGF domains have been annotated in plant genomes such as Arabidopsis, although there are reports in the literature of plant EGFs. Integrin
-chains containing integrin EGFs have been reported in Arabidopsis and sponges, an earlier diverging metazoan than C. elegans. It would appear that EGFs are very ancient, and that genomes of earlier diverging multicellular organisms are required to test the hypothesis and date the appearance of the various EGF modules.
The identified subtypes correlate with functional data. The bNaC loop identified in the structural analysis as an important discriminator of the 1 and 2 subtypes is fucosylated in some EGF domains. Indeed, EGF domains undergo several unusual forms of post-translational modification including hydroxylation of aspartate and asparagine residues and two rare forms of O-glycosylation.
O-fucose modifications of EGF domains have been demonstrated to modulate Notch, TGF
family (nodal) and urinary-type plasminogen activator (uPA) signal transduction (Haltiwanger 2002). We surveyed all of the O-glycosylation modifications of the bNaC loop reported in the literature and found they have been reported in hEGF domains only. In addition to the requirement for a serine or threonine at the aC 1 position, investigation of sequence determinants of modification suggests fucosylation is dependent on the presence of five residues in the bNaC loop (Fig. 2A
). Several fucosylated sequences were found to conform to a CXXGG(T/S)C motif (Harris and Spellman 1993). However, additional studies suggest that alanine is also permissible at position 5 and several other variations (D, Q) are allowable at position 4 (Panin et al. 2002).
As it has already been demonstrated that proper folding of the EGF domain is a requirement for O-fucose modification (Wang et al. 1996; Wang and Spellman 1998), we investigated the conformations of five-residue loops in known structures. A superposition of hEGF structures with five residue bNaC loops shows that all structures solved to date adopt one of two conformers (data not shown). Conformer 1, which is a type I'
-turn, is shared by tPA (1tpg), Cox-1, the N-terminal EGF domains of factors VII, IX, and X, neuregulin, and heparin-binding growth factor, while conformer 2 is adopted by E-selectin. All structures that are able to be fucosylated adopt conformer 1, suggesting that the structure of the epitope may be an important determinant of fucosylation.
-Hydroxylation of an aspartate or asparagine residue at the aC + 2 position of the aCbC loop has been demonstrated in over 25 calcium-binding EGF modules (e.g., Przysiecki et al. 1987; Stenflo et al. 2000). The biological role for this post-translational modification, which appears to be restricted to EGF domains, is unclear. However, knockout mice which lack the genomic locus containing the enzyme catalyzing aspartyl
-hydroxylation have developmental defects and an increased incidence of intestinal neoplasia (Dinchuk et al. 2002). The consensus motif previously determined for
-hydroxylation is CX[DN]4X[FY]XCXC (PROSITE signature PS00010) (Stenflo et al. 1988), where the hydroxylated residue is in bold, X is any residue, and residues within square brackets represent possible options for that amino acid position. A review of the 25+ instances of
-hydroxylation which have been confirmed experimentally shows there is a one-to-one correspondence between hydroxylation of aspartic acid and the hEGF type; and hydroxylation of asparagine and the cEGF type. This suggests the consensus motif for cEGF hydroxylation may be expressed as CXN4X[F,Y]XCXC, whereas the motif for hEGF hydroxylation is CXD4X[F,Y]XCXC.
Development of sequence descriptors for the two EGF types
Differential post-translational processing of these EGF domain subtypes suggested possible functional specialization to us, so we wished to search specifically for the different EGF types in Swiss-Prot and correlate the EGF type with functional information about the proteins. The sequence motif databases do not differentiate between the two types, so it was first necessary to construct our own sequence descriptors.
At the sequence level, the two major types of three-disulfide EGFs can be differentiated by a number of features. hEGF subunits almost always have eight residues between the two half-cystines of disulfide c (Fig. 2A
). The integrin
-4 subunit and prostaglandin H-synthase are the only structurally characterized hEGF subunits that do not comply with this rule, having nine residues instead. cEGF subunits typically have 1013 residues separating the half-cystines of disulfide c (Fig. 2A
). These length differences and the fact that different secondary structures have different sequence preferences, allow the construction of more specific sequence descriptors through pattern discovery methods (Materials and Methods).
In order to evaluate the sensitivity and specificity of our pattern-based approach, we referred to InterPro (Apweiler et al. 2001), which amalgamates several sequence detection databases including PROSITE, PRINTS, SMART, Pfam, and ProDom. A search of Swiss-Prot 43 for the relevant InterPro signatures shows the relative performance of several EGF detection methods (Table 2
). The amalgamated results from these databases detect matches in 640 proteins, which are grouped as InterPro IPR006209. The two PROSITE signatures EGF_1 and EGF_2 represent the best single method for EGF detection and, in addition to the Swiss-Prot annotations we will use these as our benchmark.
|
|
Another measure of the quality of our cEGF and hEGF pattern-descriptors is their ability to find EGF domains that are already annotated in the Swiss-Prot database. Our pattern-based scheme was able to find 92% of the 2662 annotated three-disulfide EGF domains in Swiss-Prot compared to 85% for the combined PROSITE motifs EGF_1 and EGF_2. Our scheme failed to identify 222 sequences in Swiss-Prot, or 8% of the annotated EGFs: These false negatives seem to represent additional heterogeneity in EGF sequences. For example, 14 of these sequences carried additional annotations such as "atypical," "incomplete," and "truncated." Other detected anomalies included 20 EGF domains with an atypical number of cysteines (usually odd, suggesting one Cys exists as a free thiol), and 56 domains with either atypically short (fewer than eight residues) or long (> 15) disulfide c loops.
But more importantly, our method was able to detect a number of novel EGF domains. Supporting evidence for the EGF identification was available for 169 of these. Firstly, 14 cEGFs in plasmodial Merozoite surface proteins (Chitarra et al. 1999; Morgan et al. 1999) have been confirmed by recent structures or homology to them. These plasmodial sequences are likely to have been transferred from a host organism in a single lateral transfer event. Secondly, the discovered potential EGFs are in proteins homologous to those with annotated EGFs. These include four novel hEGFs in adamalysins. Sixty-two additional EGF domains were detected in proteins that contain EGF domains or that contain domains which themselves are often associated with EGF domains. These include one additional hEGF domain in human and mouse netrin G2; mouse netrin G1; human and mouse tenascin N; the long form of human LTBP-1; C. elegans LIN-12; human, mouse, and rat Jagged-1, as well as zebrafish Jagged-3; and Electric ray agrin: two additional domains in the long form of mouse LTBP-1; mouse perlecan; the human and mouse scavenger receptor 2; human slit 2: three additional EGFs in human and mouse attractin; UN-52, the C. elegans homolog of perlecan, and the human scavenger receptor: one additional cEGF domain in bovine, human, and rat thrombomodulin; Drosophila cadherin N 1 and 2; the cattle tick protective antigen BM86; the C. elegans proteins C14orf27 and ZK112.7; human and mouse LTBP-3; human, pig, and rabbit zonadhesin; bovine, human, pig, and mouse fibronectin 1, and human fibronectin 2; and two additional cEGF domains in mouse fibronectin 2. Some proteins had additional hEGFs and cEGFs including human crumbs homolog (hEGF, cEGF); and Drosophila starry night protein (hEGF, 2 x cEGFs).
Additional novel domains suggested by the less specific hidden Markov models and confirmed by X-ray structures include 85 I-EGF domains in 28 integrin
-chains (Xiong et al. 2001); and four hEGF domains in variants of alliinase, a protein attributed with antibacterial properties in garlic (Kuettner et al. 2002). In both cases, these deviate from the cEGF and hEGF templates. The alliinase EGF domain is very atypical (Fig. 3A
). It lacks disulfide a, and contains an additional disulfide joining the C-terminus of the EGF domain to the N-terminus of the minor sheet. On the basis of the cNcC loop length and the location of half-cystine cC, garlic alliinase belongs to either the hEGF or four-disulphide group. In addition, it has a C-terminal
-turn, formed by residues Gln 55 and Gly 56, reminiscent of the
-turn in the four-disulphide EGF laminin. Several other hits in plants suggested by the hidden Markov models were not confirmed by the more specific patterns. However, they shared some common features such as association with BULB lectin domains (InterPro: IPR001480) (Apweiler et al. 2001). These plant EGF domains may be false positives, or alternatively, the failure of the more specific patterns to detect them may suggest EGF domains have diverged significantly in the plant kingdom, a view supported by the structure of garlic alliinase.
Analysis of EGFs in Swiss-Prot based on cEGF/hEGF grouping
Using the cEGF and hEGF sequence descriptors, we reclassified the three-disulfide EGF domains in Swiss-Prot into the two groups, observed the occurrence of the two groups in mosaic proteins, and correlated this information with functional data.
Many mosaic proteins are homogeneous with respect to EGF type. For example, many developmentally important proteins such as Notch and Delta as well as EGFs that are mitogenic contain only hEGFs (Fig. 4
). Proteins that contain solely cEGFs, on the other hand, include thrombomodulin and the LDL receptor. However, there are a significant number of mosaic proteins that contain both types. For these mixed EGF proteins, a bipartite structure where the different EGF types are grouped together is the most common, but other interleaved arrangements are also found (Fig. 4
). For example, most of the proteins involved in blood coagulation are a mixture of hEGFs and cEGFs with the hEGF always N-terminal to the cEGF (Fig. 4
). Fibrillin and LTBP-1, components of the extracellular matrix (ECM), have a similar arrangement. They predominantly consist of the cEGF type but are predicted to have one to three hEGFs at the N-terminus. In contrast, the LDL receptor-related protein 1 (LRP1), nidogen, the transmembrane receptor adhesion protein MUA3, and proEGF also contain predominantly cEGFs but have the opposite arrangement, with the few hEGF subunits disposed toward the C-terminus near the membrane. In addition, mosaic proteins that contain laminin EGFs contain hEGFs only. Perlecan and agrin are examples (Fig. 4
).
Functionally distinct domains of LTBP-1 contain distinct EGF types
Proteins with the cEGF and hEGF domains disposed in a bipartite fashion are particularly illuminating with regard to specific functions for hEGF and cEGF subunits. An example is provided by LTBP-1, in which the two EGF groups seem to have different roles in the function of the protein: the hEGFs in targeting the assembly to the ECM; and the cEGFs in some unspecified role in TGF-
activation after its separation from the hEGF subunits. LTBP-1 is a binding protein that anchors TGF-
to the ECM in its latent form until it is required. Association with the ECM is effected by the N-terminus. LTBP-1 is expressed tissue specifically in two alternatively-spliced forms: a short form which has a single hEGF domain at the N-terminus, and a long form which has an additional hEGF domain. The long form associates more efficiently with the ECM (Olofsson et al. 1995), suggesting a role for hEGFs in targeting to the ECM. TGF-
is attached to LTBP-1 via a cysteine-rich domain interleaved between tandem cEGF subunits. The ability of most of these cEGF subunits to bind calcium suggests the region probably forms a stiff rod in its presence. All but one of the C-terminal cEGFs are separated from the N-terminus by a hinge region that can be cleaved by various proteases (Sato and Rivkin 1989; Yu and Stamenkovic 2000), resulting in release of the tandem cEGF substructure from the ECM and activation of TGF-
. Thus, the hEGF and cEGF subunits seem to function at different stages during TGF-
s deployment.
A similar pattern of proteolytic separation of the hEGF and cEGF subunits is apparent for proEGF and LRP1 (Fig. 4
). For proEGF, the soluble mature 6kD EGFR ligand is derived from the most distal repeat, the only hEGF subunit (Dempsey et al. 1997). LRP1 is proteolysed in a series of cleavage events culminating in the release of its intracellular domain by the intramembrane protease
-secretase. The first of these cleavages, which is catalyzed by furin in a late secretory compartment, separates the bulk of the cEGF subunits from the hEGF subunit (Willnow et al. 1996). The two halves of the protein subsequently remain noncovalently associated. A second cleavage is believed to occur close to the plasma membrane prior to
-secretase cleavage.
Differential glycosylation and hydroxylation of EGF types
Other indicators of a structure/function relationship are provided by differential post-translational modification of EGF subtypes. There is a clear differentiation between the two types in terms of glycosylation and proteolytic processing. EGFs are glycosylated in unusual ways, which to date, have been detected on few other protein domains. Only hEGFs have been reported to undergo O-glycosylation of the aNbN and bNaC loops. Fucosylation seems to be restricted to a subset of hEGFs having been identified on uPA (Kentzer et al. 1990), tPA (Harris et al. 1991), and coagulation factors VII (Bjoern et al. 1991), IX (Nishimura et al. 1992), and XII (Harris et al. 1992), as well as components of the Notch/Delta system (Shao et al. 2003). O-fucose modifications on EGF repeats have recently been shown to play significant roles in several signal transduction pathways. O-fucose on uPA was shown to be required for activation of the uPA receptor (Rabbani et al. 1992). O-fucose on the EGF repeat of Cripto was demonstrated to be essential for Cripto to mediate Nodal-dependent signaling (Schiffer et al. 2001; Yan et al. 2002). Furthermore, it is now clear that Fringe modulates Notch function by altering O-fucose structures on Notch (Bruckner et al. 2000; Hicks et al. 2000; Okajima and Irvine 2002). Similarly, Notch ligands also undergo O-fucosylation (Panin et al. 2002).
Hydroxylation is also dependent on EGF type. The details of the functional significance of
-hydroxylation remain to be determined but
-hydroxylase knockout mice have multiple developmental defects including craniofacial abnormalities, mild palatal defects, and soft tissue syndactyly (Dinchuk et al. 2002). These developmental abnormalities resemble those seen in mutants of Jagged-2, an EGF domain-containing protein in the Notch signaling pathway. This suggests that hydroxylation may be important in signaling pathways. Given the proximity of the O-fucosylation and
-hydroxylation sites in some EGF domains (Fig. 3A
), it has been suggested that these modifications may influence each other (Dinchuk et al. 2002). Differentiating between the cEGF and hEGF types suggests more specific motifs for hydroxylation. These new motifs should be useful for restricting searches for potential
-hydroxylated residues in large proteins such as Notch. For example, 22 of the 36 Drosophila Notch EGF domains contain the
-hydroxylation consensus sequence, but only 18 have the Asp hydroxylation consensus sequence which seems to be necessary for hEGF
-hydroxylation.
Role of hEGFs in shedding
Further intriguing evidence for specific functions for hEGFs and cEGFs is suggested by proteins that undergo shedding from the membrane and/or regulated intramembrane proteolysis (RIP). Most of these proteins contain at least one hEGF domain. Those that contain a single domain are always of the hEGF type. These include L-selectin; the EGFR ligands TGF-
, amphiregulin, epiregulin, betacellulin, and hbEGF; the EGFR-related HER4 ligand, neuregulin (Harris et al. 2003); E and N-cadherin in Drosophila; as well as the Drosophila EGF ligands, Spitz, Gurken, and Keren. Shed proteins that contain multiple EGFs either consist of tandem hEGFs like Notch, Delta and Jagged; or contain a mixture of hEGF and cEGF subunits with the hEGFs arranged nearest to the membrane. Given the predominance of juxtamembrane hEGF subunits in proteins that undergo shedding or RIP, it is likely that they have some role in recognition or regulation of this penultimate cleavage step. In support of this, the hEGF domain in L-selectin has been demonstrated to be crucial in regulated shedding of L-selectin from the membrane (Zhao et al. 2001).
The role of hEGFs in shedding may extend to the sheddases. Regulated shedding has been extensively attributed to zinc metalloproteases of the ADAM family (Sahin et al. 2004). These proteases also contain a hEGF domain (Fig. 4
), suggesting a like-recognizes-like recognition mode between the hEGF ADAM sheddases and hEGFs of the proteins being shed.
In summary, data from LTBP-1 suggest separate functions for cEGF and hEGF subunits in the protein. Data on EGFs that undergo functionally related post-translational modifications demonstrate the modifications are dependent on EGF type.
| Discussion |
|---|
|
|
|---|
An alternative classification system of EGF domains based on the size of the linking regions between EGFs (Downing et al. 1996) is not inconsistent with the hEGF/cEGF classification system. The Downing classification system recognized two major classes of EGF tandem pairs: class I pairs where tandem EGF domains are separated by a one linker residue, and class II pairs separated by two linker residues. Thus, for class I pairs the last cysteine (half-cystine [cC]1) of the N-terminal EGF is separated from the first cysteine (half-cystine [aN]2) of the C-terminal EGF by five residues; whereas for class II pairs the cysteines are separated by six residues. These numbers are consistent with separations for cEGF and hEGF pairs, respectively. Downing et al. (1996) also noted that class I and II pairs have striking differences in the length of the loops connecting the 5th and 6th half-cystines (disulfide c). Thus class I pairs correspond to a pair of cEGF subunits and class II pairs are two hEGF subunits. Given the difference in registration of the cysteines with respect to the
-sheet (Fig. 3A
), it can be seen that the number of residues between domain pairs of the two different groups is effectively the same. Allowing for the difference in registration of the cysteines between the two groups also brings the calcium-binding residues in the linker into alignment without gaps (Fig. 3D
). This suggests that a pair of hEGFs could be effectively modeled by a cEGF pair if the structural nonequivalence of the last cysteine in each module is taken into account. While there are structures of cEGF pairs in the structure database, to date no pair of hEGF domains has been solved. However, the previous observations suggest that the minor sheet of the N-terminal EGF and the major sheet of the C-terminal EGF should superimpose well. Differences in the pitch of the screw axis of multiple tandem hEGF or cEGF pairs will arise from the different orientation of the major sheet with respect to the minor sheet within hEGF and cEGF domains.
Downing et al. (1996) concluded that proteins with tandem EGF domains consist entirely of either hEGFs (class II) or cEGFs (class I) and cited a single exception to this rule: that of protein S. The more extensive data available to our study show there are many more exceptions. Although many proteins are homogeneous with respect to EGF type, there is a significant number that contain both types. We identified four distinct combinations of EGF types in mosaic proteins: those that contain hEGFs only; those that contain cEGFs only; mixed cEGF/hEGF proteins both of a bipartite and interleaved nature; and mixed hEGF/laminin proteins. An interesting feature of many of the bipartite cEGF/hEGF proteins is the existence of a cleavage site in the vicinity of the hEGF/cEGF boundary. Examples include proEGF, LTBP-1, and LRP1. In the case of LTBP-1, there is good evidence that the two halves of the protein encode distinct functions. We speculate this may be a general feature of cleaved bipartite EGFs.
This study raises some interesting question about the evolution of EGF domains. Domains of both types have in the past been aligned such that all six cysteines are in register, implying all six cysteines are homologous and the difference between the two groups is merely a difference in the length of a variable loop. However, structural analysis of EGF domains suggests that the sixth cysteine of the two groups may not be homologous. Here we mooted a disulfide-capture model based on observed differences between EGF types. This model postulates derivation of EGF types from a four-disulfide EGF progenitor. Alternatively, the observed conformational similarities of the minor sheet between hEGFs and four-disulfide EGFs may suggest these groups are the more closely related with the cEGF group being the outlier. Further evolutionary studies should elucidate this question. Finally, the existence of proteins containing both hEGFs and cEGFs also suggests these tandem EGF proteins contain higher order structures and have not evolved simply from multiple EGF duplication events.
In summary, for the first time, these structurally different types of three-disulfide EGF domains have been related to experimental data suggesting EGF subtypes may have distinct functional roles. Additionally, very sensitive and specific in silico detection of structural and functional domains in protein sequences is now feasible. The ability to detect instances of the two EGF types in an automated manner should enhance further investigation into structure/function relationships.
| Materials and methods |
|---|
|
|
|---|
Improved motifs to specifically extract EGF domains of the hEGF and cEGF types from sequence databases were generated using structure-based alignments of the different C-termini of the two types. The motifs were created and refined using an iterative technique which involved initial extraction of hEGF and cEGFs by regular expression, generation of hidden Markov models, and finally generation of the final motifs using pattern discovery. EGF domains were initially extracted from Swiss-Prot (Boeckmann et al. 2003) by an iterative heuristic method using the program Find-Patterns (GCG, Wisconsin Group). The initial regular expressions were based on the sequences of types from the structural analysis. The output was compared to Swiss-Prot annotations and the regular expressions modified until all sequences fitting the structural templates were extracted. False positives were then removed from the lists based on several criteria. Firstly, as EGFs have to date only been confirmed in metazoans and plasmodia, all bacterial and fungal sequences were removed. Secondly, sequences in which the detected EGF was alternatively annotated as a different type of domain were removed. This group included a large number of zinc fingers. Thirdly, sequences were removed on the basis of function and cellular location. For example, proteins annotated as transcription factors were removed as well as proteins annotated as nuclear or cytosolic. Fourthly, the inability to detect a majority of homologs was also used as a criterion for exclusion, since it indicated that features detected were not conserved. The remaining true positives belonging to the hEGF and cEGF groups were aligned with ClustalW, hidden Markov models (HMMs) were constructed for each of the two types using HMMER (Eddy 1996) and used to search Swiss-Prot for members of the two groups. As the results of the HMM search included a substantial fraction of false positives, they were filtered manually: in particular, only the known positives and potential EGFs, which were verified through BLAST (Altschul et al. 1990) searches, were included; additionally, all of the four-disulfide EGFs which were detected by the hEGF-HMM were removed from the hEGF training set.
The resulting collection of hEGF and cEGF instances was subsequently used as a training set in conjunction with a pattern discovery algorithm (Rigoutsos and Floratos 1998a, b) to derive sequence descriptors that were both sensitive and specific for each of the two groups. These pattern-based descriptors were subsequently used to process Swiss-Prot anew and identify more instances of the two groups. The instances were finally correlated with functional data from the literature and the contents of the Swiss-Prot annotations.
| Electronic supplemental material |
|---|
|
|
|---|
| Footnotes |
|---|
| Acknowledgments |
|---|
| References |
|---|
|
|
|---|
Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D., et al. 2001. InterProan integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29: 3740.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235242.
Bersch, B., Hernandez, J.F., Marion, D., and Arlaud, G.J. 1998. Solution structure of the epidermal growth factor (EGF)-like module of human complement protease C1r, an atypical member of the EGF family. Biochemistry 37: 12041214.[CrossRef][Medline]
Bjoern, S., Foster, D.C., Thim, L., Wiberg, F.C., Christensen, M., Komiyama, Y., Pedersen, A.H., and Kisiel, W. 1991. Human plasma and recombinant factor VII. Characterization of O-glycosylations at serine residues 52 and 60 and effects of site-directed mutagenesis of serine 52 to alanine. J. Biol. Chem. 266: 1105111057.
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., ODonovan, C., Phan, I., et al. 2003. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.Nucleic Acids Res. 31: 365370.
Bruckner, K., Perez, L., Clausen, H., and Cohen, S. 2000. Glycosyltransferase activity of Fringe modulates NotchDelta interactions. Nature 406: 411415.[CrossRef][Medline]
Campbell, I.D. and Bork, P. 1993. Epidermal growth factor-like modules. Curr. Opin. Struct. Biol. 3: 385392.
Chan, A.W., Hutchinson, E.G., Harris, D., and Thornton, J.M. 1993. Identification, classification, and analysis of
-bulges in proteins. Protein Sci. 2: 15741590.[Abstract]
Chitarra, V., Holm, I., Bentley, G.A., Petres, S., and Longacre, S. 1999. The crystal structure of C-terminal merozoite surface protein 1 at 1.8 Å resolution, a highly protective malaria vaccine candidate. Mol. Cell 3: 457464.[CrossRef][Medline]
Dempsey, P.J., Meise, K.S., Yoshitake, Y., Nishikawa, K., and Coffey, R.J. 1997. Apical enrichment of human EGF precursor in Madin-Darby canine kidney cells involves preferential basolateral ectodomain cleavage sensitive to a metalloprotease inhibitor. J. Cell. Biol. 138: 747758.
Dinchuk, J.E., Focht, R.J., Kelley, J.A., Henderson, N.L., Zolotarjova, N.I., Wynn, R., Neff, N.T, Link, J., Huber, R.M., Burn, T.C, et al. 2002. Absence of post-translational aspartyl beta-hydroxylation of epidermal growth factor domains in mice leads to developmental defects and an increased incidence of intestinal neoplasia. J. Biol. Chem. 277: 1297012977.
Downing, A.K., Knott, V., Werner, J.M., Cardy, C.M., Campbell, I.D., and Handford, P.A. 1996. Solution structure of a pair of calcium-binding epidermal growth factor-like domains: Implications for the Marfan syndrome and other genetic disorders. Cell 85: 597605.[CrossRef][Medline]
Eddy, S.R. 1996. Hidden Markov models. Curr. Opin. Struct. Biol. 6: 361365.[CrossRef][Medline]
Haltiwanger, R.S. 2002. Regulation of signal transduction pathways in development by glycosylation. Curr. Opin. Struct. Biol. 12: 593598.[CrossRef][Medline]
Harris, R.J. and Spellman, M.W. 1993. O-linked fucose and other post-translational modifications unique to EGF modules. Glycobiology 3: 21924.
Harris, R.J., Leonard, C.K., Guzzetta, A.W., and Spellman, M.W. 1991. Tissue plasminogen activator has an O-linked fucose attached to threonine-61 in the epidermal growth factor domain. Biochemistry 30: 23112314.[CrossRef][Medline]
Harris, R.J., Ling, V.T., and Spellman, M.W. 1992. O-linked fucose is present in the first epidermal growth factor domain of factor XII but not protein C. J. Biol. Chem. 267: 51025107.
Harris, R.C., Chung, E., and Coffey, R.J. 2003. EGF receptor ligands. Exp. Cell Res. 284: 213.[CrossRef][Medline]
Hicks, C., Johnston, S.H., diSibio, G., Collazo, A., Vogt, T.F., and Weinmaster, G. 2000. Fringe differentially modulates Jagged1 and Delta1 signaling through Notch1 and Notch2. Nat. Cell Biol. 2: 515520.[CrossRef][Medline]
Hutchinson, E.G. and Thornton, J.M. 1990. HERAA program to draw schematic diagrams of protein secondary structures. Proteins 8: 203212.[CrossRef][Medline]
Kabsch, W. and Sander, C. 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22: 25772637.[CrossRef][Medline]
Kentzer, E.J., Buko, A., Menon, G., and Sarin, V.K. 1990. Carbohydrate composition and presence of a fucose-protein linkage in recombinant human prourokinase. Biochem. Biophys. Res. Commun. 171: 401406.[CrossRef][Medline]
Kiyota, T. and Kinoshita, T. 2004. The intracellular domain of X-Serrate-1 is cleaved and suppresses primary neurogenesis in Xenopus laevis. Mech. Dev. 121: 573585.
Kuettner, E.B., Hilgenfeld, R., and Weiss, M.S. 2002. The active principle of garlic at atomic resolution. J. Biol. Chem. 277: 4640246407.
Lai, E.C. 2004. Notch signaling: Control of cell communication. Development 131: 965973.
Morgan, W.D., Birdsall, B., Frenkiel, T.A., Gradwell, M.G., Burghaus, P.A., Syed, S.E., Uthaipibull, C., Holder, A.A., and Feeney, J. 1999. Solution structure of an EGF module pair from the Plasmodium falciparum merozoite surface protein 1. J. Mol. Biol. 289: 113122.[CrossRef][Medline]
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536540.[CrossRef][Medline]
Nishimura, H., Takao, T., Hase, S., Shimonishi, Y., and Iwanaga, S. 1992. Human factor IX has a tetrasaccharide O-glycosidically linked to serine 61 through the fucose residue. J. Biol. Chem. 267: 1752017525.
Okajima, T. and Irvine, K.D. 2002. Regulation of Notch signaling by O-linked fucose. Cell 111: 893904.[CrossRef][Medline]
Olofsson, A., Ichijo, H., Moren, A., ten Dijke, P., Miyazono, K., and Heldin, C.H. 1995. Efficient association of an amino-terminally extended form of human latent transforming growth factor-
binding protein with the extra-cellular matrix. J. Biol. Chem. 270: 3129431297.
Panin, V.M., Shao, L., Lei, L., Moloney, D.J., Irvine, K.D., and Haltiwanger, R.S. 2002. Notch ligands are substrates for protein O-fucosyltransferase-1 and Fringe. J. Biol. Chem. 277: 2994529952.
Przysiecki, C.T., Staggers, J.E., Ramjit, H.G., Musson, D.G., Stern, A.M., Bennett, C.D., Friedman, P.A. 1987. Occurrence of
-hydroxylated asparagine residues in non-vitamin K-dependent proteins containing epidermal growth factor-like domains. Proc. Natl. Acad. Sci. 84: 78567860.
Rabbani, S.A., Mazar, A.P., Bernier, S.M., Haq, M., Bolivar, I., Henkin, J., and Goltzman, D. 1992. Structural requirements for the growth factor activity of the amino-terminal domain of urokinase. J. Biol. Chem. 267: 1415114156.
Rigoutsos, I. and Floratos, A. 1998a. Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14: 5567.
. 1998b. Motif Discovery without alignment or enumeration. Proceedings 2nd International Conference on Computational Molecular Biology (RECOMB 98), pp. 221227. ACM Press, New York, NY.
Russell, R.B. and Barton, G.J. 1992. Multiple protein sequence alignment from tertiary structure comparison: Assignment of global and residue confidence levels. Proteins 14: 309323.[CrossRef][Medline]
Sahin, U., Weskamp, G., Kelly, K., Zhou, H.M., Higashiyama, S., Peschon, J., Hartmann, D., Saftig, P., and Blobel, C.P. 2004. Distinct roles for ADAM10 and ADAM17 in ectodomain shedding of six EGFR ligands. J. Cell Biol. 164: 769779.
Sato, Y. and Rifkin, D.B. 1989. Inhibition of endothelial cell movement by pericytes and smooth muscle cells: Activation of a latent transforming growth factor-
1-like molecule by plasmin during co-culture. J. Cell Biol. 109: 309315.
Schiffer, S.G., Foley, S., Kaffashan, A., Hronowski, X., Zichittella, A.E., Yeo, C.Y., Miatkowski, K., Adkins, H.B., Damon, B., Whitman, M., et al. 2001. Fucosylation of Cripto is required for its ability to facilitate nodal signaling. J. Biol. Chem. 276: 3776937778.
Shao, L., Moloney, D.J., and Haltiwanger, R. 2003. Fringe modifies O-fucose on mouse Notch1 at epidermal growth factor-like repeats within the ligand-binding site and the Abruptex region. J. Biol. Chem. 278: 77757782.
Stenflo, J. 1991. Structurefunction relationships of epidermal growth factor modules in vitamin K-dependent clotting factors. Blood 78: 16371651.
Stenflo, J., Ohlin, A.K., Owen, W.G., and Schneider, W.J. 1988.
-Hydroxy-aspartic acid or
-hydroxyasparagine in bovine low density lipoprotein receptor and in bovine thrombomodulin. J. Biol. Chem. 263: 2124.
Stenflo, J., Stenberg, Y., and Muranyi, A. 2000. Calcium-binding EGF-like modules in coagulation proteinases: Function of the calcium ion in module interactions. Biochim. Biophys. Acta 1477: 5163.[CrossRef][Medline]
Stetefeld, J., Mayer, U., Timpl, R., and Huber, R. 1996. Crystal structure of three consecutive laminin-type epidermal growth factor-like (LE) modules of laminin
1 chain harboring the nidogen binding site. J. Mol. Biol. 257: 644657.[CrossRef][Medline]
Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 46734680.