|
|
||||||||
Department of Biomathematical Sciences, Mount Sinai School of Medicine, New York, New York 10029, USA
Reprint requests to: S. Rackovsky, Department of Biomathematical Sciences, Mount Sinai School of Medicine, Box 1023, 1 Gustave L. Levy Place, New York, NY 10029, USA; e-mail: shelly{at}camelot.mssm.edu; fax: (212) 860-4630.
(RECEIVED May 15, 2003; FINAL REVISION July 16, 2003; ACCEPTED July 18, 2003)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.03209703.
| Abstract |
|---|
|
|
|---|
,
) dihedral angles associated with each amino acid type. This index was used to analyze the local structural propensity of both SAPs and the sequence fragments contiguous to them. We also analyzed type-specific amino acid composition, solvent accessibility, and overall structural properties of SAPs and their sequence context. We show that each type of SAP has an unusual, type-specific amino acid composition and, as a result, simultaneous intrinsic preferences for two distinct types of backbone conformation. All types of SAPs have lower sequence complexity than average. Fragments that adopt helical conformation in one protein and sheet conformation in another have the lowest sequence complexity and are sampled from a relatively limited repertoire of possible residue combinations. A statistically significant difference between two distinct conformations of the same SAP is observed not only in the overall structural properties of proteins harboring the SAP but also in the properties of its flanking regions and in the pattern of solvent accessibility. These results have implications for protein design and structure prediction. Keywords: Chameleon sequence; conformational variability; Ramachandran map; intrinsic propensity; local context; secondary structure; entropy
| Introduction |
|---|
|
|
|---|
-helical or ß-sheet conformations, and a limited number of substitutions can convert a helical protein to a predominantly ß-sheet protein (Minor Jr. and Kim 1996; Dalal and Regan 2000). It has also been shown that many short sequence fragments are associated with a broad distribution of local structures rather than with a well-defined structural motif (Rackovsky 1993; Han and Baker 1996). A better understanding of the contribution of intrinsic propensity, sequence context, and other environmental factors to the conformational preference of such structurally ambivalent sequence fragments is important for reliable local structure prediction. Most scoring functions used in fold recognition use secondary structure propensity encoded in a variety of forms (Bowie et al. 1991; Bahar et al. 1997; Rost et al. 1997; Kuznetsov and Rackovsky 2002). Structurally ambivalent fragments therefore pose a challenge for fold recognition methods. Identification of structurally ambivalent fragments, and determination of the rules governing their conformational preferences, is also important for understanding the mechanism of conformational changes observed in misfolding diseases. One of the best-known examples of misfolding diseases is the group of neurodegenerative disorders (Creutzfeld-Jacob Disease [CJD] scrapie, mad cow disease, etc.) caused by the prion protein, which undergoes a conformational transition from a normal mostly-helical form to a ß-sheetrich pathogenic conformation (Prusiner 1998).
To address the problem of structural variability, one needs to identify conformationally ambivalent segments and to measure the degree of conformational variability of each sequence fragment. A great deal of attention has been paid to systematic identification and classification of sequence patterns with strong conformational preferences (Rooman et al. 1992; Bystroff et al. 2000; Jonassen et al. 2000), whereas the complementary analysis of structurally ambivalent segments has received less attention. The analyses that are available are mainly directed to the classification of loops used in homology modeling (Kwasigroch et al. 1996; Rufino et al. 1997) and macromolecular motions observed in proteins with known structure (Gerstein and Krebs 1998). It was also shown that sequentially identical peptides of length up to nine residues can be found in the Protein Data Bank (PDB) in two different conformations (Kabsch and Sander 1984; Argos 1987; Cohen et al. 1993; Mezei 1998; Sudarsanam 1998; Zhou et al. 2000). Attempts to identify the environmental factors that modulate the intrinsic propensity of chameleon fragments lead to the conclusion that the main factor is the overall structural class of the protein (Cohen et al. 1993; Zhou et al. 2000). No attempts were made to systematically and exhaustively classify structurally ambivalent peptides (SAPs) or to carry out a statistical analysis of their properties and sequence context.
Recently, a number of empirical methods for identification of long conformationally flexible fragments have been suggested. One of them, a neural network predictor of long disordered segments, is based on amino acid composition (Romero et al. 2001). Another uses secondary structure prediction to identify long fragments in coil conformation (Liu et al. 2002). Application of these methods showed that highly flexible disordered regions are quite common in protein sequences, especially in eukaryotic ones, and that they have low sequence complexity (Dunker et al. 2002; Liu et al. 2002). These segments are presumed to be involved in binding and signal transduction (Wright and Dyson 1999; Dunker et al. 2002). Another approach is based on the assumption that the conformations of sequence fragments that do not have a strong intrinsic propensity for a particular type of local structure can be altered by changes in their environment. This approach uses conformational propensities of individual residues to find structurally ambivalent sequence fragments that exhibit strong preference for both
-helical and ß-sheet conformation, and this approach was able to identify certain parts of known proteins that undergo conformational switches (Young et al. 1999). However, it is not known whether low sequence complexity holds for different types of structurally ambivalent fragments and what type-specific factors contribute to their final conformation.
In this article, we do not rely on empirical methods for the prediction of conformationally flexible segments in protein sequences. Rather, we analyze the properties of ensembles of SAPs in known protein structures, and we identify statistically significant differences in the context in which distinct conformations of these peptides are found. We address the following points:
-helix and ß-sheet conformation. We also identify other possible types of SAPs that exist in at least two distinct conformations. This work has several novel aspects. First, we use a strict length-dependent threshold to identify significantly different conformations of the same sequence fragment and to classify SAPs into five nonredundant types based on their local structure. Furthermore, we analyze not only the properties of SAPs themselves but also those of the structural context in which these fragments are found, and we compare these properties to the average properties of reference fragments. This approach allows us to reliably detect even subtle differences between alternative conformations of SAPs and to identify type-specific factors modulating their structural propensities.
| Results |
|---|
|
|
|---|
,
) dihedral angles associated with the central residue X in a tripeptide iXj, glp(iXj). The central residue X is described by a full 20-letter alphabet, whereas the flanking residues i and j are collapsed into three groups (for details, see Materials and Methods). A positive value of glp(iXj) indicates that the amino acid type X in the tripeptide iXj has conformational variability that is lower than average. A negative value of glp(iXj) indicates that the amino acid type X has higher conformational variability than average.
As expected, Gly is the most flexible residue, with the highest entropy of the distribution of dihedral angles, whereas Pro is the least flexible (Fig. 1
). Because the side chain in Gly is absent, it has high conformational variability even in the Pro-Gly-Pro tripeptide (glp[Pro-Gly-Pro] = -0.444). Asn and Asp are also conformationally flexible residues with glp(iXj) < 0. Pro has a positive value of glp(iXj), even in Gly-Pro-Gly (glp[Gly-Pro-Gly] = 0.922). This indicates that the performance of GLP is consistent with that expected from the physicochemical properties and conformational preferences of the conformationally significant amino acids, Gly, Pro, Asn, and Asp (Solis and Rackovsky 2000, 2002). Figure 1
shows the GLP for each of the 20 amino acids, plotted versus propensities for the
-helical and coil regions of the Ramachandran map. One characteristic feature of these plots is that the conformationally flexible Gly, Asn, and Asp, as well as the inflexible Pro, are outliers, and behave differently than the other amino acids. The GLP is highly correlated with the propensity for the area outside
-helical and extended region of the Ramachandran map (the coil region) for Gly, Asn, Asp, and Pro (R = -0.93), whereas the correlation for the rest of amino acids is low (R = -0.58). The GLP also shows a moderate correlation with
-helical propensity (R = 0.69 if Gly, Pro, Asn, and Asp are excluded), and no correlation with ß-sheet propensity (R = 0.20 if Gly, Pro, Asn, and Asp are excluded).
|
-helical and/or extended regions, and all have similar low coil propensity, although their GLP may vary. Correlation between the GLP and coil propensity computed for this group of amino acids is therefore low. Pro occurs only in a small area of the extended region and shows the highest GLP and lowest propensity for the coil region. The relationship between the GLP and
-helical propensity is ambiguous, because amino acids with similar
-helical propensities may be distributed differently outside the
-helical region and, as a result, may have significantly different GLP (Fig. 1B
Residues that demonstrate negative or very small positive values of glp(iXj) do not have a preference for particular areas on the Ramachandran map and thus can adopt a broad range of backbone conformations. We will refer to them as residues with low local coding propensity. Residues that demonstrate large positive values of glp(iXj) do prefer particular areas on the Ramachandran map, and we will refer to them as residues with high local coding propensity. We will use the data obtained in this section to analyze the local coding propensity of SAPs and their flanks by using equations 3 through 5![]()
![]()
.
SAPs observed in the PDB and their sequence properties
In this section, we identify all types of SAPs observed in the PDB, classify them into five distinct groups based on the types of structure they adopt in two different proteins (Helix-Extended [chameleon fragments]; Helix-Irregular; Extended-Irregular; Irregular-Irregular; and Mixed), and analyze their sequence properties. Table 1
shows the total number of different types of SAPs observed in the PDB. The longest chameleon k-mer observed in the PDB, VLYVKLHN, is eight residues long, and the longest SAPs (KGVVPQLVK and SHHHHHHGS) are nine residues long and belong to the mixed type. It should be noted that EI k-mers are considerably less frequent than are other types. Fragments of sizes 5 and 6 are most frequent, and longer fragments are extremely rare. This observation is consistent with a previous conclusion that sequence fragments greater than size 7 are not well represented in the database of known structures, and there is no prospect for obtaining a large enough database to adequately represent fragments longer than eight residues (Fidelis et al. 1994).
|
-helix formers with low ß-sheet propensity (Ala, Arg, Glu, and Gln), strong ß-sheet formers with low
-helical propensity (Val, Ile, Phe, and Tyr), and amino acids with high propensity for irregular conformation (Gly, Pro, Asn, and Asp). An analysis of the frequency of occurrence of residue pairs shows that the majority of chameleon k-mers (77.2% of all hexamers) have at least one pair consisting of a helix former with low sheet propensity and a sheet former with low helix propensity (Fig. 3
|
|
-helix and ß-sheet propensities). SAPs can be arranged in the following order based on their average GLP: chameleon k-mers (0.43)
helical-irregular k-mers (0.30)
ß-sheet-irregular k-mers (0.153)
irregular-irregular k-mers (0.101). Pairwise comparison of SAP types shows that, with one exception, the average GLP of each type is significantly different from that of other types (p
10-4) and, therefore, type-specific. The exception is the pair EI-II, which does not show a significant difference in GLP. Among all types of SAPs, the chameleon k-mers have the highest average GLP per residue, 0.43, a value that is higher than the average GLP of all
-helical (0.39) and ß-sheet (0.34) fragments observed in the PDB (Table 2
-helical and, especially, ß-sheet fragments. As seen in Table 2
-helix and ß-sheet propensities of the same magnitude, equal to the average secondary structure propensities of all
-helical and ß-sheet fragments observed in the PDB. These data indicate that due to a concerted occurrence of strong
-helix and ß-sheet formers, as well as lack of amino acids with a strong propensity for areas outside the
-helical and extended regions of the Ramachandran map, chameleon k-mers have a strong preference for both
-helical and ß-sheet conformations but not for an irregular conformation.
|
-helix, ß-sheet, or irregular). To avoid this bias, we compare the amino acid composition of SAPs with that of fragments with corresponding secondary structure. For instance, the composition of chameleon k-mers was compared with the average composition of
-helical and ß-sheet k-mers of the same length. The results of this comparison are shown in Figure 4
-helical k-mer. They also contain more Ala and Leu and more of the hydrophilic helix-former Glu than does the average ß-sheet.
|
Irregular k-mers in our classification include turns ("T" in the DSSP assignment). In many cases, residues in turn conformation (such as type II turns) are located in the helical region of the Ramachandran map and thus have (
,
) angles similar to those of helical residues. Automatic secondary structure classification may erroneously assign the turn conformation to short helices or vice versa. We therefore need to check whether k-mers of HI type in irregular conformation contain a high percentage of turns. We compare the secondary structure content of HI, EI, and II k-mers in the irregular conformation. Surprisingly, HI k-mers have the lowest percentage of residues in turn conformations and the largest percentage in the coil state (unclassified conformation in the DSSP assignment). This observation confirms that HI k-mers indeed undergo a helix
coil conformational transition.
The SAPs can be arranged in order of increasing average sequence complexity, Q (equation 6
), as follows: HE (Q = 3.79)
II (Q = 3.87)
EI (Q = 3.91)
HI (Q = 3.95). The average sequence complexity of the data set from which the SAPs was sampled is 4.19. The observation that all types of SAP have sequence complexity below average, and are constructed using a limited number of "preferred" residues, implies that the number of different short flexible k-mers (especially of chameleon k-mers, which have the lowest sequence complexity) should be limited. We check this assumption by asking how many longer chameleon k-mers can be identified by matching to shorter ones. We used a selection method based on local sequence alignment provided by the PISCES Web site (see Materials and Methods) to construct a nonredundant subset of PDB structures (both high-resolution X-ray and NMR structures) such that all proteins in this subset have pairwise sequence identity <50%. All k-mers with chameleon properties were identified in this subset. The nonredundant nature of this subset ensures that all SAPs, regardless of their length, are separated from each other by a significant evolutionary distance. Matching known chameleon 5-mers against 6-mers (only exact matches were considered) shows that 68% of all chameleon 6-mers from our data set are identified. Similarly, 91% of all 7- and 8-mers in our data set are identified by matching them against 5- and 6-mers. These results indicate that the repertoire of chameleon k-mers is indeed limited to certain combinations of residues, and during the course of protein evolution, longer k-mers are constructed by adding a few residues with appropriate local propensity to shorter ones (or, conversely, that shorter k-mers are constructed by trimming a few residues from longer ones).
Factors that may affect the conformational preferences of SAPs
The amino acid composition of SAPs allows them to adopt at least two different backbone conformations. We now ask what factors govern conformational choice in these peptides. To gain insights into this problem, we analyze secondary structure content, solvent accessibility, and local structural propensity in the flanking regions of SAPs, as well as the overall properties of protein sequences in which SAPs were found. Our goal is to find out whether each of two alternative conformations of a given type of SAP is associated with specific properties of its flanking regions (local context) or of the entire sequence (global context).
To compare the local sequence and structural context of two alternative conformations of the same SAP, we first need to determine how many adjacent residues we must take into account. To do this, we compared average local propensity, solvent accessibility, and secondary structure content computed for flanking windows of size 4 as a function of the sequence separation between the k-mer and the windows (equation 4
). The results of this comparison for the GLP and solvent accessibility are shown in Figures 5 and 6![]()
. The difference in local propensity and solvent accessibility between flanks of two distinct conformations of the same SAP rapidly decreases to zero when sequence separation reaches four to six residues. It should be remarked, however, that we did not find a significant difference in hydrophobic properties between flanks of two conformations of the same SAP. The difference in
-helical and ß-sheet propensity decays to zero faster than that in the GLP (data not shown). The difference in secondary structure content also rapidly decreases (data not shown) and reaches the average difference (Table 3
) observed between two groups of proteins that contain distinct conformations of the same SAP type. The length of flanking regions that maximizes the average difference between two distinct conformations of the same SAP type is six residues.
|
|
|
37% of the time. Additional information about the local context of chameleon fragments can be obtained by analyzing how often they occur in the middle of helix or ß-sheet. We define a fragment as being located in the middle of a regular secondary structure if at least two residues immediately adjacent to both ends of the fragment have the same regular structure. Chameleon hexamers in the helical conformation are located in the middle of a helix 59% of the time, whereas only 15% of chameleon hexamers in the ß-sheet conformation are located in the middle of a ß-sheet. We conclude that strong local helical propensity in the flanking residues forces chameleon k-mers to adopt a helical conformation. In the absence of flanks with high local coding propensity, chameleon k-mers adopt the more energetically favorable extended conformation. A factor of lesser importance may be interaction with solvent: The chameleon k-mers in the helical conformation are more accessible to solvent, whereas their flanking residues are less accessible than those in the ß-sheet conformation. The chameleon k-mers in both conformations are slightly less accessible than are average helical and sheet fragments. However, it should be noted that the flanks of chameleon k-mers in the helical conformation do not show unusual properties. Their average local propensity and solvent accessibility are similar to those observed in flanks of all helical fragments in general (Tables 2, 4
|
|
|
|
| Discussion |
|---|
|
|
|---|
What features of protein environment modulate the final conformation of a SAP? A significant difference between two distinct conformations of the same SAP is observed both in the overall sequence and structural properties of the proteins harboring the SAP (global context) and in the sequence and structural properties of its flanking regions (local context), as well as in the per-fragment and per-flank pattern of solvent accessibility. The importance of the global protein environment in determining the final conformation of SAPs has been pointed out previously (Cohen et al. 1993). The data reported here indicate that local sequence context is another major determinant of the final conformation of these fragments, with solvent accessibility also playing an important role, especially in the case of irregular conformations. These data do not support the earlier conclusion that most SAPs have similar patterns of solvent accessibility (Cohen et al. 1993). A likely explanation of this disagreement is that previous studies were performed on a much smaller data set (59 hexapeptides), which did not allow observation of fine differences. The amino acid composition and local propensity of flanking regions of SAPs are similar to those of average fragments (Fig. 8
, Table 2
). This observation indicates that there exist general conformation-specific signals in the local context of all helical, sheet, and irregular fragments that can be used in structure prediction to distinguish between alternative conformations of the same SAP. The difference in local context between alternative conformations of most SAPs becomes statistically insignificant for residues separated by more than 6 to 10 positions from the SAP. Many secondary structure prediction methods use six to eight adjacent residues to predict conformation for the central residue (Rost and Sander 1994; Garnier et al. 1996) and therefore seem to extract nearly the maximal possible amount of information from local sequence context. However, it is worth noting that in the case of EI k-mers, a significant difference in local propensity and solvent accessibility of flanking regions is observed up to 11 adjacent residues. This indicates that a larger window size is required in order to capture all the contextual differences between ß-sheet and irregular conformations.
Our results indicate that both local and global context modulate the final conformation of SAPs. What is the relative contribution of each of these factors during the folding process? It is reasonable to assume that when a HE or HI k-mer is flanked by strong helix formers, it undergoes a fast cooperative transition to an
-helical conformation initiated by adjacent residues, which may adopt native-like helical conformation during the early stages of protein folding (Baldwin and Rose 1999). Low ß-sheet content in a mainly
protein or in a mainly
-helical domain of an
/ß protein may facilitate this process, because of the absence of complementary ß-strands that might form long-range hydrogen bonds with the SAP and interfere with helix formation. On the other hand, a HE or EI k-mer located in a sequence context with low local coding propensity may adopt the energetically favorable extended conformation. The greater conformational flexibility in flanks that we observe in this case may allow the k-mer to find a complementary ß-strand in a protein with high ß-sheet content, and form long-range hydrogen bonds. These assumptions are consistent with experimental data of Minor Jr. and Kim (1996) and the results of folding simulations performed by Baldwin and Rose (1999), which showed that a chameleon fragment of length 11 adopts a helical conformation when placed in the helical context within protein G, and the same fragment adopts a ß-sheet conformation when placed in the ß-sheet context within the same protein.
The results presented in this work were obtained by using a set of relatively short SAPs observed in a data set of soluble globular proteins that represent the majority of proteins available in the PDB. There is a growing body of evidence indicating that many proteins, especially from eukaryotes, contain long flexible disordered fragments (Wright and Dyson 1999; Dunker et al. 2002; Liu et al. 2002), which are not well represented in the PDB (Huntley and Golding 2002). Elucidation of the rules governing the behavior of these long disordered fragments requires additional theoretical and experimental study.
| Materials and methods |
|---|
|
|
|---|
,
) angles associated with the central amino acid in a tripeptide iXj. The central residue, X, is described by a full 20-letter alphabet, whereas the flanking residues i and j are collapsed into three groups, based on their side chain properties: group 1, Gly; group 2, Pro; and group 3, 18 other amino acids. Division into a larger number of groups is not statistically feasible, given the current size of the data set of nonredundant structures. Gly and Pro were separated from the rest of the amino acids because this grouping of the amino acids was shown to be the most informative three-letter alphabet for defining local sequence/structure relationships in proteins (Solis and Rackovsky 2000, 2002). Our approach is an extension of a method based on the comparison of the entropy of structural distributions in the virtual-bond backbone representation (Rackovsky 1990), which was used to identify tetramers that encode for identifiable distributions of local structure (Rackovsky 1993).
The Ramachandran plot was divided by using an equally spaced four-by-four grid (-180° to -90°, -90° to 0°, 0° to 90°, and 90° to 180° on both axes). The width of the distribution of dihedral angles for an amino acid X in a given tripeptide iXj is measured by its entropy (Rackovsky 1990, 1993) by using the following equation:
![]() | (1) |
![]() |
where S(iXj) is the observed Shannon entropy of the tripeptide-specific distribution of dihedral angles; n is the total number of grid cells used to partition the Ramachandran map (n = 16 here); p(k) is the fractional frequency of occurrence of tripeptide iXj in the grid cell k; NiXj is the number of occurrences of tripeptide iXj in the data set; and SR(NiXj) is the average entropy of a distribution of NiXj tripeptides randomly sampled without replacement from the data set.
To compute the average entropy for each tripeptide iXj, we generate 105 random samples of size NiXj (a number sufficient to ensure convergence to three decimal digits). Comparison to a reference random distribution of the same size as the observed number of tripeptides provides an automatic correction for the unequal number of observations and unequal proportion of residues observed in helical and extended conformations (Rackovsky 1993; Swindells et al. 1995). A positive value of glp(iXj) indicates that the average entropy of random distributions is greater than that observed for a given tripeptide iXj. This implies that the tripeptide is preferentially observed in specific areas of the Ramachandran plot. Negative values of glp(iXj) correspond to a lack of preference of the tripeptide iXj for specific values of
and
angles. The higher the value of glp(iXj), the smaller the area on the Ramachandran plot the tripeptide is observed to occupy. It should be emphasized that for each amino acid, this approach measures the context-dependent breadth of the allowed area on the Ramachandran map, not a preference for particular values of the dihedral angles. The latter property is measured by secondary structure propensities. glp(iXj) combines information about preferences for
-helical, ß-sheet, and coil regions into one "flexibility" index. We will refer to this index as the GLP. When we refer to conformational variability of a particular amino acid type, we will imply the context-dependent degree of backbone conformational variability given by equation 1
. It should be remarked that the resolution of the GLP can be increased, as the size of the data set increases, by partitioning the amino acid alphabet into a larger number of groups and/or dividing the Ramachandran map using a finer grid.
The GLP can be used to compute the average conformational variability index of a sequence fragment, GLP(S):
![]() | (2) |
where L is the sequence length and Am is amino acid in sequence position m.
To derive tripeptide-specific distributions of dihedral angles, we used a nonredundant data set of high-resolution X-ray structures, with resolution
3.0 Å, R-factor
0.3, and sequence length
40 residues, such that all chains have pairwise sequence identity <20%. The data set consists of 1647 structures (389,120 tripeptides) and was obtained from the PISCES Web site (http://www.fccc.edu/research/labs/dunbrack/pisces/), which provides the most up-to-date compilation of the nonredundant PDB database by using the selection method of Hobohm et al (1993). The (
,
) angles were computed by using the DSSP program (Kabsch and Sander 1983). A complete set of generalized local propensities is available online at http://c3.biomath.mssm.edu/~igor/trimers.txt.
Identification of SAPs
We selected from the PDB (Berman et al. 2000) all pairs of high-resolution X-ray structures (resolution
3.0 Å, R-factor
0.3, and length
40 residues) that have pairwise sequence identity <50%. Transmembrane fragments annotated in the SCOP structural database (Murzin et al. 1995) were excluded from the analysis. Each pair was scanned for sequentially identical peptide pairs (SIPPs), and we retain only those SIPPs that exist in two distinct backbone conformations. SIPPs with two distinct backbone conformations were identified by using the following two-step procedure:
root mean square deviation (RMSD) above a certain threshold. Two conformations of the same sequence fragment are considered to be significantly different if their C
RMSD is equal to or greater than the average RMSD of an ensemble of random fragments of the same length. We will refer to a sequence fragment that exists in two dissimilar conformations in two different proteins as a SAP.
To obtain the dissimilarity threshold, we compute the average C
RMSD for randomly selected pairs of structural fragments of length 5 to 15 (using 105 random pairs for each fragment length). Only those pairs are used that come from two different proteins and do not have positions with matching regular secondary structure. Random pairs were selected from the September 2001 release of the PDB_SELECT data set of nonredundant high-resolution X-ray structures clustered at 25% sequence similarity (Hobohm et al. 1993), without chain breaks, with resolution
2.0 Å, and R-factor
0.2 (471 proteins in total). C
RMSDs were computed by using the ProFit program (Martin 2001). Secondary structure was assigned by using the DSSP program (Kabsch and Sander 1983).
If a fragment of length k in protein r had a crystallographic B-factor 2 SD above the average B-factor computed for all fragments of length k in the protein r, this fragment was excluded from the analysis. This approach allows one to remove poorly characterized fragments by taking into account a protein-specific dependence of B-factor on the particular method used to refine crystallographic data (Saqi 1995).
All final SAPs were classified in five nonredundant categories based on secondary structure: (1) HE, the SAP exists as a helix in the DSSP assignment in one protein and a ß-sheet in the other; (2) HI, the SAP exists as a helix in one protein and in an irregular conformation (S, T, B, or C in the DSSP assignment) in the other; (3) EI, the SAP exists as a ß-sheet in one protein and in an irregular conformation in the other; (4) II, the SAP exists in different irregular conformations in the two proteins; and (5) mixed, all SAPs that can not be classified into the first four groups.
We will refer to these SAPs as k-mer pairs (pentamers, hexamers, etc.), denoting sequence fragments of length k that exist in two significantly different conformations in distinct proteins. It should be emphasized that the SAPs were classified based solely on the structural criteria outlined above, without regard to their amino acid composition. We also generated an extended list of chameleon k-mers by including NMR structures and selecting SAPs that exist as a helix in one protein and a ß-sheet in the other. The data set of all SAPs used in this work is available upon request.
For a given amino acid property, P (index P denotes a particular property, such as helical or sheet propensity), the average property of a k-mer of type i found in conformation c in protein r, is computed by using the following equation:
![]() | (3) |
The average property in flanking regions of a k-mer of type i, found in conformation c in protein r, is computed by using the following equation:
![]() | (4) |
where w is the length of flanking regions; j is the sequence position of the first residue in the k-mer; k is the size of the k-mer; s is the sequence separation between the k-mer and flanking regions (s = 0 means that the flanking region is immediately adjacent to the k-mer); Arl is the amino acid at sequence position l of protein r; and P(Arl) is the value of property P for this amino acid. In the case of the GLP, Arl is a tripeptide centered at sequence position l, and P(Arl) is the GLP of this tripeptide, glp(Arl). If for some k-mer (j-w-s)
0 or (j+s+k+w-1) > Lr (where Lr is the length of protein r), that k-mer was excluded from the computation of the properties of flanking regions.
The average property P computed over flanks of all k-mers of type i found in conformation c is given by
![]() | (5) |
where N(i,c) is the number of proteins that contain the k-mer of type i in conformation c. The average property computed over all k-mers of type i found in conformation c, <Pk-mer(i|c)>, is computed in a similar fashion. We used the conventional
-helical and ß-sheet propensities of amino acids taken from Swindells et al. (1995) to determine the average
-helical and ß-sheet propensities of a sequence fragment by means of equations 3 through 5![]()
![]()
.
Entropic sequence complexity of a sequence fragment, Q(S), was computed by using the following equation (Wotton 1993):
![]() | (6) |
where fi is the normalized frequency of amino acid type i in sequence S.
We use the UPGMA method (average linkage; Sneath and Sokal 1973) to cluster groups of sequence fragments according to their average amino acid frequencies. The output of the method is a clustering tree that shows the degree of similarity between these groups. The distance between a group of sequence fragments of type i and a group of sequence fragments of type j is defined as follows:
![]() | (7) |
where fik is the average normalized frequency of amino acid type k in group i.
| Acknowledgments |
|---|
| References |
|---|
|
|
|---|
Bahar, I., Kaplan, M., and Jernigan, R.L. 1997. Short-range conformational energies, secondary structure propensities, and recognition of correct sequence-structure matches. Proteins 29: 292308.[CrossRef][Medline]
Baldwin, R.L and Rose, G.D. 1999. Is protein folding hierarchic? I: Local structure and peptide folding. TIBS 24: 2633.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235242.
Bowie, J.U., Luthy, R., and Eisenberg, D. 1991. A method to identify protein sequences that fold into a known three-dimensional structure. Science 253: 164170.
Bystroff, C., Thorsson, V., and Baker, D. 2000. HMMSTR: A hidden Markov model for local sequencestructure correlations in proteins. J. Mol. Biol. 301: 173190.[CrossRef][Medline]
Cohen, B.I., Presnell, S.R., and Cohen, F.E. 1993. Origins of structural diversity within sequentially identical peptides. Protein Sci. 2: 21342145.[Abstract]
Dalal, S. and Regan, L. 2000. Understanding the sequence determinants of conformational switching using protein design. Protein Sci. 9: 16511659.[Abstract]
Dunker, K.A., Brown, C.J., Lawson, D.J., Iakoucheva, L.M., and Obradovic, Z. 2002. Intrinsic disorder and protein function. Biochemistry 41: 65736582.[CrossRef][Medline]
Fidelis, K., Stern, P.S., Bacon, D., and Moult, J. 1994. Comparison of systematic search and database methods for constructing segments of protein structure. Protein Eng. 7: 953960.
Garnier, J., Gibrat, J.-F., and Robson, B. 1996. GOR secondary structure prediction method version IV. Methods Enzymol. 266: 540553.[Medline]
Gerstein, M. and Krebs, W. 1998. A database of macromolecular motions. Nucleic Acids Res. 26: 42804290.
Griffiths-Jones, S.R., Sharman, G.J., Maynard, A.J., and Searle, M. 1998. Modulation of intrinsic phi,psi propensities of amino acids by neighboring residues in the coil regions of protein structures: NMR analysis and dissection of a ß-hairpin peptide. J. Mol. Biol. 284: 15971609.[CrossRef][Medline]
Han, K.F. and Baker, D. 1996. Global properties of the mapping between local amino acid sequence and local structure in proteins. Proc. Natl. Acad. Sci. 93: 58145818.
Hobohm, U., Scharf, M., and Schneider, R. 1993. Selection of representative protein data sets. Protein Sci. 1: 409417.
Huntley, M.A. and Golding, G.B. 2002. Simple sequences are rare in Protein Data Bank. Proteins 48: 134140.[CrossRef][Medline]
Jonassen, I., Eidhammer, I., Grindhaug, S.H., and Taylor, W.R. 2000. Searching the protein structure databank with weak sequence patterns and structural constraints. J. Mol. Biol. 304: 599619.[CrossRef][Medline]
Kabsch, W. and Sander, C. 1983. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22: 25772637.[CrossRef][Medline]
. 1984. On the use of sequence homologies to predict protein structure: Identical pentapeptides can have different conformations. Proc. Natl. Acad. Sci. 81: 10751078.
Kuznetsov, I.B. and Rackovsky, S. 2002. Discriminative ability with respect to amino acid types: Assessing the performance of knowledge-based potentials without threading. Proteins 49: 266284.[CrossRef][Medline]
Kwasigroch, J.-M., Chomilier, J., and Mornon, J.-P. 1996. A global taxonomy of loops in globular proteins. J. Mol. Biol. 259: 855872.[CrossRef][Medline]
Liu, J., Tan, H., and Rost, B. 2002. Loopy proteins appear conserved in evolution. J. Mol. Biol. 322: 5364.[CrossRef][Medline]
Martin, A.C.R. 2001. ProFit V2.2. http://www.bioinf.org.uk/software/profit/
Mezei, M. 1998. Chameleon sequences in PDB. Protein Eng. 11: 411414.
Minor Jr., D.L. and Kim, P.S. 1996. Context-dependent secondary structure formation of a designed protein sequence. Nature 380: 730734.[CrossRef][Medline]
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of protein database for the investigation of sequence and structures. J. Mol. Biol. 247: 536540.[CrossRef][Medline]
Prusiner, S. 1998. Prions. Proc. Natl. Acad. Sci. 95: 1336313383.
Rackovsky, S. 1990. Quantitative organization of known protein X-ray structures, I: Methods and short-length scale results. Proteins 7: 378402.[CrossRef][Medline]
. 1993. On the nature of the protein folding code. Proc. Natl. Acad. Sci. 90: 644648.