|
|
||||||||
Department of Biomathematical Sciences, Mount Sinai School of Medicine, New York, NY 10029, USA
Reprint requests to: Igor B. Kuznetsov, Center for Functional Genomics, University at Albany, 1 University Place, Rensselaer, NY 12144-2345, USA; e-mail: IKuznetsov{at}albany.edu; fax: (518) 525-2799.
(RECEIVED April 22, 2004; FINAL REVISION August 9, 2004; ACCEPTED August 11, 2004)
| Abstract |
|---|
|
|
|---|
Keywords: conformational variability; PrP structural transition; intrinsic propensity; physicochemical property; scan statistics; bioinformatics
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.04833404.
| Introduction |
|---|
|
|
|---|
-helix and 3%
-sheet, whereas the infectious PrPSc contains 30% helix and 43%
-sheet (Pan et al. 1993). Thus, the conformational transition PrPC
PrPSc involves unfolding of
-helices and formation of
-sheets. The cellular form, PrPC, is characterized by high thermodynamic stability, and analysis of some of the mutations linked to hereditary forms of human prion disease showed that they do not result in a significant destabilization of PrPC (Swetnicki et al. 1998). These data indicate that some hereditary forms of prion disease caused by familial mutations cannot be explained by a decrease in the thermodynamic stability of PrPC, which would favor the formation of the pathogenic conformation, PrPSc, and that alternative disease-forming mechanisms should be considered.
The cellular form of the prion protein is a GPI-anchored outer-membrane protein that undergoes rapid endocytosis with subsequent recycling with half-life on the cell membrane of about 20 min. Soon after synthesis, a signal peptide 22 amino acids long is removed from the N-terminal end of PrP (Harris et al. 1996; Lehmann et al. 1999). Upon addition of the GPI anchor, a 23-residue-long peptide is removed from the C-terminal end. The function of the prion protein is not known, and it was shown that knock-out mice that do not express PrP are resistant to prion infection (Prusiner 1998). A number of NMR and X-ray studies aimed to detect the structure of PrPC have revealed that the C-terminal domain of the protein is structured and contains three
-helices (A, B, and C) and a short
-sheet, whereas the N-terminal domain, which contains Gly and Pro-rich octarepeats, is highly flexible and cannot be assigned a particular conformation (Donne et al. 1997; Wright and Dyson 1997; Riek et al. 1998; Lopez-Garcia et al. 2000). Helices B and C are linked by a disulfide bond and form a two-helix bundle. A recent X-ray study showed that PrPC can form a three-dimensional domain-swapped dimer, in which helices 2 and 3 repack with rearrangement of the disulfide bond (Knaus et al. 2001). It has also been shown that certain secondary structure prediction algorithms predict a
-sheet conformation for PrP helix B (Kallberg et al. 2001). However, it is not clear whether helix B has an unusual amino acid composition and, as a result, unusually high
-sheet propensity, or this misprediction is merely a consequence of a limited accuracy of secondary structure prediction algorithms. A paralog of the prion protein, PrP-Doppel, was identified (Silverman et al. 2000; Mo et al. 2001). This protein and PrP share about 25% sequence identity and have very similar structures. Despite its structural similarity to PrP, Doppel does not form infectious particles and does not undergo a structural transition into a
-sheet-rich conformation (Nicholson et al. 2002).
Much less is known about the pathogenic conformation of the prion protein, PrPSc, except for its approximate secondary structure content, protease resistance, and the insolubility of some forms (Prusiner 1998; Prusiner et al. 1998). Different theoretical models of the PrP conformational transition have been proposed. Some of these models suggest that helix A in PrPC unfolds, adopts a
-sheet-like structure, and serves as a potential nucleation site that initiates conformational transition, whereas helices B and C, stabilized by the presence of an interhelix disulfide bond that remains intact, retain helical structure in PrPSc (Huang et al. 1994, 1995; Morrisey and Shakhnovich 1999; Wille et al. 2002). It has also been shown experimentally that the conversion from PrPC to PrPSc is influenced by interactions involving aspartic acid residues in helix A (Speare et al. 2003). However, recent crystallographic data argue that PrPC can form a dimer, in which the intramolecular disulfide bond becomes an intermolecular bond linking two monomers (Knaus et al. 2001). Others suggest that helix A has a high intrinsic helical propensity and is preserved during the initial stages of conformational transition (Liu et al. 1999; Ziegler et al. 2003). Another site that is believed to be involved in PrPC to PrPSc conversion is the PrP(108144) fragment, which is one of the most highly amyloidogenic peptides (Ma and Nussinov 2002) and constitutes a part of PrP(90145), the shortest fragment sufficient for prion infectivity. It has been shown that single-step amino acid replacements in this PrP segment tend to increase its
-sheet propensity (Kuznetsov et al. 1997).
There is no evidence of any covalent modification that would distinguish PrPSc from PrPC. However, a possibility that a small ligand bound to PrP may be an essential component of the infectious particles has not yet been completely eliminated. It is also believed that a species-specific cofactor, named protein X, is required for conformational conversion of PrPC to PrPSc (Kaneko et al. 1997). A number of mutations that promote the disease have been identified. A significant proportion of these mutations are found in hypermutable CpG dinucleotides within the structured C-terminal domain (for a recent compilation, see Kovacs et al. 2002). Whether the observed clustering of these mutations is determined by the fact that amino acid replacements in certain PrP regions are more likely to cause conformational transition, or mainly by the presence of DNA mutational hot spots, is unknown. How these mutations lead to the disease and whether they share any common features is also unknown.
Progress in the field of prion diseases depends on understanding which parts of PrPC undergo conformational transitions, and how mutations affect these transitions. Answers to both of these questions remain elusive. Extensive experimental studies were carried out to determine putative segments of the prion proteins that can adopt a
-sheet conformation. Most of the fragments corresponding to the elements of regular secondary structure were shown to have a high
-sheet propensity and to form
-sheet-like aggregates (except for helix A) (Nguen et al. 1995; Zhang et al. 1995; Inouye and Kirschner 1998; Viles et al. 2001; Jamin et al. 2002). However, experimental studies have not been put into a reference framework by comparison with the corresponding properties of an average peptide. Most short peptides have a high tendency to aggregate into
-sheets in solution. A comprehensive experimental study of structural propensities of many unrelated peptides is time consuming and expensive. Experimental determination of common features of disease-promoting mutations by means of site-directed mutagenesis is also very costly. An initial sense of direction for such experimental efforts can be provided by computational analysis.
In this work, we present a computational approach designed to detect (1) sequence fragments with unusual structural propensities and high conformational variability, and (2) patterns in mutational data. We use this approach to study PrP sequence and mutation data. We ask the following questions:
Do PrP sequences possess any unique sequence or structural propertiesconformational variability in particularthat distinguish them from other proteins?
What parts of the prion protein are likely to undergo refolding?
Are there any common features shared by the majority of prion disease-associated mutations?
Is the observed distribution of these mutations along the sequence mainly determined by the presence of DNA mutational hot spots or by the fact that substitutions in certain parts of PrP are more likely to induce conformational transition?
This work consists of two major parts:
| Results |
|---|
|
|
|---|
-helical and
-sheet conformation in two distinct proteins) to identify potential chameleon segments in the prion protein by finding exact matches between the data set and PrP sequences. Overlapping or immediately adjacent chameleon k-mers found in PrP sequence are merged into longer fragments. We show that the most conserved part of PrP sequences in all species contains an unusually long chameleon fragment located in an unusually conformationally flexible sequence context.
All chameleon k-mers from our data set found in PrP and Doppel sequences are listed in Tables 1
and 2
. The representative sequences from a PrP multiple-sequence alignment used to map chameleon k-mers on human PrP are shown in Figure 1
. The Doppel multiple-sequence alignment is shown in Figure 2
. Figure 3
shows the chameleon segments in human PrP reconstructed using multiple-sequence alignment. The first striking observation is that the part of the prion sequence, PrP(114125), which is conserved across all species studied, is a chameleon fragment of length 12, GAAAAGAVVGGL. This fragment is obtained by matching chameleon pentamers that have two or three overlapping residues on both ends. Because each pentamer significantly overlaps with its neighbors and exists as a part of an
-helix and a
-strand in two distinct proteins, the 12-mer formed by these pentamers is a so-called structurally ambivalent sequence fragment, and can adopt either an
-helical or a
-sheet conformation depending on its environment (Young et al. 1999). Two of the three experimentally determined PrP helices contain chameleon k-mers; helix B contains a chameleon 6-mer (27% of the total helix length) and helix C contains a pentamer (18% of the total helix length). We did not detect any chameleon sequences in helix A. In contrast to PrP sequences, in Doppel, we identified six nonoverlapping chameleon pentamers evenly distributed along the sequence, including one in helix A. All of the chameleon pentamers in Doppel are different from those observed in PrP. The results for PrP cannot directly be compared with those for Doppel for two reasons: First, the Doppel multiple alignment consists only of four sequences, whereas the PrP alignment involves 57 sequences. Second, because we consider exact matches, we can identify only chameleon k-mers observed in the PDB, and a difference in a single position can make a k-mer undetectable. Nevertheless, the pattern of chameleon fragments observed in the Doppel sequences is completely different from that in PrP, which suggests that these two proteins have different conformational propensities.
|
|
|
|
|
Our data set of chameleon k-mers contains only one fragment with zero complexity (composed of a single amino acid)the pentamer AAAAA. No other chameleon k-mers with zero complexity were found in the PDB. This fragment will match all long poly-Ala runs, which are abundant in eukaryotes (Liu et al. 2002). We compared the amino acid content in the flanking regions of chameleon and poly-Ala sequences of length >10 with that of the SWISS-PROT database. The results are shown in Figure 4
. One can see that the flanks of long chameleon fragments have amino acid composition similar to the average composition of the SWISS-PROT database, with very small excess of Ala, Gly, and Pro, whereas long poly-Ala fragments have large excess of Ala, Pro, Gly, His, Gln, and Ser in their flanks. This indicates that long poly-Ala fragments are located in unusual sequence contexts that combine amino acids with very low (Pro) and high (Gly, His, Ser) conformational variability, and should be considered separately. Because long poly-Ala homopolymers also have some unique properties, such as a high tendency to form disordered intermolecular aggregates, we excluded the pentamer AAAAA from the data set of chameleon k-mers.
|
-helical conformation, chameleon fragments require a sequence context with strong
-helical propensity (Baldwin and Rose 1999; Kuznetsov and Rackovsky 2003a). To be fixed in a
-sheet conformation, a chameleon fragment must form long-range hydrogen bonds with another
-strand. Work of Minor Jr. and Kim (1996) also demonstrated that when a chameleon 11-mer is placed in an
-helical context, it adopts an
-helical conformation, whereas when placed in a
-sheet context, it adopts a
-sheet conformation. In PrPC, neither of these requirements is fulfilled, and the chameleon 12-mer, located in a sequence context with unusually high conformational variability, remains very flexible, as has been shown by NMR (Lopez-Garcia et al. 2000). Another important factor that affects the preference of chameleon fragments for an
-helical or a
-sheet conformation is the solvent accessibility of the fragments themselves and their flanking regions (Kuznetsov and Rackovsky 2003a). Chameleon fragments in a
-sheet conformation tend to be less accessible and their flanks are more accessible than those in an
-helical conformation. Because the N-terminal part of PrP is highly flexible, it is very unlikely that the additional requirement of being a part of regular secondary structure is fulfilled, either.
|
-sheets and exceptionally strong intermolecular aggregates (Wilson et al. 2000). The repeated pattern GAAAAG in the circumsporozoite protein forms an epitope with extremely high affinity to immunoglobulines (McCutchan et al. 1996). In all other sequences, GAAAA occurs in three copies or less, and those with three copies are all DNA-binding transcription factors. GAAAAG in all other sequences occurs only in one copy (except for spider silk, where it occurs twice). These findings suggest that the sequence pattern observed in PrP(114125) possesses two special properties; it is a chameleon fragment that can adopt a
-sheet conformation upon interaction with another
-strand, and occurs in proteins that have a very high binding capability and form intermolecular aggregates.
Structural propensities of PrP and Doppel
We have shown that PrP(114125) is a chameleon segment located in a sequence context with high conformational variability. The next logical step is to analyze structural propensities of the three helices from the structured C-terminal domain of PrP. A helix with very low
-helical and high
-sheet propensity would be a potential candidate for unfolding during the PrPC to PrPSc transition. In this section, we study and compare the structural propensities of PrP and Doppel sequences, and show that PrP helix B has an unusually low
-helical and unusually high
-sheet propensity.
First, we compare the generalized local propensity profiles of human and mouse PrP and Doppel (two species for which both PrP and Doppel structures are known). This comparison shows that the conformational variability of the segments that correspond to the loop connecting helices B and C, and to
-strands S1 and S2, are significantly different in these two proteins, although they share the same fold (Figs. 5
,6
). The B-C loop and the strand S1 and its local sequence context in PrP are very flexible, whereas strand S2 is very flexible in Doppel. Comparison of the PrP helices to all helices observed in the PDB_SELECT data set shows that only helix B has unusually low
-helical propensity and unusually high
-sheet propensity (Table 3
, Fig. 7
). It is noteworthy that, in contrast to PrP, Doppel helix B has normal average
-helical and
-sheet propensity (Fig. 8
). Another indirect way to analyze the local preferences of sequence fragments determined by their local sequence context is to use secondary structure-prediction algorithms. We applied four different methods of secondary structure prediction to PrP and Doppel sequences. Each of these methods uses a different prediction algorithm as follows:
|
|
|
|
-sheets.
All four methods identify helices A and C in PrP and predict an extended conformation for helix B. Moreover, the PHD program predicts extended conformation in residues corresponding to helix B with a high degree of confidence. All methods also predict a helical conformation around PrP(110120), where the chameleon 12-mer was detected. In contrast to PrP, for Doppel helix, A GOR-IV and PREDATOR predict an extended conformation, whereas DSC and PDH predict a helical conformation for the last turn of this helix only. All methods identify Doppel helices B and C. Because secondary structure-prediction methods compute an intrinsic local propensity of sequence fragments, rather than the actual conformation (Cordier-Ochsenbein et al. 1998), the results obtained using four different prediction methods provide additional evidence for the conclusion that helix B has an unusually low
-helical and an unusually high
-sheet propensity.
Analysis of disease-promoting PrP mutations
In this section, we study known PrP mutations (Table 4
) that have been shown to promote the conformational transition of the cellular PrPC to pathogenic PrPSc and lead to the onset of the prion disease. We wish to know whether the observed distribution of these mutations along the sequence is significantly different from that expected from the background distribution of DNA mutational hot spots in the PrP gene and whether these mutations share any underlying common features.
|
|
B), alters the properties of the polypeptide in a way that facilitates conformational transition. We wish to find out whether known mutations result in an increase or decrease in one or more amino acid properties. We use a data set of five physicochemical properties as follows:
-helical propensity,
-sheet propensity, hydrophobicity, residue volume, and generalized local propensity. For each of the 20 known disease-promoting mutations in human PrP, we compute, using each property from this data set separately, the difference between mutant and wild-type amino acids (Table 4
-sheet propensity, and especially hydrophobicity is considerably larger than 50%. Application of our method for estimating the significance of the observed number of mutations that increase a particular physicochemical property (see Materials and Methods for details) shows that, given the codon usage in human PrP, it is not unusual to observe 14 or more random single-step amino acid replacements that increase
-sheet propensity (P > 0.05). No bias is observed among disease-promoting mutations with regard to
-helical propensity or GLP. An increase in the number of mutations that change a smaller amino acid to a larger one (the total of 14 mutations that increase volume) is only marginally significant, as the P-value does not pass the significance threshold obtained using the Bonferroni correction for multiple testing (for five independent tests this threshold is 0.05/5 = 0.01). Hydrophobicity is the only property for which a significant over-representation of (+) mutations is observed (17 of 20 mutations increase hydrophobicity; see Fig. 3
|
| Discussion |
|---|
|
|
|---|
-helical and a
-sheet conformation upon changes in its environment. Analysis of the similarity between PrP(114125) and database sequences shows that this fragment contains amino acid fragments used in other proteins involved in the formation of intermolecular complexes and, therefore, may possess high binding potential. These results are in excellent agreement with experimental data which showed that peptides corresponding to the most conserved part of PrP, which contains the chameleon fragment 114125, can adopt both an
-helical and a
-sheet conformation (Nguen et al. 1995; Zhang et al. 1995). Experiments also showed that the 109122 peptide, which is thought to play a crucial role in PrPC?PrPSc conversion, is highly amyloidogenic (Nguen et al. 1995; Zhang et al. 1995; Jobling et al. 1999). Additionally, PrP(114125) is very hydrophobic and, therefore, may provide a potential hydrophobic oligomerization site. Because this hydrophobic chameleon fragment constitutes the N-terminal flank of
-strand 128131, one may speculate that it can be nucleated by this
-strand and fixed in an extended conformation by minimal additional external interactions. The absence of termination signals in the form of Pro residues in the N-terminal flank of
-strand 128131 can facilitate the nucleation.
The other PrP fragment with unusual properties is helix B. The unusually low
-helical propensity, and unusually high
-sheet propensity of this sequence segment is evidenced by application of both our method and secondary structure prediction methods. Although these two lines of evidence suggest a high
-sheet propensity for helix B, it somehow manages to maintain a stable helical conformation in PrPC. This puzzling phenomenon may be partially explained by the stabilizing effect of the interhelix disulfide bond that links helices B and C. It is believed that the C-terminal part of PrPC interacts with a hypothetical protein X that promotes PrPC
PrPSc conversion (Kaneko et al. 1997). One may assume that if the disulfide bond between helices B and C is reduced, either under unusual physiological conditions, such as very low pH in lysosomes, or assisted by protein X, this removes conformational constraints imposed on helix B, which has an unusually high propensity for
-sheet conformation. The relaxation of these structural constraints may promote partial unfolding of helix B. Unfolding may take place at the C terminus of helix B, which, according to our data, has high conformational variability (Fig. 5
). Remarkably, in Doppel protein, which does not undergo structural rearrangements, helix B does not have high
-sheet propensity or high conformational variability. Doppel has no unusually long chameleon fragments, either. Thus, of these two topologically identical proteins, only PrP, a protein that can exist in different conformations, has sequence fragments with unusual properties. This provides additional support for a special role for PrP(114125) and helix B during the conformational transition in prion protein.
However, it is generally believed that it is PrP helix A that undergoes major structural rearrangement upon transition from PrPC to PrPSc (Huang et al. 1995). Recent experimental data obtained using low-resolution electron crystallography suggest that the fragment incorporating helix A in PrPSc refolds into a left-handed
-helix (Wille et al. 2002). Helix A has also been shown to possess certain unique features. It is the most hydrophilic helix observed in the PDB, entirely stabilized by intrahelical interactions (Morrisey and Shakhnovich 1999). These intrahelical interactions have been shown to be involved in conformational transition (Speare et al. 2003). In the three-dimensional structure of PrPC, this helix does not form any interactions with the rest of the C-terminal domain. Helix A is also the most conserved helix in PrP sequences. These results provide a basis for a model of the PrPC
PrPSc transition, in which helix A serves as a starting point for conformational transition and forms a
-like aggregate, whereas helices B and C retain their conformation (Huang et al. 1995; Morrisey and Shakhnovich 1999; Wille et al. 2002). Recently, however, two independent groups have experimentally shown that helix A possesses a remarkably high
-helical propensity and retains helical conformation under a wide range of denaturing conditions (Liu et al. 1999; Ziegler et al. 2003). Our data also show that helix A has low propensity for
-sheet conformation. It was also demonstrated that deleting helix A does not abrogate prion infectivity (Prusiner 1998). To reconcile the remarkably high stability of helix A against environmental changes with experimental evidence of
-like structure observed in the PrPSc segment corresponding to helix A, it has been proposed that this helix unfolds in the late stage of the structural transition under the influence of global conformational rearrangements occurring in other parts of the prion protein (Ziegler et al. 2003).
The unusually high density of disease-promoting mutations in helices B and C also points to the particular importance of these helices for conformational transition. Because helix B has a strong propensity for the extended conformation, it is reasonable to assume that a single amino acid replacement in the vicinity of this helix may significantly affect the conformational preference of the entire B-C segment and further increase the propensity for the extended conformation, facilitating conformational rearrangement in this region. The assumption that the segment comprising the C terminus of helix B and the adjacent loop may partially unfold and represents a potential oligomerization site is further supported by crystallographic data, which show that PrPC can form a dimer in which two helices A are at the dimer interface and retain their conformation. On the other hand, helices B and C in the dimer undergo significant rearrangements; helix C swings out across the dimer interface and packs against helix B in the other monomer, the intramolecular disulfide bond between helices B and C becomes an intermolecular bond, the last turn of helix B unwinds, and the BC connecting loop in the two monomers forms an intermolecular
-sheet (Knaus et al. 2001).
We have shown that disease-promoting mutations have a statistically significant tendency to cause an increase in local hydrophobicity. Hydrophobicity is the only property that demonstrates a consistent trend. Hydrophobic interactions bring fragments of polypeptide chain in close proximity to each other and play an important role in formation of
-sheets (Barrow et al. 1992). Thus, the increase in hydrophobicity caused by a point mutation may facilitate aggregation of prion monomers and the formation of intra- or intermolecular
-sheet-like structures. This assumption is supported by experimental data that have shown that a decrease in hydrophobicity of the PrP106126 peptide is associated with a marked reduction in its neurotoxicity and
-sheet structure (Jobling et al. 1999). Similar results were obtained for the amyloid
-peptide of Alzheimers disease (Hilbich et al. 1992). An increase in local hydrophobicity may have an especially profound effect on the stability of the helical bundle formed by helices B and C, in which most of the disease-promoting mutations are clustered. Coupled with an acidic environment that reduces the disulfide bond connecting helices B and C, such an increase may promote separation of these helices. It may also affect the interactions between prion protein and the hypothetical protein X. This finding suggests a direction for development of antiprion drugs capable of blocking hydrophobic interactions between prion monomers. However, it should be remarked that, despite a general statistically significant trend, three of 20 disease-promoting mutations actually decrease hydrophobicity, which indicates that increased hydrophobicity is not the only driving force for the conformational transition, and that other factors should be considered. Whether the disulfide bond observed in PrPC remains intact during the conformational transition in vivo is also uncertain. One line of experimental evidence suggests that reduction of the intramolecular disulfide bond in PrPC induces an in-vitro transition to a highly soluble protein rich in
-sheet (Jackson et al. 1999), whereas the other suggests that in vitro transition from PrPC to a protease resistant scrapie-like conformation can occur without disulfide exchange (Welker et al. 2002).
The results of this work can be summarized as follows:
| Materials and methods |
|---|
|
|
|---|
-helix-like (i,i+3) hydrogen-bonding pattern is four residues. However, the same pattern is observed in type I turns, in which only two central residues have
-helical conformation. Five residues represent a minimal closed helical turn that allows two in-trafragment (i,i+3) and (i+1,i+4) hydrogen bonds between the backbone atoms and (i,i±3,4) side-chainside-chain interactions. Fragments of length five to six are also well represented in the PDB and cover most conformations accessible to a given fragment (Fidelis et al. 1994). We therefore use chameleon k-mers of size five or greater and identify potential long chameleon fragments in a query sequence by finding matches between known chameleon k-mers from our data set and the sequence under study. We consider exact matches only. Overlapping or immediately adjacent chameleon k-mers found in the query sequence are merged into longer fragments.
Theoretically, potential chameleon sequences may also be identified by selecting successive residues that have a high propensity for both
-helical and
-sheet conformations, or by applying a pattern-recognition method trained on the data set of known chameleon k-mers. However, as each residue in regular secondary structure is involved in an intricate network of cooperative interactions with local (
-helix) and long-range (
-sheet) neighbors, a chameleon fragment must be able to participate in such interactions in both helical and sheet conformations. A single wrong residue may destroy the chameleon properties of a fragment by making the side-chain interactions unfavorable in one or both conformations. For this reason, we use only chameleon k-mers previously identified from the PDB data to reconstruct sequence segments with chameleon properties. Because all k-mers in our data set are experimentally known to exist in both types of regular secondary structure, potential adverse steric effects are minimized. Moreover, it has been shown that chameleon fragments have low-sequence complexity and are composed of a limited number of residue combinations, and a majority of long fragments can be identified by finding exact matches between query sequence and the data set of short chameleon pentamers (Kuznetsov and Rackovsky 2003a). Chameleon fragments found in a set of closely related homologous sequences can be mapped onto a particular sequence from this set by means of a multiple-sequence alignment. This allows one to maximize the total number of chameleon fragments detected in this sequence. The use of a multiple-sequence alignment for determining sequence/structure correlations is a general approach used to maximize the amount of information extracted from the sequences (Cuff and Barton 2000).
A total of 57 mammalian PrP and 4 Doppel sequences were retrieved from the SWISS-PROT and TrEMBL databases (Bairoch and Apweiler 2000). We searched for all matches between chameleon k-mers from our data set and all PrP and Doppel sequences. A multiple-sequence alignment of mammalian prion protein sequences (Kuznetsov and Rackovsky 2003b) was used to map all matches onto human PrP. Avian PrP sequences were excluded, as the structure of the prion protein in these species is not known, and a low degree of sequence similarity with mammalian sequences (35%) and the presence of gaps do not allow an unambiguous mapping of the elements of secondary structure. Matches in the multiple alignment of Doppel sequences were mapped onto mouse Doppel sequence for which an NMR structure is known. Multiple alignments were obtained using the PILEUP program from the GCG package, version 10.0 (Accelrys, Inc.; http://www.accelrys.com/bio) using the BLOSUM50 similarity matrix (Henikoff and Henikoff 1992), a gap initiation penalty of 16 and a gap extension penalty of 4. A nonredundant SWISS-PROT database (release 40.19) clustered at 90% sequence identity (meaning that all sequences have pairwise sequence similarity below 90%) according to the method of Holm and Sander (1998), was utilized to analyze the frequency of occurrence of potential chameleon segments as a function of fragment length.
Structural propensities of sequence fragments
The degree of context-dependent local backbone variability of a sequence fragment was determined using a modified generalized local propensity, GLP. A detailed description of the original methods is provided elsewhere (Kuznetsov and Rackovsky 2003a). Briefly, for each amino acid, X, in a tripeptide, iXj, this index measures the overall context-dependent breadth of the distribution of accessible backbone conformations, glp(iXj):
![]() | (1) |
where Q(iXj) is the observed Shannon entropy of the tripeptide-specific distribution of backbone conformations observed in a nonredundant data set of known protein structures, and QR(NiXj) is the average entropy of a distribution of NiXj tripeptides randomly sampled without replacement from this data set. The version of GLP used in this work differs from the original method in that, here, we normalize GLP by using a ratio of the observed and random entropies, rather than taking a difference between them. The central residue, X, is represented in the full 20-letter alphabet, whereas the flanking residues i and j are collapsed into three groups based on side-chain properties, 1-Gly; 2-Pro; 3-18 other amino acids (Solis and Rackovsky 2000). A value of glp(iXj) <1.0 indicates that the average entropy of random distribution is greater than that observed for a given tripeptide iXj and implies that the tripeptide is preferably observed in defined areas of the Ramachandran plot. A value >1.0 indicates that the given tripeptide has conformational variability higher than the average. For any type of amino acid substitution iXj
iYj (residue X changing to Y in a sequence context defined by the neighboring residue types i and j), this method also allows one to compute the context-dependent expected change in the backbone conformational variability,
glp(iXj
iYj):
![]() | (2) |
Positive values of
glp correspond to an increase in the local backbone variability resulting from the substitution, whereas negative values indicate that the substitution decreases the range of accessible torsion angles.
The average generalized local propensity of a sequence fragment S, GLP(S), was computed by summing the GLP values over all residues and dividing by the length of fragment:
![]() | (3) |
For the N-terminal or the C-terminal residue, X, of each fragment we used the values of the GLP in a 3-X-j or i-X-3 tripeptide. The
-helical and
-sheet propensities of a sequence fragment, Prk(S), were computed in a similar fashion:
![]() | (4) |
where L is the fragment length, Am is the amino acid in sequence position m, and Prk(A) is the intrinsic propensity of type k (
-helical or
-sheet) of amino acid A.
An intrinsic structural propensity of an amino acid for a particular type of secondary structure represents a normalized index that measures the strength of the intrinsic preference of this amino acid for this type of secondary structure. A propensity >1.0 means that the given amino acid has an intrinsic preference for given secondary structure, whereas a propensity below 1.0 means that the amino acid avoids this particular type of structure. We used conventional
-helical and
-sheet propensities of amino acids to determine the average structural propensity of a sequence fragment. The intrinsic structural propensity of an individual amino acid type provides experimental information about its conformational preferences, averaged over all possible types of sequence context. We used both the most recent amino acid propensities derived from a large nonredundant data set of 1091 protein structures (Kallberg et al. 2001) and an earlier propensity scale derived from a much smaller data set of 85 protein structures (Swindells et al. 1995). We will refer to fragments that have GLP higher than average as the fragments with high conformational variability.
Identification of fragments with unusual structural propensities
When the structural propensity of a particular sequence fragment is computed, the statistical significance of the result must be established. To do so, we compare the fragment-specific propensity with a distribution of propensities computed for all fragments with similar length and the same structural properties. For instance, structural propensities of PrP and Doppel helices are compared with the distributions of propensities of all helices of length 10 or longer observed in the nonredundant representative data set of high-resolution X-ray structures. We used the September 2001 release of the PDB_SELECT_25 data set (Hobohm et al. 1993), with resolution
2.0A, R-factor
0.2, including a total of 471 proteins without chain breaks. For each fragment, i, and propensity, j, we can compute the z-score, Z(i,j):
![]() | (5) |
where Y(i,j) is the fragment-specific propensity, Y(j) and S(j) are the average and the standard deviation for propensity j computed over all fragments in the data set. The z-score for each fragment is easily converted into the two-tailed probability of observing the z-score of the same magnitude or greater by chance, P(z
|Z|), using the standard normal distribution:
![]() | (6) |
A z-score >1.96 indicates that the corresponding fragment has an unusual structural propensity (P(z > 1.96) < 0.05).
Assignment of helices and strands for human PrP was taken from Zahn et al. (2000), for human Doppel from Luhrs et al. (2003), for mouse PrP and Doppel from Mo et al. (2001). When we analyze the local sequence context of a short-sequence fragment, we look at eight adjacent residues on each side of the fragment. Flanking regions of this length were shown to be the longest that retain statistically significant differences in local propensity between two alternative conformations of the same chameleon k-mer (Kuznetsov and Rackovsky 2003a).
Identification of unusual patterns of amino acid substitutions
Clusters of mutations
We used a data set of 20 confirmed missense mutations in mature human PrP linked to hereditary forms of prion diseases (Kovacs et al. 2002). All mutations occur in different codons. We assume that all mutations in our data set were sampled uniformly from the total pool of possible mutations that promote conformational transition. We find statistically significant clusters of mutations by scanning a sequence with a sliding window of fixed size, w. For a window beginning at sequence position k, we define the number of mutations observed in this window, M(k,w). For a sequence of length L, we denote the maximum number of mutations observed in a window of size w as Sw:
![]() | (7) |
The quantity Sw is called the scan statistic. If, for some window of size w beginning at the sequence position k, the value of Sw is unusually high, meaning that this or a larger value of Sw is unlikely to be observed by pure chance for a given sequence and given total number of mutations, we can say that the mutations found within this window form a statistically significant cluster. To judge how significant a particular value of Sw is, we need to know the distribution of Sw as a function of the sequence length, L, and the total number of mutations, n. This distribution can be computed for certain cases of a simple probability model in which all types of amino acid mutations are equally likely (Glaz et al. 2001). However, in the real case of single-step amino acid replacements (caused by a single-nucleotide mutation) in protein-coding genes, different codons have unequal probabilities of nonsynonymous substitutions and different types of nucleotides have unequal mutation rates. The difference in mutation rates is especially large in CpG dinucleotides, which serve as mutation hot spots (Bulmer 1986). Many known PrP mutations are caused by mutations in hypermutable CpG-containing codons, thus reflecting nonuniformity in causative mutation distribution. We therefore need to separate the contribution that arises from the intensity of mutation processes at the DNA level, which underlies the observed pattern of amino acid substitutions in all protein-coding genes, from the contribution that may arise because, for structural reasons, prion diseases are caused by amino acid replacements that occur preferentially at certain positions within the PrP sequence. This will allow us to identify regions of the PrP sequence with an unusually high density of mutations associated with a conformational transition.
For a protein-coding DNA sequence, C, the P-value for any given value, m, of the scan statistic, Sw, is computed as the cumulative probability of observing Sw equal to or greater than m, P(Sw
m|w, n, C). We estimate this probability using a computer simulation by taking the cDNA sequence of the protein of interest, C, and making in this sequence n single-step nonsynonymous substitutions caused by a nonuniform random process. By repeating this procedure Nr times, we obtain Nr random sequences, each of which has n amino acid replacements. We use two slightly different stochastic models designed to account for the nonuniformity of substitution rates at the DNA level:
In the first model, the random process has three parameters, the probability of non-CpG transitions (A/T
G/C), Ptrs, the probability of transitions in CpG dinucleotides (CpG
TpG, CpG
CpA), PCG, and the probability of transversions (A/G
T/C, A/G
C/T), Ptrv. Each random sequence is generated by making n nonsynonymous nucleotide substitutions in the source sequence caused by this nonuniform process.
In the second model, the number of non-CpG transitions, Ntrs, CpG transitions, NCG, and transversions, Ntrv are fixed, and all possible nonsynonymous nucleotide substitutions in the source cDNA sequence are enumerated. Each random sequence is generated by randomly choosing the corresponding fixed number of transversions (Ntrv), non-CpG transitions (Ntrs), and CpG transitions (NCG) from the list of all nonsynonymous substitutions for a given source sequence. In the case of mutations in human PrP, Ntrv = 3, Ntrs = 10 and NCG = 7. Because model 2 is essentially a permutation test, it is the most conservative model that does not make any assumptions about the parameters of the mutational process.
For both models, P(Sw
m|w, n, C) is estimated using the following equation:
![]() | (8) |
where n(x|w, n, C) is the number of random sequences obtained using the source sequence C that have Sw equal to x. This cumulative probability depends on the window size, w, the total number of mutations, n, and the length and codon usage of sequence C. We generate 107 random sequences (Nr = 107), a number for