|
|
||||||||
1 Center for Information Science and Technology, Temple University, Philadelphia, PA 19122, USA
2 Department of Biochemistry, University of Hong Kong, Hong Kong
3 Department of Biochemistry, Hong Kong University of Science and Technology, Clearwater Bay, Kowloon, Hong Kong
4 School of Molecular Biosciences, Washington State University, Pullman, WA 99164-4630, USA
Reprint requests to: A. Keith Dunker, Center for Computational Biology and Bioinformatics, Indiana University, Indianapolis, IN 46202, USA; e-mail: kedunker{at}iupui.edu; fax: (317) 274-4686.
(RECEIVED April 8, 2003; FINAL REVISION September 12, 2003; ACCEPTED September 12, 2003)
Supplemental material: See www.proteinscience.org.
5 Present addresses: IBEST, Department of Biological Sciences, University of Idaho, Moscow, ID 83844, USA; ![]()
6 Concurrent Pharmaceuticals, 502 W. Office Center Drive, Fort Washington, PA 19034, USA; ![]()
7 Center for Computational Biology and Bioinformatics, Indiana University, Indianapolis, IN 46202, USA. ![]()
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.03128904.
| Abstract |
|---|
|
|
|---|
Keywords: temperature factor; natively unfolded; intrinsically unstructured; flexibility prediction
| Introduction |
|---|
|
|
|---|
-carbon and the B-factor averaged over the four backbone atoms have both been used as measures of residue flexibility of folded proteins (Karplus and Schulz 1985; Vihinen et al. 1994; Kundu et al. 2002). In crystal structures of macromolecules, the B-factor reflects the uncertainty in atom positions in the model and often represents the combined effects of thermal vibrations and static disorder (Rhodes 1993).
B-factors have been studied from a variety of viewpoints. Karplus and Schulz (1985) determined normalized
-carbon B-factors for each amino acid from which flexibility indices were calculated and subsequently used in a sliding-window prediction of the B-factor. Vihinen et al. (1994) and Smith et al. (2003) further developed the method of Karplus and Schulz (1985) and improved the correlation between predicted and experimentally determined B-factors. These flexibility indices do not indicate inherent amino acid plasticity, but rather correlate with the tendency of the side chain to be buried or exposed (Sheriff et al. 1985), which can explain, among other behaviors, the midrange index value for glycine and the high value for proline (Vihinen 1987). Indeed, Halle (2002) showed that the B-factor is inversely proportional to the atomic packing density and argued that little information on polypeptide chains is contained in B-factors apart from the atom coordinates. This theory was supported by Kundu et al. (2002), who achieved significant improvement in predicting experimental B-factors when atomic coordinates were known. Other researchers studied statistical properties of the B-factor (Altman et al. 1994; Wampler 1997) or aspects such as reliability of B-factors (Carugo and Argos 1999), use of B-factors for predicting biologically active sites (Ragone et al. 1989; Carugo and Argos 1998), and use of B-factors for characterizing protein regions (Carugo 2001).
| Intrinsically disordered proteins |
|---|
|
|
|---|
Many other apparently noncrystallizable proteins are mostly comprised of similar disordered regions, with some of these proteins lacking persistent 3-D structure along their entire lengths. Following the work of Ptitsyn and Uversky (1994), we proposed that native proteins may exist in ordered (folded, structured) and/or disordered (unfolded, unstructured) form, where the existence of disorder is determined by overall protein dynamics rather than by local secondary structure. Thus,
-helix, ß-sheet, and coil, the three types of secondary structure that are characteristic of ordered chains, may also occur in regions of intrinsic disorder.
Given the strong association of disorder with function (Dunker et al. 2002a), disordered proteins are becoming the subject of increased interest (Wright and Dyson 1999; Dunker et al. 2002a; Dyson and Wright 2002; Uversky 2002b). The predictability of disordered regions from amino acid sequence (Obradovic et al. 2003), the observed compositional biases of such regions (Romero et al. 2001), the typically faster rates of evolution (Brown et al. 2002), and the distinctive amino acid substitution patterns during evolution (Radivojac et al. 2002) combine to strongly indicate that intrinsic protein disorder is generally encoded by the amino acid sequence (Dunker et al. 2002b).
| Flexible ordered regions versus intrinsically disordered regions |
|---|
|
|
|---|
| Results |
|---|
|
|
|---|
30 residues.
The amino acid compositions of the low-B-factor ordered, the high-B-factor ordered, and the two intrinsically disordered sets were compared to the compositions of a reference ordered set, Globular-3D (Romero et al. 2001), in order to gain insight into the differences among these data sets (Fig. 1
). Because the low- and high-B-factor sets contain about 91% and 9% of the ordered amino acids, low-B-factor order has amino acid compositions very similar to those of the reference ordered set. However, the differences from the reference ordered set, although small, are not random: Low-B-factor order is slightly enriched in almost all of the more buried residues (Fig. 1
, left) and slightly depleted in three particular surface residues (Fig. 1
, right), serine, glutamic acid, and lysine.
|
The four distributions can also be compared using a more rigorous statistical approach. Because there is little higher-order Markov dependence in proteins (Nevill-Manning and Witten 1999), all segments from each group can be concatenated to form four distinct samples, Sk (k = 1. . .4). Each sample Sk can be considered a realization of an independent and identically distributed random process that emits symbols from an alphabet of 20 amino-acid codes. To compare the four amino-acid frequency distributions, we calculated the Kullback-Leibler (KL) distance between each pair of distributions p1 and p2 as
![]() |
where p1(i) and p2(i) represent relative frequencies of amino acid i in samples S1 and S2. In all cases, the reference distribution p2 was chosen to be the one with fewer observations. Table 1
presents the six non-zero KL-distances among these four distributions.
|
To further understand the distinctions among the sets, five averages were determined: segment length, flexibility index value, hydropathy, net charge, and total charge (Table 2
). Flexibility indices were compared because these are the focus of the present study, whereas average hydropathy and charge were compared because these two properties have been shown to be an indicator of natively unfolded proteins (Williams 1979; Uversky 2002b). The results in Table 2
indicate, surprisingly, that high-B-factor ordered regions have a higher average flexibility index, a higher average hydrophilicity, a higher average absolute net charge, and a higher total charge than do either short or long disordered regions. The low-B-factor ordered regions are significantly enriched in hydrophobic residues and depleted in the total number of charged residues compared to the other three classes. Finally, long disordered regions differ noticeably from both short disordered and high-B-factor ordered regions as their total charge is relatively high, but their (absolute) net charge is low with high variance. This indicates an overall balance of positively and negatively charged residues in the set of long disordered segments. Further analysis, however, indicates that individual segments often have significant net positive or negative charge, which contributes to the large variance in the bootstrapping experiment, with a slightly greater occurrence of negatively charged regions.
|
|
Predicting B-factor values
Despite the problems that arise from differences in crystal environments, B-factors show correlation with amino acid sequence, which suggests that they should be predictable from amino acid sequence. To test this hypothesis, three logistic regression models based on different attribute sets were trained to discriminate between high and low B-factors. The models were systematically evaluated for various window sizes, win and wout, and the best results were in all cases obtained for win = 1 for structural attributes, win = 5 for nonstructural attributes, and wout = 5. The three models are called the NS predictor, which uses no structural information, the KS predictor, which uses known secondary structure, and the PS predictor, which uses predicted secondary structure.
The NS predictor reached 64.5% accuracy (sn = 62.8 ± 0.9, sp = 66.1 ± 0.3), the PS predictor reached 67.0% accuracy (sn = 66.8 ± 0.9, sp = 67.2 ± 0.4), and the KS predictor reached 67.8% accuracy (sn = 65.3 ± 0.8, sp = 70.3 ± 0.3). The disparity in confidence intervals is due to the difference in sizes between the two classes. Construction of nonlinear models only marginally improved prediction accuracy (64.5% for the NS, 67.2% for the PS, and 68.3% for the KS predictor). Although the models were trained only to discriminate between high- and low-B-factor regions, we found that the approximated probability that the residue has a high B-factor is well correlated with the experimental B-values. The observed correlation coefficients for the experimental data versus the raw outputs of the NS, PS, and KS predictors reached 0.34 ± 0.02, 0.38 ± 0.02, and 0.41 ± 0.02, respectively.
The prediction accuracies and correlation coefficients of our B-factor predictors were compared with a predictor based only on flexibility indices by Vihinen et al. (1994), which was previously found to outperform other similar methods. The method of Vihinen et al. achieved 63.8% accuracy, and the correlation coefficient with the experimental data was 0.32 ± 0.02. Thus, our PS single-sequence predictor attained an improvement of 3.4 percentage points (5.3%) in prediction accuracy and 0.06 (19%) in correlation coefficient compared to the values obtained by Vihinen et al.
B-factor predictor with evolutionary modeling
It is well known that adding evolutionary information in the form of sequence alignments leads to improved secondary structure prediction (Benner et al. 1992; Levin et al. 1993; Rost 2001). In recent examples of this principle, Jones (1999) and Przybylski and Rost (2002) improved single-sequence prediction accuracy by 24 percentage points. Using a similar reasoning for B-factor prediction, we constructed protein families using PSI-BLAST and enhanced the performance of our models (Materials and Methods). The average improvement of the prediction results was 2.0 percentage points for the NS predictor and 2.5 percentage points for the PS predictor. Thus, the overall prediction accuracy reached 69.7%. We note that the higher the number of available homologs, the higher the prediction accuracy. For example, in the case in which 30 or more nonredundant homologs can be found, the average prediction accuracy reaches 70.8%. In terms of average correlation coefficients, PSI-BLAST-enhanced NS and PS predictors reached 0.36 ± 0.02 and 0.43 ± 0.02, respectively. Thus, the overall improvement over the predictor based only on flexibility indices by Vihinen et al. reached 5.9 percentage points (9.2%) in prediction accuracy and 0.11 (34.4%) in correlation coefficient. The quality of our predictions can be verified from the figure presented in the Supplemental Material.
Predictor-based analysis of the ordered and disordered data
To further explore the relationship between the ordered and disordered data sets that was suggested by the amino-acid frequency data, we used two predictors of intrinsic disorder: (1) a previously constructed predictor of long disordered regions, VL2 (Vucetic et al. 2003) and (2) a logistic-regression-based predictor developed here to discriminate between short disordered regions and ordered regions. The short disorder predictor, named XS1 according to our conventions (Obradovic et al. 2003), was developed from Dataset-SD and used the same set of attributes as our PS high-B-factor predictor. The maximum performance of 80.6% was obtained using win = 9 and wout = 7; the structural attributes were averaged in a window of 5.
The high-B-factor predictor, short disorder predictor, and long disorder predictor were all applied to three data sets (Dataset-O, Dataset-SD, and Dataset-LD) and the prediction results are shown in Table 4
. This experiment confirmed that high-B-factors and short disorder are the most similar phenomena among the three data sets. On the other hand, VL2 performance on both B-factor and short disorder data sets was weak, in part caused by longer averaging (win = wout = 41). Correlation coefficients between predictor outputs were: 0.26 ± 0.02 between VL2 and the high-B-factor predictor, 0.31 ± 0.02 between VL2 and the short disorder predictor, and 0.88 ± 0.02 between high-B-factor and the short disorder predictor.
|
| Discussion |
|---|
|
|
|---|
Crystal packing effects can be viewed as a special case of nonlocal interactions. Given the dependence of the B-factor on packing density (Halle 2002) and hence on nonlocal interactions, crystal packing would be expected to exert large effects on B-factor values. In agreement with this, previous comparisons indicated that different crystal forms of myohemerythrin (Sheriff et al. 1985) and myoglobin (Phillips Jr. 1990) exhibited rather low correlations in their B-values, with further confirmation on additional protein pairs (Kundu et al. 2002). Our comparisons of many similar and identical proteins in the same and different space groups show that crystal packing effects generally perturb B-factor values, and the effects can be very significant (Table 3
). Overall, the B-factor perturbations arising from crystal packing effects are probably the largest source of noise in the B-factor data.
Prediction accuracy
Prediction of B-factors cannot exceed the accuracy with which B-factors can be experimentally reproduced; thus, the noise in the B-factor data sets an upper limit to prediction of flexibility. To estimate this upper limit, we collected pairs of B-factor sets from identical proteins and subjected the data to the same analysis used to compare the predicted and observed B-factor values. The results suggest that the upper limit on prediction accuracy is approximately 81%. In terms of the agreement between raw predictions and experimental values, the upper limit on the correlation coefficient is about 0.8 (Table 3
). From this perspective, our achievement of about 70% accuracy and a correlation coefficient of 0.43 seems quite reasonable.
Our predictor of high B-factors joins many other machine learning tools that attempt to predict protein features from amino acid sequence (Lund et al. 1997; Blom et al. 1999; Jones 1999; Pollastri et al. 2002; Obradovic et al. 2003). Its prediction accuracy is comparable to the 64%77% accuracies for coordination number, two-class interresidue distances, or relative solvent accessibility, and lower than the 75%80% prediction accuracy of secondary structure or long regions of intrinsic disorder. Because flexible ordered and short disordered protein regions are frequently involved in important biological functions and they were not previously predictable from the sequence using our old predictors, we expect this B-factor predictor to be an advanced practical tool to aid in the automated discovery of short molecular recognition regions and possibly even the active sites. Moreover, the raw outputs of this predictor can be utilized in semi-automated detection of flexible ordered regions (see Supplemental Material). The correlation of the high-B-factor regions with short disordered regions may prove important in high-throughput genomewide characterization of novel proteins with unknown structure and function.
The improvement in B-factor prediction from adding either known (KS predictor) or predicted (PS predictor) secondary structure is small but significant. This improvement is related to the differences in average flexibility observed over the three structural categories (data not shown). Addition of evolutionary information obtained by PSI-BLAST alignments improves prediction of B-factors, for both the NS and PS predictors. The improvement of about three percentage points matches the increase in secondary structure prediction (Przybylski and Rost 2002). The fact that the evolutionary information improved prediction results and that the PSI-BLAST-enhanced PS predictor outperformed the KS predictor is further support for the predictability of B-factor values from amino acid sequence.
In terms of correlation coefficients, results achieved in this study exceed those obtained with other methods from the literature. Predictors by Karplus and Schulz (1985), Vihinen et al. (1994), and Smith et al. (2003) reach correlation coefficients between 0.30 and 0.33, and some earlier methods (Bhaskaran and Ponnuswamy 1988; Ragone et al. 1989) cannot surpass 0.3. On the other hand, our PS predictor reached 0.38 without the presence of evolutionary information, and, on average, homologous sequences boost the correlation coefficient to 0.43. However, the gap of 0.23 between sequence-based methods and the 0.66 found using the methods of Kundu et al. (2002), which includes known atom coordinates, is still significant.
The gap between sequence-based approaches and approaches based on atomic coordinates is likely to be further decreased in the future. An immediate route is noise reduction, which can be effectively achieved by determining residues that are involved in crystal contacts and excluding them from model training. We believe that the improvement similar to that in methods based on atomic coordinates can result (Kundu et al. 2002). Additionally, due to the imbalance between sizes of low- versus high-B-factor classes, our model was constructed using balanced data that, in turn, lead to a significant overprediction of the high B-factors. In our future research, we will study ways to detect locally flexible regions based on their local and nonlocal neighborhoods and thus reduce the number of false positives outputted by our model.
Comparing compositions of high-B-factor ordered and intrinsically disordered proteins
Our original hypothesis was that amino acid composition determines whether a protein folds into specific 3-D structure or not. Although early indications of this idea were developed from structural studies on protein sequences (Williams 1978), we missed this original work and developed our version of this hypothesis from prior studies of lattice models of protein structure by Shakhnovich and Gutin (1993). In those lattice studies, the determination whether a lattice-model protein folds or not depended on the polar/nonpolar ratio, which corresponds to the amino acid composition in real proteins. Given a folding polar/nonpolar ratio (composition), the detailed arrangement of the amino acids indicated which fold was stabilized. Here we suggest that, not only foldability, but also flexibility is determined, to a significant degree, by the amino acid composition.
Comparison of the amino acid compositions of experimentally characterized regions of protein disorder with regions of order (Romero et al. 2001) showed that disordered proteins generally have more of the flexible amino acids as defined by the scale of Vihinen et al. (1994), suggesting that disordered regions and high-B-factor regions might be quite similar to each other. Furthermore, Romero et al. (1997) also indicated that disordered regions of different lengths might have different amino acid compositions, but the original data sets were quite small. Here, comparisons of the amino acid compositions of low- and high-B-factor regions and short and long disordered regions indicate that all four categories are distinct (Fig. 1
; Tables 1
, 2
). Although the compositional distinctions among the high-B-factor order, short disorder, and long disorder sets might change as more data are added, we expect the overall trends indicated in Tables 1
and 2
to be maintained. This expectation is based on the observation that the current data sets are large enough already to show statistically significant distinctions.
Just as amino acid compositions vary for different types of secondary structure (Nakashima et al. 1986; Liu and Chou 1999; Cai et al. 2002), compositional differences might distinguish different types of intrinsic disorder or different types of flexible regions. For example, regions of extended disorder might be expected to be more hydrophilic than either regions of collapsed disorder or regions corresponding to the premolten-globule, if indeed this form is distinctive (Uversky 2002a). Also, there could be compositional biases in subsets of intrinsically disordered proteins that correlate with function such as enrichments in lysine and arginine for nucleic acid binding regions. Indeed, recently published work provides some support for this conjecture (Vucetic et al. 2003).
Previously we found significant amino-acid compositional differences between ordered protein and long regions of intrinsic disorder. If structuresequence relationships existed on a continuum, then one would expect to observe monotonic increases or decreases in the various amino acid compositions as the set of interest is changed from low-B-factor regions, to high-B-factor regions, to short disordered regions and to long disordered regions. However, almost none of the amino acids exhibit monotonic changes in the order indicated. Even the global averages of Table 2
do not exhibit monotonic changes across the different flexibility/disorder classes in the order indicated. Thus, the amino acid compositions that specify flexibility and intrinsic disorder are evidently distinct and not merely quantitative differences on a continuum.
| Materials and methods |
|---|
|
|
|---|
2 Å, and an R-factor
20%. Sequence identity within the set was limited to 25%, and only chains without nonstandard residues and missing backbone or side chain atoms were chosen, making a database of 67,552 residues in total.
The second set of protein chains, Dataset-EO, contains 1287 sequences from the PDB divided into 195 disjoint clusters of similar sequences. For each chain in a cluster there is at least one other chain with
50% sequence identity. Minimum and maximum cluster sizes are 2 and 205, and the total number of residues is 238,133. All proteins in the data set were required to have at least 50 residues and a resolution of
2 Å.
The third data set, Dataset-SD, was extracted from the PDB and contains nonredundant chains with stretches of missing coordinates no longer than 10 consecutive residues. The length limitation of 10 residues was chosen in order to make the average segment length and standard deviation comparable to the high-B-factor regions from Dataset-O. All chains from Dataset-SD were required to be at least 80 residues in length, and the maximum sequence identity between any two chains was limited to 25%. Dataset-SD contains 511 sequences with 3216 disordered residues in short stretches out of 174,301 total residues.
All data sets are publicly available at our Web site: http://www.ist.temple.edu.
Data representation and types of predictors
To construct a predictor, a machine-learning example (data point) was constructed for each residue where the corresponding C-
atom B-factor was quantized into classes high and low, according to a threshold, and included as a binary target designation. To compensate for the large variability of averages over proteins, C-
B-factors were normalized using the method of Smith et al. (2003) prior to quantization.
An attribute vector for each position in a protein was constructed considering neighboring amino acids within a symmetric input window of size win. The window was centered at a given position except near the N and C termini, where the window was allowed to expand and collapse, respectively, and where the window was no longer centered as described in more detail previously (Vucetic et al. 2003). The first 21 attributes were the 20 relative frequencies of each amino acid within win and K2 entropy, a measure of sequence complexity (Wootton and Federhen 1996). The last set of attributes used in the present study exploits secondary structure information. Because each residue may belong to structure forms
-helix, ß-sheet, and coil, we included three structural attributes, constructed in the same way as compositional attributes. The NMR- or X-ray-determined structures of a query sequence were used for the KS predictor (known structure), the first of the three models built in this work. For proteins whose structure was unknown, the raw PHD secondary structure predictions (Rost et al. 1997) on the query amino acid sequence were used. We refer to the predictor using PHD scores as the PS predictor (predicted structure). Finally, the NS predictor (no structure), which does not exploit secondary structure information, was used for comparison purposes. It is possible to optimize the size of the input window for each attribute; however, due to the high computational requirements, the window size for the structural attributes was optimized separately from the remaining attributes.
After predictions were made for each residue in a protein, the raw outputs were smoothed using a moving average postfiltering. The size of the smoothing (output) window wout was also subject to optimization.
Model choice, training, and evaluation criteria
We use logistic regression for linear modeling and bagged neural networks (Breiman 1996) for nonlinear modeling. To train a predictor we applied the following procedure: The original set of 290 proteins was first randomly split into training and testing sets in the ratio 75% : 25%. From the set of training proteins we constructed examples for all available residues and then fed them to the model, which learned from a class-balanced data set. After the model was trained, we evaluated its performance on all examples from the test set. The whole process of splitting, training, and testing was repeated 30 times in all experiments.
To evaluate the performance of the predictors, we measured sensitivity (sn) and specificity (sp) for a given set of parameters. Sensitivity is defined as the percentage of high B-factors correctly predicted, and specificity is the percentage of low B-factors correctly predicted (Hastie et al. 2001). This type of model evaluation is commonly used in cases of class imbalance (Kubat et al. 1998). Assuming the class sizes are equal, the accuracy of prediction (acc) is expressed as the arithmetic mean of sensitivity and specificity. Therefore, random predictors or models that always output only one class will have an accuracy of 50%. Together with sensitivity and specificity, we also report their 95% confidence intervals calculated as
, where s is the standard deviation of the estimate (sn or sp) and n is the number of experimental repetitions.
Prediction averaging over evolutionary data
Families of homologous proteins were built using PSI-BLAST queries of GenBank (Benson et al. 1999). The conditions for the PSI-BLAST queries included using the blosum62 scoring matrix (Henikoff and Henikoff 1992) with 11/1 gap penalties and E-values of 0.0002 to include a sequence in a profile and of 0.01 to accept it as a family member. The maximum number of iterations was limited to three in order to constrain the influence of potential false positives. Construction of profiles usually incorporates some form of weight assignment in order to avoid the influence of very similar hits, but also sequences from the "twilight zone." As noted in the study of Altschul et al. (1997), several intuitive weighting schemes usually yield similar results. Based on these previous studies, the following simple scheme was devised: All sequences with sequence identity above 70% or below 30% in the region of the local alignment to the query sequence were discarded from the family. Additionally, no pair of homologs within a family was allowed to exceed the 70% sequence identity threshold. Pairwise sequence alignments were performed using the Smith-Waterman algorithm (Smith and Waterman 1981) with the blosum62 scoring matrix and 11/1 gap penalties. The remaining sequences in each family were all assigned equal weights, and prediction of the B-factor for the query sequence at position i was formed as an average over all proteins in a family that do not have a gap at that position.
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 33893402.
Benner, S.A., Cohen, M.A., and Gerloff, D. 1992. Correct structure prediction? Nature 359: 781.
Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., Ouellette, B.F., Rapp, B.A., and Wheeler, D.L. 1999. GenBank. Nucleic Acids Res. 27: 1217.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The protein data bank. Nucleic Acids Res. 28: 235242.
Bhaskaran, R. and Ponnuswamy, K.P. 1988. Positional flexibilities of amino acid residues in globular proteins. Int. J. Pept. Protein Res. 32: 241255.
Blom, N., Gammeltoft, S., and Brunak, S. 1999. Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J. Mol. Biol. 294: 13511362.[CrossRef][Medline]
Breiman, L. 1996. Bagging predictors. Machine Learning 24: 123140.
Brown, C.J., Takayama, S., Campen, A.M., Vise, P., Marshall, T., Oldfield, C.J., Williams, C.J., and Dunker, A.K. 2002. Evolutionary rate heterogeneity in proteins with long disordered regions. J. Mol. Evol. 55: 104110.[CrossRef][Medline]
Cai, Y.D., Liu, X.J., Xu, X.B., and Chou, K.C. 2002. Artificial neural network method for predicting protein secondary structure content. Comput. Chem. 26: 347350.[Medline]
Carugo, O. 2001. Detection of breaking points in helices linking separate domains. Proteins 42: 390398.[Medline]
Carugo, O. and Argos, P. 1998. Accessibility to internal cavities and ligand binding sites monitored by protein crystallographic thermal factors. Proteins 31: 201213.[CrossRef][Medline]
. 1999. Reliability of atomic displacement parameters in protein crystal structures. Acta Crystallogr. D Biol. Crystallogr. 55: 473478.[Medline]
Dunker, A.K., Brown, C.J., Lawson, J.D., Iakoucheva, L.M., and Obradovic, Z. 2002a. Intrinsic disorder and protein function. Biochemistry 41: 65736582.[CrossRef][Medline]
Dunker, A.K., Brown, C.J., and Obradovic, Z. 2002b. Identification and functions of usefully disordered proteins. Adv. Protein Chem. 62: 2549.[Medline]
Dyson, H.J. and Wright, P.E. 2002. Coupling of folding and binding for unstructured proteins. Curr. Opin. Struct. Biol. 12: 5460.[CrossRef][Medline]
Garner, E., Cannon, P., Romero, P., Obradovic, Z., and Dunker, A.K. 1998. Predicting disordered regions from amino acid sequence: Common themes despite differing structural characterization. Genome Inform. Ser. Workshop Genome Inform. 9: 201213.[Medline]
Halle, B. 2002. Flexibility and packing in proteins. Proc. Natl. Acad. Sci. 99: 12741279.
Hastie, T., Tibshirani, R., and Friedman, J.H. 2001. The elements of statistical learning: Data mining, inference, and prediction. Springer Verlag, NY.
Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89: 1091510919.
Jones, D.T. 1999. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292: 195202.[CrossRef][Medline]
Karplus, P.A. and Schulz, G.E. 1985. Prediction of chain flexibility in proteins. Naturwissenschaften 72: 212213.[CrossRef]
Kubat, M., Holte, R.C., and Matwin, S. 1998. Detection of oil spills in satellite radar images of sea surface. Machine Learning 30: 195215.[CrossRef]
Kundu, S., Melton, J.S., Sorensen, D.C., and Phillips Jr., G.N. 2002. Dynamics of proteins in crystals: Comparison of experiment with simple models. Biophys. J. 83: 723732.
Kyte, J. and Doolittle, R.F. 1982. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157: 105132.[CrossRef][Medline]
Levin, J.M., Pascarella, S., Argos, P., and Garnier, J. 1993. Quantification of secondary structure prediction improvement using multiple alignments. Protein Eng. 6: 849854.
Liu, W. and Chou, K.C. 1999. Prediction of protein secondary structure content. Protein Eng. 12: 10411050.
Lund, O., Frimand, K., Gorodkin, J., Bohr, H., Bohr, J., Hansen, J., and Brunak, S. 1997. Protein distance constraints predicted by neural networks and probability density functions. Protein Eng. 10: 12411248.
Nakashima, H., Nishikawa, K., and Ooi, T. 1986. The folding type of a protein is relevant to the amino acid composition. J. Biochem. (Tokyo) 99: 153162.
Nevill-Manning, C.G. and Witten, I.H. 1999. Protein is incompressible. In Data compression conference, pp. 257266. IEEE Computer Society Press, Snowbird, UT.
Obradovic, Z., Peng, K., Vucetic, S., Radivojac, P., Brown, C.J., and Dunker, A.K. 2003. Predicting intrinsic disorder from amino acid sequence. Proteins (in press).
Phillips Jr., G.N. 1990. Comparisons of the dynamics of myoglobin in different crystal forms. Biophys. J. 57: 381383.
Pollastri, G., Baldi, P., Fariselli, P., and Casadio, R. 2002. Prediction of coordination number and relative solvent accessibility in proteins. Proteins 47: 142153.[CrossRef][Medline]
Przybylski, D. and Rost, B. 2002. Alignments grow, secondary structure prediction improves. Proteins 46: 197205.[CrossRef][Medline]
Ptitsyn, O.B. and Uversky, V.N. 1994. The molten globule is a third thermodynamical state of protein molecules. FEBS Lett. 341: 1518.[CrossRef][Medline]
Radivojac, P., Obradovic, Z., Brown, C.J., and Dunker, A.K. 2002. Improving sequence alignments for intrinsically disordered proteins. Pac. Symp. Biocomput. 7: 589600.
Ragone, R., Facchiano, F., Facchiano, A., Facchiano, A.M., and Colonna, G. 1989. Flexibility plot of proteins. Protein Eng. 2: 497504.
Rhodes, G. 1993. Crystallography made crystal clear: A guide for users of macromolecular models. Academic Press, San Diego, CA.
Ringe, D. and Petsko, G.A. 1986. Study of protein dynamics by X-ray diffraction. Methods Enzymol. 131: 389433.[Medline]
Romero, P., Obradovic, Z., Kissinger, C.R., Villafranca, J.E., and Dunker, A.K. 1997. Identifying disordered regions in proteins from amino acid sequences. IEEE Int. Conf. Neural Netw. 1: 9095.
Romero, P., Obradovic, Z., Li, X., Garner, E.C., Brown, C.J., and Dunker, A.K. 2001. Sequence complexity of disordered protein. Proteins 42: 3848.[CrossRef][Medline]
Rost, B. 2001. Review: Protein secondary structure prediction continues to rise. J. Struct. Biol. 134: 204218.[Medline]
Rost, B., Schneider, R., and Sander, C. 1997. Protein fold recognition by prediction-based threading. J. Mol. Biol. 270: 471480.[CrossRef][Medline]
Shakhnovich, E.I. and Gutin, A.M. 1993. Engineering of stable and fast-folding sequences of model proteins. Proc. Natl. Acad. Sci. 90: 71957199.
Sheriff, S., Hendrickson, W.A., Stenkamp, R.E., Sieker, L.C., and Jensen, L.H. 1985. Influence of solvent accessibility and intermolecular contacts on atomic mobilities in hemerythrins. Proc. Natl. Acad. Sci. 83: 11041107.
Smith, J.L., Hendrickson, W.A., Honzatko, R.B., and Sheriff, S. 1986. Structural heterogenity in protein crystals. Biochemistry 25: 50185027.[CrossRef][Medline]
Smith, D.K., Radivojac, P., Obradovic, Z., Dunker, A.K., and Zhu, G. 2003. Improved amino acid flexibility parameters. Protein Sci. 12: 10601072.
Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195197.[CrossRef][Medline]
Uversky, V.N. 2002a. Natively unfolded proteins: A point where biology waits for physics. Protein Sci. 11: 739756.
. 2002b. What does it mean to be natively unfolded? Eur. J. Biochem. 269: 212.[Medline]
Vihinen, M. 1987. Relationship of protein flexibility to thermostability. Protein Eng. 1: 477480.
Vihinen, M., Torkkila, E., and Riikonen, P. 1994. Accuracy of protein flexibility predictions. Proteins 19: 141149.[CrossRef][Medline]
Vucetic, S., Brown, C.J., Dunker, A.K., and Obradovic, Z. 2003. Flavors of protein disorder. Proteins 52: 573584.[CrossRef][Medline]
Wampler, J.E. 1997. Distribution analysis of the variation of B-factors of X-ray crystal structures; temperature and structural variations in lysozyme. J. Chem. Inf. Comput. Sci. 37: 11711180.[CrossRef][Medline]
Williams, R.J. 1978. The conformational mobility of proteins and its functional significance. Biochem. Soc. Trans. 6: 11231126.[Medline]
. 1979. The conformational properties of proteins in solution. Biol. Rev. Camb. Philos. Soc. 54: 389437.[Medline]
Wootton, J.C. and Federhen, S. 1996. Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 266: 554571.[Medline]
Wright, P.E. and Dyson, H.J. 1999. Intrinsically unstructured proteins: Re-assessing the protein structure-function paradigm. J. Mol. Biol. 293: 321331.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
P. Alves, R. J. Arnold, D. E. Clemmer, Y. Li, J. P. Reilly, Q. Sheng, H. Tang, Z. Xun, R. Zeng, and P. Radivojac Fast and accurate identification of semi-tryptic peptides in shotgun proteomics Bioinformatics, January 1, 2008; 24(1): 102 - 109. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Schlessinger, M. Punta, and B. Rost Natively unstructured regions in proteins identified from contact predictions Bioinformatics, September 15, 2007; 23(18): 2376 - 2384. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Hirose, K. Shimizu, S. Kanai, Y. Kuroda, and T. Noguchi POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions Bioinformatics, August 15, 2007; 23(16): 2046 - 2053. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Pandini, G. Mauri, A. Bordogna, and L. Bonati Detecting similarities among distant homologous proteins by comparison of domain flexibilities Protein Eng. Des. Sel., June 30, 2007; (2007) gzm021v2. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Borg, T. Mittag, T. Pawson, M. Tyers, J. D. Forman-Kay, and H. S. Chan Polyelectrostatic interactions of disordered ligands suggest a physical basis for ultrasensitivity PNAS, June 5, 2007; 104(23): 9650 - 9655. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Radivojac, L. M. Iakoucheva, C. J. Oldfield, Z. Obradovic, V. N. Uversky, and A. K. Dunker Intrinsic Disorder and Functional Pro |