|
|
||||||||
Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia
Reprint requests to: Oxana V. Galzitskaya, Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia; e-mail: ogalzit{at}vega.protres.ru; fax: 7095-924-0493.
(RECEIVED September 20, 2002; FINAL REVISION December 23, 2002; ACCEPTED January 7, 2003)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.0233103.
| Abstract |
|---|
|
|
|---|
Keywords: Protein domain; latent entropy profile; degrees of freedom; domain database; superfamily
| Introduction |
|---|
|
|
|---|
Several methods have been developed to identify domains in globular proteins starting from atomic coordinates. All of these methods are based on a simple geometrical model stating that a domain has relatively more contacts within itself than with residues in the remainder of the structure (Busetta and Barrans 1984; Kikuchi et al. 1988; Islam et al. 1995; Siddiqui and Barton 1995; Berezovsky et al. 1999). With the current rapid growth in the number of sequences with unknown structures, it is very important not only to accurately define protein structural domains, but to predict domain boundaries on the basis of amino-acid sequence alone.
Recently, several methods for predicting domain boundaries from amino acid sequence have been proposed on the basis of a multiple sequence alignment (Park and Teichmann 1988; Sonnhammer and Kahn 1994; Adams et al. 1996; Gracy and Argos 1998; Guan and Du 1998; Gouzy et al. 1999; George and Heringa 2002) and on statistically derived distributions of domain lengths (Wheelan et al. 2000). However, these methods can only be successful at identifying domains if the sequence has detectable similarity to other sequence fragments in databases or when the length of the unknown domains does not substantially deviate from the average of known protein structures.
In this work, we describe a new method to predict domain boundary from protein sequence alone using a simple physical approach based on the fact that the protein unique three-dimensional structure is a result of the balance between the gain of attractive native interactions and the loss of conformational entropy, that is, that the topology of the chain determines how much the chain entropy is lost as native interactions are formed. Conformational entropy is often subdivided into backbone and side chain entropies, although the side chain size and nature is expected to impose steric constraints on the backbone. A large contribution to the loss of conformational entropy upon folding is due to side chains that are restricted in the folded protein. On the basis of this idea, we assume that a high side chain entropy of a region in a protein chain must be compensated by a high interaction energy within the region, which could correlate with a well-structured part of the globule, that is, with a domain unit. This means that the domain boundary is conditioned by amino acid residues with a small value of side chain entropy, which correlates with the side chain size. On the one hand, relatively high Ala and Gly content on the domain boundary results in a high conformational entropy of the backbone chain between the domains. On the other hand, the presence of Pro residues leads to the formation of hinges for a relative orientation of domains. Considering here the conformational entropy as the number of degrees of freedom on the angles
,
, and
for each amino acid along the chain, our method for domain boundary prediction relies on finding the minima in a latent entropy profile. This offers a possibility of identifying domain boundaries in proteins without prior knowledge of their tertiary structure, opening the road to useful applications both in sequence analysis and structure prediction.
| Results and discussion |
|---|
|
|
|---|
|
,
, and
(Table 3
|
As can be seen from Table 1
, the accuracy of the predictions is different for different superfamilies. The quality of the predictions is very good when the protein length is <450 residues. In these cases, the position of domain boundaries correlates well with the position of the deepest minima in the plots (see Table 1
, Nos. 13, 711, 20, 22, 2427, and 29). Its value varies from 3.4 (the average number of degrees of freedom for the angles
,
, and
) to 3.9 in different groups suggesting that a global optimum threshold does not exist. If there are several equally deep minima in the profile, one should use additional information from the entropy profile with a smaller window size, 9 or 5 residues to choose one of them (Table 1
, No. 23).
It can be seen from Table 1
that domain boundaries for some groups including proteins longer than 450 residues are difficult to predict (see Table 1
, Nos. 5, 12, 13, 16, and 21). Usually in these cases, the profile contains additional minima. For large proteins, this method will provide valuable information about a potential sequence position of domain limits that could be complemented with additional information for further localization of domain boundaries.
Studies of the distribution of 20 amino acid residues within the domain boundary region including 81 residues (40 from each side from the boundary) for 366 proteins from the 29 groups consisting of 44 superfamiles have indicated a preference for amino acid residues with a small side chain entropy value in comparison with the distribution of the residues for all 366 protein lengths. This confirms the hypothesis that domain boundaries are conditioned by such a type of amino acid residues.
The problem of separating protein structure into their constituent domains becomes more complex as the number of domains increases. Usually, the accuracy of the prediction is quite good if single-domain proteins are included in the test set (Islam et al. 1995; Siddiqui and Barton 1995). The correct assignment for two-domain proteins varies from 50% to 70% of accuracy (Islam et al. 1995; Siddiqui and Barton 1995; Jones et al. 1998; Wheelan et al. 2000; George and Heringa 2002).
In our case, the accuracy of the method depends on the group. Using 280 two-domain proteins not selected into groups (see Materials and Methods) and considering the resolution of our method to be ± 40 residues (the window size), the predictions are accurate in 63% of cases, random in 47%, and the Z score is 5. In the case of 29 groups, the prediction is accurate in 80%, random in 56%, and the Z score is 11.
Table 2
presents the result of the prediction of domain boundaries for 21 CASP5 targets longer than 150 residues and nonhomologous to proteins of known three-dimensional structures. This blind test is the most suitable for assessing the accuracy of the new method and will allow potential users to verify the quality of our results on an unbiased test set.
|
The entropy parameter (the number of degrees of freedom on the angles
) correlates with the side chain entropy. In fact, the value of the average entropy parameter calculated as a summation of the individual number of the entropy parameter over its complete sequence and normalized by the protein length correlates with the average side chain entropy |gn considered by Galzitskaya et al. (2000).
The average entropy parameter for proteins in our database falls in a defined region (from 3.6 to 4) as has been demonstrated by Galzitskaya et al. (2000). Proteins with a high entropy parameter (>4) usually belong to DNA-binding proteins or bind some additional agents. In general, one of the domains has larger conformational entropy than the other, by, on average, 0.10.2 units (Table 1
, Nos. 1, 7, 8, 19, 20, 22, 24, and 29). The need to balance the conformational entropy with the energy of interactions is one of the general conditions to achieve the functional active form of a protein, and it is possible that the domain organization is necessary for proteins to compensate for the large conformational entropy of one of the domains and to enhance the stability of the whole protein (see profile No. 30 in Table 1
).
We would like to underline here that there are many domain prediction algorithms using information from multiple sequence alignments or statistical analysis. However, there are no prediction algorithms relying only on protein sequence such as the one described here that, although very simple, provides valuable and reasonable information about protein domain organization and can be useful for sequence analysis and structure prediction.
| Materials and methods |
|---|
|
|
|---|
Calculation of the entropy profile
The latent entropy profile is calculated as follows. First, the number of degrees of freedom for the angles
,
, and
is determined for each residue (Table 3
); then, the propensities for the residues inside the window are averaged and assigned to the central residue of the window. Therefore, the influence of residues along the sequence flanking each window is included in our calculation. The value of the average entropy parameter (the average number of degrees of freedom on the angles
,
, and
) for every position of the polypeptide chain provides the latent entropy profile whose minima are predicted to correlate to domain boundaries (only the deepest minimum should be considered for two-domain proteins). We used a sliding window of 41 residues. This value was selected for two reasons. First, a domain should contain a hydrophobic core and should be larger than 40 residues. Second, this window size has been found to be the best compromise between a good resolution of the plot and a tolerable level of noise.
Estimation of accuracy of the method
The probability pi to guess the domain boundary by chance in our method is the relation between two lengths, double length of the window size (80 residues) and the protein length decreased by 50 residues from each end. In the case when the reduced length is <80 residues, the probability pi is equal to 1. Therefore, the Z-score is (M-<M>)/
, in which M is the number of correctly predicted domain boundaries by our method and <M> is the average number of expected successful random predictions in our method that is equal to the summation of probabilities pi, in which i changes from 1 to the considered number of the proteins.
is the standard deviation.
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Berezovsky, I.N., Namiot, V.A., Tumanyan, V.G., and Esipova, N.G. 1999. Hierarchy of the interaction energy distribution in the spatial structure of globular proteins and the problem of domain definition. J. Biomol. Struct. Dyn. 17: 133155.[Medline]
Bernstein, F.C., Koetzle, T.F., Williams, G.J., Meyer, Jr., E.F., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., and Tasumi, M. 1977. The Protein Data Bank. A computer-based archival file for macromolecular structures. J. Mol. Biol. 112: 535542.[Medline]
Busetta, B. and Barrans, Y. 1984. The prediction of protein domains. Biochim. Biophys. Acta 790: 117124.[CrossRef][Medline]
Galzitskaya, O.V., Surin, A.K., and Nakamura, H. 2000. Optimal region of average side-chain entropy for fast protein folding. Protein Sci. 9: 580586.[Abstract]
George, R.A. and Heringa, J. 1992. SnapDRAGON: A method to delineate protein structural domains from sequence data. J. Mol. Biol. 316: 839851.
Gouzy, J., Corpet, F., and Kahn, D. 1999. Whole genome protein domain analysis using a new method for domain clustering. Comput. Chem. 23: 333340.[CrossRef][Medline]
Gracy, J. and Argos, P. 1998. Automated protein sequence database classification. II. Delineation of domain boundaries from sequence similarities. Bioinformatics 14: 174187.
Guan, X. and Du, L. 1998. Domain identification by clustering sequence alignments. Bioinformatics 14: 783788.
Islam, S.A., Luo, J., and Sternberg, M.J. 1995. Identification and analysis of domains in proteins. Protein Eng. 8: 513525.
Jones, S., Stewart, M., Michie, A., Swindells, M.B., Orengo, C., and Thornton, J.M. 1998. Domain assignment for protein structures using a consensus approach: Characterization and analysis. Protein Sci. 7: 233242.[Abstract]
Kikuchi, T., Némethy, G., and Scheraga, H.A. 1988. Prediction of the location of structural domains in globular proteins. J. Protein Chem. 7: 427471.[CrossRef][Medline]
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536540.[CrossRef][Medline]
Park, J. and Teichmann, S.A. 1998. DIVCLUS: An automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins. Bioinformatics 14: 144150.
Richardson, J.S. 1981. The anatomy and taxonomy of protein structure. Advan. Protein Chem. 34: 167339.[Medline]
Siddiqui, Q.S. and Barton, G.J. 1995. Continuous and discontinuous domains: An algorithm for the automatic generation of reliable protein domain definitions. Protein Sci. 4: 872884.[Abstract]
Sonnhammer, E.L.L. and Kahn, D. 1994. Modular arrangement of proteins as inferred from analysis of homology. Protein Sci. 3: 482492.[Abstract]
Wetlaufer, D.B. 1973. Nucleation, rapid folding, and globular intrachain regions in proteins. Proc. Natl Acad. Sci. 70: 697701.
Wheelan, S.J., Marchler-Bauer, A., and Bryant, S.H. 2000. Domain size distributions can predict domain boundaries. Bioinformatics 16: 613618.
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |