|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Computational Biology Branch, National Center for Biotechnology Information (NCBI), National Institutes of Health (NIH), Bethesda, Maryland 20894, USA
2 Section of Evolution and Ecology, University of California, Davis, California 95616, USA
Reprint requests to: Anna R. Panchenko, Computational Biology Branch, NCBI, Bldg. 38A, Rm. 8N805, NIH, Bethesda, MD 20894, USA; e-mail: panch{at}ncbi.nlm.nih.gov; fax (301) 435-7794.
(RECEIVED October 1, 2003; FINAL REVISION December 2, 2003; ACCEPTED December 3, 2003)
| Abstract |
|---|
|
|
|---|
Keywords: protein domains; prediction of functional residues; evolutionary conservation
Article published online ahead of print. Article and publication date are at http://www.proteinscience.org/cgi/doi/10.1110/ps.03465504.
| Introduction |
|---|
|
|
|---|
Several investigators have considered the problem of functional site prediction using multiple sequence alignments (Casari et al. 1995; Andrade et al. 1997; Hannenhalli and Russell 2000; Li et al. 2003). Casari et al. (1995), for example, applied principal component analysis to a vector representation of protein sequences in a multidimensional "sequence space," to derive subfamily-specific residues involved in protein function. Andrade et al. (1997) proposed a rigorous clustering algorithm based on a self-organizing map as a means to identify protein subfamilies and retrieve characteristic sequence patterns. As functional similarity can be inferred from clades in phylogenetic trees, some methods of functional site prediction use phylogenetic analysis to identify residues associated with functional divergence (Lichtarge et al. 1996; Sjolander 1998; Aloy et al. 2001; Madabushi et al. 2002; del Sol Mesa et al. 2003). The evolutionary trace (ET) method, for example, delineates invariant residues responsible for subgroup specificity by partitioning the dendrogram into an increasing number of subgroups of similar sequences with subsequent analysis of their three-dimensional (3D) structures (Lichtarge et al. 1996; Aloy et al. 2001; Madabushi et al. 2002).
Despite the efforts in this field, the accuracy of functional site predictions remains low, suggesting that it may be worthwhile to consider other aspects beyond sequence conservation. Use of structure information is one possibility, because knowledge of the protein structure is necessary for predicting many aspects of protein function (Teichmann et al. 2001). Given that functionally important surface regions often contain residues with specific characteristics, some methods attempt to identify functional sites on the basis of physicochemical properties of individual residues, their electrostatic contribution, and their location in the 3D structure (Jones and Thornton 1997; Tsai et al. 1997; Elcock 2001; Bartlett et al. 2002). Landgraf and colleagues (2001), for example, offered an automated method for functional site prediction by identifying 3D clusters of conserved residues using residue-specific (regional) and global similarity scores.
Here we present a method which is based on the assumption that the structural location of functional sites is conserved between homologous proteins and that functionally important residues tend to cluster together in space, forming three-dimensional residue clusters or surface patches. In the method considered here, each residue is assigned a score which depends on its own conservation in homologs and the conservation of residues in its spatial neighborhood, as judged from the analysis of known structures within a given protein family. We hypothesize that high-scoring sites are more likely to be involved in specific binding or catalysis, and that one may identify functionally important residues even in the absence of structural data on proteinligand or macromolecular complexes.
We tested the method on a benchmark of 86 protein domain families, including families with a wide variety of functions and sequence diversity. To assess the accuracy of functional site predictions, we applied a rigorous receiver operating characteristic (ROC) test (see Materials and Methods). This gave us a means to compare different scoring schemes directly, by calculating the actual number of correctly predicted functional sites at a given level of false assignments. We show that including information about conserved structural features in some cases helps to make more accurate predictions, especially for DNA/RNA binding macromolecular interfaces. When sequence diversity is low, spatial averaging also helps to detect functional sites against the high background of sequence conservation.
| Results |
|---|
|
|
|---|
To determine whether clustering of conserved residues in space and consideration of their solvent accessibility help to identify functional sites, we compared scoring functions based on sequence conservation alone and sequence conservation with spatial averaging (see Materials and Methods). Figure 1
shows the ROC30 statistic for the contact-based scoring function with an optimized distance cutoff (the distance cutoff yielding the best performance for each domain family) and with a fixed distance cutoff (less than 6 Å), plotted against ROC30 values obtained with a sequence-based scoring function. As can be seen from the figure, the contact-based scoring function with optimized distance cutoff detects more functional sites for 73% of domain families compared to sequence-based scoring function. Because the value of optimal distance cutoff is difficult to determine a priori for each domain family, in our work we used the 6 Å distance cutoff, which has been shown to yield the best performance.
|
|
|
It should be noted that there is great variety among different catalytic domains. They can vary in terms of the type of enzymatic activity, the sizes of protein clefts, and interacting ligands. These factors apparently make it difficult to predict active sites using structure-based scoring function with the fixed distance cutoff. As a consequence, the sequence-based scoring function alone gives more reliable predictions for sufficiently diverse domain families where conserved active sites become more apparent. On the other hand, DNA/RNA binding and proteinprotein binding sites very often are nonspecific and form contiguous patches on the surface of the protein. These factors apparently allow the contact-solvent-accessibility scoring function to improve detection of functional sites.
Statistical significance of functional site predictions
To compare the results obtained by our method to the outcome of random assignments, we performed a binomial test for each domain family. The number of trials in the binomial test was equal to the overall number of functional residues in a given domain alignment, and the probability of success was calculated as a number of functional residues in the alignment divided by the overall number of residues in the alignment. Using the contact-solvent-accessibility scoring function, we found that predictions of functional sites for 57% of domain families are significant with P-values <0.05 (P-value here denotes the probability of finding an equal or higher number of correctly predicted functional sites purely from the binomial distribution). Values for domains with annotated catalytic, DNA/RNA-binding, and proteinprotein binding sites were 76%, 35%, and 20%, respectively. Sequence conservation scoring yielded significant predictions of catalytic sites for 65% of domains, DNA/ RNA-binding sites for 24% of domains, and proteinprotein interfaces for 20% of domains (50% overall). In all cases the site was predicted to be functional if it belonged to the top 5% of the most conserved sites in domain alignment.
These results are comparable to those of the 3D cluster analysis employed by Landgraf et al. (2001). Those investigators identified 36% of all interface residues at a threshold of less than 1% expected from reshuffled alignments and 67% at the less stringent threshold of 10%. An automated method based on the ET approach found the correct locations of catalytic residue clusters for 62 out of 80 enzymes (78% of clusters compared to 76% of catalytic domains with significant predictions found by our method) for multiple alignments with less than 30% identity (Aloy et al. 2001). Aloy et al. defined the predicted site/cluster to be correct if the overlap between the volume of predicted cluster and the volume of annotated functional site was more than 50%. Their method was considered to find a right prediction for a given protein if at least one of the predicted functional clusters was correct.
Conserved structural features help to predict functional residues for domain alignments with low sequence diversity
Our test set can be considered rather heterogeneous in terms of the sequence diversity of domain families (Table 2
). For domain families with low sequence diversity, sequence and structure similarity is extensive and the degree of residue conservation is high for all positions in alignments. Sequence profiles based on low-diversity alignments perform relatively poorly in a database search (Panchenko and Bryant 2002), and we similarly found that functional residue identification is problematic in these cases. As shown in Figure 3
, for low-diversity domain alignments (where the number of different amino acid types per column, Nobs is less than 5 and average sequence identity is about 45%), the average recognition rate (R0.05) is less than 0.2, whereas for more diverse alignments (Nobs is greater than 15 and average sequence identity is about 20%), the average recognition rate is twice as high. In agreement with these results, Aloy et al. (2001) reported that for multiple alignments with sequence identity of more than 30%, their method of functional site prediction has very limited applications.
|
|
|
| Discussion |
|---|
|
|
|---|
Second, it was shown that the sequence divergence of domain alignments is a prerequisite for the successful functional prediction, and structurally conserved features help to discriminate functional and nonfunctional sites for families with low sequence diversity. Accordingly, to increase blind prediction accuracy we can formulate several rules based on these observations. The first: To predict functional residues for low-diversity families, whenever possible diversify them with more distantly related family representatives and, if not possible, use a structure-based scoring function. The second rule can be applied if the general function of the domain family is known: Whenever possible use contact-based and solvent accessibility-based scoring for predicting DNA/RNA binding and proteinprotein binding sites; for catalytic sites use a contact-based scoring function for low-diversity families and the original sequence-based scoring function for all others. If a blind prediction of functional residues is being attempted, the simple strategy would be to apply these rules for initial family screening and then define functional residues as those having conservation scores among the top 7%, 6%, and 5% of conservation scores for catalytic, DNA/RNA binding, and proteinprotein binding sites, respectively. These conservation score cutoffs correspond approximately to the error rate of 5% false positives.
As we showed, spatial averaging does not always help the function prediction, and prediction accuracy still remains quite low. Madabushi et al. (2002) demonstrated that the number of clusters (or size of the largest cluster) of functional residues determined by the ET method was larger than the number of clusters predicted by random simulations for 98% of their test cases (at the significance level of 5%). It should be noted that this result does not imply that the ET method is able to correctly identify active sites for 98% of test proteins at the 5% significance level. Similarly to Landgraf et al. (2001), we showed that the accuracy of functional site prediction, in fact, was far from reaching 100%. Applying ROC analysis we found that 47% of active sites, 20% of DNA/RNA binding sites, and 14% of proteinprotein interfaces can be predicted at a 5% false positive rate. We note that the limited accuracy of functional prediction can be caused by the differences in functional specificity among homologous family members as well as by the functional plasticity of protein molecules. Even proteins sharing the same evolutionary origin and functional activity may show variability in the physicochemical properties of functional residues and their location in a 3D structure (Todd et al. 2001, 2002; Lichtarge and Sowa 2002).
| Materials and methods |
|---|
|
|
|---|
The selected test set covered a broad range of different functional categories including 37 domains with annotated catalytic sites, 17 domains with annotated DNA/RNA binding sites, 20 domains with annotated proteinprotein binding sites, and domains from other functional groups (domains containing disulfide bonds and domains with less than two annotated functional sites were excluded). Names of CDD families used in the test set together with their sequence diversity, length, the number and the type of functional sites are listed in Table 2
. By definition, CDD alignments have at least one structural family representative, whereas in our test set the number of structures per family ranged from 1 to 15, with three structures per family on average.
Calculation of sequence conservation
We used two different measures to estimate the level of conservation at each position in CDD alignments. The first measure, information content, was based on counting the number of different amino acid types per aligned column and inferring the relationships between amino acid types with the pseudocount method (Altschul et al. 1997), where pseudocount frequencies were calculated using the PAM70 amino acid substitution matrix. The second measure of evolutionary conservation of different sites, the substitution rate per site, was calculated using the PAML3.12 package (Yang 1997) with its implementation of the Jones, Taylor, and Thornton amino acid substitution model (Jones et al. 1992), where the variable substitution rates across sites were described with the
-model. Phylogenetic trees required for this analysis were constructed by the neighbor-joining method (Saitou and Nei 1987) with the PHYLIP package (Felsenstein 1989).
Scoring the clusters of conserved residues
For each position in the alignment, two regional conservation scores were calculated. The first one represented the average over conservation scores for residues located within a given distance from each position "i" of the alignment, namely,
![]() | (1) |
where
ij is equal to 1 if residues i and j are in contact, and 0 otherwise. Cj is the residue conservation score of residue j, N is the total number of positions in the alignment, and n is the number of residues in contact with residue "i." Contacts were defined between the virtual C
atoms (points 2.4 Å away from C
atom) of residues separated along the chain by at least five peptide bonds and having the distance less than a given distance cutoff (4, 5, 6, 7, 8, and 9 Å). It should be noted that contacts were calculated for all structural representatives of domain alignments, and only conserved contacts were used in the evaluation of Ccont. The contact between positions i and j was defined as conserved if aligned residues in these positions formed the contact in all structural representatives. For those residues which did not make any contacts, the original residue conservation value was assigned. Inter-residue contacts conserved between all structural representatives were shown to increase prediction accuracy for 60% of domain families (for families with more than one structure) compared to the scoring function based on one representative structure (data not shown).
The second regional conservation score gave emphasis to solvent accessible residues, because these residues are very often involved in the formation of functionally important interfaces:
![]() | (2) |
where
solv is equal to 1, if solvent accessibility of position "i" is greater than 0.05, and 0 otherwise. Reversing equation 2 and considering only buried residues in contact did not improve the prediction accuracy (data not shown). The cutoff threshold of 0.05 was derived from an analysis of homologous protein structures forming a conserved hydrophobic interior (Miller et al. 1987). Solvent-accessible area was calculated by the DSSP algorithm (Kabsch and Sander 1983), where solvent accessibility of residue "X" was defined as the ratio of its solvent-accessible area in protein structure to that for extended tripeptide Gly-X-Gly. The solvent accessibility of position "i" in a multiple alignment was calculated by averaging solvent accessibility values in a given position for all structural representatives.
Evaluation of prediction accuracy
To evaluate the accuracy of functional site predictions, we calculated the number of correctly predicted functional sites (true positives) and the number of incorrectly predicted functional sites (false positives) found at different thresholds of conservation score. True positives were identified as those functionally important sites which had scores higher than a given score threshold. False positives, in turn, were identified as sites with scores higher than a given threshold, but unrelated to the functional activity of a given domain family. To measure the performance of retrieval methods, the truncated receiver operating characteristic (ROC) has been widely used (Gribskov and Robinson 1996; Schaffer et al. 2001). A ROCn statistic was calculated as the sum of the number of true positives found at 1,2,3, . . . n false positive levels (ti) divided by the overall number of true positives (T): ROCn = (
I=1, . . . , n ti)/nT. Here, the total number of true positives (T) was calculated as the total number of annotated functionally important sites in a given domain family, whereas the total number of false positives was equal to the difference between the total number of sites in the alignment and the number of functional sites annotated for a family. Knowing the number of true positives detected and overall number of true positives, it is possible to calculate the fraction of true positives detected and, correspondingly, the fraction of false positives detected, and plot them in the order of decreasing score threshold (see Fig. 2
). The false positive cutoff "n" was set to 30, which corresponds approximately to the first quarter of false positives detected. In those cases where the prediction performance was compared for different families with the different numbers of false positives, the R0.05 was used.
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 33893402.
Andrade, M.A., Casari, G., Sander, C., and Valencia, A. 1997. Classification of protein families and detection of the determinant residues with an improved self-organizing map. Biol. Cybern. 76: 441450.[CrossRef][Medline]
Bartlett, G.J., Porter, C.T., Borkakoti, N., and Thornton, J.M. 2002. Analysis of catalytic residues in enzyme active sites. J. Mol. Biol. 324: 105121.[CrossRef][Medline]
Casari, G., Sander, C., and Valencia, A. 1995. A method to predict functional residues in proteins. Nat. Struct. Biol. 2: 171178.[CrossRef][Medline]
Chambers, J.M. 1998. Programming with data. A guide to the S language. Springer Verlag, New York.
del Sol Mesa, A., Pazos, F., and Valencia, A. 2003. Automatic methods for predicting functionally important residues. J. Mol. Biol. 326: 12891302.[CrossRef][Medline]
Devos, D. and Valencia, A. 2000. Practical limits of function prediction. Proteins 41: 98107.[CrossRef][Medline]
Elcock, A.H. 2001. Prediction of functionally important residues based solely on the computed energetics of protein structure. J. Mol. Biol. 312: 885896.[CrossRef][Medline]
Felsenstein, J. 1989. PHYLIPPhylogeny inference package. Cladistics 5: 164166.
Gribskov, M. and Robinson, N.L. 1996. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem. 20: 2533.[CrossRef][Medline]
Hannenhalli, S.S. and Russell, R.B. 2000. Analysis and prediction of functional sub-types from protein sequence alignments. J. Mol. Biol. 303: 6176.[CrossRef][Medline]
Hegyi, H. and Gerstein, M. 1999. The relationship between protein structure and function: A comprehensive survey with application to the yeast genome. J. Mol. Biol. 288: 147164.[CrossRef][Medline]
Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8: 275282.
Jones, S. and Thornton, J.M. 1997. Prediction of proteinprotein interaction sites using patch analysis. J. Mol. Biol. 272: 133143.[CrossRef][Medline]
Kabsch, W. and Sander, C. 1983. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22: 25772637.[CrossRef][Medline]
Landgraf, R., Xenarios, I., and Eisenberg, D. 2001. Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J. Mol. Biol. 307: 14871502.[CrossRef][Medline]
Li, L., Shakhnovich, E.I., and Mirny, L.A. 2003. Amino acids determining enzyme-substrate specificity in prokaryotic and eukaryotic protein kinases. Proc. Natl. Acad. Sci. 100: 44634468.
Lichtarge, O. and Sowa, M.E. 2002. Evolutionary predictions of binding surfaces and interactions. Curr. Opin. Struct. Biol. 12: 2127.[CrossRef][Medline]
Lichtarge, O., Bourne, H.R., and Cohen, F.E. 1996. An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Biol. 257: 342358.[CrossRef][Medline]
Luscombe, N.M. and Thornton, J.M. 2002. ProteinDNA interactions: Amino acid conservation and the effects of mutations on binding specificity. J. Mol. Biol. 320: 9911009.[CrossRef][Medline]
Madabushi, S., Yao, H., Marsh, M., Kristensen, D.M., Philippi, A., Sowa, M.E., and Lichtarge, O. 2002. Structural clusters of evolutionary trace residues are statistically significant and common in proteins. J. Mol. Biol. 316: 139154.[CrossRef][Medline]
Marchler-Bauer, A., Panchenko, A., Shoemaker, B., Thiessen, P., Geer, L., and Bryant, S. 2002. CDD: A database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 30: 281283.
Miller, S., Janin, J., Lesk, A.M., and Chothia, C. 1987. Interior and surface of monomeric proteins. J. Mol. Biol. 196: 641656.[CrossRef][Medline]
Nooren, I.M. and Thornton, J.M. 2003. Structural characterisation and functional significance of transient proteinprotein interactions. J. Mol. Biol. 325: 9911018.[CrossRef][Medline]
Panchenko, A.R. and Bryant, S.H. 2002. A comparison of position-specific score matrices based on sequence and structure alignments. Protein Sci. 11: 361370.
Saitou, N. and Nei, M. 1987. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4: 406425.[Abstract]
Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V., and Altschul, S.F. 2001. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 29: 29943005.
Sjolander, K. 1998. Phylogenetic inference in protein superfamilies: Analysis of SH2 domains. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6: 165174.[Medline]
Teichmann, S.A., Murzin, A.G., and Chothia, C. 2001. Determination of protein function, evolution, and interactions by structural genomics. Curr. Opin. Struct. Biol. 11: 354363.[CrossRef][Medline]
Todd, A.E., Orengo, C.A., and Thornton, J.M. 2001. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307: 11131143.[CrossRef][Medline]
. 2002. Plasticity of enzyme active sites. Trends Biochem. Sci. 27: 419426.[CrossRef][Medline]
Tsai, C.J., Lin, S.L., Wolfson, H.J., and Nussinov, R. 1997. Studies of proteinprotein interfaces: A statistical analysis of the hydrophobic effect. Protein Sci. 6: 5364.[Abstract]
Yang, Z. 1997. PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13: 555556.
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
S. Sankararaman and K. Sjolander INTREPID--INformation-theoretic TREe traversal for Protein functional site IDentification Bioinformatics, November 1, 2008; 24(21): 2445 - 2452. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Guney, N. Tuncbag, O. Keskin, and A. Gursoy HotSprint: database of computational hot spots in protein interfaces Nucleic Acids Res., January 11, 2008; 36(suppl_1): D662 - D666. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Capra and M. Singh Predicting functionally important residues from sequence conservation Bioinformatics, August 1, 2007; 23(15): 1875 - 1882. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Chakrabarti and C. J. Lanczycki Analysis and prediction of functionally important sites in proteins Protein Sci., January 1, 2007; 16(1): 4 - 13. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Liang, C. Zhang, S. Liu, and Y. Zhou Protein binding site prediction using an empirical scoring function Nucleic Acids Res., August 7, 2006; 34(13): 3698 - 3707. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. A. Shoemaker, A. R. Panchenko, and S. H. Bryant Finding biologically relevant protein domain interactions: Conserved binding mode analysis Protein Sci., February 1, 2006; 15(2): 352 - 361. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Pei, W. Cai, L. N. Kinch, and N. V. Grishin Prediction of functional specificity determinants from protein sequences using log-likelihood ratios Bioinformatics, January 15, 2006; 22(2): 164 - 171. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Chakrabarti, C. J. Lanczycki, A. R. Panchenko, T. M. Przytycka, P. A. Thiessen, and S. H. Bryant Refining multiple sequence alignments with conserved core regions. Nucleic Acids Res., January 1, 2006; 34(9): 2598 - 2606. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Guharoy and P. Chakrabarti Conservation and relative importance of residues across protein-protein interfaces PNAS, October 25, 2005; 102(43): 15447 - 15452. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. G. Kann, P. A. Thiessen, A. R. Panchenko, A. A. Schaffer, S. F. Altschul, and S. H. Bryant A structure-based method for protein sequence alignment Bioinformatics, April 15, 2005; 21(8): 1451 - 1456. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. F. Fliri, W. T. Loging, P. F. Thadeio, and R. A. Volkmann Biological spectra analysis: Linking biological activity profiles to molecular structure PNAS, January 11, 2005; 102(2): 261 - 266. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH |