|
|
||||||||
ukasz Salwi
skiDepartments of Chemistry and Biochemistry and Biological Chemistry, UCLA-DOE Laboratory of Structural Biology and Molecular Medicine, UCLA, Los Angeles, California 90095-1570, USA
Reprint requests to: D. Eisenberg, UCLA-DOE Laboratory of Structural Biology and Molecular Medicine, UCLA, Box 951570, Los Angeles, California 90095-1570, USA; e-mail: david{at}mbi.ucla.edu; fax: (310) 206-3914.
(RECEIVED April 16, 2001; FINAL REVISION September 4, 2001; ACCEPTED September 4, 2001)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1101/ps.14401.
| Abstract |
|---|
|
|
|---|
Keywords: Bioinformatics; sequence motif; functional annotation; fold assignment
Abbreviations: 3D, three-dimensional HMM, hidden Markov model SDP, sequence-derived properties MBA, motif-based fold assignment TIM, triose phosphate isomerase
| Introduction |
|---|
|
|
|---|
The rate of the experimental determination of protein structures, although continuously increasing, still lags behind protein sequences by roughly two orders of magnitude. To fill this gap, investigators have developed methods for protein structure and function prediction (for reviews, see Smith 1999; Domingues et al. 2000a; Skolnick and Fetrow 2000; Skolnick et al. 2000). These methods rely almost exclusively on identification of sequence similarity to proteins of known fold or on the compatibility of the new sequence with the chemical environments of individual residues when threaded through previously determined experimental structures.
Sequence-based methods of fold assignment attempt to identify pairs of homologous proteinsproteins that share, because of common ancestry, similar structure and function (Fig. 1
). Dynamic programming-based sequence alignment methods (Needleman and Wunsch 1970; Smith and Waterman 1981) are able to identify homologs when sequence identity is larger then roughly 20%30%. The use of multiple sequence alignment-based sequence profiles (Gribskov et al. 1987; Altschul et al. 1997) and HMM methods (Karplus et al. 1998) can, at least in some cases, extend the sensitivity of the fold assignments below 20% of sequence identity. Structure-based predictions take into consideration residue preferences for different environments within the structure (Bowie et al. 1991; Jones et al. 1992; Jones 1999; Bienkowska et al. 2000). When combined with the prediction of secondary structure, they can perform about as well as the sequence-based methods (Fischer and Eisenberg 1996; Russell et al. 1996; Jones et al. 1999; Panchenko et al. 2000). Another set of methods rely on the intergenome distribution of homologous proteins to infer their function directly (Marcotte 2000). Those methods, although bypassing the structure-determination step, are able to assign a function to the protein by identifying groups of nonhomologous proteins that coevolved together and thus fulfill similar roles within a cell (Andrade et al. 1997).
|
Functional information was used recently in the later stages of the mostly manual structure prediction of CASP3 targets by Murzin and Bateman (1997) or to identify a possible function of the new protein after initial structure prediction (Zhang et al. 1999). On the other hand, the SAWTED algorithm (MacCallum et al. 2000) allows automatic screening of the potential predicted structures against the functional information about the unknown protein.
The fully automated approach presented here combines the functional information contained in the SwissProt keyword annotation with the Prosite motif database to improve the performance of any conventional sequence- or structure-based prediction. As opposed to the SAWTED approach (MacCallum et al. 2000), our method does not rely on the annotation used to characterize newly identified proteins of the unknown sequence and thus is well suited to analysis of poorly characterized sequences such as those produced by the full-genome sequencing projects.
| Results |
|---|
|
|
|---|
Motiffold compatibility
It has been long known that some regions within protein sequence are crucial for function and thus better conserved among homologs than are surrounding regions (Bork and Koonin 1996; Kasuya and Thornton, 1999). This observation has led to the creation of motif libraries such as Prosite (Hofmann et al. 1999), which catalog patterns repeatedly recurring in protein sequences. The motifs present in the library can be classified as belonging to one of the two groups. Some of them correspond to structural elements, such as coiled-coil and zinc-finger motifs, that are shared by all representatives of a given fold or group of folds. Often, conservation of such motifs is required for proper folding of the protein. The other group of sequence motifs reflects the functions of the molecule: cofactor and ligand binding pockets, catalytic sites, or motifs responsible for interaction with other proteins directly or after posttranslational modification. Usually proteins of similar structure perform similar functions within the cell, and thus it can be expected that the occurrence of not only structural but also of functional motifs would correlate with protein fold, although there are marked exceptions to this expectation (Hegyi and Gerstein 1999).
The correlation between motif presence and protein structure can be evaluated by calculating the log-odds score, SFM, defined as
![]() | ((1)) |
Figure 2a
(solid circles) shows the distribution of SFM scores for the folds of our library. The presence of foldmotif pairs characterized by SFM >> 1 demonstrates that, indeed, in a number of cases, protein fold is strongly correlated with the presence of particular Prosite motifs. However, within the range -2SFM < 2 there are a large number of uncorrelated pairs. Inspection shows that they are mostly due to the presence of short, weakly defined motifs, such as phosphorylation and myristoylation sites.
|
|
|
![]() | ((2)) |
|
![]() | ((3)) |
These keyword-filtered SFM scores can be used to evaluate sequencefold compatibility in the same manner as the unfiltered scores. As shown in Figure 3
(open squares), the modified scoring scheme performs better than the initial SFM cutoff-based approach. Notice that annotation of only the target domains was used. A more advanced version of the method could also use annotation of the probe sequence obtained through automated literature scanning or by function prediction. However, to demonstrate that prior knowledge of the probe structure does not affect the performance of the method in the test set, no annotation of the probe sequence was used here.
In addition to improved overall performance, the keyword-filtered version returns a large number of correct fold assignments that are missed by a sequence-only based method, such as PSI-BLAST (Fig. 4
). In fact, for both methods of motif selection, the set of correct motif-based assignments, although comparable in number with the PSI-BLAST results, differs from it by more than 40%. This observation suggests that the motif-based method relies on a different set of information embedded in the protein sequence than the conventional sequence-based methods and thus a combination of both approaches might be beneficial.
|
![]() | ((4)) |
is an empirically adjustable weight. As shown in Figure 5a
, significantly better than does each component alone. Additional improvement of performance is possible by using both CMK and CFM cutoffs. As shown in Figure 5a
= 0.075,CMK = 0.25, CFM = 4.0).
|
= 0. In this case, the presence of motifs affects the fold assignment only through CMK and CFM cutoffs but does not modify the initial, sequence score-based ranking of the targets. Note that such a scoring scheme is analogous to the use of the occurrence of highly conserved sequence motifs during manual analysis of protein sequence. Thus, the improvement observed for
= 0 is consistent with the usefulness of this common, manual approach.
Despite the partial reliance of the combined score on the sequence information, the set of predictions based on Stot is still different than the results returned by PSI-BLAST. Thus, combining the two methods results in an additional increase of the performance shown in Figure 5b
. Here the prediction was generated by first running PSI-BLAST and accepting hits at a significance level of p = 1 x 10-3. When PSI-BLAST returned no significant hits, a combined sequencemotif assignment was generated. It is apparent that, at the accuracy level of 95%, the coverage of the combined method is more than five times higher than for SDP and about 40% higher than for PSI-BLAST (50% vs. 35% total coverage).
| Discussion |
|---|
|
|
|---|
The automated motif-based fold assignment method presented here is based on two observations. First, the observation that the functional sites are more often conserved than the rest of the sequence establishes a traceable correlation between folds and sequence motifs, even in cases where automatic detection of the sequencesequence homology is not reliable. Second, the observation that functional annotation can be used to identify "meaningful" motifs allows one to filter them out from random occurrences inevitable for information-poor, short motifs. The use of the two independent criteriaannotation and motif occurrenceto obtain the motif-based score bypasses the problem frequently encountered when dealing with remote homologs: the decrease of coverage that accompanies eliminating false positives by raising the significance cutoff. The scoring scheme presented here uses the correlation between two partially independent sources of information and thus is less compromised by uncorrelated noise in either of them.
The contribution of functional annotation to fold assignment is helpful for a number of reasons. The most significant is that motifs shared by only a few folds or present in only a subset of folds can be identified by the virtue of the annotationmotif correlation. This allows for a less stringent motiffold cutoff, CFM, leading to increased coverage of the method without sacrificing accuracy. Another advantage is the possibility of using motifs present in the sequences closely related to the probe, such as identified through BLAST searches. Those motifs, although by definition not completely conserved, can still provide information about the possible functional sites of the fold to be identified. This information can be further validated using the annotationmotif correlation. Initial results indicate that the performance of the modified method is at least comparable to the use of motifs present only in the original probe sequence (data not shown).
In the benchmark adopted here, we attempted to eliminate any effects of prior knowledge of the probe structure on the results. We assumed that no annotation of the probe sequence is available, either directly or through a simple BLAST search of annotated sequence databases or through other forms of function prediction (Marcotte et al. 1999; Pellegrini et al. 1999; Marcotte 2000) or data-mining techniques (Andrade et al. 1999). However, the final version of the algorithm can easily accommodate and benefit from additional annotation of probe sequences obtained experimentally, through a literature search (see MacCallum et al. 2000) or as the result of functional predictions. In the latter case, the predictions are often inferred in a sequence-independent manner (Marcotte 2000). Annotation obtained in this way is often independent from sequence- and experiment-based information that is used by the current version of the algorithm, and would therefore be expected to enhance the signal.
It should be pointed out that, our method, although using functional annotation, does not rely directly on the annotation transfer between homologous proteins, but, rather, detects correlations between annotation and sequence motifs. Thus, it is not limited by a low level of function conservation that has been reported recently (Devos and Valencia 2000; Wilson et al. 2000), and, at the same time, is relatively insensitive to random annotation errors that are not correlated with the motif presence. Such an approach is in contrast to the recently introduced Fuzzy Functional Forms of Fetrow et al (Fetrow and Skolnick 1998) and SiteMatch method of Zhang et al. (1999), both of which are based on recognition of conserved spatial or sequence motifs to identify a protein's function after initial fold assignment. It can be expected that these methods, although efficient at intermediate and high homology levels, might suffer from the alignment errors often encountered in low homology alignments (Domingues et al. 2000b).
In short, the MBA method combines, in a completely automatic way, information provided by occurrences of sequence motifs with functional annotation. The only other method that uses functional information is SAWTED (MacCallum et al. 2000), which relies exclusively on the annotation of the probe sequence. Probe annotations are, obviously, more direct and accurate sources of functional information about the probe sequence than is the annotation of the target domain. However, it is difficult to ensure that the knowledge about the probe's structure does not influence the annotation of the target domains. These factors, together with differences in the benchmarking methodology, make a direct, quantitative comparison of the methods difficult.
Currently, the MBA method is limited by a small number (
1300) of the motifs defined in the Prosite database. In addition, at a 95% level of accuracy (i.e., CMK = 0.25, CFM = 4.0), only
25% of those can contribute to Stot, as correlated, at a high enough level (i.e., SFMCFM) with at least one domain in the fold library. This limitation could be overcome by using large, automatically generated motif libraries, such as those created by EMOTIF (Nevill-Manning et al. 1998) or TEIRESIAS (Rigoutsos et al. 1999), because it should be expected that, the larger the size of the library, the more of the structural and functional features of a fold will be captured. However, the specificity of the motif libraries, in general, decreases with their size, and thus identification of false positives becomes a problem. We hope that a combination of the filtering criteria used in this work will maintain the high accuracy of the method as its coverage is increased.
| Materials and methods |
|---|
|
|
|---|
Fold library
The set of 3076 domains representing 522 distinct folds (as defined by CATH classes with a unique Class:Architecture:Topology identifier; 246 folds were represented by more than one structure) was created as a subset of all CATH domains present in both PDB Select and NR BLAST databases. Representatives of the discontinuous folds and all transmembrane domains were discarded.
Sequence/structure compatibility score
The sequencesecondary structure profile method (SDP) was used for the initial sequence/structure prediction as described earlier (Fischer and Eisenberg 1996). Briefly, a Gonnet substitution matrix (Gonnet et al. 1992) was used as a sequence-dependent component of the scoring function, whereas the secondary structure-dependent component was based on a secondary structure substitution matrix calculated as described by Rice and Eisenberg (1997). GLOLOC modification of the Smith-Waterman algorithm (Fischer and Eisenberg 1996) was used to generate sequence-profile alignments using 4.5 and 0.5 for gap opening and extension penalties, respectively.
PSI-BLAST score
NCBI implementation of the PSI-BLAST algorithm was used to assign folds following the methodology of Muller et al. (1999). Briefly, the NR BLAST database was combined with all the contiguous domains from the CATH database. After removal of the low-complexity regions (Wootton and Federhen 1996) up to 20 iterations of PSI-BLAST were performed to obtain a list of domain hits ranked by e-value. Only hits with an e-value <1 x 10-3 were accepted. Drift of the PSI-BLAST searches was avoided by adjusting the value of the h parameter as described by Muller and coworkers (1999).
Motif-based score
The motif-based score was calculated as
![]() | ((5)) |
![]() | ((6)) |
![]() | ((7)) |
In practice, because of the small size of the motif library, the sum (5) is typically reduced to one component.
Performance benchmark
To evaluate the performance of the fold assignment methods, we scored all of the sequences containing the domains in the fold library (probes) scored against all of the library domains (targets). The ranked prediction list was screened to remove self-hits and domains considered too similar to the structural domains identified in the probe sequence. As a similarity criterion, the relative positions within the CATH hierarchy of the target domain and the domains within the probe sequence were used. Namely, the CATH numerical identifier of the target domain had to differ at, at least, one of the top five levels of the CATH hierarchy (i.e., Class, Architecture, Topology, Homology, and Superfamily) from the probe domain to be taken into consideration. It was also required that, while constructing the SFM table, only domains fulfilling the above criterion were used.
The prescreened list of the hits was used to generate the best prediction by selecting a set of the top-ranked targets covering the entire length of the probe but overlapping <25% of their length. A prediction was considered to be correct (true positive) when the target domain had at least a two-thirds sequence overlap with the probe domain and identical Class, Architecture, and Topology identifiers. Any difference in the identifiers at such a level of overlap was considered a misprediction (false positive), whereas predictions overlapping over less than two-thirds of the length were considered neutral and not taken into account.
Performance of prediction methods is presented in the form of accuracy versus coverage curves, in which accuracy is the ratio of the number of true positives to the total number of predictions and coverage is reported relative to the number of domains (2381) in the probe library with at least one remote homolog (see earlier) present in the target library.
The initial tests at different levels of similarity within CATH have shown that the enforcing difference at the top four levels renders the benchmark too difficult for the current prediction methods (2% PSI-BLAST coverage), whereas releasing the stringency by enforcing the difference at as much as the top six levels resulted in >65% of probe domains assignable with PSI-BLAST.
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Andrade, M., Casari, G., de Daruvar, A., Sander, C., Schneider, R., Tamames, J., Valencia, A., and Ouzounis, C. 1997. Sequence analysis of the Methanococcus jannaschii genome and the prediction of protein function. Comput. Appl. Biosci. 13: 481483.
Andrade, M.A., Brown, N.P., Leroy, C., Hoersch, S., de Daruvar, A., Reich, C., Franchini, A., Tamames, J., Valencia, A., Ouzounis, C., and Sander, C. 1999. Automated genome sequence analysis and annotation. Bioinformatics 15: 391412.
Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28: 4548.
Baxevanis, A.D. 2000. The molecular biology database collection: An online compilation of relevant database resources. Nucleic Acids Res. 28: 17.
Bienkowska J.R., Yu, L., Zarakhovich, S., Rogers Jr., R.G., and Smith, T.F. 2000. Protein fold recognition by total alignment probability. Proteins 40: 451462.[CrossRef][Medline]
Bork, P and Koonin, E.V. 1996. Protein sequence motifs. Curr. Opin. Struct. Biol. 6: 366376.[CrossRef][Medline]
Bowie, J.U., Luthy, R., and Eisenberg, D. 1991. A method to identify protein sequences that fold into a known three-dimensional structure. Science 253: 164170.
Devos, D. and Valencia, A. 2000. Practical limits of function prediction. Proteins 41: 98107.[CrossRef][Medline]
Domingues, F.S., Koppensteiner, W.A., and Sippl, M.J. 2000a. The role of protein structure in genomics. FEBS Lett. 476: 98102.[CrossRef][Medline]
Domingues, F.S., Lackner, P., Andreeva, A., and Sippl, M.J. 2000b. Structure-based evaluation of sequence comparison and fold recognition alignment accuracy. J. Mol. Biol. 297: 10031013.[CrossRef][Medline]
Fetrow, J.S. and Skolnick, J. 1998. Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and T1 ribonucleases. J. Mol. Biol. 281: 949968.[CrossRef][Medline]
Fischer, D. and Eisenberg, D. 1996. Protein fold recognition using sequence-derived predictions. Protein Sci. 5: 947955.[Abstract]
Gonnet, G.H., Cohen, M.A., and Benner, S.A. 1992. Exhaustive matching of the entire protein sequence database. Science 256: 14431445.
Gribskov, M., McLachlan, A.D., and Eisenberg, D. 1987. Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. 84: 43554358.
Hegyi, H. and Gerstein, M. 1999. The relationship between protein structure and function: A comprehensive survey with application to the yeast genome. J. Mol. Biol. 288: 147164.[CrossRef][Medline]
Hobohm, U. and Sander, C. 1994. Enlarged representative set of protein structures. Protein Sci. 3: 522524.[Abstract]
Hofmann, K., Bucher, P., Falquet, L., and Bairoch, A. 1999. The PROSITE database, its status in 1999. Nucleic Acids Res. 27: 215219.
Jones, D.T. 1999. GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 287: 797815.[CrossRef][Medline]
Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992. A new approach to protein fold recognition. Nature 358: 8689.[CrossRef][Medline]
Jones, D.T., Tress, M., Bryson, K., and Hadley, C. 1999. Successful recognition of protein folds using threading methods biased by sequence similarity and predicted secondary structure. Proteins Suppl 3: 104111.
Karplus, K., Barrett, C., and Hughey, R. 1998. Hidden Markov models for detecting remote protein homologies. Bioinformatics 14: 846856.
Kasuya, A. and Thornton, J.M. 1999. Three-dimensional structure analysis of PROSITE patterns. J. Mol. Biol. 286: 16731691.[CrossRef][Medline]
MacCallum, R.M., Kelley, L.A., and Sternberg, M.J. 2000. SAWTED: Structure assignment with text description-enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics 16: 125129.
Marcotte, E.M. 2000. Computational genetics: finding protein function by nonhomology methods. Curr. Opin. Struct. Biol. 10: 359365.[CrossRef][Medline]
Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice, D.W., Yeates, T.O., and Eisenberg, D. 1999. Detecting protein function and proteinprotein interactions from genome sequences. Science 285: 751753.
Muller, A., MacCallum, R.M., and Sternberg, M.J.E. 1999. Benchmarking PSI-BLAST in genome annotation. J. Mol. Biol. 293: 12571271.[CrossRef][Medline]
Murzin, A.G. and Bateman, A. 1997. Distant homology recognition using structural classification of proteins. Proteins Suppl. 1: 105112.
Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443453.[CrossRef][Medline]
Nevill-Manning, C.G., Wu, T.D., and Brutlag, D.L. 1998. Highly specific protein sequence motifs for genome analysis. Proc. Natl. Acad. Sci. 95: 58655871.
Orengo, C.A., Jones, D.T., and Thornton, J.M. 1994. Protein superfamilies and domain superfolds. Nature 372: 631634.[CrossRef][Medline]
Orengo, C.A., Pearl, F.M.G., Bray, J.E., Todd, A.E., Martin, A.C., Lo Conte, L., and Thornton, J.M. 1999. The CATH database provides insights into protein structure/function relationships. Nucleic Acids Res. 27: 275279.
Panchenko, A.R., Marchler-Bauer, A., and Bryant, S.H. 2000. Combination of threading potentials and sequence profiles improves fold recognition. J. Mol. Biol. 296: 13191331.[CrossRef][Medline]
Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., and Yeates, T.O. 1999. Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc. Natl. Acad. Sci. 96: 42854288.
Rice, D.W. and Eisenberg, D. 1997. A 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence. J. Mol. Biol. 267: 10261038.[CrossRef][Medline]
Rigoutsos, I., Floratos, A., Ouzounis, C., Gao, Y., and Parida, L. 1999. Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins. Proteins 37: 264277.[CrossRef][Medline]
Russell, R.B., Copley, R.R., and Barton, G.J. 1996. Protein fold recognition by mapping predicted secondary structures. J. Mol. Biol. 259: 349365.[CrossRef][Medline]
Skolnick, J. and Fetrow, J.S. 2000. From genes to protein structure and function: Novel applications of computational approaches in the genomic era. Trends Biotechnol. 18: 3439.[CrossRef][Medline]
Skolnick, J., Fetrow, J.S., and Kolinski, A. 2000. Structural genomics and its importance for gene function analysis. Nat. Biotechnol. 18: 283287.[CrossRef][Medline]
Smith, T.F. 1999. The art of matchmaking: Sequence alignment methods and their structural implications. Structure with Folding & Design 7: R7R12.
Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195197.[CrossRef][Medline]
Wilson, C.A., Kreychman, J., and Gerstein, M. 2000. Assessing annotation transfer for genomics: Quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J. Mol. Biol. 297: 233249.[CrossRef][Medline]
Wootton, J.C. and Federhen, S. 1996. Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 266: 554571.[Medline]
Zhang, B., Rychlewski, L., Pawlowski, K., Fetrow, J.S., Skolnick, J., and Godzik, A. 1999. From fold predictions to function predictions: Automation of functional site conservation analysis for functional genome predictions. Protein Sci. 8: 11041115.[Abstract]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
Z. Zhang, S. Kochhar, and M. G. Grigorov Descriptor-based protein remote homology identification Protein Sci., February 1, 2005; 14(2): 431 - 444. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Mallick, R. Weiss, and D. Eisenberg The directional atomic solvation energy: An atom-based potential for the assignment of protein sequences to known folds PNAS, December 10, 2002; 99(25): 16041 - 16046. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |