|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 School of Computer Science and Engineering
2 Department of Biological Chemistry, The Hebrew University of Jerusalem, Jerusalem 91904, Israel
(RECEIVED February 23, 2006; FINAL REVISION February 23, 2006; ACCEPTED February 23, 2006)
| Abstract |
|---|
|
|
|---|
Keywords: protein family; hierarchical classification; InterPro; clustering
| Introduction |
|---|
|
|
|---|
200 eukaryotes that are in their final stage of assembly or in progress (http://www.ncbi.nlm.nih.gov/Genomes). These new genomes largely outnumber the 18 complete eukaryotic genomes currently available. Therefore, the need for automation in the painstaking task of functional annotation becomes critically important. In addition to ongoing "whole genome" projects, other types of experimental data are becoming available from numerous high-throughput methodologies. In recent years, standardization in the technologies of SNP arrays, DNA micro-array, and DNA chips has increased the quality and reproducibility of the results. Overall, the volume of data that is collectively referred to as "nonsequence data" is rapidly growing. However, the quality of the data varies. While the quality of some data sources may be very high, other types may be of inherently poor quality. For example, structural genomics projects produce detailed and accurate three-dimensional information from crystallography and NMR spectroscopy. The function of many of these structures is still unknown (Skolnick et al. 2000). In contrast, data on proteinprotein interactions originating from two-hybrid systems suffer from large amounts of false positives and low reproducibility. With the addition of proteomics data from LC MS/MS experiments, protein chips, and subcellular localization data, the data that emerges is protein rather than genomic centered (Bork et al. 2004).
The notion of protein function is elusive. To apply computational methods, we need to provide an unambiguous definition. We propose equating function to annotations. Annotations are simply categorical biological properties describing the protein's functionality. Annotations can describe various biological aspects of the protein such as its structure, enzymatic classification, taxonomy, cellular localization, and more. Local alignment search tools such as BLAST (Altschul et al. 1997) provide the most straightforward method for performing automatic function prediction on a new sequence (Jones and Swindells 2002), via function inference. With this method, a protein database is searched for high-scoring local alignments with the query protein. The annotations on the sequence that score the highest alignment are assigned to the query sequence, provided the alignment score passes a predetermined threshold. The underlying logic is simple: Proteins with similar sequences are conjectured to have evolved from a single ancestor gene and thus to have retained similar functionality. While this approach is simplistic, it performs fairly well in many cases. However, local alignment searches suffer from some important caveats:
We hereby present a scheme for inference of functional annotations of protein sequences. The scheme consists of two parts: (1) ProtoNet, an automatic hierarchical organization of protein sequence databases representing functional and evolutionary relations amongst the proteins, and (2) an automatic method for predicting the function of a new protein based on its localization in the protein tree (Sasson et al. 2003; Kaplan and Linial 2005). We start by describing the ProtoNet classification hierarchy, proceed by discussing its biological validity, and conclude by explaining the annotation inference method and showing how it avoids the common annotation assignment pitfalls mentioned above.
| Results |
|---|
|
|
|---|
In contrast to a nonhierarchical functional grouping, this hierarchical representation of proteins provides a much more accurate view on protein functional relations, because functionality encompasses several degrees of granularity, from very general effects at the organism level to very specific descriptions of biochemical function. To achieve this organization, we use the following three phases:
|
|
In light of the explosive growth of sequence databases, scalability is an important issue. Although the presented method scales well in terms of result quality (tested on a database of 90,000 up to 200,000 proteins), the computation itself is more challenging. For large protein databases such as UniProt (containing >1,600,000 sequences), performing the hierarchical clustering requires very large memory. To avoid this problem, we divide the clustering problem into several clustering steps, each of which considers a subset of the similarity graph. Preliminary results indicate that the biological validity of the hierarchy produced by this method is not reduced significantly (Sasson 2005). ProtoNet is available at http://www.protonet.cs.huji.ac.il.
Biological validity of ProtoNet
The validity of clusters can be determined in comparison to other classifications, e.g., InterPro (Mulder et al. 2002). At present, the InterPro classifier uses a combination of 12 supervised detection methods based on state-of-the-art methods such as hidden Markov models (HMMs), position-specific scoring matrices (PSSMs), and profiles (Mulder et al. 2005). To determine if ProtoNet is able to detect the weak functional relationships that are detected by InterPro, we perform the following test: For each InterPro annotation (each InterPro entry can be thought of as an annotation), we consider the set of all proteins that were assigned that annotation (S). Next, we define the following score between a cluster C and the set S (this score is also known as the Jaccard coefficient):
|
|
Note that a score of one means C = S, and a score of zero means C
S = {Ø}. Finally, we find the highest scoring cluster for each InterPro annotation. Figure 1 shows an area plot describing the distribution of the scores for the highest scoring cluster of each InterPro annotation. Remarkably, we find that ProtoNet is able to produce clusters that are extremely consistent with the InterPro classification (mean score 0.85), even though ProtoNet uses only BLAST e-values and is completely unsupervised. Furthermore, ProtoNet shows high consistency with manual and semiautomatic classifications as well. For more results, see Kaplan et al. (2004) and Shachar and Linial (2004).
|
GAS1 (growth arrest sequence 1) is a tumor suppressor that prevents DNA synthesis by blocking the entry of cells into the S phase (Mullor and Ruiz i Altaba 2002). During embryogenesis GAS1 is differentially expressed and its expression has been associated with cell death during limb development, while in the cerebellum GAS1 was shown to act as a positive growth regulator (Marques and Fan 2002). The molecular function of GAS1 in vivo remains elusive. A routine BLAST search using the human or mouse GAS1 protein sequence as the query sequence fails to detect any significant hits to any protein groups other than GAS1 proteins from related species. InterPro also fails to detect a connection to any protein families. However, the ProtoNet 4.0 classification tree suggests a relationship between GAS1 and a large family of GFR
, the GPI (glycosyl-phosphatidyl-inositol) coreceptor for glial cell linederived neurotrophic factor (GDNF) and its related factors. Examining the ProtoNet hierarchy, we find a cluster that combines GAS1 from vertebrates, worms, and insects with GDNF receptors from avians, rodents, and primates (Fig. 2). Interestingly, proteins belonging to the GFR
family consistently emerge by a BLAST search, yet with a score that is below any statistical significance (Fig. 2; www.protonet.cs.huji.ac.il). Submission of GAS1 to the Meta-server for fold recognition (Ginalski and Rychlewski 2003) substantiated the connection between GAS1 and GFR
and identified Protein Data Bank 1q8dA (108 amino acids from rat GFR
1) as a parent model with very high confidence (e-value of 6 x 1024). Additional evidence for the functional connectivity between GAS1 and GFR
substantiated our study (Furman et al. 2006).
|
The aforementioned procedure was applied for >10,000 unannotated predicted proteins from the honey bee genome. A ProtoNet-like approach including
200,000 sequences was applied (www.protobee.cs.huji.ac.il), and for
75% of the honey bee proteins, some biological annotation was successfully assigned (N. Kaplan and M. Linial, unpubl.).
Looking back at the example of local similarity, if the proteins of a cluster are varied biologically but share a local region of similarity (and therefore some functional features), only the annotations that are shared by the proteins of the cluster will be assigned to the cluster. This can greatly reduce the chance of excessive transfer of annotations and transfer of incorrect annotations (provided that the incorrect annotations are isolated incidences and do not represent the majority of cases in the database). In addition to the high sensitivity/specificity results in comparison with other method and the threshold relativity that the clustering method is able to take into account, it seems that this method succeeds in avoiding many of the common pitfalls of local alignment searches.
| Discussion |
|---|
|
|
|---|
An interesting advantage of ProtoNet over the naïve local similarity search approach is that any kind of annotation can be assigned to the new sequence. This means that any data that are available on the underlying database of proteins are available for use in annotation. By using UniProt as its underlying database, ProtoNet is able to assign InterPro, UniProt keywords, GO, ENZYME, and SCOP annotations. This not only offers a wider and constantly-growing range of available annotations but also overcomes inconsistencies between different sources.
It is worth mentioning that much work has been done on automatic functional annotation. An approach that is related to the one presented in this article is prediction by phylogenomic methods, using the evolutionary context of a sequence for function prediction (Engelhardt et al. 2005). The use of the evolutionary context is analogous to the use of the classification hierarchy in this work.
One problem that remains partially unaddressed by ProtoNet is the problem of multiple domains. Since a protein often consists of several domains, it can be viewed as belonging to several protein families. In ProtoNet, proteins are the basic entities. As a result of this, every protein appears once and can therefore belong to several families only if they contain each other. This issue is irresolvable in the current scheme. However, this issue is addressed in a related work called EVEREST (www.everest.cs.huji.ac.il), in which protein domains are the basic entities that are clustered.
While local similarity searches usually give a statistical evaluation of the results, it is often very difficult to deduce from this evaluation what biological similarity exists amongst the query protein and the matches found. This is especially true for borderline or even clearly insignificant statistical values. As ProtoNet uses a clustering method, it is unable to provide a good statistical evaluation of the results. However, since the statistical evaluation simply acts as a mean for evaluating validity of prediction quantitatively, ProtoNet provides several alternative measures that are related to the structure and localization of the protein in the tree. These measurements can be used to assess the validity of the classification of any query protein.
| Footnotes |
|---|
Reprint requests to: Michal Linial, CCB, The Sudarsky Center for Computational Biology, Department of Biological Chemistry, Life Science Institute, The Hebrew University, Jerusalem 91904, Israel; e-mail: michall{at}cc.huji.ac.il; fax: 972-2-6585448.
Article published online ahead of print. Article and publication date are at http://www.proteinscience.org/cgi/doi/10.1110/ps.062185706.
Abbreviations: GAS1, growth arrest sequence 1; GDNF, glial-cell-line-derived neurotrophic factor; GFR, GDNF family receptor; GPI, glycosyl phosphatidylinositol; HMM, hidden Markov model; PSSM, position-specific scoring matrix.
| Acknowledgments |
|---|
| References |
|---|
|
|
|---|
Bairoch A. 2000. The ENZYME database in 2000 Nucleic Acids Res. 28: 304305.
Bairoch A., Apweiler R., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M.et al. 2005. The Universal Protein Resource (UniProt) Nucleic Acids Res. 33: D154D159.
Bork P., Jensen L.J., von Mering C., Ramani A.K., Lee I., Marcotte E.M. 2004. Protein interaction networks from yeast to human Curr. Opin. Struct. Biol. 14: 292299.[CrossRef][Medline]
Camon E., Barrell D., Lee V., Dimmer E., Apweiler R. 2004. The Gene Ontology Annotation (GOA) DatabaseAn integrated resource of GO annotations to the UniProt Knowledgebase In Silico Biol. 4: 56.[Medline]
Edgar R.C. and Sjolander K. 2004. COACH: Profileprofile alignment of protein families using hidden Markov models Bioinformatics 20: 13091318.
Engelhardt B.E., Jordan M.I., Muratore K.E., Brenner S.E. 2005. Protein function prediction by Bayesian phylogenomics PLoS Comput. Biol. 1: 432445.
Furman O., Glick E., Segovia J., Linial M. 2006. Is GAS1 a co-receptor of the GDNF family of ligands? Trends Pharmacol. 2: 7279.
Ginalski K. and Rychlewski L. 2003. Detection of reliable and unexpected protein fold predictions using 3D-Jury Nucleic Acids Res. 31: 32913292.
Godzik A. 2003. Fold recognition methods Methods Biochem. Anal. 44: 525546.[Medline]
Han S., Lee B.C., Yu S.T., Jeong C.S., Lee S., Kim D. 2005. Fold recognition by combining profileprofile alignment and support vector machine Bioinformatics 21: 26672673.
Hubbard T.J., Ailey B., Brenner S.E., Murzin A.G., Chothia C. 1999. SCOP: A structural classification of proteins database Nucleic Acids Res. 27: 254256.
Jones D.T. and Swindells M.B. 2002. Getting the most from PSI-BLAST Trends Biochem. Sci. 27: 161164.[CrossRef][Medline]
Kaplan N. and Linial M. 2005. Automatic detection of false annotations via binary property clustering BMC Bioinformatics 6:46.
Kaplan N., Friedlich M., Fromer M., Linial M. 2004. A functional hierarchical organization of the protein sequence space BMC Bioinformatics 5:196.
Kaufman L. and Rousseeuw P. In Finding groups in data: An introduction to cluster analysis . 1990. John Wiley and Sons, New York.
Krause A., Stoye J., Vignron M. 2005. Large scale hierarchical clustering of protein sequences BMC Bioinformatics 6:15.
Kriventseva E.V., Fleischmann W., Zdobnov E.M., Apweiler R. 2001. CluSTr: A database of clusters of SWISS-PROT+TrEMBL proteins Nucleic Acids Res. 29: 3336.
Linial M. 2003. How incorrect annotations evolve: The case of short ORFs Trends Biotechnol. 21: 298300.[CrossRef][Medline]
Marques G. and Fan C.M. 2002. Growth arrest specific gene 1: A fuel for driving growth in the cerebellum Cerebellum 1: 259263.[CrossRef][Medline]
McGinnis S. and Madden T.L. 2004. BLAST: At the core of a powerful and diverse set of sequence analysis tools Nucleic Acids Res. 32: W20W25.
Mulder N.J., Apweiler R., Attwood T.K., Bairoch A., Bateman A., Binns D., Biswas M., Bradley P., Bork P., Bucher P.et al. 2002. InterPro: An integrated documentation resource for protein families, domains and functional sites Brief. Bioinform. 3: 225235.
Mulder N.J., Apweiler R., Attwood T.K., Bairoch A., Bateman A., Binns D., Bradley P., Bork P., Bucher P., Cerutti L.et al. 2005. InterPro, progress and status in 2005 Nucleic Acids Res. 33: D201D205.
Mullor J.L. and Ruiz i Altaba A. 2002. Growth, hedgehog and the price of GAS Bioessays 24: 2226.[CrossRef][Medline]
Sasson O. "The protein metric space: A study in clustering." Ph.D. thesis 2005. School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel.
Sasson O., Vaaknin A., Fleischer H., Portugaly E., Bilu Y., Linial N., Linial M. 2003. ProtoNet: Hierarchical classification of the protein space Nucleic Acids Res. 31: 348352.
Shachar O. and Linial M. 2004. A robust method to detect structural and functional remote homologues Proteins 57: 531538.[CrossRef][Medline]
Skolnick J., Fetrow J.S., Kolinski A. 2000. Structural genomics and its importance for gene function analysis Nat. Biotechnol. 18: 283287.[CrossRef][Medline]
Sonnhammer E.L. and Koonin E.V. 2002. Orthology, paralogy and proposed classification for paralog subtypes Trends Genet. 18: 619620.[CrossRef][Medline]
Yang Z.R. 2004. Biological applications of support vector machines Brief. Bioinform. 5: 328338.
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |