|
|
||||||||
1 Department of Electronic and Information System Engineering, Faculty of Science and Technology, Hirosaki University, Hirosaki 036-8561, Japan
2 Department of Developmental Biology and Neuroscience, Graduate School of Life Sciences and
3 Department of Molecular Immunology, Institute of Development, Aging and Cancer, Tohoku University, Sendai 980-8577, Japan
Reprint requests to: Toshio Shimizu, Department of Electronic and Information System Engineering, Faculty of Science and Technology, Hirosaki University, 3, Bunkyo-cho, Hirosaki 036-8561, Japan; e-mail: slsimi{at}si.hirosaki-u.ac.jp; fax: 81-172-39-3638.
(RECEIVED April 15, 2004; FINAL REVISION May 19, 2004; ACCEPTED May 19, 2004)
| Abstract |
|---|
|
|
|---|
Keywords: transmembrane protein; transmembrane topology similarity; functional classification and identification; proteome-wide analysis; prokaryotic genome
Abbreviations: ABC, ATP-binding cassette n-tms, with n transmembrane segment(s) ORF, open reading frame SP, signal peptide TM, transmembrane TMS, transmembrane segment
Supplemental material: see www.proteinscience.org
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.04814404.
| Introduction |
|---|
|
|
|---|
On the other hand, recent studies have revealed that TM protein functions are closely related to TM topology (the number of TM segments [TMSs], positions of the TMSs and N-tail location), and can be classified and identified with high accuracies using TM topology information as the primary basis even without using sequence similarity itself directly (Sugiyama et al. 2003; Inoue et al. 2004). Individual functional groups have their own specific TM topologies, that is, characteristic combination patterns of loop lengths. The similarity of TM topologies between two TM protein sequences can be evaluated rather easily from comparing the lengths of corresponding loop regions between the two sequences, as described in detail in the Materials and Methods section. It is generally true that a pair of TM protein sequences with a higher sequence identity usually shows a higher TM topology similarity. In some cases, however, the TM topology similarity is kept at a high level between two sequences belonging to the same functional groups (at the superfamily level) even if the sequence similarity is below the twilight zone. For example, we have a pair of TM protein sequences, mouse GABA receptor
6 (GAA6_MOUSE) and human neuronal acetylcholine receptor
5 (ACH5_HUMAN), between which sequence identity is only 15.8%, while the TM topology similarity is as high as 96.9%. Thus, it is expected that the classification and identification of TM protein functions on proteome scale should be improved to a large degree by making good use of TM topology information in addition to sequence similarity.
One example of the approaches for obtaining reliable and more accurate TM topology prediction data is the ConPred program (Ikeda et al. 2002; Arai et al. 2004; Xia et al. 2004), which is based on a consensus strategy by combining several proposed prediction methods, and achieves an accuracy increase of as much as 10%, for example, predicting the entire TM topology of prokaryotic TM protein sequences, from 56.5% (by MEMSAT 1.8 [Jones et al. 1994] and HMMTOP 2.0 [Tusnády and Simon 1998] and HMMTOP 2.0 [Tusnády and Simon 2001]) to 68.1% (Arai et al. 2004).
In this study, we propose a new approach for classifying and identifying TM proteome functions by using a clustering method based on TM topology similarity. We focused on predicted TM proteins from 87 completed prokaryotic (72 bacterial and 15 archaean) genome sequences. In this approach, in the case when sequences of unknown function are segregated into a cluster together with sequences of known function, not only the functional classification but also the functional identification are achieved. Prior to carrying out the clustering, we first identified functions of the predicted TM protein sequences and classified them into three categories by using homology search and sequence comparison on the basis of sequence similarity, that is, "known," "putative" (similar to "known" sequences), and "unknown."
| Results and Discussion |
|---|
|
|
|---|
|
|
The fractions of "putative" sequences, the functions of which are inferable from the functionally known sequences in SWISS-PROT, range widely from the minimum, 11.3% for 2-tms, to the maximum, 34.9% for 9-tms TM proteins, with an overall average of 20.5%. The "known" and "putative" sequences added together amount to only 24.3%, that is, about one-quarter of the TM proteomes, indicating the majority (i.e., more than three-quarters) of TM proteomes are still classified as functionally unknown.
The results listed in Table 2
are illustrated in detail separately for each species in Figure 1
. As expected, Escherichia coli has the highest percentage of known sequences, with over 40% of its TM proteome sequences classified as "known." The high rate of "known" sequences for Shigella flexneri as much as for E. coli is due to the close phylogenic relationship between them that about three-quarters of TM protein sequences of S. flexneri are almost identical to those of E. coli. The "known" fractions for the archaean genomes are extremely small, that is, just 1.1% as an average over the 15 genomes.
|
proteobacteria in proteobacteria (from E. coli to P. multocida in the list) stand out among the other species. This is again the contribution from the large number of "known" E. coli sequences in SWISS-PROT. Overall, the proteobacteria genomes far exceed the other three species categories in the fractions of "known" and "putative" sequences. The archaean TM proteomes have the smallest fractions of "known" plus "putative" sequences, 8.4% as an average over the 15 species. Interestingly, 65.1% of the "putative" sequences over all the archaean genomes are annotated after the proteobacterial "known" sequences, while only 23.3% of them are directly after the archaean "known" sequences.
Threshold TM topology similarities and the minimum cluster size
We assumed the proteome-scale functional classification using the clustering approach was successful when more than 50% of all the sequences were included in the clusters of at least 10 sequences (the minimum cluster size). The threshold TM topology similarities as the criteria for clustering were determined based on this assumption. The conditions (the 50% coverage and the minimum cluster size of 10) adopted in our approach are not based on any scientific data, but rather are purely empirical ones. This assumption is, however, supported by the relationships between the threshold TM topology similarities versus the minimum cluster size, where with increasing minimum cluster size, the threshold TM topology similarities decrease rapidly at first and then reach saturated levels at a minimum cluster size of around 10 for most numbers of TMSs (see Supplemental Fig. 1
). Hereafter, we refer to clusters whose size is larger than nine as "large clusters," and all others including orphan clusters as "small clusters."
Threshold TM topology similarities thus determined are, for example, 98%, 85%, and 82% for 1-tms, 6-tms, and 12-tms TM protein sequences, respectively, as shown in the third column of Table 3
. As expected, stricter threshold similarity values are obtained for the smaller numbers of TMSs.
|
The percentages of newly classified and identified sequences using this approach are displayed in Figure 1
(gray bars) individually for the respective species. Averaged over the 31 proteobacteria, 22 gram-positive bacteria, 19 other bacteria, and 15 archaea, 32.3%, 40.3%, 38.9%, and 36.7% of all the sequences are newly classified and identified, respectively. These correspond to more than half of all the "unknown" sequences in the four individual species categories. Several
proteobacteria belonging to proteobacteria, that is, E. coli, Salmonella typhi, Salmonella typhimurium, Yersinia pestis, S. flexneri, Buchnera sp., B. aphidicola, Wigglesworthia brevipalpis, Haemorphilus influenzae and Pasteurella multocida have smaller numbers of newly classified and identified TM proteins when compared with the other species, although the total levels of functional annotation achieved were remarkably high, as much as around 80%. It is also noted that the number of classified and identified archaean sequences that were originally "unknown" significantly increased from 8.4% to 45.9%.
The following describes the details of the functional classification and identification attained by this approach, exemplifying 6-tms TM proteins.
Table 4
provides the list of the 27 large clusters generated by single-linkage clustering based on TM topology similarity (threshold similarity 85%) for 6-tms TM proteins enumerated in order of cluster size. The largest cluster, Cluster 1, includes 1085 sequences, nearly one-fourth of all of the 6-tms TM protein sequences, with the "known" plus "putative" sequences (679 in total) annotated as "transport system permease protein" except for one sequence (as photosystem II chlorophyll-binding protein). This implies the 406 "unknown" sequences (37.4% of the 1085 sequences) included in the cluster also could be annotated as transport system permease proteins. By further clustering based on sequence similarity (threshold sequence identity 30%) within Cluster 1, we obtained 46 subclusters that correspond to functional subgroups that are, for example, "dipeptide transport system permease dppB" (in total 228 sequences including "unknown" sequences), "maltose transport system permease malD" (212 sequences), "lactose transport system permeases lacF" (181 sequences), "sulfate transport system permease cysT" (118 sequences), etc., suggesting that the TM topology-based clustering may correspond to a superfamily-or family-level classification, whereas the sequence similarity-based clustering to a family- or subfamily-level one in this case.
|
In Table 4
, we have four clusters composed of only "unknown" sequences, Clusters 9, 16, 23, and 27. Of these, Clusters 23 and 27 comprise the sequences from only archaean and proteobacterial species, respectively. These "unknown" protein sequences must be not only novel but also biologically important functional groups. We expect further experimental studies would characterize these sequences and elucidate their functions in detail.
Cluster 3 (231 sequences, of which 109 are "known" or "putative" assigned as "ATP-binding cassette [ABC] transporters") clearly illustrates how well the TM topology-based clustering works in the functional classification and identification of TM proteins. Out of 119 6-tms sequences annotated as "ABC transporter," 109 sequences (91.6%) are captured properly in this cluster, and the remaining 10 sequences are spread across nine small clusters: one sequence in a cluster with the size of nine (including nine sequences in total, N-in topology), one in a size-four cluster (N-out), two in a size-two (N-in), one in a size-two (N-out, +SP), and five orphan sequences (all N-in). The other 122 sequences are all "unknown," and no sequences with other functions are included in this cluster at all.
TM topology models of the 231 sequences are illustrated in Figure 2
, where TMS, cytoplasmic, and noncytoplasmic loop regions are represented with black, gray, and dark gray bars, respectively. The sequences in this cluster have the following characteristics with regard to TM topology: (1) No signal peptide (SP) is present; (2) the N-tail loop is located on the cytoplasmic side; (3) the cytoplasmic loops (including the N- and C-tails) are long, in particular, the C-tail loop is extremely long; and (4) the noncytoplasmic loops are short with connecting adjacent TMSs to form three typical "helical-hairpin" domains in the TM topology architecture (Gafvelin and von Heijne 1994; Gafvelin et al. 1997). The sequence-similarity based clustering within Cluster 3 segregates these 231 sequences into 23 subclusters (including 13 orphan clusters), which are shown in Figure 2
divided by the space lines. In the largest subcluster (182 sequences), 105 sequences are "known" or "putative" as ABC transporters, and the remaining 77 are "unknown." The other 22 subclusters are composed of only "unknown" sequences, with the exception of a few subclusters that contain ABC transporter sequences.
|
|
| Materials and methods |
|---|
|
|
|---|
Prediction of TM protein sequences and their TM topologies from the proteomes
Out of the protein sequences translated from the ORFs, we segregated TM protein sequences and predicted their TM topologies according to the following procedure: (1) prediction of TM protein sequence candidates using SOSUI (
98% accuracy; Hirokawa et al. 1998); (2) removal of predicted SP regions using DetecSig (88% accuracy; Lao and Shimizu 2001; Lao et al. 2002); and (3) prediction of TM topology by ConPred (68.1% accuracy; Arai et al. 2004). A more detailed description of this procedure is given in our previous article (Arai et al. 2003).
Functional identification of TM protein sequences based on sequence similarity
We first categorized the 114,965 full-length protein sequences in SWISS-PROT release 41 into "known," "putative," or "unknown" groups according to the level of functional annotation. For this categorization, we adopted the simple but rational criteria given in the GTOP database (http://spock.genes.nig.ac.jp/?genome/func.html; Kawabata et al. 2002). The criterion for discriminating sequences with a "known" function requires at least one of the following: (1) more than five letters with functional information in the DE line, (2) at least one informative word in the KW line, or (3) both "-!- FUNCTION" and "-!- CATALYTIC ACTIVITY" in the CC line. Sequence entries were classified as "putative" if the entry contains one of the following descriptions: (1) "HOMO-LOG," "HOMOLOGY," "HYPOTHETICAL," "POTENTIAL," "POSSIBLE," "PROBABLE," or "PUTATIVE" in the DE line; (2) "BY SIMILARITY," "HYPOTHETICAL," "POTENTIAL," "POSSIBLE," "PROBABLE," or "PUTATIVE" in the "CC -!FUNCTION" or "CC -!- CATALYTIC ACTIVITY" line; and (3) "HYPOTHETICAL PROTEIN" in the KW line. When only the "known" criterion is satisfied, the sequence is regarded as "known." In cases when both "known" and "putative" criteria are true, the sequence is classified as "putative." The sequences to which the "known" criterion does not fit are categorized as "unknown," even if the "putative" criterion fits. Through this procedure, we obtained 70,228 "known" (10,796 TM protein sequences), 39,296 "putative" (6643), and 5441 "unknown" (754) sequences from SWISS-PROT release 41.
Next, we classified the 51,044 predicted TM protein sequences from the 87 prokaryotic genomes into three categories in agreement with the functional description levels in SWISS-PROT using a BLAST homology search (Altschul et al. 1990, 1997) and an ALIGN (Myers and Miller 1988) sequence comparison, as illustrated in Figure 4
. The BLAST search was carried out with the default settings (first gap penalty, -11; additional gap penalty, -1; substitution matrix, BLOSUM 62; Henikoff and Henikoff 1992) against the grouped full-length sequences from the SWISS-PROT database. If a query sequence matches one of the SWISS-PROT sequences of the "known" group with an E-value less than 105, it was treated as a candidate for the "known" or "putative" category; otherwise, it was classified as "unknown."
|
95%), "putative" (30%~95%), or "unknown" (<30%). When a query sequence is categorized into "known" or "putative," it is considered to be a functionally identified TM protein and the function of the matched SWISS-PROT sequence is given to the query sequence as its function.
Functional classification and identification of TM protein sequences based on TM topology similarity
The procedure for classifying and identifying TM protein functions based on TM topology similarity is illustrated in Figure 5
. The 51,044 TM protein sequences annotated using a BLAST homology search and ALIGN sequence comparison were divided into 36 data sets according to the number of TMSs, the presence or absence of a signal peptide, and N-tail location. The sequences are clustered within individual data sets by a single-linkage method based on TM topology similarity. In this single-linkage clustering, the TM topology similarity is used as the determining factor defined as:
|
![]() | (1) |
where, n, l1, i and l2, i are the number of TMSs and the length of the i-th loop in sequences 1 and 2, respectively, and min (l1, i, l2, i) and max (l1, i, l2, i) are the lengths of the shorter and longer loops in l1, i and l2, i, respectively.
Within the individual TM-topology based clusters, the sequences are further clustered by a single-linkage method based on sequence similarity (threshold sequence identity 30%) using the ALIGN program with the default settings, except for the substitution matrix (BLOSUM 62 was used), to generate subclusters that must correspond to functional subgroups in the TM-topology based clusters, as illustrated in Figure 5
.
Electronic supplementary material
Supplemental materials are (1) lists of the obtained large clusters based on TM topology similarity for 1~12-tms TM proteins (named "Supple_Table1.doc"), (2) Supplemental Figure legends ("Supple_Fig_legends.doc"), (3) Supplemental Figure 1
("Supple_ Fig1.xls"), (4) Supplemental Figure 2
("Supple_Fig2.ppt"), (5) data sets of 87 prokaryotic TM proteome sequences functionally annotated by homology search plus sequence similarity comparison (e.g., "eco.db" for E. coli), and (6) ID lists of the sequences in the obtained clusters (both the large and small clusters) for individual 1~12-tms TM proteins (e.g., "tms06.db" for 6-tms TM proteins). These files are also available at ftp://bioinfo.si.hirosaki-u.ac.jp/TopClust.
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 33893402.
Arai, M., Ikeda, M., and Shimizu, T. 2003. Comprehensive analysis of trans-membrane topologies in prokaryotic genomes. Gene 304: 7786.[CrossRef][Medline]
Arai, M., Mitsuke, H., Ikeda, M., Xia, J.-X., Kikuchi, T., Satake, M., and Shimizu, T. 2004. ConPred II: A consensus prediction method for obtaining transmembrane topology models with high reliability. Nucleic Acids Res. 32: W390W393.
Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. 2004. The Pfam protein families database. Nucleic Acids Res. 32: D138D141.
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., and Wheeler, D.L. 2004. GenBank: Update. Nucleic Acids Res. 32: D23D26.
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., ODonovan, C., Phan, I., et al. 2003. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31: 365370.
Boyd, D., Schierle, C., and Beckwith, J. 1998. How many membrane proteins are there? Protein Sci. 7: 201205.[Abstract]
Drury, L.S. and Buxton, R.S. 1988. Identification and sequencing of the Escherichia coli cet gene which codes for an inner membrane protein, mutation of which causes tolerance to colicin E2. Mol. Microbiol. 2: 109119.[CrossRef][Medline]
Gafvelin, G. and von Heijne, G. 1994. Topological "frustration" in multispanning E. coli inner membrane proteins. Cell 77: 401412.[CrossRef][Medline]
Gafvelin, G., Sakaguchi, M., Andersson, H., and von Heijne, G. 1997. Topological rules for membrane protein assembly in eukaryotic cells. J. Biol. Chem. 272: 61196127.
Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89: 1091510919.
Hirokawa, T., Boon-Chieng, S., and Mitaku, S. 1998. SOSUI: Classification and secondary structure prediction system for membrane proteins. Bioinformatics 14: 378379.
Ikeda, M., Arai, M., Lao, D.M., and Shimizu, T. 2002. Transmembrane topology prediction methods: A re-assessment and improvement by a consensus method using a data set of experimentally-characterized transmembrane topologies. In Silico Biol. 2: 1933.[Medline]
Inoue, Y., Ikeda, M., and Shimizu, T. 2004. Proteome-wide classification and identification of mammalian-type GPCRs by binary topology pattern. Comput. Biol. Chem. 28: 3949.[CrossRef][Medline]
Jones, D.T. 1998. Do transmembrane protein superfolds exist? FEBS Lett. 423: 281285.[CrossRef][Medline]
Jones, D.T., Taylor, W.R., and Thornton, J.M. 1994. A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry 33: 30383049.[CrossRef][Medline]
Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., and Hattori, M. 2004. The KEGG resource for deciphering the genome. Nucleic Acids Res. 32: D277D280.
Kawabata, T., Fukuchi, S., Homma, K., Ota, M., Araki, J., Ito, T., Ichiyoshi, N., and Nishikawa, K. 2002. GTOP: A database of protein structures predicted from genome sequences. Nucleic Acids Res. 30: 294298.
Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E.L. 2001. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 305: 567580.[CrossRef][Medline]
Lao, D.M. and Shimizu, T. 2001. A method for discriminating a signal peptide and a putative 1st transmembrane segment. In Proceedings of the 2001 International Conference on Mathematics and Engineering Techniques in Medicine and Biological SciencesMETMBS01 (ed. F. Valafar), pp. 119125. CSREA Press, Las Vegas.
Lao, D.M., Arai, M., Ikeda, M., and Shimizu, T. 2002. The presence of signal peptide significantly affects transmembrane topology prediction. Bioinformatics 18: 15621566.
Liu, J. and Rost, B. 2001. Comparing function and structure between entire proteomes. Protein Sci. 10: 19701979.
Mitaku, S., Ono, M., Hirokawa, T., Boon-Chieng, S., and Sonoyama, M. 1999. Proportion of membrane proteins in proteomes of 15 single-cell organisms analyzed by the SOSUI prediction system. Biophys. Chem. 82: 165171.[CrossRef][Medline]
Myers, E.W. and Miller, W. 1988. Optimal alignments in linear space. Comput. Appl. Biosci. 4: 1117.
Pasquier, C. and Hamodrakas, S.J. 1999. An hierarchical artificial neural network system for the classification of transmembrane proteins. Protein Eng. 12: 631634.
Serres, M.H., Gopal, S., Nahum, L.A., Liang, P., Gaasterland, T., and Riley, M. 2001. A functional update of the Escherichia coli K-12 genome. Genome Biol. 2: RESEARCH0035. 10035.7.[Medline]
Stevens, T.J. and Arkin, I.T. 2000. Do more complex organisms have a greater proportion of membrane proteins in their genomes? Proteins 39: 417420.[CrossRef][Medline]
Sugiyama, Y., Polulyakh, N., and Shimizu, T. 2003. Identification of transmembrane protein functions by binary topology patterns. Protein Eng. 16: 479488.
Tusnády, G.E. and Simon, I. 1998. Principles governing amino acid composition of integral membrane proteins: Application to topology prediction. J. Mol. Biol. 283: 489506.[CrossRef][Medline]
. 2001. The HMMTOP transmembrane topology prediction server. Bioinformatics 17: 849850.
Wallin, E. and von Heijne, G. 1998. Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci. 7: 10291038.[Abstract]
Xia, J.-X., Ikeda, M., and Shimizu, T. 2004. ConPred_elite: A highly reliable approach to transmembrane topology prediction. Comput. Biol. Chem. 28: 5160.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |