|
|
||||||||
1 Stockholm Bioinformatics Centre, AlbaNova, SE-106 91 Stockholm, Sweden
2 Department of Medical Biochemistry and Biophysics, Karolinska Institutet, SE-171 77 Stockholm, Sweden
3 Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden
Reprint requests to: Gunnar von Heijne, Stockholm Bioinformatics Centre, AlbaNova, SE-106 91 Stockholm, Sweden; e-mail: gunnar{at}dbb.su.se; fax: 46-8-153679.
(RECEIVED August 1, 2002; FINAL REVISION September 13, 2002; ACCEPTED September 13, 2002)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.0226702.
| Abstract |
|---|
|
|
|---|
70% of all membrane proteins in a typical bacterial genome and for
55% of all membrane proteins in a typical eukaryotic genome. The average fraction of sequence length covered by a partial consensus topology is 44% for the prokaryotic proteins and 17% for the eukaryotic proteins in our test set, and similar numbers are found when the algorithm is applied to whole genomes. Reliably predicted partial topologies may simplify experimental determinations of membrane protein topology. Keywords: Membrane protein; topology; consensus prediction
Abbreviations: PCT, partial consensus topology TMH, transmembrane helix
| Introduction |
|---|
|
|
|---|
helices (von Heijne 1999). Recent investigations of complete genomes estimate the fraction of genes encoding helix bundle membrane proteins as 20%25% in most organisms (Jones 1998; Krogh et al. 2001). Several methods are currently available to predict the topology of helix bundle membrane proteins, and the best methods predict the correct global topology for approximately 65%70% of all proteins (Ikeda et al. 2001; Möller et al. 2001a). There is thus considerable scope for further improvements in topology prediction.
We previously described how the reliability of a given topology prediction can be estimated by combining the results from five different prediction algorithms (Nilsson et al. 2000), and this approach has been used to reduce the experimental efforts required for topology mapping (Drew et al. 2002). Here, we present an extension of the consensus prediction approach to include cases where only a part of the global topology is covered by the consensus. The new partial consensus topology (PCT) prediction method provides highly reliable topology information for
70% of all membrane proteins encoded in a typical bacterial genome and
55% of all membrane proteins in a typical eukaryotic genome. Given a partial consensus topology prediction, experimental topology mapping efforts can be focused on the less reliably predicted parts of the global topology.
| Results |
|---|
|
|
|---|
|
|
Predictions on entire genomes
We carried out global and partial consensus predictions for all putative multispanning membrane proteins in the genomes of eight prokaryotic and five eukaryotic organisms (putative membrane proteins were identified by the TMHMM method as detailed in Materials and Methods). Table 2
shows the fraction of membrane proteins with different majority levels for the global topology prediction. The fraction of proteins for which all five methods agree on the global topology is about 20% in the prokaryotic and about 10% in the eukaryotic genomes, in agreement with the results for the smaller test sets.
|
|
| Discussion |
|---|
|
|
|---|
A PCT prediction can be valuable in the context of experimental topology mapping, where it can help to identify regions in the protein where the topology is very likely to be correctly predicted, making it possible to focus the experimental efforts on the remaining, less reliably predicted, parts. The average fraction of sequence length covered by a PCT in the prokaryotic proteins of our test set is 44% (Table 1
), implying that the experimental efforts may be significantly reduced for a typical prokaryotic membrane protein. For eukaryotic proteins, the corresponding fraction is much lower (17%). Similar tendencies are found when the algorithm is applied to whole genomes (Table 3
).
Over our test set, the PCT predictions have a reliability of approximately 90% for both prokaryotic and eukaryotic sequences (Table 1
). This is roughly the same reliability that we find for the global topology predictions when all five methods agree (Fig. 1
). It is interesting to note that while it is well established that topology prediction is more difficult for eukaryotic proteins than for prokaryotic proteins (von Heijne 1997; Ikeda et al. 2001), the reliabilities of the partial consensus predictions on eukaryotic and prokaryotic proteins seem to be roughly equal (although the coverage is much smaller for the latter group). Because of the small number of eukaryotic proteins in the test set, these reliability and coverage estimates should be regarded as preliminary.
Consensus techniques have previously proven successful for, for example, globular protein fold recognition (Lundström et al. 2001) and secondary structure prediction (Cuff et al. 1998). Algorithms for consensus prediction of membrane protein topology have also been described (Promponas et al. 1999; Ikeda et al. 2001; Möller et al. 2001b). However, our method focuses more specifically on identifying global or partial topologies of high reliability and thus increases the usefulness of topology predictions.
In summary, we have shown that partial consensus topologies can be predicted with high reliability for many membrane proteins for which no global consensus topology can be predicted. Such partial topology predictions may be used to guide experimental topology determination efforts.
| Materials and methods |
|---|
|
|
|---|
The resulting test set was split into a prokaryotic subset and a eukaryotic subset. Both test sets were then homology-reduced using an implementation of the Hobohm algorithm (Hobohm et al. 1992) with a pairwise global sequence similarity threshold of 30%. ClustalW (Thompson et al. 1994) was used for the pairwise, global sequence alignments. The numbers of sequences in the final prokaryotic and eukaryotic test sets were 73 and 23, respectively (see Supplementary Information).
Genome databases
The genomes analyzed were from Anabaena sp. PCC7120 (Kaneko et al. 2001; ftp://ftp.kazusa.or.jp/pub/cyano/Anabaena/chromo/), Borrelia burgdorferi B31 (Fraser et al. 1997; ftp://ftp.tigr.org/pub/data/b_burgdorferi/), Helicobacter pylori 26695 (Tomb et al. 1997; ftp://ftp.tigr.org/pub/data/h_pylori/), Mycobacterium tuberculosis CDC1551 (Cole et al. 1998; ftp://ftp.tigr.org/pub/data/m_tuberculosis/), Salmonella pneumoniae (Tettelin et al. 2001; ftp://ftp.tigr.org/pub/data/s_pneumoniae/), Mus musculus (ftp://ftp.ensembl.org/pub/current_mouse/data/fasta/pep/), Drosophila melanogaster (Adams et al. 2000; ftp://ftp.ncbi.nih.gov/genbank/genomes/D_melanogaster/), Arabidopsis thaliana (Theologis et al. 2000; ftp://ftp.tigr.org/pub/data/a_thaliana/ath1/), Escherichia coli (Blattner et al. 1997; http://bmb.med.miami.edu/EcoGene/EcoWeb/), Saccharomyces cerevisiae (ftp://genome-ftp.stanford.edu/pub/yeast/yeast_ORFs/), Caenorhabditis elegans (Stein et al. 2001; (ftp://ftp.sanger.ac.uk/pub/wormbase/),Bacillus subtilis 168 (Kunst et al. 1997; ftp://ftp.pasteur.fr/pub/GenomeDB/SubtiList/), and Methanococcus jannaschii DSM2661 (Bult et al. 1996; ftp://ftp.tigr.org/pub/data/m_jannaschii/). For each genome, TMHMM2.0 (Sonnhammer et al. 1998) was used to identify putative membrane proteins with a minimum of two predicted TMHs. The resulting data sets were then analyzed by the PCT prediction procedure described below.
Prediction methods
Five topology prediction methodsTMHMM2.0 (Sonnhammer et al. 1998; Krogh et al. 2001), HMMTOP2.0 (Tusnady and Simon 1998Tusnady and Simon 2001), MEMSAT1.8 (Jones et al. 1994), PHD2.1 (Rost et al. 1996), and TOPPRED1.0 (von Heijne 1992; Claros and von Heijne 1994)were used in their single-sequence mode (i.e., no information from homologous proteins was included). All methods produce a prediction of both the number and location of the TMHs, and the in/out location of the N-terminus relative to the membrane. All user-adjustable parameters were kept at their default values, with the exception of TOPPRED predictions for eukaryotic proteins, where the organism parameter was set to eukaryote. The output from the different topology prediction programs was converted into a standard format for further analysis.
Partial consensus topology prediction algorithm
The partial consensus topology prediction method is based on our previous observation that the reliability of a predicted topology can be estimated from the number of prediction methods that agree on the global topology (i.e., that give the same number of predicted TMHs and the same predicted orientation for the Nterminus). Specifically, that study (Nilsson et al. 2000) indicated that very high reliability can be assigned to topologies where five different prediction methods give the same prediction.
Here, we tested the assumption that this relationship holds also for cases where all five prediction methods agree on the topology of only a part of the protein. These cases are referred to as partial consensus topologies (PCTs).
The PCT algorithm is described in Figure 2A
. In the first step, if all methods agree on the topology at a certain position in the sequence, a consensus topology prediction is assigned to this position (inside loop, outside loop, intoout helix, and outtoin helix states are designated i, o, m, and w, respectively). If all methods do not agree at a certain position, no consensus is assigned (designated .). To aid in the construction of the final partial consensus prediction, we define two additional symbols that represent positions for which the predicted topology states are incompatible with each other. Thus, when loop states with opposite locations (i and o) are predicted at the same position, we define this as a loop clash (X). In the same manner, a TMH clash (#) is defined for positions where two TMHs with opposite directions (m and w) are predicted.
|
The final step is the construction of the partial consensus topology (Fig. 2A
). Starting from the N-terminus of the protein, the N-terminal end of the first PCT is defined by the first TMH (m or w states) of at least n residues in the consensus topology (where n is an adjustable parameter; the default value used here is n = 5). The PCT is then extended towards the Cterminus until either a consensus TMH of less than n residues is encountered, or a loop clash or TMH clash occurs. In either case, the end of the PCT is defined by the most Cterminally located m or w state in the consensus. The process is then repeated until the Cterminal end of the protein is reached. A protein may thus contain more than one PCT.
The significance of the nvalue is illustrated in Figure 2C
, where the resulting PCT prediction differs depending on whether the consensus TMH is longer or shorter than the value of n.
To be included in a PCT, a consensus TMH has to be at least a minimum number of residues n in length. The larger the nvalue, the smaller the risk that an incorrectly predicted consensus TMH is included in the PCT. However, a high nvalue also decreases the average length of a PCT. To determine the optimal nvalue, the evaluation step above was performed for different length thresholds. Figure 2D
shows the fraction of correctly predicted PCTs and the average fraction of sequence length covered by a PCT for different values of n. For the prokaryotic proteins, both the fraction of sequence length covered and the fraction of correctly predicted PCTs is relatively constant for n = 112. For n > 12 residues, the fraction of sequence length covered drops significantly, whereas there is only a minor increase in the fraction of correct PCTs. The trend is basically the same for the eukaryotic proteins, though we consider these results less reliable because of the small test set. In summary, the results do not vary appreciably for nvalues < 10, and the default value n = 5 has been used for all results reported here.
Method evaluation
To assess the performance of the PCT prediction algorithm, it was applied to the prokaryotic and eukaryotic test sets of proteins with experimentally determined topologies described above. For a given PCT, the corresponding region in the experimentally determined topology was checked, and if both the number and directions of TMHs in this region agreed with the PCT, the prediction was considered to be correct.
| Electronic supplemental material |
|---|
|
|
|---|
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Blattner, F.R., Plunkett III, G., Bloch, C.A., Perna, N.T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J.D., Rode, C.K., Mayhew, G.F., et al. 1997. The complete genome sequence of Escherichia coli K-12. Science 277: 14531474.
Bult, C.J., White, O., Olsen, G.J., Zhou, L., Fleischmann, R.D., Sutton, G.G., Blake, J.A., FitzGerald, L.M., Clayton, R.A., Gocayne, J.D., et al. 1996. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science 273: 10581073.
Claros, M.G. and von Heijne, G. 1994. TopPred II: An improved software for membrane protein structure predictions. Comput. Appl. Biosci. 10: 685686.
Cole, S.T., Brosch, R., Parkhill, J., Garnier, T., Churcher, C., Harris, D., Gordon, S.V., Eiglmeier, K., Gas, S., Barry III, C.E., et al. 1998. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393: 537544.[CrossRef][Medline]
Cuff, J.A., Clamp, M.E., Siddiqui, A.S., Finlay, M., and Barton, G.J. 1998. JPred: A consensus secondary structure prediction server. Bioinformatics 14: 892893.
Drew, D., Sjöstrand, D., Nilsson, J., Urbig, T., Chin, C., de Gier, J.W., and von Heijne, G. 2002. Rapid topology mapping of Escherichia coli inner-membrane proteins by prediction and PhoA/GFP fusion analysis. Proc. Natl. Acad. Sci. 99: 26902695.
Fraser, C.M., Casjens, S., Huang, W.M., Sutton, G.G., Clayton, R., Lathigra, R., White, O., Ketchum, K.A., Dodson, R., Hickey, E.K., et al. 1997. Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi. Nature 390: 580586.[CrossRef][Medline]
Hobohm, U., Scharf, M., Schneider, R., and Sander, C. 1992. Selection of representative protein data sets. Protein Sci. 1: 409417.[Abstract]
Ikeda, M., Arai, M., Lao, D., and Shimizu, T. 2001. Transmembrane topology prediction methods: A re-assessment and improvement by a consensus method using a dataset of experimentally-characterised transmembrane topology. In Silico Biol. 2: 115.
Jayasinghe, S., Hristova, K., and White, S.H. 2001. MPtopo: A database of membrane protein topology. Protein Sci. 10: 455458.
Jones, D.T. 1998. Do transmembrane protein superfolds exist? FEBS Lett. 423: 281285.[CrossRef][Medline]
Jones, D.T., Taylor, W.R., and Thornton, J.M. 1994. A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry 33: 30383049.[CrossRef][Medline]
Kaneko, T., Nakamura, Y., Wolk, C.P., Kuritz, T., Sasamoto, S., Watanabe, A., Iriguchi, M., Ishikawa, A., Kawashima, K., Kimura, T., et al. 2001. Complete genomic sequence of the filamentous nitrogen-fixing cyanobacterium Anabaena sp. strain PCC 7120. DNA Res. 8: 227253.[CrossRef]
Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E.L. 2001. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 305: 567580.[CrossRef][Medline]
Kunst, F., Ogasawara, N., Moszer, I., Albertini, A.M., Alloni, G., Azevedo, V., Bertero, M.G., Bessieres, P., Bolotin, A., Borchert, S., et al. 1997. The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature 390: 249256.[CrossRef][Medline]
Lundström, J., Rychlewski, L., Bujnicki, J., and Elofsson, A. 2001. Pcons: A neural-network-based consensus predictor that improves fold recognition. Protein Sci. 10: 23542362.
Möller, S., Kriventseva, E.V., and Apweiler, R. 2000. A collection of well characterised integral membrane proteins. Bioinformatics 16: 11591160.
Möller, S., Croning, M., and Apweiler, R. 2001a. Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics 17: 646653.
Möller, S., Schroeder, M., and Apweiler, R. 2001b. Consistent integration of non-reliable heterogeneous information resources applied to the annotation of transmembrane proteins. Comput. Chem. 26: 4149[CrossRef][Medline]
Nilsson, J., Persson, B., and von Heijne, G. 2000. Consensus predictions of membrane protein topology. FEBS Lett. 486: 267269.[CrossRef][Medline]
Promponas, V.J., Palaios, G.A., Pasquier, C.M., Hamodrakas, J.S., and Hamodrakas, S.J. 1999. CoPreTHi: A Web tool which combines transmembrane protein segment prediction methods. In Silico Biol. 1: 159162.[Medline]
Rost, B., Fariselli, P., and Casadio, R. 1996. Topology prediction for helical transmembrane proteins at 86% accuracy. Protein Sci. 5: 17041718.[Abstract]
Sonnhammer, E., von Heijne, G., and Krogh, A. 1998. A hidden Markov model for predicting transmembrane helices in protein sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6: 175182.[Medline]
Stein, L., Sternberg, P., Durbin, R., Thierry-Mieg, J., and Spieth J. 2001. WormBase: Network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res. 29: 8286.
Tettelin, H., Nelson, K.E., Paulsen, I.T., Eisen, J.A., Read, T.D., Peterson, S., Heidelberg, J., DeBoy, R.T., Haft, D.H., Dodson, R.J., et al. 2001. Complete genome sequence of a virulent isolate of Streptococcus pneumoniae. Science 293: 498506.
Theologis, A., Ecker, J.R., Palm, C.J., Federspiel, N.A., Kaul, S., White, O., Alonso, J., Altafi, H., Araujo, R., Bowman, C.L., et al. 2000. Sequence and analysis of chromosome 1 of the plant Arabidopsis thaliana. Nature 408: 816829.[CrossRef][Medline]
Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 46734680.
Tomb, J.F., White, O., Kerlavage, A.R., Clayton, R.A., Sutton, G.G., Fleischmann, R.D., Ketchum, K.A., Klenk, H.P., Gill, S., Dougherty, B.A., et al. 1997. The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature 388: 539547.[CrossRef][Medline]
Tusnady, G.E. and Simon, I. 1998. Principles governing amino acid composition of integral membrane proteins: Application to topology prediction. J. Mol. Biol. 283: 489506.[CrossRef][Medline]
. 2001. The HMMTOP transmembrane topology prediction server. Bioinformatics 17: 849850.
von Heijne, G. 1992. Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. J. Mol. Biol. 225: 487494.[CrossRef][Medline]
. 1997. Principles of membrane protein assembly and structure. Progr. Biophys. Mol. Biol. 66: 113139.
. 1999. Recent advances in the understanding of membrane protein assembly and structure. Q. Rev. Biophys. 32: 285307.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
Z. Wu, X. Zhang, B. He, L. Diao, S. Sheng, J. Wang, X. Guo, N. Su, L. Wang, L. Jiang, et al. A Chlorophyll-Deficient Rice Mutant with Impaired Chlorophyllide Esterification in Chlorophyll Biosynthesis Plant Physiology, September 1, 2007; 145(1): 29 - 40. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. T. Jones Improving the accuracy of transmembrane protein topology prediction using evolutionary information Bioinformatics, March 1, 2007; 23(5): 538 - 544. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Bernsel and G. Von Heijne Improved membrane protein topology prediction by domain assignments Protein Sci., July 1, 2005; 14(7): 1723 - 1728. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Yagur-Kroll and O. Amster-Choder Dynamic Membrane Topology of the Escherichia coli {beta}-Glucoside Transporter BglF J. Biol. Chem., May 13, 2005; 280(19): 19306 - 19318. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Sugiyama, N. Polulyakh, and T. Shimizu Identification of transmembrane protein functions by binary topology patterns Protein Eng. Des. Sel., July 1, 2003; 16(7): 479 - 488. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |