|
|
||||||||
Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden
Stockholm Bioinformatics Center, AlbaNova, SE-106 91 Stockholm, Sweden
Reprint requests to: Gunnar von Heijne, Stockholm University, SE- 106 91 Stockholm, Sweden; e-mail: gunnar{at}dbb.su.se; fax: +46-8-15- 36-79.
(RECEIVED February 3, 2005; FINAL REVISION April 17, 2005; ACCEPTED April 17, 2005)
| Abstract |
|---|
|
|
|---|
Keywords: topology prediction; transmembrane protein; domain assignment; prediction constraints
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.051395305.
| Introduction |
|---|
|
|
|---|
-Helical transmembrane proteins constitute about 20% of all proteins encoded by most genomes (Krogh et al. 2001), and are responsible for several vital processes in the cell. In addition, the medical importance of membrane bound receptors, channels, and pumps as targets for drugs is well established. Still, for the large majority of membrane proteins, the structure or even the topology, i.e., the positions and in/out orientations of all transmembrane helices, is not known experimentally. The continuously growing amount of sequence data, in combination with the limited amount of structural data available, highlight the need for better and more accurate theoretical structure prediction methods, particularly for the annotation of membrane proteins. Protein domains are modular, independently evolving, and structurally similar amino acid segments, which may exist alone in single-domain proteins, or may combine to form multidomain proteins. Although covalent combinations between transmembrane domains, (i.e., domains with one or more membrane spanning regions) rarely occur, covalent combinations between soluble domains and transmembrane domains are observed frequently (Liu et al. 2004). Moreover, domains are often compartment-specific, and information about domain occurrence can be used to predict the subcellular localization of soluble proteins (Mott et al. 2002).
Here, we explore the possibility that the presence of compartment-specific extra-membranous protein domains in transmembrane protein sequences might be used as a constraint in a subsequent topology prediction step, in much the same way that experimentally determined "anchor points" have been used to constrain topology predictions (Kim et al. 2003; Rapp et al. 2004; Daley et al. 2005). Unconstrained topology predictions are correct for only ~55%60% of all membrane proteins (Melén et al. 2003), while, as shown below, compartment-specific domains that are always located on just one side of a membrane (facing, e.g., the extracellular space or the cytosol) can be identified with high reliability. If such a domain is found in a membrane protein, that particular segment in the protein sequence can be fixed to the corresponding side of the membrane before applying a sequence-based topology prediction algorithm on the rest of the sequence. Here, we show that domains of this kind are found in at least 11% of many eukaryotic proteomes, and that a significant improvement in topology prediction can be achieved by using these domains as prediction constraints.
| Results |
|---|
|
|
|---|
Domain selection
SMART (Letunic et al. 2004) is a database of well-annotated protein domains, represented as profile- HMMs, and is divided into four main categories: extracellular, nuclear, signaling, and others. In general, we considered domains annotated in SMART 4.0 as "extracellular" to reside outside of the membrane (i.e., on the noncytoplasmic side), and domains annotated as "signaling" to reside on the inside of the membrane (i.e., on the cytoplasmic side). This assumption is, for the most part, correct, and in agreement with, e.g., Mott et al. (2002).
However, we made one general exception to this rule. All domains were assigned to the 78,371 putative membrane protein sequences (see below), and the domain hits were compared to the topologies predicted by PRO-TMHMM (Viklund and Elofsson 2004), which uses the TMHMM 2.0 architecture (Krogh et al. 2001). If a domain was found to contain one or more predicted transmembrane helices, it was removed from the domain collection. Only four out of 372 domains were discarded this way.
Estimation of error frequency of domain assignments
In order to assess the validity of our domain selection method, the domains were assigned to 297 homology reduced sequences of membrane proteins with experimentally known topologies. This resulted in 48 domain hits, contained in 29 (10%) of the sequences. Out of all domain hits, 47 (98%) were in agreement with the topology. One domain (TarH) was in conflict with a known topology, and was thus removed from the domain collection. Although the test set is small, we consider our domain collection as highly reliable.
The final domain list used for placing constraints on the topology predictions consisted of 367 domains, of which 146 were "IN-domains" (i.e., appear only on the cytoplasmic side of the membrane), and 221 were "OUT-domains" (i.e., appear only on the non-cytoplasmic side of the membrane) (see Supplemental Material S1).
Unconstrained topology predictions
A total of 553,974 protein sequences from 38 eukaryotic genomes (Supplemental Material S2) was downloaded from the SUPERFAMILY Web site (Gough et al. 2001). In an initial topology prediction step, 24% of the sequences were predicted by TMHMM to be membrane proteins, which is in agreement with earlier estimates (Krogh et al. 2001). After a second topology prediction step using PRO-TMHMM (Viklund and Elofsson 2004) and homology reduction (see Materials and Methods), 78,371 putative membrane protein sequences remained for further analysis. These sequences, together with their predicted topologies, are available as Supplemental Material S3 both for the full and homology-reduced data sets.
Constrained topology predictions
The IN/OUT location for the final list of 367 domains was used as constraint for the topology prediction; in other words, we considered the domain assignments to be entirely correct. Of all 78,371 predicted membrane proteins, 8703 (11%) contained one or more of the 367 domains, which is consistent with the fraction of membrane proteins with known topology that contain at least one of the domains (10%; see above). Of these domain hits, 4126 (34%) were in conflict with the unconstrained topology predictions, which is much higher than the same figure for proteins with known topology (Table 1
). This discrepancy is not surprising, since we are now dealing with topology predictions as opposed to known topologies, but rather suggests that in those cases where the domain assignments and topology prediction are in conflict, the latter is most likely incorrect. In fact, the fraction of conflicting domain hits is consistent with earlier reported error frequencies of TMHMM topology predictions (Krogh et al. 2001), further supporting this idea.
|
Domains are more frequent in single-spanning membrane proteins
Based on the constrained predictions, the topologies of the 8703 proteins containing at least one domain were analyzed. Sixty-six percent were single-spanning proteins (Fig. 1
), compared to just 37% in the complete set of predicted membrane proteins, suggesting that our method will have particular impact on single-spanning proteins. Single-spanning proteins are often mispredicted by the current topology prediction methods, mostly due to an inversion of the predicted topology such that the TM-segment is correctly located but the overall orientation is wrong. Large extra-membranous domains carry little or no orientational information in the current predictors, and our domain-based method thus solves a major weakness in these methods.
|
|
|
|
|
To be certain that the trend observed was not just an artifact of the domain composition, such that the proteins with domains on both sides of the membrane were, e.g., closely related, we looked further into which domains were present in those proteins. No such artifacts were found; for instance, 58 different domain types are represented in the IN/OUT combinations in Figure 3
, and no domain represents> 17% of the total number of domain hits.
| Discussion |
|---|
|
|
|---|
In a large collection of 78,371 redundancy-reduced proteins from fully sequenced eukaryotic genomes, 11% contain domains that, when found in soluble proteins, have compartment-specific localization. At least two-thirds of these 8703 proteins are single-spanning, and overall, we can correct the unconstrained topology prediction for 34% of the 8703 domain-containing proteins.
Although the coverage of compartment-specific domain hits is limited, this figure will increase as more domains are characterized and included in the SMART database. In fact, domains from the Pfam database (Bateman et al. 2004) were found in >90% of the 297 known membrane proteins analyzed here (data not shown), although the predictive value of those domains remains to be investigated. Although in this paper we have focused only on soluble domains that are devoid of TM-helices, a possible further use of domain information in topology prediction is to attempt to define conserved partial topologies (Nilsson et al. 2002) for protein domains that contain one or more TM-helix and use these as constraints in a subsequent topology prediction step.
In conclusion, domain-based topology constraints provides a solution to a major weakness in current topology prediction schemes, which in general, gain little information from large extramembranous domains.
| Materials and methods |
|---|
|
|
|---|
Membrane proteins with experimentally known topology were used to test the accuracy of the domain assignment method. Sequences and topologies from three different sources, Mptopo (Jayasinghe et al. 2001), TMpdb (Ikeda et al. 2003), and the Möller database (Möller et al. 2000), were combined, and homology reduced at 40% threshold using the CD-HIT algorithm (Li et al. 2002) (word-size 2). This produced 297 nonredundant membrane protein sequences with experimentally known topologies.
Domain selection
All predicted membrane protein sequences were searched for SMART 4.0 domains (Letunic et al. 2004) annotated as "extracellular" or "signaling," using an E-value cutoff of 106. In order to avoid artifacts arising from domain repeats, only the most significant domain hit for each sequence was retained. Conflicting domain assignments were resolved so that the assignment with the lowest E-value was regarded first, and then any nonconflicting assignments were added in order of increasing E-values. For each domain, the predicted partial topologies (i.e., the topology within the region of the domain hit) of all proteins assigned with this domain were examined, and the total fraction of residues predicted as containing a predicted TM-region was calculated. If this fraction was above 10%, the domain was considered to actually contain TM-regions, and was removed from the domain collection.
As a test of the accuracy of our method, the remaining domains were searched for in the 297 membrane proteins with known topologies. Out of 48 domain hits, one hit was in conflict with the experimentally known topology, and this domain (TarH) was removed from the domain collection.
Constrained topology predictions
All proteins with at least one domain hit had their topologies repredicted using the PRO-TMHMM prediction algorithm (Viklund and Elofsson 2004), with the domain region fixed to the corresponding side of the membrane. The IN/OUT-fixation of a certain residue is achieved by setting the corresponding state probability in the HMM equal to 1.0, and is straightforward using the modhmm package (Viklund and Elofsson 2004). As a precaution not to interfere with any TM-regions, since the positions of both the predicted domain and any predicted TM-helices might be somewhat imprecise, only the core part (i.e., the middle 50%) of the domain assignment was fixed. Conflicting domain assignments were resolved as described above.
SignalP predictions
Predictions were performed using SignalP-HMM for the 70 most N-terminal residues of the sequence. If the probability for a signal peptide exceeded 0.5, and if there was an overlap of at least 10 residues between the signal peptide and the most N-terminal predicted TM-helix, this was taken as an indication that an actual signal peptide might have been mistaken for a TM-helix by TMHMM.
| Electronic supplemental material |
|---|
|
|
|---|
| Footnotes |
|---|
| Acknowledgments |
|---|
| References |
|---|
|
|
|---|
Daley, D.O., Rapp, M., Granseth, E., Melén, K., Drew, D., and von Heijne, G. 2005. Global topology analysis of the Escherichia coli inner membrane proteome. Science, in press.
Dyrløv-Bendtsen, J., Nielsen, H., von Heijne, G., and Brunak, S. 2004. Improved prediction of signal peptidesSignalP3.0. J. Mol. Biol. 340: 783795.[CrossRef][Medline]
Gough, J., Karplus, K., Hughey, R., and Chothia, C. 2001. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol. 313: 903919.[CrossRef][Medline]
Ikeda, M., Arai, M., Okuno, T., and Shimizu, T. 2003. TMPDB: A database of experimentally-characterized transmembrane topologies. Nucleic Acids Res. 31: 406409.
Jayasinghe, S., Hristova, K., and White, S.H. 2001. MPtopo: A database of membrane protein topology. Protein Sci. 10: 455458.
Käll, L., Krogh, A., and Sonnhammer, E.L. 2004. A combined transmembrane topology and signal peptide prediction method. J. Mol. Biol. 338: 10271036.[CrossRef][Medline]
Kim, H., Melén, K., and von Heijne, G. 2003. Topology models for 37 Saccharomyces cerevisiae membrane proteins based on C-terminal reporter fusions and prediction. J. Biol. Chem. 278: 1020810213.
Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E. 2001. Predicting transmembrane protein topology with a hidden Markov model. Application to complete genomes. J. Mol. Biol. 305: 567580.[CrossRef][Medline]
Letunic, I., Copley, R.R., Schmidt, S., Ciccarelli, F.D., Doerks, T., Schultz, J., Ponting, C.P., and Bork, P. 2004. SMART 4.0: Towards genomic data integration. Nucleic Acids Res. 32: D142D144.
Li, W., Jaroszewski, L., and Godzik, A. 2002. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics 18: 7782.
Liu, Y., Gerstein, M., and Engelman, D.M. 2004. Transmembrane protein domains rarely use covalent domain recombination as an evolutionary mechanism. Proc. Natl. Acad. Sci. 101: 34953497.
Melén, K., Krogh, A., and von Heijne, G. 2003. Reliability measures for membrane protein topology prediction algorithms. J. Mol. Biol. 327: 735744.[CrossRef][Medline]
Möller, S., Kriventseva, E., and Apweiler, R. 2000. A collection of well-characterised integral membrane proteins. Bioinformatics 16: 11591160.
Mott, R., Schultz, J., Bork, P., and Ponting, C.P. 2002. Predicting protein cellular localization using a domain projection method. Genome Res. 12: 11681174.
Nilsson, J., Persson, B., and von Heijne, G. 2002. Prediction of partial membrane protein topologies using a consensus approach. Protein Sci. 11: 29742980.
Ponting, C.P., Hofmann, K., and Bork, P. 1999. A latrophilin/CL-1-like GPS domain in polycystin-1. Curr. Biol. 9: R585R588.[CrossRef][Medline]
Rapp, M., Drew, D.E., Daley, D.O., Nilsson, J., Carvalho, T., Melén, K., de Gier, J.W., and von Heijne, G. 2004. Experimentally based topology models for E. coli inner membrane proteins. Protein Sci. 13: 937945.
Ungar, D. and Hughson, F.M. 2003. SNARE protein structure and function. Annu. Rev. Cell Dev. Biol. 19: 493517.[CrossRef][Medline]
Viklund, H. and Elofsson, A. 2004. Best alpha-helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary information. Protein Sci. 13: 19081917.
Wallin, E. and von Heijne, G. 1995. Properties of N-terminal tails in G-protein coupled receptorsA statistical study. Protein Eng. 8: 693698.
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
G. E. Tusnady, L. Kalmar, H. Hegyi, P. Tompa, and I. Simon TOPDOM: database of domains and motifs with conservative location in transmembrane proteins Bioinformatics, June 15, 2008; 24(12): 1469 - 1470. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Pabuwal and Z. Li Network pattern of residue packing in helical membrane proteins and its application in membrane protein structure prediction Protein Eng. Des. Sel., January 3, 2008; (2008) gzm059v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Sebag and P. M. Hinkle Melanocortin-2 receptor accessory protein MRAP forms antiparallel homodimers PNAS, December 18, 2007; 104(51): 20244 - 20249. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. L. Tress, P. L. Martelli, A. Frankish, G. A. Reeves, J. J. Wesselink, C. Yeats, P. l. Olason, M. Albrecht, H. Hegyi, A. Giorgetti, et al. The implications of alternative splicing in the ENCODE protein complement PNAS, March 27, 2007; 104(13): 5495 - 5500. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Lee, B. Lee, I. Jang, S. Kim, and J. Bhak Localizome: a server for identifying transmembrane topologies and TM helices of eukaryotic proteins utilizing domain information. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W99 - W103. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. Remminghorst and B. H. A. Rehm In Vitro Alginate Polymerization and the Functional Role of Alg8 in Alginate Production by Pseudomonas aeruginosa Appl. Envir. Microbiol., January 1, 2006; 72(1): 298 - 305. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |