Protein Science Sheba protein
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplemental Research Data
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Chen, C. P.
Right arrow Articles by Rost, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chen, C. P.
Right arrow Articles by Rost, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?
Protein Science (2002), 11:2774-2791.
Copyright © 2002 The Protein Society

Transmembrane helix predictions revisited

Chien Peter Chen1, Andrew Kernytsky1 and Burkhard Rost1,2,3

1 Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), New York, New York 10032, USA
3 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA

Reprint requests to: Burkhard Rost, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 W. 168 St., BB217, New York, NY 10032, USA; e-mail: rost{at}columbia.edu; fax: (212) 305-7932.

(RECEIVED May 5, 2002; FINAL REVISION July 22, 2002; ACCEPTED September 16, 2002)

Terminology: Advanced prediction methods, all methods that do not exclusively use a hydrophobicity scale; simple prediction methods, membrane prediction methods exclusively based on hydrophobicity scales.

Formula abbreviations: htm, transmembrane helix; T, residue in transmembrane helix; N, nonmembrane residue.

Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.0214502.


    Abstract
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Conclusion
 Materials and methods
 References
 
Methods that predict membrane helices have become increasingly useful in the context of analyzing entire proteomes, as well as in everyday sequence analysis. Here, we analyzed 27 advanced and simple methods in detail. To resolve contradictions in previous works and to reevaluate transmembrane helix prediction algorithms, we introduced an analysis that distinguished between performance on redundancy-reduced high- and low-resolution data sets, established thresholds for significant differences in performance, and implemented both per-segment and per-residue analysis of membrane helix predictions. Although some of the advanced methods performed better than others, we showed in a thorough bootstrapping experiment based on various measures of accuracy that no method performed consistently best. In contrast, most simple hydrophobicity scale-based methods were significantly less accurate than any advanced method as they overpredicted membrane helices and confused membrane helices with hydrophobic regions outside of membranes. In contrast, the advanced methods usually distinguished correctly between membrane-helical and other proteins. Nonetheless, few methods reliably distinguished between signal peptides and membrane helices. We could not verify a significant difference in performance between eukaryotic and prokaryotic proteins. Surprisingly, we found that proteins with more than five helices were predicted at a significantly lower accuracy than proteins with five or fewer. The important implication is that structurally unsolved multispanning membrane proteins, which are often important drug targets, will remain problematic for transmembrane helix prediction algorithms. Overall, by establishing a standardized methodology for transmembrane helix prediction evaluation, we have resolved differences among previous works and presented novel trends that may impact the analysis of entire proteomes.

Keywords: Sequence analysis; protein structure prediction; multiple alignments, predicting transmembrane helices; comparing genomes; bioinformatics; computational biology; proteomes

Abbreviations: A-Cid, normalized hydrophobicity scale for {alpha}-proteins (Cid 1992) • Av-Cid, normalized average hydrophobicity scale (Cid 1992) • Ben-Tal, hydrophobicity scale representing the free energy of transferring an amino acid from water into the center of the hydrocarbon region of a lipid bilayer (Kessel and Ben-Tal 2002) • BIG, nonidentical merger of SWISS-PROT (Bairoch and Apweiler 2000) and TrEMBL (Bairoch and Apweiler 2000) and PDB (Berman et al. 2000) • BLAST, fast sequence alignment method (Altschul and Gish 1996) • Bull-Breese, Bull-Breese hydrophobicity scale (Bull 1974) • DSSP, program assigning secondary structure (Kabsch and Sander 1983) • Eisenberg, normalized consensus hydrophobicity scale (Eisenberg et al. 1984) • EM, Solvation free energy (Eisenberg and McLachlan 1986) • EVA, server automatically evaluating structure prediction methods (Eyrich et al. 2001a,b) • Fauchere, hydrophobic parameter {pi} from the partitioning of N-acetyl-amino-acid amides (Fauchere and Pliska 1983) • GES, hydrophobicity property (Engelman et al. 1986; Prabhakaran 1990) • Heijne, transfer free energy to lipophilic phase (von Heijne and Blomberg 1979) • HMM, hidden Markov model • HMMTOP, hidden Markov model predicting transmembrane helices (Tusnady and Simon 1998) • Hopp-Woods, Hopp-Woods hydrophilicity value (Hopp and Woods 1981) • KD, Kyte–Doolittle hydropathy index (Kyte and Doolittle 1982) • Lawson, transfer free energy (Lawson et al. 1984) • Levitt, hydrophobic parameter (Levitt 1976) • MaxHom, dynamic programming algorithm for conservation weight-based multiple sequence alignment (Sander and Schneider 1991) • MEMSAT, dynamic-programming based prediction of transmembrane helices (Jones et al. 1994) • META-PP, internet service allowing access to a variety of bioinformatics tools through one single interface (Eyrich and Rost 2000) • Nakashima, normalized composition of membrane proteins (Nakashima et al. 1990) • PDB, Protein Data Bank of experimentally determined 3D structures of proteins (Bernstein et al. 1977; Berman et al. 2000) • PHDhtm, profile-based neural network prediction of transmembrane helices (Rost 1996; Rost et al. 1996b) • PHDpsihtm, divergent profile (PSI-BLAST)-based neural network prediction 2002) • PSI-BLAST, position-specific iterated database search (Altschul et al. 1997) • Radzicka, transfer free energy from 1-octanol to water (Radzicka and Wolfenden 1988) • Roseman, solvation-corrected side-chain hydropathy (Roseman 1988) • SignalP, signal peptide prediction (Nielsen et al. 1997a) • SOSUI, hydrophobicity- and amphiphilicity-based transmembrane helix prediction (Hirokawa et al. 1998) • SPLIT, transmembrane helix prediction (Juretic et al. 1998) • Sweet, optimal matching hydrophobicity (Sweet and Eisenberg 1983) • SWISS-PROT, database of protein sequences (Bairoch and Apweiler 2000) • TM, transmembrane • TMAP, alignment-based prediction of transmembrane helices (Persson and Argos 1996) • TMH, transmembrane helix • TMHMM, transmembrane prediction using cyclic hidden Markov models (Sonnhammer et al. 1998; Krogh et al. 2001) • TMpred, prediction of transmembrane helices (Hofmann and Stoffel 1993) • TopPred2, hydrophobicity-based membrane helix prediction (von Heijne 1992; Cserzö et al. 1997) • TrEMBL, translation of the EMBL-nucleotide database coding DNA to protein sequences (Bairoch and Apweiler 2000) • Wolfenden, hydration potential (Wolfenden et al. 1981) • WW, Wimley–White hydrophobicity scale-based method (Wimley et al. 1996a,b; White and Wimley 1999; White 2001).


    Introduction
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Conclusion
 Materials and methods
 References
 
Helical membrane proteins challenge bioinformatics. Membrane proteins are crucial for survival. They constitute key components for cell–cell signaling, mediate the transport of ions and solutes across the membrane, and are crucial for recognition of self (Stack et al. 1995; Chapman et al. 1998; Le Borgne and Hoflack 1998; Chen and Schnell 1999; Hettema et al. 1999; Pahl 1999; Truscott and Pfanner 1999; Bauer et al. 2000; Ito 2000; Soltys and Gupta 2000; Thanassi and Hutltgren 2000). Furthermore, the pharmaceutical industry preferably targets membrane-bound receptors (Heusser and Jardieu 1997; Bettler et al. 1998; Moreau and Huber 1999; Saragovi and Gehring 2000; Sedlacek 2000). Despite their great biological and medical importance, we still have very little experimental information about their 3D structures: <1% of the proteins of known structure are membrane proteins. Fortunately, it is relatively easy to identify the location of membrane helices through low-resolution experiments. An expert-curated list of low-resolution experiments maintained by Steffen Möller and colleagues (Möller et al. 2000) considers information from C-terminal fusions with indicator proteins (McGovern et al. 1991; Hennessey and Broome-Smith 1993; Traxler et al. 1993; van Geest and Lolkema 2000) and from antibody-binding studies (Traxler et al. 1993; McGuigan 1994; Jermutus et al. 1998; Morris et al. 1998; Amstutz et al. 2001). Nevertheless, we only have low-resolution experimental information for <500 helical membrane proteins, and PDB (Berman et al. 2000) contains <50 sequence-unique protein chains with high-resolution helical membrane structures (Materials and Methods). These numbers contrast with the >7000 helical membrane proteins expected in humans alone (Wallin and von Heijne 1998; Krogh et al. 2001; Liu and Rost 2001). Thus, bioinformatics is challenged to help bridge the information gap between what we want and what we have.

Published estimates for membrane helix prediction questioned by recent analyses. Recently, a few groups have questioned the estimated levels of performance for membrane helix prediction methods. Möller, Croning, and Apweiler analyzed 14 prediction methods that did not use alignment information on a set of 188 proteins with experimentally known helices (Möller et al. 2000, 2001). They also applied the prediction methods to globular proteins and to signal peptides. The results indicated the following conclusions: (1) The best prediction method (TMHMM, transmembrane prediction using cyclic hidden Markov models) correctly predicts all membrane helices for 52%–69% of all proteins tested. (2) The best distinction between globular and membrane-helical proteins reaches levels of >97% for the globular proteins tested (TMHMM and SOSUI, hydrophobicity- and amphiphilicity-based transmembrane helix prediction). (3) On a set of 34 signal and transit peptide proteins, the best methods reached 98% (PHDhtm, profile-based neural network prediction of transmembrane helices) to 100% (ALOM2) accuracy in distinguishing these from membrane helices. (4) The best simple hydrophobicity index (KD, Kyte–Doolittle hydropathy index; Kyte and Doolittle 1982) correctly predicted all helices for 44% of all the proteins in a set for which HMMTOP (hidden Markov model predicting transmembrane helices; Tusnady and Simon 1998) reached only 43% accuracy. Another recent analysis was based on a set of 145 sequence-unique proteins (Ikeda et al. 2001). The researchers tested 10 prediction methods not using alignment information on their data set. In contrast to Möller et al., the investigators found that HMMTOP was not only much better than the KD hydrophobicity index, but that it was the most accurate prediction method, correctly predicting all membrane helices for ~68% of all proteins. Averaging over all 10 methods, the authors found the resulting consensus prediction ~10 percentage points more accurate than the best single method. The investigators also claimed that prediction accuracy is higher for prokaryotes than for eukaryotes. They speculated that they found different levels of accuracy than Möller et al. because they used different percentages of prokaryotic proteins in their data sets. Jayasinghe, Hristova, and White analyzed four prediction methods on two different sets of proteins with known membrane helix locations: (1) on 150 high-resolution structures from PDB, and (2) on 242 low-resolution proteins (Jayasinghe et al. 2001b). The researchers found that the results between the high- and low-resolution sets differed marginally and reported that the best methods (PHDhtm and HMMTOP) correctly predict >93%–97% of all helices. This group has also proposed a method based on a novel entropy-based hydrophobicity scale, namely, the Wimley–White scale (WW, Wimley–White hydrophobicity-scale-based method), which is claimed to correctly predict 99% of all membrane helices (Jayasinghe et al. 2001a). One major problem of hydrophobicity-based methods appears to be the poor distinction between membrane and globular proteins (Edelman 1993; Jones et al. 1994; Rost et al. 1995 Rost et al. 1996b; Jayasinghe et al. 2001a; Möller et al. 2001).

Problems with previous analyses. Previous analyses were limited in various ways. (1) Performance on high- and low-resolution data sets was distinguished by neither the Möller nor the Ikeda groups, although it seemed that performance differed between the two (Jayasinghe et al. 2001b). (2) The redundancy in data sets resulting from many copies of very similar proteins was not reduced by the Möller or Jayasinghe groups. However, such bias is known to create problems when estimating prediction methods (Rost and Sander 1993; Rost et al. 1995 Rost et al. 1996b; Rost 2002). (3) Neither Möller et al. nor Ikeda et al. tested any method based on alignment information, although such methods are known to be more accurate (Rost and Sander 1993; Persson and Argos 1994; Neuwald et al. 1995; Rost et al. 1995; Rost 1996; Johnson and Church 1999). (4) No group explored per-residue—along with per-segment—based measures for prediction accuracy. Instead, all groups focused on one particular definition of prediction accuracy; no two groups applied the same definition. (5) No group established levels for significant differences between methods. This makes it impossible to conclude whether or not differences between any two methods are relevant. In general, levels of significant differences typically depend on the data sets and the scores used (Eyrich et al. 2001; Rost and Eyrich 2001; Marti-Renom et al. 2002). (6) Only Möller and coworkers tested proteins with signal peptides; however, their analysis was restricted to a small set of 34 proteins with known signal peptides. (7) No group analyzed more than 14 prediction methods. (8) Generally, prediction accuracy differs significantly between proteins used to develop a method and proteins never seen by a method (Moult et al. 1995, 1997, 1999). For membrane proteins, this effect is very difficult to estimate because few high-resolution structures of membrane proteins are added over a course of a year. Although Möller et al. tried to estimate this effect by analyzing only proteins not used for developing a method, they did not rule out that the proteins tested in the category "not known to the method" were similar to proteins used for development. Surprisingly, Möller et al. found most methods to perform better on proteins not used for development. Given how prediction methods are developed, it is very unlikely that this result holds in general. Either the differences are not significant, or the data sets were not representative (or both).

To resolve these limitations and to standardize membrane helix prediction performance comparisons, we have presented an analysis that distinguished between performance on redundancy-reduced high- and low-resolution data sets, established thresholds for significant differences in performance by introducing a bootstrap experiment, and implemented both per-segment and per-residue analysis of membrane helix predictions. Additionally, we analyzed more methods (8 publicly available advanced prediction methods and 19 different hydrophobicity scales). In particular, we included alignment-based prediction methods. Furthermore, we tested membrane helix prediction methods on a large, representative set of 1418 unique signal peptides and 616 unique globular protein folds taken from SCOP (Lo Conte et al. 2002). Although we confirmed many previous findings, overall our results differed greatly in detail from previous publications.


    Results
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Conclusion
 Materials and methods
 References
 
Accuracy in predicting membrane helices
Prediction methods not significantly less accurate than low-resolution experiments! We compared the membrane annotations for 13 proteins for which we had both low-resolution and high-resolution data available. Whereas ~94%–96% of the helices agreed between the two experimental methods, for only 11 of the 13 proteins did all helices overlap between the two experimental methods (Table 1Go). Also, the two methods agreed on only 82% of all residue assignments (Table 1, QGo2, percentage of correctly predicted residues in two states: membrane helix and non-membrane helix). A detailed comparison of the percentage of identically assigned membrane-helical residues confirmed that for most cases, the differences arose from the longer segments observed in the high-resolution data (Q2T%obs < Q2T%prd, where Q2T%obs is the percentage of all observed TMH helix residues that are correctly predicted and Q2T%prd is the percentage of all predicted TMH helix residues that are correctly predicted). Assuming that the high-resolution data were correct, we can interpret the low-resolution data as an experimental prediction of transmembrane helices. Surprisingly, most prediction methods performed as well as the low-resolution experiments (Table 1Go). In fact, in terms of almost all measures for accuracy, we could find one method that numerically agreed more with the high-resolution data than the low-resolution experiment. However, given the small size of the data set, this statement ignored the error margins in the estimate for accuracy.


View this table:
[in this window]
[in a new window]
 
Table 1. Accuracy of low-resolution experiments and predictions
 
Simple hydrophobicity-based predictions were less accurate than advanced methods. Of the methods that only used hydrophobicity scales for prediction, none detected all membrane helices correctly for >70% of the high-resolution proteins (Table 2, QGook, percentage of proteins for which all TM helices are predicted correctly). However, most methods correctly identified >90% of all observed membrane helices (Table 2, QGohtm%obs, percentage of all observed helices that are predicted correctly). In fact, measured by this score alone, most simple hydrophobicity-based methods appeared more accurate than many advanced prediction methods, but this success was achieved by overpredicting membrane helices (Table 2, QGohtm%prd < Qhtm%obs, where Qhtm%prd is the percentage of all predicted helices that are predicted correctly). Encouragingly, >80% of the helices predicted by most methods were correct (Table 2, QGohtm%prd). Unfortunately, the real problem with the simple methods was that they did not correctly predict the nonmembrane regions as apparent in levels of <70% correctly predicted residues (Table 2, QGo2). Note that we implemented all simple hydrophobicity scales by using the algorithm proposed by the White group (Jayasinghe et al. 2001a). To ensure that this optimized or at least did not penalize membrane protein prediction for some hydrophobicity scales, we also tested the thresholds suggested in the original publications for the GES (hydrophobicity property; Engelman et al. 1986; Prabhakaran 1990) and KD scales (Kyte and Doolittle 1982). Interestingly, the originally proposed thresholds decreased prediction accuracy (Supplementary Table 1Go; available online at http://www.proteinscience.org).


View this table:
[in this window]
[in a new window]
 
Table 2. Accuracy of prediction methods for high-resolution set
 
Most advanced predictions were correct. All advanced prediction methods correctly identified all helices for most high-resolution proteins (Table 2, QGook). In contrast, the only two methods we found to also accurately predict the orientation of the helices, that is, the topology, most often were TopPred2 (hydrophobicity-based membrane helix prediction) and HMMTOP2 (Table 2Go, TOPO, percentage of proteins for which the topology is correctly predicted). Note that HMMTOP2 was developed using all the 36 high-resolution chains for which we compiled the results. On the other hand, TopPred2 used only four of the 36 chains when it was developed. All methods tested correctly predicted >70% of the residues in either of the two states, TMH (T) and non-TMH (N, Table 2Go, QGo2). However, all methods significantly underpredicted residues in membrane helices (Table 2, QGo2T%obs < Q2T%prd).

No single advanced method best by all scores. The set of 36 high-resolution proteins was small enough to require extreme caution in ranking methods based on numerical differences. When comparing pairwise ranks of the methods according to various scores, we found that no advanced method performed consistently best, and none consistently worst (Fig. 1Go). Interestingly, TMHMM1 and TopPred2 appeared to be the most representative methods in that the scores for these methods were most often indistinguishable from all other advanced methods in pairwise comparisons. In contrast, DAS appeared to be most unique in that it was often better and often worse than all other methods. Three methods were clearly more often worse than better: WW (5 times better/30 times worse), PRED-TMR (6/23), and SOSUI (7/26). Three methods were clearly more often better than worse: HMMTOP2 (21 times better/1 time worse), PHDpsihtm08 (divergent profile-based neural network prediction of transmembrane helices) (27/2), and PHDhtm08 (20/6).



View larger version (70K):
[in this window]
[in a new window]
 
Fig. 1. Pairwise comparison of methods. For all high-resolution results compiled in Table 2Go, we show the pairwise comparison for eight different scores and nine methods. Differences by more than one (two) standard error(s) are marked by one (two) arrow(s). Empty boxes indicate that the difference between the respective scores of the two methods is not significant. For example, DAS is two standard errors better than WW in terms of the number of correctly predicted proteins (Qok), whereas HMMTOP2 is two standard errors better than DAS in terms of the overall per-residue accuracy (Q2). The lower table summarizes the respective counts of pair-comparisons for which a particular method is better or worse than the others. TopPred2 and TMHMM1 appear to be the most neutral method (44 and 46 times indistinguishable), whereas DAS seems the most unique method in that it is often better than the others and equally often worse. Note: only DAS, PHDhtm08, PHDpsihtm07, and TopPred2 did not use most of the proteins tested to optimize prediction accuracy; thus, the results for all the other methods are likely to be overestimates.

 
Performance on low-resolution data set: distinct differences. The low-resolution set was considerably larger (165 proteins) than the high-resolution set (36 chains). Nevertheless, we could still not find any method that performed consistently better than all the others (Table 3Go). Most methods reached better per-segment scores for the high- than for the low-resolution data. The opposite was the case for per-residue scores as they were consistently higher for the low-resolution proteins. Most surprising may be the significant differences between the two data sets in terms of the percentage of proteins for which all helices were correctly predicted for the old methods DAS and TopPred2 (Qok in Tables 2 and 3GoGo). Even more stunning was the extremely poor performance of most simple methods using only hydrophobicity scales for the prediction. Interestingly, for the hydrophobicity scales, the two newest ones (WW and Ben-Tal; hydrophobicity scale representing the free energy of transferring an amino acid from water into the center of the hydrocarbon region of a lipid bilayer) performed best overall on the data from low-resolution experiments.


View this table:
[in this window]
[in a new window]
 
Table 3. Accuracy of prediction methods for low-resolution set
 
Most errors were under- or overpredictions of one TMH. The good news was that all methods predicted the number of membrane helices correctly for most proteins (Fig. 2Go). However, this number differed significantly between the high- (71%) and the low-resolution data (56%). The majority of deviations were to predict one helix too few or one too many (68% for high; 64% for low-resolution, Fig. 2Go, center). Interestingly, the errors were rather symmetric for the low-resolution set, whereas they were substantially asymmetric for the high-resolution data. We could not find any significant correlation between the number of membrane helices and the errors of a particular method (data not shown). However, this may be largely owing to the few high-resolution structures in our data set.



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 2. Over- and underprediction of membrane helices. All methods (top panel): For all methods and all proteins in the high- and low-resolution sets, the difference between the number of membrane helices predicted and observed is shown. Although the two distributions appear rather similar, the higher symmetry in the low-resolution graph hid that the percentages with no difference were quite different: 71% for the high-resolution data and 56% for the low-resolution data. The inset (center) underlined the observation that the majority of errors were due to under- or overpredicting one helix.

 
Accuracy lower for proteins with more than five TMH’s. For proteins with five or fewer membrane helices, the average over all advanced methods exceeded 80% (Qok, eq. 4Go) for the high-resolution data and 60% for the low-resolution data (Fig. 3Go). However, prediction accuracy dropped significantly for proteins with more than five helices to values from 33%–36% (Fig. 3Go). Why are proteins with less than five TMH’s so different from proteins with more than six TMH’s? Answers to this question remain speculative.



View larger version (67K):
[in this window]
[in a new window]
 
Fig. 3. Proteins with many helices predicted less accurately. We binned the results for all advanced methods according to the number of observed membrane helices such that the three classes contained similar numbers of proteins (X-axis). Accuracy (Y-axis) is measured in terms of the percentage of proteins for which all helices are correctly predicted (Qok). Both, for the high- and the low-resolution data, proteins with more than five membrane helices were predicted at significantly lower levels of accuracy.

 
Most proteins and most helices correctly predicted by one of the methods. None of the high-resolution helices has been consistently mispredicted by all programs. However, this may reflect that the more recent methods used all these proteins for training. In contrast, three transmembrane helices from three proteins of the low-resolution set were not identified by any of the methods: (1) The C4-dicarboxylate transport protein from Rhizobium meliloti (SWISS-PROT ID dcta_rhime; helix from residues 282–300, sequence ALPGLMNKMEKAGCKRSVV) has a relatively hydrophobic sequence, but it has a polar stretch of residues, NKMEK, in the middle of the helix. The gene fusion constructs were not always created with the reporter gene present in the predicted loop regions (Jording and Puhler 1993). In some cases, the reporter gene was present in the predicted membrane regions. This is a problem because it may alter the topological placement of the reporter gene with respect to the membrane. In addition, gene fusion constructs were not made for each loop region because reporter genes were introduced at random. Hence, each loop was not tested, which included loops for helix 282–300, for its topological placement. Hence, the experimental evidence for this membrane helix (282–300) was weak, at best. (2) The Haemolysin Secretion ATP-Binding Protein (HlyB) from Escherichia coli (hlyb_ecoli, residues 38–51, sequence GTGL GLTSWLLAAK) is an integral membrane protein. However, the particular membrane helix missed appears very short. The other seven membrane helices of HlyB are at least 20 residues long. However, some authors have claimed that membrane-spanning helices may be as short as 10 residues long (Lewis et al. 1990). The experimental evidence for hylb_ecoli had similar problems as that for dcta_rhime: The experimentalists found it difficult to identify membrane-spanning regions through predictions (Wang et al. 1991). This was caused by the high proportion of hydrophilic residues in the N-terminal portion of HlyB. Consequently, the authors did not know where to insert their reporter gene, which in this case was ß-lactamase. Thus, they randomly inserted the reporter gene. Additionally, topological models identify the short stretch as loop (Wang et al. 1991; Gentschev and Goebel 1992). (3) Like all other problematic cases, the Mitochondrial brown fat uncoupling protein 1 from Rattus norvegicus (ucp1_rat, residues 178–194, sequence PNLMRNVIINCTELVTY) has transmembrane regions that contain many polar residues. For this protein, the experimentalists stated that their data did not suffice to strongly conclude that residues 178–194 are in a membrane helix (Miroux et al. 1993).

No significant difference in performance for prokaryotic and eukaryotic proteins. We compared the performance of each method for eukaryotic and prokaryotic proteins. Most methods did not consistently perform better for both the high- and low-resolution data (Table 4Go, {Delta}Qok). In fact, the trends differed greatly between both data sets, and for different measures of prediction accuracy. Whereas prokaryotic proteins were predicted more accurately in terms of per-segment measures for the high-resolution data sets, the opposite was the case for most methods when compared on the low-resolution set. Only four methods had a similar trend in Qok: PRED-TMR predicted eukaryotic proteins more accurately; SOSUI, TopPred2, and WW predicted prokaryotic proteins more accurately for both sets. However, none of the values exceeded two times the estimated error, that is, none was statistically very significant. All methods predicted topology ({Delta}TOPO) better for the prokaryotic proteins in the high-resolution set and for the eukaryotic proteins in the low-resolution set. When measuring prediction accuracy in terms of per-residue performance ({Delta}Q2), we could not find any significant difference between prokaryotic and eukaryotic proteins; all methods did slightly better for eukaryotic proteins for both high- and low-resolution data. Nevertheless, because of the lack of consistent direction of the difference and the lack of statistical significance, our data did not support the previously published conclusion that either prokaryotic or eukaryotic proteins were predicted more accurately.


View this table:
[in this window]
[in a new window]
 
Table 4. Difference between eukaryotic and prokaryotic membrane proteins
 
Accuracy of distinguishing between membrane and other proteins
Few false positives: best methods found few membrane helices in globular proteins. Most advanced methods correctly distinguished between membrane and globular proteins (Table 5Go). The best methods confused between the two types of proteins for <4% of all globular proteins tested (Table 5Go). DAS had the highest error rate of the advanced methods (16% false positives), which was surprising given that DAS tended to underpredict residues in membrane helices. In contrast to the advanced methods, the simple methods distinguished only poorly between membrane and globular proteins. The two exceptions were the old scale from Wolfenden (hydration potential; Wolfenden et al. 1981) and the new one from Ben-Tal (Kessel and Ben-Tal 2002). The latter also predicted membrane proteins rather accurately (Tables 2 and 3GoGo). However, most methods found helices in >90% of all the globular proteins.


View this table:
[in this window]
[in a new window]
 
Table 5. Confusion of membrane and globular proteins
 
Few false negatives. Most methods find all membrane proteins. Although most hydrophobicity scales detected membrane helices in >90% of the globular proteins, they detected all membrane proteins as such. The exceptions were the two scales that were best in rejecting globular proteins: Wolfenden and Ben-Tal (Table 5Go). Similarly, PHDhtm08 misclassified only 2% of the globular proteins, but also missed ~20% of the membrane proteins. The only methods that misclassified <10% of the globular proteins and overlooked <10% of the membrane proteins were: SOSUI, TMHMM1, PHDpsihtm, PRED-TMR, and HMMTOP2 (Table 5Go).

Signal peptides falsely predicted to be membrane helices by most methods. Even the advanced methods had high error rates for signal peptides (Table 6Go). In fact, one of the most accurate rejections of signal peptides was achieved by the simple method solely using the Wolfenden (Wolfenden et al. 1979) hydrophobicity scale (26% errors). Many of the false predictions were at the very beginning of the respective secreted proteins. Thus, we tested the following simple expert rule: delete all membrane helices predicted between 5 and 10 residues after an N-terminal methionine. For PHDpsihtm08, this reduced the falsely predicted signal peptides from 322 (23%) to 146 (10%). Encouragingly, when we applied the same rule to the set of membrane proteins, no helix was removed by this rule. For three out of the 1418 signal peptides, PHDpsihtm08 incorrectly predicted two transmembrane helices.


View this table:
[in this window]
[in a new window]
 
Table 6. Incorrectly predicted membrane helices in signal peptides (false positives)
 

    Discussion
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Conclusion
 Materials and methods
 References
 
Confirming previous analyses
Some methods correctly distinguish globular from helical membrane proteins. Previous analyses showed that simple hydrophobicity-based methods have problems distinguishing between helical transmembrane and globular proteins (Edelman 1993; Jones et al. 1994; Rost et al. 1995; Jayasinghe et al. 2001a; Möller et al. 2001). In general, we confirmed this finding (Table 5Go). However, the Wolfenden and the Ben-Tal scales were clearly exceptional in this respect. Both performed on a par with the best advanced methods that predict membrane helices in at most 3% of all globular proteins (Table 5Go). Interestingly, these levels of accuracy are similar to the performance of the same methods six years ago (Rost et al. 1996a,b). This finding confirms that the globular proteins added to PDB over the last decade are not radically different from the structures that we knew before (Rost and Sander 1993; Rost 2001). Möller and colleagues published significantly more pessimistic estimates for the confusion between globular and membrane proteins (Möller et al. 2001). Whereas our estimates were based entirely on proteins of known structure, those from Möller et al. were based on proteins of unknown structure. Thus, we see two possible reasons for the difference between the two estimates. (1) Proteins in PDB differ from proteins in SWISS-PROT in their average length by almost a factor of 2 because structural biologists often have to truncate the proteins to obtain high-resolution structures. We might argue that the truncated regions are more likely to be confused with membrane helices than the regions for which structure is determined. (2) Many of the proteins used by Möller and colleagues may, in fact, contain membrane helices or signal peptides (for which the error is higher, Table 6Go). We suspect that the truth lies somewhere between the two extremes. Hence, our estimates for the confusion between globular and membrane proteins may be slightly optimistic.

Most methods confuse signal peptides and membrane helices. Möller et al. tested prediction methods on 34 signal and target peptides. They found that most methods incorrectly predicted these regions to contain membrane helices. We tested all 27 methods on 1418 sequence-unique signal peptides. Our results confirmed the previously uncovered trends (Table 6Go). However, the larger set that we used revealed that TMHMM1, which is one of the best methods in this respect, confuses >30% of the signal peptides with membrane helices rather than <10% as previously estimated (Möller et al. 2001). Most simple methods based only on hydrophobicity scales confused >90% of all the signal peptides with membrane helices (exception: Wolfenden scale, Table 6Go). The good news was that the error could be reduced by experts who discard all membrane helices predicted closer than 10 residues to an N-terminal methionine. In this best-case scenario, PHDhtm and PHDpsihtm falsely predicted only ~10% of the signal peptides as membrane helices. Possibly, combinations of membrane-optimized and signal-peptide-optimized programs could reduce this error rate.

Most methods identify most membrane helices. We confirmed (Ikeda et al. 2001; Jayasinghe et al. 2001b; Möller et al. 2001) that many methods correctly predict most membrane helices (Fig. 2Go). We also found the most common mistake to be the under- or overprediction of a single transmembrane helix. However, our results differed in detail from previous analyses (see below).

Resolving differences in previous analyses
Some methods are better; none is clearly best. Evaluations of membrane prediction methods are sometimes based on different definitions for performance accuracy. A particular example of the latter is to count a prediction of one long helix as correct although it stretches over two observed helices and thus misses the break in between the two. Another misleading standard procedure is to only report values covering one side of the coin, that is, only the values of correctly predicted as percentage of observed or vice versa. Here, we carefully evaluated all methods on identical data sets and compiled all reasonable scores for prediction accuracy. To simplify the complexity, we focused in our report on a relatively limited number of scores. Another problem with many previous analyses is that investigators have not estimated the error associated with a particular score. For example, from Table 1Go we may conclude that HMMTOP2 is much better than TopPred2 when applying any measure for prediction accuracy. Although the numbers differed greatly, a thorough bootstrap experiment revealed that the performance of the two methods was indeed indistinguishable. We compared the methods in a pairwise manner for each score of the high-resolution data set (Fig. 1Go). Some methods appeared more accurate than others. However, no method(s) performed consistently better than all others by more than one standard error (Fig. 1Go). Our estimates of error margins explained the numerical differences found between three analyses (Ikeda et al. 2001; Jayasinghe et al. 2001b; Möller et al. 2001).

Simple hydrophobicity-based methods less accurate than advanced methods. Möller et al. (2001) suggested that simple hydrophobicity scale-based methods predict membrane helices almost as accurately as the best advanced methods. We could not confirm this proposition. In contrast, we found that the best advanced methods were significantly more accurate than the best hydrophobicity-scale based methods, both in terms of per-segment and per-residue accuracy (Tables 2 and 3GoGo). The only possible exception may be the per-residue performance of the Ben-Tal scale for the low-resolution data (Table 3Go). However, we did confirm that, because of overprediction, a few hydrophobicity-scale-based methods identify the observed membrane helices at a level of accuracy similar to that of advanced methods in Qhtm%obs in Tables 2 and 3GoGo. Jayasinghe et al. found that the WW hydrophobicity scale-based method that they introduced outperformed even the best advanced methods ("We find that [the] WW scale ... identifies TM helices of membrane proteins with an accuracy greater than 99%"; Jayasinghe et al. 2001a). We could also not confirm this finding, no matter which definition of prediction accuracy we compared. Nevertheless, the major problem with simple hydrophobicity-based methods is their failure on globular proteins (Table 5Go) and signal peptides (Table 6Go). In fact, the error of hydrophobicity scales depends on the length of the protein. For example, the high-resolution chains had an average length of ~215 residues, whereas low-resolution proteins were, on average, ~420 residues long. Although hydrophobicity scales correctly predicted all helices in 28%–65% of the short proteins (Table 2Go), they only detected 5%–29% for the long proteins (Table 3Go). In particular, the scale that performed best on the high-resolution set (KD) dropped in accuracy from 65% (high) to 13% (low), whereas the scale that performed most poorly on the short proteins in the high-resolution data (Wolfenden) became best for the long proteins in the low-resolution data. The Wolfenden scale also performed relatively well on globular proteins (Table 5Go) and on signal peptides (Table 6Go). The price for the lack of overprediction is a low accuracy in detecting membrane helices (underprediction). Overall, the most successful hydrophobicity scale appeared to be the Ben-Tal scale, which is based on the free energy of transferring an amino acid from water into the center of the hydrocarbon region of a lipid bilayer (Kessel and Ben-Tal 2002). It out-performed the Wolfenden scale for membrane proteins and for globular proteins, and it bested all other scales for the low-resolution set. Simple hydrophobicity scales obviously have tremendous importance for sequence analysis. However, to use them as the only criterion to predict membrane helices appears to be a bad idea.

Incorrect ranking by per-segment accuracy depends on definition of score. As discussed above, any attempt to rank prediction methods should account for the standard error in the estimated level of accuracy. A particular illustration of this finding is that different definitions of the accuracy in correctly predicting all helices (eq. 4Go) would slightly alter the ranks. For example, DAS scored worst among all advanced methods when an overlap of at least nine residues was required to consider a helix correctly predicted (definition introduced by Möller et al. 2001), but it appeared to be the third-best of all advanced methods when we applied the definition introduced by Ikeda et al. (2001) (see Supplementary Table 1Go; available online at http://www.proteinscience.org). When giving different ranks only for significant differences, this apparent contradiction was resolved. Most averages were relatively insensitive to whether we required an overlap of 3 or 9 residues between predicted and observed helix (Qok3 and Qok9 in Supplementary Table 1Go; available online at http://www.proteinscience.org). However, contrary to what has been claimed previously, some methods had lower averages when requiring nine overlapping residues. Similarly, for most methods the average scores did not change considerably when using the definition of Ikeda et al (Qok11Centre in Supplementary Table 1Go; available online at http://www.proteinscience.org). However, although the score was lower for most methods for which it differed from the other two, for a few it was actually higher. These were methods that tended to underpredict helices. Overall, the dependence of ranking on the definition of the score used underscored the need to standardize evaluations.

Similar prediction accuracy for prokaryotic and eukaryotic membrane proteins. Ikeda et al. (2001) found that prediction methods are consistently worse at predicting membrane proteins from eukaryotes than those from prokaryotes. We could not verify this finding. Both for the high- and for the low-resolution data sets, we found that some methods reached slightly higher levels on one than on the other (Table 4Go). However, the differences were not significant.

Novel findings
Low-resolution experiments not much more accurate than prediction methods. The low-resolution experiments differed substantially in their assignments of membrane helices from high-resolution experiments. In fact, for a small subset of 13 high-resolution chains, many prediction methods appeared to be as correct—or as incorrect—as previously deposited low-resolution experiments (Table 1Go). This problem was also reflected in the substantial differences between the numerical scores for some of the methods. For example, DAS, TopPred2, and the PHDhtm series used partial information about 9 of the 36 high-resolution chains for development. For these methods, the scores on the 27 cross-validated high-resolution chains were similar to those for the 36 high-resolution chains (data not shown). However, the per-segment scores for the low-resolution sets differed from those for the high-resolution sets (Tables 2 and 3GoGo, in particular Qok). There are two possible explanations for this: either the low-resolution set contains new motifs, or the low-resolution experiments over- or underassign many helices. Such errors could result in a particularly poor performance in terms of predicting all TM helices correctly. In fact, for the set of 13 proteins for which we had low- and high-resolution experiments, Qok was low (84%, Table 1Go) for the low-resolution experiments. Furthermore, the observation that DAS, TopPred2, and the PHDhtm series got higher per-residue scores on the low-resolution data than on the high-resolution data indicated that the low-resolution assignments might not reflect completely new membrane motifs. Thus, the estimate for these cross-validated methods may be correctly estimated by the high-resolution data set (Table 2Go).

Problems with topology assignments by low-resolution data. The topologies of two proteins were incorrectly assigned by the low-resolution experiments (Table 1Go). These two proteins were (1) PDB: 1EHK:B/SWISS-PROT: COX2_THETH; and (2) PDB: 1EUL:A/SWISS-PROT: ATA2_RABIT. (1) 1EHK:B has one membrane helix and the N terminus is in the periplasm. Thus, PDB annotates the topology IN. In contrast, SWISS-PROT (release 34) annotates COX2_THETH with topology OUT, despite experimental data indicating otherwise (Keightley et al. 1995). Note that the latest SWISS-PROT release still annotates COX2_THETH as OUT. (2) The second pair is more complicated: The old SWISS-PROT release 20 entry for ATCA_RABIT was annotated with 10 membrane helices with topology IN, whereas the PDB structure 1EUL:A has 10 membrane helices with topology OUT. In contrast, the latest SWISS-PROT release for ATA2_RABIT annotates 10 helices, but still assigns the topology as IN according to antibody studies (Moller et al. 1997). However, this experimentally determined topology may be incorrect because of nonspecific antibodies for the N-terminus epitope. Indeed, the experimentalists noted that the antibody against the N terminus was only immunoreactive to the 1–243 N-terminal fragment rather than specific to the N-terminal 12 residues. At the same time, they argued that this antiserum can correctly locate the epitope for residues 1–12 (Juul et al. 1995). They suggested that the N terminus is cytoplasmic, but for other cytosolic loops, the authors observed enhanced antibody reactivities. Additionally, the N terminus may be OUT because after solubilization with C12E6, proteolysis did not drastically increase reactivity of antiserum 1–12. Furthermore, antisera to epitopes on all loop regions of ATA2_RABIT were not tested. Therefore, it would be useful to acquire information of the location of the other loops in ATA2_RABIT to verify the topological orientation of this protein.

All prediction methods missed only helices with weak experimental evidence. None of the helices in the high-resolution set and only three in the low-resolution set were missed by all advanced methods. As described above (in Results), the experiments done for these three proteins were not fully convincing in terms of the assignments of transmembrane helices and topology. This observation suggests implementing a consensus prediction of membrane helices. The potential success of such an approach has been initially tried out by a couple of authors (Promponas et al. 1999; Ikeda et al. 2001). However, these two initial attempts have focused only on advanced methods. Although advanced methods are more accurate than simple hydrophobicity-based methods, they tend to underpredict transmembrane helices, especially for high-resolution structures (Table 2Go). Advanced methods could thus serve as a specificity filter for a consensus method. Using both advanced and simple methods could help to verify low-resolution experimental results from proteolysis and gene fusion.

Not all membrane proteins identified. The only advanced method that predicted all known helical membrane proteins to contain at least one helix was DAS (Table 5Go, false negatives). However, the flip-side of the same coin was that DAS also performed poorly on globular proteins (Table 5Go, false positives). The other extreme was PHDhtm, based on conventional pairwise alignments that performed well in rejecting globular proteins while also missing almost one-fifth of the membrane proteins with the default parameters. Obviously, there is a tradeoff between predicting too many globular as membrane proteins, and too many membrane as globular proteins. Possibly the best compromise was achieved by SOSUI and TMHMM, which missed 6% of the membrane proteins while incorrectly predicting membrane helices in ~1% of all globular proteins. PHDhtm based on PSI-BLAST profiles (PHDpsihtm) reached a similar compromise: 8% of the membrane proteins were missed, and 2% of all globular proteins were mispredicted. Nevertheless, the problem of missing membrane proteins underlines once again that we need better methods that correctly distinguish between globular and membrane proteins.

Dependence of prediction accuracy on number of helices. We did not find any significant difference in the performance between proteins with one and many membrane proteins. In contrast, proteins with <=5 membrane helices (<=5) were predicted more accurately than proteins with more (>6, Fig. 2BGo). Although we could label the difference as significant, we failed to come up with any reasonable explanation for this finding. Readers may speculate that the numerical differences we observe between 6TM and 7TM proteins could be explained by the overabundance of transporters with buried charged residues. However, the number of proteins in each category was too small to validate such a fine-grained distinction.


    Conclusion
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Conclusion
 Materials and methods
 References
 
We also overestimated the performance. Although we spent considerable effort on comparing prediction methods, our comparisons suffered from one crucial problem: We do not have cross-validation data available for all methods. In fact, the only methods for which we had cross-validated results were DAS, PHDhtm, PHDpsihtm, TopPred2, and most of the simple methods using only hydrophobicity scales. Although the overall scores for the advanced methods did not differ substantially between the sets of 27 cross-validated and 36-non-cross-validated high-resolution chains (data not shown), they did differ markedly between the nine chains used for development and the 27 cross-validated chains. This seemingly contradictory result is explained by the simple fact that most high-resolution proteins were not used in the development of these methods. In contrast, the newer prediction methods PRED-TMR, SOSUI, TMHMM, and WW used most and HMMTOP2 used all of the high-resolution chains for development. In fact, we observed two trends: (1) Newer methods were slightly better than older ones (HMMTOP2 was clearly more accurate than HMMTOP1 when tested on a small subset of the data); and (2) methods based on alignments were superior to those based on single sequences; in fact, when switching from using MaxHom (dynamic programming algorithm for conservation weight-based multiple sequence alignment) alignments against SWISS-PROT as input to PHDhtm to using PSI-BLAST alignments against all known sequences (BIG—nonidentical merger of SWISS-PROT and TrEMBL and PDB—and PHDpsiHtm), prediction accuracy increased considerably.

Most methods get most membrane helices, but the type of membrane protein is often wrong. The most common mistake was the under- or overprediction of one transmembrane helix. This appears encouraging in terms of prediction methods, in general. However, membrane predictions are very important in the context of analyzing entire proteomes because the number and orientation of the helices typically reveal aspects about function. In fact, only the very best methods predict all helices and the topology more often correctly than not. We may rightfully argue that present methods are still not good enough. Because both the number of helices and their orientation can easily be altered by engineering (Nilsson and von Heijne 1998; Ota et al. 1998; Monne et al. 1999a,b), the task at hand is, however, not an easy one. These experiments along with our analysis of the conservation of transmembrane helices strongly argue against the view that the number and orientation of membrane helices constitute a "solid reality written into the sequence." Rather, single residue exchanges can alter these macroscopic features. Thus, correct predictions require a precision typically not achieved. Perhaps present methods have reached the maximum possible level of accuracy and the chapter of simply predicting the location and orientation of membrane proteins is closed. With the recent high-resolution structures challenging common assumptions and our present analysis highlighting the number of urgent problems with prediction methods, we strongly doubt this. Therefore, we challenge that the issues elucidated in this investigation have reopened the field rather than closed it.


    Materials and methods
 TOP
 Abstract
 Introduction
 Results
 Discussion
 Conclusion
 Materials and methods
 References
 
Data sets
High-resolution data sets for membrane proteins. We started with a total set of 105 chains from helical membrane proteins for which a high-resolution structure was deposited in PDB (Berman et al. 2000). We identified these as helical membrane proteins according to the excellent up-to-date collection of membrane proteins at http://blanco.biomol.uci.edu (Jayasinghe et al. 2001b).

Low-resolution data sets for membrane proteins. We used an expert-curated set of 165 helical membrane proteins that was collected by Stefan Möller and colleagues (Möller et al. 2000). For all these proteins, good low-resolution experimental evidence about localization was available. For the comparison between high-resolution and low-resolution data, we used the annotations we found about transmembrane helix location in old SWISS-PROT versions released prior to the publication of the high-resolution structures.

High-resolution data set for globular proteins. The EVA server (Eyrich et al. 2001) continuously maintains a sequence-unique subset of PDB proteins. We used the version from July 2001 with 1852 representative protein chains. From that set we first removed all membrane proteins. Then we removed all proteins that were similar to one representative in a SCOP superfamily (Murzin et al. 1995; Lo Conte et al. 2000). Representatives were taken to be the longest proteins in the respective superfamily. This procedure yielded a final set of 616 globular protein chains.

Data set of proteins with known signal peptides. Henrik Nielsen and colleagues at the CBS in Copenhagen keep an up-to-date list of experimentally known signal peptides at their Web site (http://www.cbs.dtu.dk/ftp/signalp/readme). This group also spent considerable effort at defining thresholds for what constitutes redundancy in sets of signal peptides (Nielsen et al. 1996, 1997a). We downloaded a set of 1418 sequence-unique signal peptides from a total list of 2845.

Sequence-unique subsets reduce bias. Many of the proteins for which we have information about TM regions are similar to one another. If we want to analyze prediction methods or simple features such as TM length, this bias is problematic. To reduce the bias from the set of enzymes of known function, we have to first generate all-against-all alignments that capture the bias existing in that set. Then, we have to choose the maximal subset that fulfils the constraint that no pair in that subset is sequence-similar. Technically, we accomplished this objective in the following way. First, a pairwise BLAST (Altschul and Gish 1996) aligned all membrane proteins against each other. Second, the resulting pairs were filtered applying the HSSP-threshold (value {theta} = 0, below) such that all remaining pairs were likely to have similar structures. Third, the resulting families were sorted by number of members and length. Fourth, all pairs were clustered with a simple greedy algorithm starting with the largest and longest families (Hobohm et al. 1992). Note that the threshold chosen roughly translated to "no pair with more than 33% sequence identity over more than 100 residues aligned." In particular, we used the following formula to compile the distance DIST from the HSSP-curve HSSP_PIDE (Rost 1999):



((1))

where PIDE is the percentage pairwise sequence identity (ignoring gaps and insertions). This procedure yielded 36 proteins in the high-resolution set, and 165 proteins in the low-resolution set.

Programs tested
Building multiple alignments. Two different alignment schemes were explored: (1) the dynamic programming method MaxHom (Sander and Schneider 1991), and (2) a profile-based PSI-BLAST (Altschul et al. 1997). The particular protocol for finding similarities with PSI-BLAST applied the usual precautions to avoid drift and pollution (Jones 1999; Przybylski and Rost 2002). Searches were restricted to three iterations, and the iteration parameter (H-value) to 10-10 was set. The search databases were SWISS-PROT (Bairoch and Apweiler 2000) and BIG (SWISS-PROT [Bairoch and Apweiler 2000] + TrEMBL [Bairoch and Apweiler 2000] + PDB [Berman et al. 2000]). To explore the conservation of membrane helices, we filtered all MaxHom alignments according to various distances {theta} (eq. 1Go).

Advanced prediction methods. We referred to prediction methods as advanced when they implement more than simple hydrophobicity scales. We tested the following programs: DAS, HMMTOP (version 2), PHDhtm, PHDpsihtm, PRED-TMR, SOSUI, TMHMM (version 2), and TopPred2. TopPred2 averages the GES-scale of hydrophobicity (Engelman et al. 1986) using a trapezoid window (von Heijne 1992; Sipos and von Heijne 1993). PHDhtm combines a neural network using evolutionary information with a dynamic programming optimization of the final prediction (Rost et al. 1995Rost et al. 1996b). DAS optimizes the use of hydrophobicity plots (Cserzö et al. 1997). SOSUI (Hirokawa et al. 1998) uses a combination of hydrophobicity and amphiphilicity preferences to predict membrane helices. TMHMM is the most advanced, and seemingly most accurate, present method to predict membrane helices (Sonnhammer et al. 1998). It embeds a number of statistical preferences and rules into a hidden Markov model to optimize the prediction of the localization of membrane helices and their orientation (note: similar concepts are used for HMMTOP; Tusnady and Simon 1998). PRED-TMR uses a standard hydrophobicity analysis with emphasis on detecting the ends and beginnings of membrane helices (Pasquier et al. 1999).

Simple methods exclusively based on hydrophobicity scales. We also implemented our in-house prediction methods that simply used various hydrophobicity scales for prediction. In particular, we tested the following scales: A-Cid, normalized hydrophobicity scale for {alpha}-proteins (Cid et al. 1992); Av-Cid, normalized average hydrophobicity scale (Cid et al. 1992); Ben-Tal, Hydrophobicity scale representing free energy of transfer of an amino acid from water into the center of the hydrocarbon region of a model lipid bilayer (Kessel and Ben-Tal 2002); Bull-Breese, Bull-Breese hydrophobicity scale (Bull 1974); Eisenberg, normalized consensus hydrophobicity scale (Eisenberg et al. 1984); EM, Solvation free energy (Eisenberg and McLachlan 1986); Fauchere, hydrophobic parameter {pi} from the partitioning of N-acetyl-amino-acid amides (Fauchere and Pliska 1983); GES, hydrophobicity property (Engelman et al. 1986; Prabhakaran 1990); Heijne, transfer free energy to lipophilic phase (von Heijne and Blomberg 1979); Hopp-Woods, Hopp-Woods hydrophilicity value (Hopp and Woods 1981); KD, Kyte–Doolittle hydropathy index (Kyte and Doolittle 1982); Lawson, transfer free energy (Lawson et al. 1984); Levitt, hydrophobic parameter (Levitt 1976); Nakashima, normalized composition of membrane proteins (Nakashima et al. 1990); Radzicka, transfer free energy from 1-octanol to water (Radzicka and Wolfenden 1988); Roseman, solvation-corrected side-chain hydropathy (Roseman 1988); Sweet, optimal matching hydrophobicity (Sweet and Eisenberg 1983); Wolfenden, hydration potential (Wolfenden et al. 1981); and WW, Wimley–White scale (Jayasinghe et al. 2001a). Replacing the WW scale with each of the above-mentioned hydrophobicity indices, we used the WW algorithm to evaluate the predictive performance of each index.

Measuring accuracy
Measuring per-segment accuracy. The ultimate goal of prediction methods obviously is to correctly predict all residues. Assume a protein with 10 membrane helices of 20 residues each; method A predicts 10 helices but gets the five residues at each end of each helix wrong, and method B misses four helices but gets the ends for the other six entirely right. Which method is better? Possibly, many readers would favor method A. This problem is captured in using two different scores measuring prediction accuracy in the field of globular secondary structure prediction: per-residue scores and per-segment scores (Rost and Sander 1993; Rost et al. 1994). Although globular secondary-structure segments are, on average, rather short (helices ~10 residues, strands ~5 residues), membrane helices are rather long. Consequently, the problem of evaluating the per-segment accuracy allows a more coarse-grained measure than required for globular secondary-structure prediction (Rost et al. 1994; Zemla et al. 1999). There are two separate issues to address when defining a helix to be predicted correctly. The first concerns counting the same helix twice. We used the simple concept of "correctly predicted segment" shown in Figure 4Go.



View larger version (27K):
[in this window]
[in a new window]
 
Fig. 4. Correctly predicted segments. In this example, there are three observed and three predicted helices. Observed helix O1 is correctly predicted by P1 as they overlap. However, observed helix O2 is not correctly predicted because P1 already overlaps with O1. Hence, P1 cannot be used as a correct prediction for O2. Similarly, P2 is counted as correct only with respect to O3, whereas P3 is not since O3 was already predicted by P2.

 
In particular, the observed helix O2 is not correctly predicted, because P1 overlaps already with O1. Similarly, P2 is counted as correct with respect to O3, whereas P3 is not. The second issue concerns the minimal overlap required between the observed and predicted helix. If not stated otherwise, we required a minimal overlap of 3 residues, following the definitions previously used in many other publications (von Heijne 1992; Jones et al. 1994; Persson and Argos 1994; von Heijne 1994; Rost et al. 1995 Rost et al. 1996b; Persson and Argos 1996; Sonnhammer et al. 1998). Möller et al. (2001) used a similar procedure; however, they required an overlap of at least 9 rather than 3 residues. Other groups required a minimal overlap of 1 residue (e.g., Cserzö et al. 1997; Tusnady and Simon 1998). Jayasinghe required an overlap of 9 (Jayasinghe et al. 2001b) and 3 (Jayasinghe et al. 2001a) residues; however, in both publications, they counted the same predicted helix twice, thus yielding 100% accuracy for the overlap between O1/P1 and O2/P2 in Figure 4Go. Yet another measure was introduced by Ikeda et al. (2001): Helices were considered as correctly predicted if the centers