|
|
||||||||
1 Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), New York, New York 10032, USA
3 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
Reprint requests to: Burkhard Rost, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 W. 168 St., BB217, New York, NY 10032, USA; e-mail: rost{at}columbia.edu; fax: (212) 305-7932.
(RECEIVED May 5, 2002; FINAL REVISION July 22, 2002; ACCEPTED September 16, 2002)
Terminology: Advanced prediction methods, all methods that do not exclusively use a hydrophobicity scale; simple prediction methods, membrane prediction methods exclusively based on hydrophobicity scales.
Formula abbreviations: htm, transmembrane helix; T, residue in transmembrane helix; N, nonmembrane residue.
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.0214502.
| Abstract |
|---|
|
|
|---|
Keywords: Sequence analysis; protein structure prediction; multiple alignments, predicting transmembrane helices; comparing genomes; bioinformatics; computational biology; proteomes
Abbreviations: A-Cid, normalized hydrophobicity scale for
-proteins (Cid 1992) Av-Cid, normalized average hydrophobicity scale (Cid 1992) Ben-Tal, hydrophobicity scale representing the free energy of transferring an amino acid from water into the center of the hydrocarbon region of a lipid bilayer (Kessel and Ben-Tal 2002) BIG, nonidentical merger of SWISS-PROT (Bairoch and Apweiler 2000) and TrEMBL (Bairoch and Apweiler 2000) and PDB (Berman et al. 2000) BLAST, fast sequence alignment method (Altschul and Gish 1996) Bull-Breese, Bull-Breese hydrophobicity scale (Bull 1974) DSSP, program assigning secondary structure (Kabsch and Sander 1983) Eisenberg, normalized consensus hydrophobicity scale (Eisenberg et al. 1984) EM, Solvation free energy (Eisenberg and McLachlan 1986) EVA, server automatically evaluating structure prediction methods (Eyrich et al. 2001a,b) Fauchere, hydrophobic parameter
from the partitioning of N-acetyl-amino-acid amides (Fauchere and Pliska 1983) GES, hydrophobicity property (Engelman et al. 1986; Prabhakaran 1990) Heijne, transfer free energy to lipophilic phase (von Heijne and Blomberg 1979) HMM, hidden Markov model HMMTOP, hidden Markov model predicting transmembrane helices (Tusnady and Simon 1998) Hopp-Woods, Hopp-Woods hydrophilicity value (Hopp and Woods 1981) KD, KyteDoolittle hydropathy index (Kyte and Doolittle 1982) Lawson, transfer free energy (Lawson et al. 1984) Levitt, hydrophobic parameter (Levitt 1976) MaxHom, dynamic programming algorithm for conservation weight-based multiple sequence alignment (Sander and Schneider 1991) MEMSAT, dynamic-programming based prediction of transmembrane helices (Jones et al. 1994) META-PP, internet service allowing access to a variety of bioinformatics tools through one single interface (Eyrich and Rost 2000) Nakashima, normalized composition of membrane proteins (Nakashima et al. 1990) PDB, Protein Data Bank of experimentally determined 3D structures of proteins (Bernstein et al. 1977; Berman et al. 2000) PHDhtm, profile-based neural network prediction of transmembrane helices (Rost 1996; Rost et al. 1996b) PHDpsihtm, divergent profile (PSI-BLAST)-based neural network prediction 2002) PSI-BLAST, position-specific iterated database search (Altschul et al. 1997) Radzicka, transfer free energy from 1-octanol to water (Radzicka and Wolfenden 1988) Roseman, solvation-corrected side-chain hydropathy (Roseman 1988) SignalP, signal peptide prediction (Nielsen et al. 1997a) SOSUI, hydrophobicity- and amphiphilicity-based transmembrane helix prediction (Hirokawa et al. 1998) SPLIT, transmembrane helix prediction (Juretic et al. 1998) Sweet, optimal matching hydrophobicity (Sweet and Eisenberg 1983) SWISS-PROT, database of protein sequences (Bairoch and Apweiler 2000) TM, transmembrane TMAP, alignment-based prediction of transmembrane helices (Persson and Argos 1996) TMH, transmembrane helix TMHMM, transmembrane prediction using cyclic hidden Markov models (Sonnhammer et al. 1998; Krogh et al. 2001) TMpred, prediction of transmembrane helices (Hofmann and Stoffel 1993) TopPred2, hydrophobicity-based membrane helix prediction (von Heijne 1992; Cserzö et al. 1997) TrEMBL, translation of the EMBL-nucleotide database coding DNA to protein sequences (Bairoch and Apweiler 2000) Wolfenden, hydration potential (Wolfenden et al. 1981) WW, WimleyWhite hydrophobicity scale-based method (Wimley et al. 1996a,b; White and Wimley 1999; White 2001).
| Introduction |
|---|
|
|
|---|
Published estimates for membrane helix prediction questioned by recent analyses. Recently, a few groups have questioned the estimated levels of performance for membrane helix prediction methods. Möller, Croning, and Apweiler analyzed 14 prediction methods that did not use alignment information on a set of 188 proteins with experimentally known helices (Möller et al. 2000, 2001). They also applied the prediction methods to globular proteins and to signal peptides. The results indicated the following conclusions: (1) The best prediction method (TMHMM, transmembrane prediction using cyclic hidden Markov models) correctly predicts all membrane helices for 52%69% of all proteins tested. (2) The best distinction between globular and membrane-helical proteins reaches levels of >97% for the globular proteins tested (TMHMM and SOSUI, hydrophobicity- and amphiphilicity-based transmembrane helix prediction). (3) On a set of 34 signal and transit peptide proteins, the best methods reached 98% (PHDhtm, profile-based neural network prediction of transmembrane helices) to 100% (ALOM2) accuracy in distinguishing these from membrane helices. (4) The best simple hydrophobicity index (KD, KyteDoolittle hydropathy index; Kyte and Doolittle 1982) correctly predicted all helices for 44% of all the proteins in a set for which HMMTOP (hidden Markov model predicting transmembrane helices; Tusnady and Simon 1998) reached only 43% accuracy. Another recent analysis was based on a set of 145 sequence-unique proteins (Ikeda et al. 2001). The researchers tested 10 prediction methods not using alignment information on their data set. In contrast to Möller et al., the investigators found that HMMTOP was not only much better than the KD hydrophobicity index, but that it was the most accurate prediction method, correctly predicting all membrane helices for
68% of all proteins. Averaging over all 10 methods, the authors found the resulting consensus prediction
10 percentage points more accurate than the best single method. The investigators also claimed that prediction accuracy is higher for prokaryotes than for eukaryotes. They speculated that they found different levels of accuracy than Möller et al. because they used different percentages of prokaryotic proteins in their data sets. Jayasinghe, Hristova, and White analyzed four prediction methods on two different sets of proteins with known membrane helix locations: (1) on 150 high-resolution structures from PDB, and (2) on 242 low-resolution proteins (Jayasinghe et al. 2001b). The researchers found that the results between the high- and low-resolution sets differed marginally and reported that the best methods (PHDhtm and HMMTOP) correctly predict >93%97% of all helices. This group has also proposed a method based on a novel entropy-based hydrophobicity scale, namely, the WimleyWhite scale (WW, WimleyWhite hydrophobicity-scale-based method), which is claimed to correctly predict 99% of all membrane helices (Jayasinghe et al. 2001a). One major problem of hydrophobicity-based methods appears to be the poor distinction between membrane and globular proteins (Edelman 1993; Jones et al. 1994; Rost et al. 1995 Rost et al. 1996b; Jayasinghe et al. 2001a; Möller et al. 2001).
Problems with previous analyses. Previous analyses were limited in various ways. (1) Performance on high- and low-resolution data sets was distinguished by neither the Möller nor the Ikeda groups, although it seemed that performance differed between the two (Jayasinghe et al. 2001b). (2) The redundancy in data sets resulting from many copies of very similar proteins was not reduced by the Möller or Jayasinghe groups. However, such bias is known to create problems when estimating prediction methods (Rost and Sander 1993; Rost et al. 1995 Rost et al. 1996b; Rost 2002). (3) Neither Möller et al. nor Ikeda et al. tested any method based on alignment information, although such methods are known to be more accurate (Rost and Sander 1993; Persson and Argos 1994; Neuwald et al. 1995; Rost et al. 1995; Rost 1996; Johnson and Church 1999). (4) No group explored per-residuealong with per-segmentbased measures for prediction accuracy. Instead, all groups focused on one particular definition of prediction accuracy; no two groups applied the same definition. (5) No group established levels for significant differences between methods. This makes it impossible to conclude whether or not differences between any two methods are relevant. In general, levels of significant differences typically depend on the data sets and the scores used (Eyrich et al. 2001; Rost and Eyrich 2001; Marti-Renom et al. 2002). (6) Only Möller and coworkers tested proteins with signal peptides; however, their analysis was restricted to a small set of 34 proteins with known signal peptides. (7) No group analyzed more than 14 prediction methods. (8) Generally, prediction accuracy differs significantly between proteins used to develop a method and proteins never seen by a method (Moult et al. 1995, 1997, 1999). For membrane proteins, this effect is very difficult to estimate because few high-resolution structures of membrane proteins are added over a course of a year. Although Möller et al. tried to estimate this effect by analyzing only proteins not used for developing a method, they did not rule out that the proteins tested in the category "not known to the method" were similar to proteins used for development. Surprisingly, Möller et al. found most methods to perform better on proteins not used for development. Given how prediction methods are developed, it is very unlikely that this result holds in general. Either the differences are not significant, or the data sets were not representative (or both).
To resolve these limitations and to standardize membrane helix prediction performance comparisons, we have presented an analysis that distinguished between performance on redundancy-reduced high- and low-resolution data sets, established thresholds for significant differences in performance by introducing a bootstrap experiment, and implemented both per-segment and per-residue analysis of membrane helix predictions. Additionally, we analyzed more methods (8 publicly available advanced prediction methods and 19 different hydrophobicity scales). In particular, we included alignment-based prediction methods. Furthermore, we tested membrane helix prediction methods on a large, representative set of 1418 unique signal peptides and 616 unique globular protein folds taken from SCOP (Lo Conte et al. 2002). Although we confirmed many previous findings, overall our results differed greatly in detail from previous publications.
| Results |
|---|
|
|
|---|
94%96% of the helices agreed between the two experimental methods, for only 11 of the 13 proteins did all helices overlap between the two experimental methods (Table 1
|
|
No single advanced method best by all scores. The set of 36 high-resolution proteins was small enough to require extreme caution in ranking methods based on numerical differences. When comparing pairwise ranks of the methods according to various scores, we found that no advanced method performed consistently best, and none consistently worst (Fig. 1
). Interestingly, TMHMM1 and TopPred2 appeared to be the most representative methods in that the scores for these methods were most often indistinguishable from all other advanced methods in pairwise comparisons. In contrast, DAS appeared to be most unique in that it was often better and often worse than all other methods. Three methods were clearly more often worse than better: WW (5 times better/30 times worse), PRED-TMR (6/23), and SOSUI (7/26). Three methods were clearly more often better than worse: HMMTOP2 (21 times better/1 time worse), PHDpsihtm08 (divergent profile-based neural network prediction of transmembrane helices) (27/2), and PHDhtm08 (20/6).
|
|
|
|
No significant difference in performance for prokaryotic and eukaryotic proteins. We compared the performance of each method for eukaryotic and prokaryotic proteins. Most methods did not consistently perform better for both the high- and low-resolution data (Table 4
,
Qok). In fact, the trends differed greatly between both data sets, and for different measures of prediction accuracy. Whereas prokaryotic proteins were predicted more accurately in terms of per-segment measures for the high-resolution data sets, the opposite was the case for most methods when compared on the low-resolution set. Only four methods had a similar trend in Qok: PRED-TMR predicted eukaryotic proteins more accurately; SOSUI, TopPred2, and WW predicted prokaryotic proteins more accurately for both sets. However, none of the values exceeded two times the estimated error, that is, none was statistically very significant. All methods predicted topology (
TOPO) better for the prokaryotic proteins in the high-resolution set and for the eukaryotic proteins in the low-resolution set. When measuring prediction accuracy in terms of per-residue performance (
Q2), we could not find any significant difference between prokaryotic and eukaryotic proteins; all methods did slightly better for eukaryotic proteins for both high- and low-resolution data. Nevertheless, because of the lack of consistent direction of the difference and the lack of statistical significance, our data did not support the previously published conclusion that either prokaryotic or eukaryotic proteins were predicted more accurately.
|
|
20% of the membrane proteins. The only methods that misclassified <10% of the globular proteins and overlooked <10% of the membrane proteins were: SOSUI, TMHMM1, PHDpsihtm, PRED-TMR, and HMMTOP2 (Table 5
Signal peptides falsely predicted to be membrane helices by most methods. Even the advanced methods had high error rates for signal peptides (Table 6
). In fact, one of the most accurate rejections of signal peptides was achieved by the simple method solely using the Wolfenden (Wolfenden et al. 1979) hydrophobicity scale (26% errors). Many of the false predictions were at the very beginning of the respective secreted proteins. Thus, we tested the following simple expert rule: delete all membrane helices predicted between 5 and 10 residues after an N-terminal methionine. For PHDpsihtm08, this reduced the falsely predicted signal peptides from 322 (23%) to 146 (10%). Encouragingly, when we applied the same rule to the set of membrane proteins, no helix was removed by this rule. For three out of the 1418 signal peptides, PHDpsihtm08 incorrectly predicted two transmembrane helices.
|
| Discussion |
|---|
|
|
|---|
Most methods confuse signal peptides and membrane helices. Möller et al. tested prediction methods on 34 signal and target peptides. They found that most methods incorrectly predicted these regions to contain membrane helices. We tested all 27 methods on 1418 sequence-unique signal peptides. Our results confirmed the previously uncovered trends (Table 6
). However, the larger set that we used revealed that TMHMM1, which is one of the best methods in this respect, confuses >30% of the signal peptides with membrane helices rather than <10% as previously estimated (Möller et al. 2001). Most simple methods based only on hydrophobicity scales confused >90% of all the signal peptides with membrane helices (exception: Wolfenden scale, Table 6
). The good news was that the error could be reduced by experts who discard all membrane helices predicted closer than 10 residues to an N-terminal methionine. In this best-case scenario, PHDhtm and PHDpsihtm falsely predicted only
10% of the signal peptides as membrane helices. Possibly, combinations of membrane-optimized and signal-peptide-optimized programs could reduce this error rate.
Most methods identify most membrane helices. We confirmed (Ikeda et al. 2001; Jayasinghe et al. 2001b; Möller et al. 2001) that many methods correctly predict most membrane helices (Fig. 2
). We also found the most common mistake to be the under- or overprediction of a single transmembrane helix. However, our results differed in detail from previous analyses (see below).
Resolving differences in previous analyses
Some methods are better; none is clearly best. Evaluations of membrane prediction methods are sometimes based on different definitions for performance accuracy. A particular example of the latter is to count a prediction of one long helix as correct although it stretches over two observed helices and thus misses the break in between the two. Another misleading standard procedure is to only report values covering one side of the coin, that is, only the values of correctly predicted as percentage of observed or vice versa. Here, we carefully evaluated all methods on identical data sets and compiled all reasonable scores for prediction accuracy. To simplify the complexity, we focused in our report on a relatively limited number of scores. Another problem with many previous analyses is that investigators have not estimated the error associated with a particular score. For example, from Table 1
we may conclude that HMMTOP2 is much better than TopPred2 when applying any measure for prediction accuracy. Although the numbers differed greatly, a thorough bootstrap experiment revealed that the performance of the two methods was indeed indistinguishable. We compared the methods in a pairwise manner for each score of the high-resolution data set (Fig. 1
). Some methods appeared more accurate than others. However, no method(s) performed consistently better than all others by more than one standard error (Fig. 1
). Our estimates of error margins explained the numerical differences found between three analyses (Ikeda et al. 2001; Jayasinghe et al. 2001b; Möller et al. 2001).
Simple hydrophobicity-based methods less accurate than advanced methods. Möller et al. (2001) suggested that simple hydrophobicity scale-based methods predict membrane helices almost as accurately as the best advanced methods. We could not confirm this proposition. In contrast, we found that the best advanced methods were significantly more accurate than the best hydrophobicity-scale based methods, both in terms of per-segment and per-residue accuracy (Tables 2 and 3![]()
). The only possible exception may be the per-residue performance of the Ben-Tal scale for the low-resolution data (Table 3
). However, we did confirm that, because of overprediction, a few hydrophobicity-scale-based methods identify the observed membrane helices at a level of accuracy similar to that of advanced methods in Qhtm%obs in Tables 2 and 3![]()
. Jayasinghe et al. found that the WW hydrophobicity scale-based method that they introduced outperformed even the best advanced methods ("We find that [the] WW scale ... identifies TM helices of membrane proteins with an accuracy greater than 99%"; Jayasinghe et al. 2001a). We could also not confirm this finding, no matter which definition of prediction accuracy we compared. Nevertheless, the major problem with simple hydrophobicity-based methods is their failure on globular proteins (Table 5
) and signal peptides (Table 6
). In fact, the error of hydrophobicity scales depends on the length of the protein. For example, the high-resolution chains had an average length of
215 residues, whereas low-resolution proteins were, on average,
420 residues long. Although hydrophobicity scales correctly predicted all helices in 28%65% of the short proteins (Table 2
), they only detected 5%29% for the long proteins (Table 3
). In particular, the scale that performed best on the high-resolution set (KD) dropped in accuracy from 65% (high) to 13% (low), whereas the scale that performed most poorly on the short proteins in the high-resolution data (Wolfenden) became best for the long proteins in the low-resolution data. The Wolfenden scale also performed relatively well on globular proteins (Table 5
) and on signal peptides (Table 6
). The price for the lack of overprediction is a low accuracy in detecting membrane helices (underprediction). Overall, the most successful hydrophobicity scale appeared to be the Ben-Tal scale, which is based on the free energy of transferring an amino acid from water into the center of the hydrocarbon region of a lipid bilayer (Kessel and Ben-Tal 2002). It out-performed the Wolfenden scale for membrane proteins and for globular proteins, and it bested all other scales for the low-resolution set. Simple hydrophobicity scales obviously have tremendous importance for sequence analysis. However, to use them as the only criterion to predict membrane helices appears to be a bad idea.
Incorrect ranking by per-segment accuracy depends on definition of score. As discussed above, any attempt to rank prediction methods should account for the standard error in the estimated level of accuracy. A particular illustration of this finding is that different definitions of the accuracy in correctly predicting all helices (eq. 4
) would slightly alter the ranks. For example, DAS scored worst among all advanced methods when an overlap of at least nine residues was required to consider a helix correctly predicted (definition introduced by Möller et al. 2001), but it appeared to be the third-best of all advanced methods when we applied the definition introduced by Ikeda et al. (2001) (see Supplementary Table 1
; available online at http://www.proteinscience.org). When giving different ranks only for significant differences, this apparent contradiction was resolved. Most averages were relatively insensitive to whether we required an overlap of 3 or 9 residues between predicted and observed helix (Qok3 and Qok9 in Supplementary Table 1
; available online at http://www.proteinscience.org). However, contrary to what has been claimed previously, some methods had lower averages when requiring nine overlapping residues. Similarly, for most methods the average scores did not change considerably when using the definition of Ikeda et al (Qok11Centre in Supplementary Table 1
; available online at http://www.proteinscience.org). However, although the score was lower for most methods for which it differed from the other two, for a few it was actually higher. These were methods that tended to underpredict helices. Overall, the dependence of ranking on the definition of the score used underscored the need to standardize evaluations.
Similar prediction accuracy for prokaryotic and eukaryotic membrane proteins. Ikeda et al. (2001) found that prediction methods are consistently worse at predicting membrane proteins from eukaryotes than those from prokaryotes. We could not verify this finding. Both for the high- and for the low-resolution data sets, we found that some methods reached slightly higher levels on one than on the other (Table 4
). However, the differences were not significant.
Novel findings
Low-resolution experiments not much more accurate than prediction methods. The low-resolution experiments differed substantially in their assignments of membrane helices from high-resolution experiments. In fact, for a small subset of 13 high-resolution chains, many prediction methods appeared to be as corrector as incorrectas previously deposited low-resolution experiments (Table 1
). This problem was also reflected in the substantial differences between the numerical scores for some of the methods. For example, DAS, TopPred2, and the PHDhtm series used partial information about 9 of the 36 high-resolution chains for development. For these methods, the scores on the 27 cross-validated high-resolution chains were similar to those for the 36 high-resolution chains (data not shown). However, the per-segment scores for the low-resolution sets differed from those for the high-resolution sets (Tables 2 and 3![]()
, in particular Qok). There are two possible explanations for this: either the low-resolution set contains new motifs, or the low-resolution experiments over- or underassign many helices. Such errors could result in a particularly poor performance in terms of predicting all TM helices correctly. In fact, for the set of 13 proteins for which we had low- and high-resolution experiments, Qok was low (84%, Table 1
) for the low-resolution experiments. Furthermore, the observation that DAS, TopPred2, and the PHDhtm series got higher per-residue scores on the low-resolution data than on the high-resolution data indicated that the low-resolution assignments might not reflect completely new membrane motifs. Thus, the estimate for these cross-validated methods may be correctly estimated by the high-resolution data set (Table 2
).
Problems with topology assignments by low-resolution data. The topologies of two proteins were incorrectly assigned by the low-resolution experiments (Table 1
). These two proteins were (1) PDB: 1EHK:B/SWISS-PROT: COX2_THETH; and (2) PDB: 1EUL:A/SWISS-PROT: ATA2_RABIT. (1) 1EHK:B has one membrane helix and the N terminus is in the periplasm. Thus, PDB annotates the topology IN. In contrast, SWISS-PROT (release 34) annotates COX2_THETH with topology OUT, despite experimental data indicating otherwise (Keightley et al. 1995). Note that the latest SWISS-PROT release still annotates COX2_THETH as OUT. (2) The second pair is more complicated: The old SWISS-PROT release 20 entry for ATCA_RABIT was annotated with 10 membrane helices with topology IN, whereas the PDB structure 1EUL:A has 10 membrane helices with topology OUT. In contrast, the latest SWISS-PROT release for ATA2_RABIT annotates 10 helices, but still assigns the topology as IN according to antibody studies (Moller et al. 1997). However, this experimentally determined topology may be incorrect because of nonspecific antibodies for the N-terminus epitope. Indeed, the experimentalists noted that the antibody against the N terminus was only immunoreactive to the 1243 N-terminal fragment rather than specific to the N-terminal 12 residues. At the same time, they argued that this antiserum can correctly locate the epitope for residues 112 (Juul et al. 1995). They suggested that the N terminus is cytoplasmic, but for other cytosolic loops, the authors observed enhanced antibody reactivities. Additionally, the N terminus may be OUT because after solubilization with C12E6, proteolysis did not drastically increase reactivity of antiserum 112. Furthermore, antisera to epitopes on all loop regions of ATA2_RABIT were not tested. Therefore, it would be useful to acquire information of the location of the other loops in ATA2_RABIT to verify the topological orientation of this protein.
All prediction methods missed only helices with weak experimental evidence. None of the helices in the high-resolution set and only three in the low-resolution set were missed by all advanced methods. As described above (in Results), the experiments done for these three proteins were not fully convincing in terms of the assignments of transmembrane helices and topology. This observation suggests implementing a consensus prediction of membrane helices. The potential success of such an approach has been initially tried out by a couple of authors (Promponas et al. 1999; Ikeda et al. 2001). However, these two initial attempts have focused only on advanced methods. Although advanced methods are more accurate than simple hydrophobicity-based methods, they tend to underpredict transmembrane helices, especially for high-resolution structures (Table 2
). Advanced methods could thus serve as a specificity filter for a consensus method. Using both advanced and simple methods could help to verify low-resolution experimental results from proteolysis and gene fusion.
Not all membrane proteins identified. The only advanced method that predicted all known helical membrane proteins to contain at least one helix was DAS (Table 5
, false negatives). However, the flip-side of the same coin was that DAS also performed poorly on globular proteins (Table 5
, false positives). The other extreme was PHDhtm, based on conventional pairwise alignments that performed well in rejecting globular proteins while also missing almost one-fifth of the membrane proteins with the default parameters. Obviously, there is a tradeoff between predicting too many globular as membrane proteins, and too many membrane as globular proteins. Possibly the best compromise was achieved by SOSUI and TMHMM, which missed 6% of the membrane proteins while incorrectly predicting membrane helices in
1% of all globular proteins. PHDhtm based on PSI-BLAST profiles (PHDpsihtm) reached a similar compromise: 8% of the membrane proteins were missed, and 2% of all globular proteins were mispredicted. Nevertheless, the problem of missing membrane proteins underlines once again that we need better methods that correctly distinguish between globular and membrane proteins.
Dependence of prediction accuracy on number of helices. We did not find any significant difference in the performance between proteins with one and many membrane proteins. In contrast, proteins with
5 membrane helices (
5) were predicted more accurately than proteins with more (>6, Fig. 2B
). Although we could label the difference as significant, we failed to come up with any reasonable explanation for this finding. Readers may speculate that the numerical differences we observe between 6TM and 7TM proteins could be explained by the overabundance of transporters with buried charged residues. However, the number of proteins in each category was too small to validate such a fine-grained distinction.
| Conclusion |
|---|
|
|
|---|
Most methods get most membrane helices, but the type of membrane protein is often wrong. The most common mistake was the under- or overprediction of one transmembrane helix. This appears encouraging in terms of prediction methods, in general. However, membrane predictions are very important in the context of analyzing entire proteomes because the number and orientation of the helices typically reveal aspects about function. In fact, only the very best methods predict all helices and the topology more often correctly than not. We may rightfully argue that present methods are still not good enough. Because both the number of helices and their orientation can easily be altered by engineering (Nilsson and von Heijne 1998; Ota et al. 1998; Monne et al. 1999a,b), the task at hand is, however, not an easy one. These experiments along with our analysis of the conservation of transmembrane helices strongly argue against the view that the number and orientation of membrane helices constitute a "solid reality written into the sequence." Rather, single residue exchanges can alter these macroscopic features. Thus, correct predictions require a precision typically not achieved. Perhaps present methods have reached the maximum possible level of accuracy and the chapter of simply predicting the location and orientation of membrane proteins is closed. With the recent high-resolution structures challenging common assumptions and our present analysis highlighting the number of urgent problems with prediction methods, we strongly doubt this. Therefore, we challenge that the issues elucidated in this investigation have reopened the field rather than closed it.
| Materials and methods |
|---|
|
|
|---|
Low-resolution data sets for membrane proteins. We used an expert-curated set of 165 helical membrane proteins that was collected by Stefan Möller and colleagues (Möller et al. 2000). For all these proteins, good low-resolution experimental evidence about localization was available. For the comparison between high-resolution and low-resolution data, we used the annotations we found about transmembrane helix location in old SWISS-PROT versions released prior to the publication of the high-resolution structures.
High-resolution data set for globular proteins. The EVA server (Eyrich et al. 2001) continuously maintains a sequence-unique subset of PDB proteins. We used the version from July 2001 with 1852 representative protein chains. From that set we first removed all membrane proteins. Then we removed all proteins that were similar to one representative in a SCOP superfamily (Murzin et al. 1995; Lo Conte et al. 2000). Representatives were taken to be the longest proteins in the respective superfamily. This procedure yielded a final set of 616 globular protein chains.
Data set of proteins with known signal peptides. Henrik Nielsen and colleagues at the CBS in Copenhagen keep an up-to-date list of experimentally known signal peptides at their Web site (http://www.cbs.dtu.dk/ftp/signalp/readme). This group also spent considerable effort at defining thresholds for what constitutes redundancy in sets of signal peptides (Nielsen et al. 1996, 1997a). We downloaded a set of 1418 sequence-unique signal peptides from a total list of 2845.
Sequence-unique subsets reduce bias. Many of the proteins for which we have information about TM regions are similar to one another. If we want to analyze prediction methods or simple features such as TM length, this bias is problematic. To reduce the bias from the set of enzymes of known function, we have to first generate all-against-all alignments that capture the bias existing in that set. Then, we have to choose the maximal subset that fulfils the constraint that no pair in that subset is sequence-similar. Technically, we accomplished this objective in the following way. First, a pairwise BLAST (Altschul and Gish 1996) aligned all membrane proteins against each other. Second, the resulting pairs were filtered applying the HSSP-threshold (value
= 0, below) such that all remaining pairs were likely to have similar structures. Third, the resulting families were sorted by number of members and length. Fourth, all pairs were clustered with a simple greedy algorithm starting with the largest and longest families (Hobohm et al. 1992). Note that the threshold chosen roughly translated to "no pair with more than 33% sequence identity over more than 100 residues aligned." In particular, we used the following formula to compile the distance DIST from the HSSP-curve HSSP_PIDE (Rost 1999):
|
| ((1)) |
where PIDE is the percentage pairwise sequence identity (ignoring gaps and insertions). This procedure yielded 36 proteins in the high-resolution set, and 165 proteins in the low-resolution set.
Programs tested
Building multiple alignments. Two different alignment schemes were explored: (1) the dynamic programming method MaxHom (Sander and Schneider 1991), and (2) a profile-based PSI-BLAST (Altschul et al. 1997). The particular protocol for finding similarities with PSI-BLAST applied the usual precautions to avoid drift and pollution (Jones 1999; Przybylski and Rost 2002). Searches were restricted to three iterations, and the iteration parameter (H-value) to 10-10 was set. The search databases were SWISS-PROT (Bairoch and Apweiler 2000) and BIG (SWISS-PROT [Bairoch and Apweiler 2000] + TrEMBL [Bairoch and Apweiler 2000] + PDB [Berman et al. 2000]). To explore the conservation of membrane helices, we filtered all MaxHom alignments according to various distances
(eq. 1
).
Advanced prediction methods. We referred to prediction methods as advanced when they implement more than simple hydrophobicity scales. We tested the following programs: DAS, HMMTOP (version 2), PHDhtm, PHDpsihtm, PRED-TMR, SOSUI, TMHMM (version 2), and TopPred2. TopPred2 averages the GES-scale of hydrophobicity (Engelman et al. 1986) using a trapezoid window (von Heijne 1992; Sipos and von Heijne 1993). PHDhtm combines a neural network using evolutionary information with a dynamic programming optimization of the final prediction (Rost et al. 1995Rost et al. 1996b). DAS optimizes the use of hydrophobicity plots (Cserzö et al. 1997). SOSUI (Hirokawa et al. 1998) uses a combination of hydrophobicity and amphiphilicity preferences to predict membrane helices. TMHMM is the most advanced, and seemingly most accurate, present method to predict membrane helices (Sonnhammer et al. 1998). It embeds a number of statistical preferences and rules into a hidden Markov model to optimize the prediction of the localization of membrane helices and their orientation (note: similar concepts are used for HMMTOP; Tusnady and Simon 1998). PRED-TMR uses a standard hydrophobicity analysis with emphasis on detecting the ends and beginnings of membrane helices (Pasquier et al. 1999).
Simple methods exclusively based on hydrophobicity scales. We also implemented our in-house prediction methods that simply used various hydrophobicity scales for prediction. In particular, we tested the following scales: A-Cid, normalized hydrophobicity scale for
-proteins (Cid et al. 1992); Av-Cid, normalized average hydrophobicity scale (Cid et al. 1992); Ben-Tal, Hydrophobicity scale representing free energy of transfer of an amino acid from water into the center of the hydrocarbon region of a model lipid bilayer (Kessel and Ben-Tal 2002); Bull-Breese, Bull-Breese hydrophobicity scale (Bull 1974); Eisenberg, normalized consensus hydrophobicity scale (Eisenberg et al. 1984); EM, Solvation free energy (Eisenberg and McLachlan 1986); Fauchere, hydrophobic parameter
from the partitioning of N-acetyl-amino-acid amides (Fauchere and Pliska 1983); GES, hydrophobicity property (Engelman et al. 1986; Prabhakaran 1990); Heijne, transfer free energy to lipophilic phase (von Heijne and Blomberg 1979); Hopp-Woods, Hopp-Woods hydrophilicity value (Hopp and Woods 1981); KD, KyteDoolittle hydropathy index (Kyte and Doolittle 1982); Lawson, transfer free energy (Lawson et al. 1984); Levitt, hydrophobic parameter (Levitt 1976); Nakashima, normalized composition of membrane proteins (Nakashima et al. 1990); Radzicka, transfer free energy from 1-octanol to water (Radzicka and Wolfenden 1988); Roseman, solvation-corrected side-chain hydropathy (Roseman 1988); Sweet, optimal matching hydrophobicity (Sweet and Eisenberg 1983); Wolfenden, hydration potential (Wolfenden et al. 1981); and WW, WimleyWhite scale (Jayasinghe et al. 2001a). Replacing the WW scale with each of the above-mentioned hydrophobicity indices, we used the WW algorithm to evaluate the predictive performance of each index.
Measuring accuracy
Measuring per-segment accuracy. The ultimate goal of prediction methods obviously is to correctly predict all residues. Assume a protein with 10 membrane helices of 20 residues each; method A predicts 10 helices but gets the five residues at each end of each helix wrong, and method B misses four helices but gets the ends for the other six entirely right. Which method is better? Possibly, many readers would favor method A. This problem is captured in using two different scores measuring prediction accuracy in the field of globular secondary structure prediction: per-residue scores and per-segment scores (Rost and Sander 1993; Rost et al. 1994). Although globular secondary-structure segments are, on average, rather short (helices
10 residues, strands
5 residues), membrane helices are rather long. Consequently, the problem of evaluating the per-segment accuracy allows a more coarse-grained measure than required for globular secondary-structure prediction (Rost et al. 1994; Zemla et al. 1999). There are two separate issues to address when defining a helix to be predicted correctly. The first concerns counting the same helix twice. We used the simple concept of "correctly predicted segment" shown in Figure 4
.
|