|
|
||||||||
1 Stockholm Bioinformatics Center, SCFAB, Stockholm University, SE-10691, Stockholm, Sweden
2 Torhoutse steenweg 238, Brügge 8200, Belgium
3 Stockholm Bioinformatics Center, Department of Bichemistry and Biophysics, Stockholm University, SE-10691 Stockholm, Sweden
Reprint requests to: Arne Elofsson, Stockholm Bioinformatics Center, Stockholm University, SCFAB, SE-10691 Stockholm, Sweden; e-mail: arne{at}sbc.su.se; fax: 46-8-5537-8214.
(RECEIVED September 28, 2001; FINAL REVISION November 28, 2001; ACCEPTED November 28, 2001)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.39402.
| Abstract |
|---|
|
|
|---|
Keywords: Membrane proteins; topology prediction; bioinformatics; homology search; threading
| Introduction |
|---|
|
|
|---|
Many algorithms have been developed to increase the sensitivity and specificity of homology searches for globular proteins. These algorithms often use evolutionary and structural information to improve the detection of related proteins. However, they may not be generally applicable to membrane proteins, as membrane proteins have different structural features (von Heijne 1981) and different amino-acids composition and residue exchangeabilities (Tourasse and Li 2000). Helical integral membrane proteins account for 20%25% of all proteins encoded in a typical genome (Krogh et al. 2001) and their central importance in many cellular processes makes it of great importance to increase the ability to detect related membrane proteins. Here we present modifications of the standard Smith-Waterman, and profile search algorithms increase the specificity and sensitivity of homology searches for membrane proteins.
The detection of globular proteins can be improved by including information from secondary structure predictions (Fischer and Eisenberg 1996; Rice and Eisenberg 1997; Rost et al. 1997; Hargbo and Elofsson 1999). To our knowledge, similar schemes have not been described for integral membrane proteins (for which classical secondary structure prediction methods do not work) (Wallace et al. 1986). Considering that membrane protein topology predictions are much more accurate than secondary structure prediction in globular proteins (Krogh et al. 2001), we have tested whether such predictions can be used to improve homology searches of membrane proteins.
For helical membrane proteins (White and Wimley 1999; Popot and Engelman 2000), topology predictions provide secondary structure information, that is, they pin-point likely transmembrane
-helical segments. We have thus extended the classical Smith-Waterman (Smith and Waterman 1981) and profile (Gribskov et al. 1987) search algorithms by including helix predictions from the topology prediction program TMHMM (Krogh et al. 2001). This resembles the use of secondary structure predictions in threading methods (Fischer and Eisenberg 1996; Rost et al. 1997). However, there are two differences, first, information from one of the best-performing membrane protein topology prediction methods, TMHMM (Krogh et al. 2001), is used and second, no information about the true secondary structure is used; instead, we match a prediction against a prediction. Further, we have tested as similar modification of profile searches (Gribskov et al. 1987) to detect related membrane proteins.
One problem during development and fine-tuning of the methods used in database searches is the need to know the true relationship between the different proteins in a test set. During the last few years, several studies have proposed new ways to evaluate methods for detecting relationships between proteins (Abagyan and Batalov 1997; Brenner et al. 1998; Park et al. 1998; Salamov et al. 1999; Lindahl and Elofsson 2000). These studies differ in detail but have a common theme, they use an existing structural classification to create the benchmark used for evaluating the performance of different search methods. The use of structural protein-family databases such as SCOP (Murzin et al. 1995) and CATH (Orengo et al. 1997) has enabled the creation of test sets in which the true relationship can be quite accurately assumed. However, for membrane proteins, no such high-quality test sets based on 3-D structural databases exist so far. To circumvent this problem, we have chosen to use the GPCRDB database (Horn et al. 2001). GPCRDB purportedly includes all known and predicted 7-TM receptors, and we thus assume that all proteins in GPCRDB are 7-TM receptors and that all other proteins found in SWISS-PROT (Bairoch and Apweiler 1996) but not in GPCRDB are not 7-TM receptors. A similar benchmark was used recently by Rehsmeier (Muller et al. 2001). This study showed that a nonsymmetric score matrix performed better than a standard (symmetric) substitution matrix for helical membrane proteins. However, no comparison with multiple sequence-based methods, such as PSI-BLAST (Altschul et al. 1997), was made.
In contrast, we now report that the inclusion of information about predicted transmembrane segments leads to a significant improvement over standard sequence-alignment methods, including the iterative multiple sequence-alignment method PSI-BLAST.
| Results |
|---|
|
|
|---|
Our test sets (see Table 1
) were derived from the GPCRDB database as described in the Materials and Methods section. The classification in GPCRDB is based on classes, that is, proteins with broadly similar function and rather close sequence homology. The whole GPCRDB can be considered as the superfamily of G-protein-coupled receptors. We have used both of these levels for the tests described below.
|
To study the detection of membrane proteins at different homology levels, we performed two different tests. First, we tested the ability to detect sequences within a GPCRDB class. Here, hits to GPCR sequences outside a class are ignored, see Table 2
. Second, we tested the ability to detect GPCRs from different classes. Here, hits to GPCR sequences in other classes are considered correct, whereas hits to sequences in the same class are ignored. This is identical to how the performance of fold recognition methods was investigated at different levels of relationship (Lindahl and Elofsson 2000).
|
|
Concerning the detection of proteins that belong to the same GPCR class, Table 3
(column Mc-class) shows that TMSW performs significantly better than standard Smith-Waterman alignments: a Mc of 0.93 versus 0.83. For PSI-BLAST and TMPSI, the highest Mc is obtained if a strict cutoff is used. The highest correlation coefficients, 0.98, are thus seen for PSI-15 and TMPSI-15. The inclusion of information about predicted transmembrane segments results in only limited improvements to the Mc in this case. The table also shows that PSI-BLAST and TMPSI are able to detect almost all proteins within a class of GPCRs.
A more detailed understanding of the performance can be obtained from spec-sens plots, see Figure 1
. It is clear that TMPSI has a significantly higher specificity than PSI-BLAST, irrespective of the cutoff E-value. Without the information about transmembrane segments, it is necessary to use an E-value cutoff of 10-15 to increase the specificity beyond 98%. For the single-sequence-based methods, transmembrane segment information significantly increases both the specificity and sensitivity.
|
|
| Discussion |
|---|
|
|
|---|
In Figures 1 and 2![]()
it can be seen that PSI-BLAST performs significantly better than the SW method, both for the detection of closely and distantly related GPCRs. For closely related proteins, the best specificity is obtained for the very restrictive cutoff of 10-15. For distantly related proteins, however, PSI-BLAST performs better when the E-value is less restrictive (see Fig. 2
). The high error rate for PSI-BLAST on our test set stands in contrast to the performance obtained for globular proteins, in which
35% of the family related proteins can be detected before any false positives (Lindahl and Elofsson 2000) and even a significant proportion of superfamily related proteins can be detected without any false positives.
High-scoring false hits seem to be a problem when PSI-BLAST is used without manual checking of the sequences incorporated into the profiles. Once a false hit is incorporated, every subsequent iteration might result in more false hits being included, and the final result will be biased by a high number of high-scoring incorrect matches. The top 100 false hits in the PSI-15 search are mainly caused by five GPCR sequences in the test set. Of the 100 highest-scoring false positives from SWISS-PROT, 57 have no predicted TM-helices, 35 have one, and 2 have 8 regions according to TMHMM. This shows that, at most, two of these proteins may in fact be GPCRs (that have not found their way into GPCRDB) and that PSI-BLAST runs an obvious risk of incorporating high-scoring false hits both to transmembrane and globular proteins. The mediocre performance of PSI-15 for detecting distantly related proteins further suggests that using very restrictive cutoffs is not without problems.
However, it is well known that PSI-BLAST performs very well to detect closely and distantly related globular proteins.
The performance increase using predicted secondary structures is quite marginal for globular proteins (Lindahl and Elofsson 2000). This puts forward the question as to when parameters should be optimized for a particular case and when it is appropriate to use general parameters. Here, we use a standard substitution matrix for membrane proteins, whereas in earlier studies it has been shown that the use of a special matrix might improve performance (Muller et al. 2001). In general, it could be possible that the best performance would be obtained by use of a special set of parameters for each class of proteins (such as globular, fibrous, porins, etc.), or even for each type of secondary structures. However, taking into account the difficulties in optimizing these parameters, we think that, in general, it is better to use parameters that are general. This study shows an exception to this assumption, as the transmembrane regions differ significantly from globular regions in proteins. However, due to the limitations with this benchmark, we do not think it is possible to obtain the ideal values for gap penalties and substitution matrices. Therefore, we choose to use default values to as large a degree as possible.
Predicted secondary structures improve detection of membrane proteins
As can be seen in Figures 1 and 2![]()
, the use of transmembrane predictions significantly helps in the detection of related membrane proteins (at least insofar as the GPCR superfamily is representative of membrane proteins in general). The best results are obtained using TMPSI, that is, by a combination of profiles from PSI-BLAST profiles and TMHMM predictions. Using TMPSI, it is possible to obtain a specificity higher than 99.5% for the detection of GPCRs from the same class. The inclusion of TMHMM predictions in a profile search is seen to increase the specificity compared with PSI-BLAST alone. However, no significant improvements are seen at lower specificity. This indicates that the inclusion of predicted membrane regions into profiles mainly functions as a filter to avoid incorrect matches, whereas it does not significantly increase the detection of distantly related proteins. Comparing Figures 1 and 2![]()
, it seems as if the best compromise to detect both closely and distantly related GPCRs might be to use TMPSI-5.
It should be noted that information from TMHMM was only included in the final profiles, that is, it was not used during the creation of the profiles. From the improvements seen for TMSW compared with SW, it seems safe to assume that if this were done, additional improvement would be obtained. We have not tested this possibility, as the inclusion of TMHMM predictions into PSI-BLAST is technically not straight forward. We will explore this possibility in future work.
| Conclusions |
|---|
|
|
|---|
| Materials and methods |
|---|
|
|
|---|
Search algorithms
For the standard Smith-Waterman (SW) algorithm (Smith and Waterman 1981), we used the BLOSUM62 matrix (Henikoff and Henikoff 1992), a gap-opening penalty of -10, and a gap-extension penalty of -4. Computational limitations made it impossible to make a systematic search using different matrices and/or gap-penalties.
The approach used to add information about predicted transmembrane segments is similar to the one used in earlier fold recognition/threading techniques (Fischer and Eisenberg 1996). Thus, the score for an alignment is calculated as:
![]() |
![]() |
![]() |
In this study we have used S' = 1 and S' = 0.
The difference from the earlier studies is that we use predicted transmembrane segments both for the query and the target sequence, whereas in the threading methods, the predicted secondary structure of the query is matched against the real secondary structure of the target.
The location and orientation of possible transmembrane helices are predicted using TMHMM (Krogh et al. 2001). If the residues in an aligned pair are both predicted to be located in a transmembrane helix, an additional positive score of one is added to the substitution score, as indicated above. The same substitution matrix and gap penalties as used in the standard Smith-Waterman search are used also in this case.
For the PSI-BLAST searches, we have used the default PSI-BLAST parameters, except for the E-value used to include a sequence in the next iteration and for the number of iterations. We have used three E-values (10-3, 10-5, and 10-15) and a maximum of five iterations. Low-complexity regions were not masked in the PSI-BLAST runs.
Finally, the novel TMPSI method includes both the information from TMHMM and the multiple sequence information from PSI-BLAST. In this method, a standard profile search (Gribskov et al. 1987) is performed, using the profile obtained from PSI-BLAST. In addition, we add a score of one for each residue in the query profile and SWISS-PROT protein when both are predicted to be in transmembrane segments. For the query profile, the prediction of transmembrane segments is the same as that obtained for the initial seed sequence in the PSI-BLAST run. A gap opening penalty of -10 and a gap extension penalty of -4 was used. Due to computational limitations, we were not able to examine more parameter values.
Comparison and assessment
We have used spec-sens plots (Rice and Eisenberg 1997; Hargbo and Elofsson 1999) as our primary measure of performance. The main advantage of this is that such plots measure the ability of a method to reliably find all pairwise matches in the database. The fraction of possible correct hits found, sensitivity, is defined as:
![]() |
![]() |
![]() |
In addition we have used the Matthews correlation coefficient (Mc) (Matthews 1975) for measuring the performance.
![]() |
Score normalizations
For all the SW, TMSW and TMPSI algorithms the raw score S was normalized by the length m and n of the compared sequences, following studies of the expected score for unrelated proteins (Altschul et al. 1997):
![]() | (8) |
For PSI-BLAST, the E-value was used for scoring.
The Pmembr program, used in this study, is available both as a webserver, and as source code from http://www.sbc.su.se/~arne/pmembr/.
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J.. 1997. Gapped blast and psi-blast: A new generation of protein database search programs. Nucleic Acids Res. 25: 33893402.
Bairoch, A. and Apweiler, R. 1996. The swiss-prot protein sequence data bank and its new supplement trembl. Nucleic Acids Res. 24: 1721.
Brenner, S.E., Chothia, C., and Hubbard, T. 1998. Assessing sequence comparison methods with reliable structurally identified evolutionary relationships. Proc. Natl. Acad. Sci. 95: 60736078.
Fischer, D. and Eisenberg, D. 1996. Protein fold recognition using sequence-derived predictions. Protein Sci. 5: 947955.[Abstract]
Gribskov, M., McLachlan, A.D., and Eisenberg, D. 1987. Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. 84: 43554358.
Hargbo, J. and Elofsson, A. 1999. A study of hidden markov models that use predicted secondary structures for fold recognition. Proteins: Struct. Funct. Genet. 36: 6887.[CrossRef][Medline]
Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89: 1091510919.
Horn, F., Vriend, G., and Cohen, F.E. 2001. Collecting and harvesting biological data: The gpcrdb and nucleardb information systems. Nucleic Acids Res. 29: 346349.
Jones, D.T., Taylor, W.R., and Thornton, J.M. 1994. A model recognition approach to the predication of all-helical membrane protein structure and topology. Biochemistry. 33: 30383049.[CrossRef][Medline]
Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E.L. 2001. Predicting transmembrane protein topology with a hidden markov model: Application to complete genomes. J. Mol. Biol. 305: 567580.[CrossRef][Medline]
Lindahl, E. and Elofsson, A. 2000. Identification of related proteins on family, superfamily and fold level. J. Mol. Biol. 295: 613625.[CrossRef][Medline]
Matthews, B.W. 1975. Comparison of predicted and observed secondary structure, of t4 phage lysozyme. Biochim. Biophys. Acta 405: 442451.[Medline]
McGuffin, L.J., Bryson, K., and Jones, D.T. 2000. The psipred protein structure prediction server. Bioinformatics 16: 404405.
Muller, T., Rahmann, S., and Rehmsmeier, M. 2001. Non-symmetric score matrices and the detection of homologous transmembrane proteins. Bioinformatics 17: S182S189.[Abstract]
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. Scop: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536540.[CrossRef][Medline]
Orengo, C.A., Michi, A.D., Jones, S., Jones, D.T., Swindels, M.B., and Thornton, J.M. 1997. Cath - a hierarchical classification of protein domain structures. Structure 5: 10931108.[Medline]
Park, J., Teichmann, S.A., Hubbard, T., and Chothia, C. 1997. Intermediate sequences increase the detection of homology between sequences. J. Mol. Biol. 273: 249254.
Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T., and Chothia, C. 1998. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol. 284: 12011210.[CrossRef][Medline]
Popot, J.L. and Engelman, D.M. 2000. Helical membrane protein folding, stability, and evolution. Annu. Rev. Biochem. 69: 881922.[CrossRef][Medline]
Rice, D. and Eisenberg, D. 1997. A 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence. J. Mol. Biol. 267: 10261038.[CrossRef][Medline]
Rost, B., Schneider, R., and Sander, C. 1997. Protein fold recognition by prediction-based threading. J. Mol. Biol. 270: 471480.[CrossRef][Medline]
Salamov, A.A., Suwa, M., Orengo, C.A., and Swindells, M.B. 1999. Combining sensitive database searches with multiple intermediates to detect distant homologues. Protein Eng. 12: 95100.
Smith, T.F. and Waterman, M.S.. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195197.[CrossRef][Medline]
Tourasse, N.J. and Li, W.H. 2000. Selective constraints, amino acid composition, and the rate of protein evolution. Mol. Biol. Evol. 17: 656664.
von Heijne, G. 1981. Membrane proteins: The amino acid composition of membrane-penetrating segments. Eur. J. Biochem. 120: 275278.[Medline]
Wallace, B.A., Cascio, M., and Mielke, D.L. 1986. Evaluation of methods for the prediction of membrane protein secondary structures. Proc. Natl. Acad. Sci. 83: 94239427.
White, S.H. and Wimley, W.C. 1999. Membrane protein folding and stability: physical principles. Annu. Rev. Biophys. Biomol. Struct. 28: 319365.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
M. K. Lemberg and M. Freeman Functional and evolutionary implications of enhanced genomic analysis of rhomboid intramembrane proteases Genome Res., November 1, 2007; 17(11): 1634 - 1646. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Viklund and A. Elofsson Best {alpha}-helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary information Protein Sci., July 1, 2004; 13(7): 1908 - 1917. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |