|
|
||||||||
Laboratory of Biocomputing, CIRB/Department of Biology, University of Bologna, 40126 Bologna, Italy
Reprint requests to: Rita Casadio, Department of Biology, University of Bologna, via Irnerio 42, 40126 Bologna, Italy; e-mail: casadio{at}alma.unibo.it; fax: 0039-051-242576.
(RECEIVED July 12, 2002; FINAL REVISION November 22, 2002; ACCEPTED February 19, 2003)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.0223603.
| Abstract |
|---|
|
|
|---|
, and all-ß membrane proteinswith the aim of fishing new membrane proteins in the pool of nonannotated proteins (twilight zone). The focus is then mainly on outer membrane proteins. This is performed by using an integrated suite of programs (Hunter) specifically developed for predicting the occurrence of signal peptides in proteins of Gram-negative bacteria and the topography of all-
and all-ß membrane proteins. Hunter is tested on the well and partially annotated proteins (2160 and 760, respectively) of Escherichia coli K 12 scoring as high as 95.6% in the correct assignment of each chain to the category. Of the remaining 1253 nonannotated sequences, 1099 are predicted globular, 136 are all-
, and 18 are all-ß membrane proteins. In Escherichia coli 0157:H7 we filtered 1901 nonannotated proteins. Our analysis classifies 1564 globular chains, 327 inner membrane proteins, and 10 outer membrane proteins. With Hunter, new membrane proteins are added to the list of putative membrane proteins of Gram-negative bacteria. The content of outer membrane proteins per genome (nine are analyzed) ranges from 1.5% to 2.4%, and it is one order of magnitude lower than that of inner membrane proteins. The finding is particularly relevant when it is considered that this is the first large-scale analysis based on validated tools that can predict the content of outer membrane proteins in a genome and can allow cross-comparison of the same protein type between different species.
Keywords: All-ß membrane proteins; all-
membrane proteins; structural genomics; neural networks; hidden Markov models; topography prediction of membrane proteins
| Introduction |
|---|
|
|
|---|
To understand the genetic blueprint of different organisms, protein sequences are automatically analyzed for function assignment and annotation by means of extensive homology search with PSI-BLAST or hidden Markov models (Eddy 1996; Altschul et al. 1997).
However, there is still a substantial number of uncharacterized proteins, including hypothetical proteins (with homologs of unknown function) or unique proteins (without known homologs) that deserve further characterization (Fischer and Eisenberg 1999; Iliopoulos et al. 2001).
To this aim we integrated a set of independent predictors that have been developed in our laboratory and tested their discriminating capability on the reannotated genome sequence data base of Escherichia coli K12, including 4173 protein coding genes (EcoGene; Rudd 2000). Of these, 52% and 18% are fully and partially annotated, respectively. The remaining 30% is nonannotated (without a Swiss-Prot entry or corresponding to proteins that are not functionally annotated).
In this article we test the efficiency of our programs in correctly discriminating globular from membrane proteins, and all-
from all-ß membrane proteins using as a test set the 70% annotated portion of the E. coli genome.
It is presently known that proteins found in the inner membrane of bacteria are interacting with typical bundles of
helices with the lipid bilayer (and are termed all-
membrane proteins; von Heijne 1999). Conversely, in the outer membrane of Gram-negative bacteria, proteins spanning the membrane bilayer with ß-strands (named all-ß membrane proteins, Schultz 2000) are organized in barrel-like structures.
Prompted by the high performance of our method, we label new globular and membrane proteins on the remaining nonannotated portion of the E. coli genome. This procedure characterizes new sequences of globular, outer, and inner membrane proteins. For the sake of comparison, the same procedure is applied to the genome of the pathogenic strain of E. coli 0157:H7 and highlights new sequences of membrane proteins. New outer membrane proteins without a counterpart in K12, and possibly related to pathogenicity, are predicted. Furthermore a genome-wide analysis of other pathogens and one thermophile is also presented. From this it emerges that in the Gram-negatives taken into consideration, outer membrane proteins are generally a small fraction of the protein content, being at least one order of magnitude lower than that of inner membrane proteins.
| Hunter at work |
|---|
|
|
|---|
membrane proteins can include or not signal peptides (23 chains out of 422, 5.5%)
|
On the other hand, it has been argued that enzyme function is less conserved than anticipated, and that functional annotation may be biased when performed only on a sequence homology basis (Rost 2002). An alternative approach for classification is based on structure prediction (Frishman and Mewes 1999; Jones 2000; Kelley et al. 2000; Thornton et al. 2000; Frishman et al. 2001; Turcotte et al. 2001). In this article, we choose to address the problem on a structural basis, relying on the classification obtained with methods specifically suited for predicting membrane protein topography.
We implemented a signal peptide predictor that compares well with the top-scoring SignalIP (Nielsen et al. 1999). Furthermore, we developed two well-performing predictors of the topography of inner all-
and outer all-ß membrane proteins, endowed with filters that minimize the rate of false positives (proteins falsely predicted in the category). The predictor for all-ß membrane proteins is similar to that already described (Jacoboni et al. 2001; Martelli et al. 2002). That for all-
inner membrane proteins is based on neural networks as other predictors of this type (Rost et al. 1995). However, it is the first predictor trained and tested only on the inner membrane proteins known with atomic resolution. The predictors and their performance are described in the Materials and Methods section.
One possibility to address the task at hand is to combine all three predictors in an efficient manner. Now the set of empirical rules that are to be taken into consideration for solving our discriminative problem are:
membrane proteins wrongly predicts signal peptides as transmembrane segments in the N-terminal portion of the chain, and that it can be affected as well by false positives.
The scheme we propose to integrate our predictors (Fig. 1
) is particularly suited to mitigate the number of false positives and negatives and to send the maximal number of chains endowed with a predicted signal peptide towards the outer membrane protein classifier. The discriminative power of the suite of programs resides mainly on two filters: one based on a hidden Markov model specifically developed for the outer membrane proteins of Gram-negative bacteria (Martelli et al. 2002); the other is a neural network-based filter minimizing the rate of false positives of a neural network predicting the topography of all-
membrane proteins (this work). All predictors use as input the sequence profile derived from multiple alignment of the target chain towards the nonredundant database.
The protein content of the genome is first filtered with the neural network-based predictor trained and tested on 36 membrane proteins known with atomic resolution. Depending on the number of helices predicted (zero, one, and two or more) the chain is then filtered with the signal peptide predictor. This is done both with low (more false positives) and stringent threshold (less false positives).
A chain with no transmembrane helices predicted (blue path in Fig. 1
) is then filtered with the low-threshold signal peptide predictor. If the protein is without a signal peptide it is classified as globular. Otherwise, if the protein is retained by the HMM filter, the chain is classified as transmembrane all-ß and eventually predicted with the neural network for computing the topography (Jacoboni et al. 2001).
When the protein is predicted to have a transmembrane helix in the N-terminal (red path in Fig. 1
) it is also filtered with the low-threshold signal peptide predictor with two alternatives: if the signal peptide is present, then the protein is sent to the beta-strand filter and the end steps are those described above; if not the protein is presented to the filter for all-
membrane proteins, and if retained, it is accordingly classified; if not, it is classified as globular.
When one helix is predicted in the chain (but not in the N-terminal region) or two or more helices are predicted (green path in Fig. 1
), the stringent signal peptide predictor is activated; then the protein can be classified as either globular or membrane all-
, depending on the output of the all-
transmembrane filter. When the protein is classified as all-
, the signal peptide is excised and the topography prediction is performed without the segment.
| Testing the performance of Hunter |
|---|
|
|
|---|
The test is performed using the subsets of proteins from E. coli K12 that are well annotated (2160) or partially annotated (760) in Swiss Prot. Test experiments were performed at three different values of sequence identity between the proteins of the testing and training tests of the predictors:
20%;
25%,
30%. With the exception of the number of sequences included in the training and testing sets, the performance of Hunter was rather similar at each level of sequence identity. We show the results obtained with the largest number of sequences, corresponding to sequence identity
30%.
The statistical validation of the predictor is shown in Tables 1 and 2![]()
, respectively. In both cases the score per protein (Q3p) is higher than 90% (95.6% and 92.6% in the case of well and partially annotated sets, respectively). Also, both the rate of proteins correctly predicted in the class (Qclass) and the probability of correct prediction in the class (Pclass) are good.
|
|
transmembrane proteins ranges from 8.8% to 9.9% and from 11.4% to 4.5% when the estimate is evaluated on the two sets, respectively; similarly, it is 17.6%22.2% and 9.7%12.5% for all-ß outer membrane, and 3.1%4.4%, 2.3%9.5% for globular proteins. Evidently the rate of false negatives and positives is affected by the smaller number of membrane proteins, particularly all-ß, compared to that of globular proteins. From these figures we may roughly estimate the rate of false negatives and positives to be associated to putative numbers of proteins predicted in each of the three categories as a value averaged over the two sets. We may conclude that the outer membrane-classification may be affected by underprediction (about 20%) more than overprediction (about 10%). Under- and overprediction for the other two categories are comparable: by averaging, about 9% for inner membrane proteins and 5% for globular ones.
In conclusion, from the test it is evident that Hunter quite accurately classifies the proteins of the E. coli genome, although it misses and overpredicts some chains. This was expected, considering also the statistical validation of each predictor (see Materials and Methods).
Hunter is then used to filter the remaining portion of the genome of E. coli K12. The results are shown in Table 3
. Out of 1253 proteins, 136 are classified transmembrane all-
; 18 transmembrane all-ß, and 1099 globular. We also detail for the outer membrane proteins the name of the file, the Swiss-Prot ID if existing, the length of the chain, the number of predicted transmembrane beta-strands, the number and the annotation of the homologs (E-value <10-7). For the sake of clarity we include the annotation of the first homolog as detected by BLAST and the level of local and global identity (%) of the target to the homolog.
|
| Filtering of E. coli 0157:H7 |
|---|
|
|
|---|
, 10 new all-ß membrane proteins, and 1564 new globular proteins. Table 4
|
| What did we learn? |
|---|
|
|
|---|
membrane protein content (including the new proteins that we add with our procedure) of both genomes is about 25% in E. coli 0157:H7 and about 22% in E. coli K12. These figures compare well with previous estimates performed with all-
membrane protein predictors based on HMM and neural networks (Krogh et al. 2001; Liu and Rost 2001). However, what is novel is that Hunter classifies and lists together with globular and inner membrane proteins, the putative contents of all-ß outer transmembrane proteins, and this may be particularly interesting in pathogenic bacteria. We found that the proteome of E. coli K12 contains about 1.7% of outer membrane proteins; the estimate is similar in E. coli 0157:H7 (see also Table 5
|
(Fig. 2B
in K12 have lengths from 150 to 13011350; in 0157, from 150 to 13511660. The percentage of short sequences (
100 residues) is 5.4 and 9.9 in K12 and O157, respectively, and the average length of all-
transmembrane proteins is quite similar (364 residues in E. coli K12 and 342 in O157, respectively). The same observation holds also for all-ß membrane proteins (Fig. 2C
|
transmembrane proteins with a given number of transmembrane segments is shown for both strains in Figure 3A
membrane proteins nearly doubles in 0157. For all-ß membrane proteins it is worth noticing (Fig. 3B
|
| Fishing new proteins in other genomes |
|---|
|
|
|---|
| Conclusions |
|---|
|
|
|---|
, outer all-ß membrane, and globular proteins. We test its availability using 50% of the genome of E. coli K12 as annotated in EcoGene, and estimate the rate both of false negatives (proteins that may be missed in the class) and false positives (proteins that may be wrongly predicted in the class).
Filtering of E. coli K12 and 0157 adds new chains to inner and outer membrane proteins and globular ones. We propose Hunter for specifically fishing new inner and outer membrane proteins in genomes of Gram-negatives, and possibly highlighting new virulence factors.
| Materials and methods |
|---|
|
|
|---|
The sequence profile was derived after alignment towards the nonredundant database (July 2001) with PSI-BLAST (Altschul et al. 1997). If necessary, local and global alignments were performed with LALIGN (www.ch.embnet.org/software/LALIGN_form.html) used with default parameters.
The signal peptide predictor
We trained and tested a neural network-based predictor on 598 chains, with 301 positive examples. The network architecture includes an asymmetric input window comprising 14 residues, three neurons in the hidden layer, and one output layer. Our predictor scores similarly to SignalIP (Nielsen et al. 1999). The accuracy is 96.5% and the correlation coefficient is 0.93. When E. coli K12 was filtered with a stringent filter the accuracy per protein was 95.3%. SignalIp under the same conditions and over the same set had accuracy per protein of 95.5%. When using the predictor, a stringent threshold means that only the output values larger or equal to 0.99 are accepted. The low threshold is similarly set at 0.84.
The predictor of all-
inner membrane proteins
We trained and tested on a nonredundant set of 36 membrane proteins known with atomic resolution a neural network-based predictor. The predictor architecture includes a 17-residue long input window that uses sequence profile, 15 neurons in the hidden layer, and two output neurons. Per-residue accuracy is 86.3% and the correlation coefficient is 0.72, with an overlapping score (SOV; Zemla et al. 1999) of 86.8%. To increase the discriminative power of the network, we implemented a filter, which takes into consideration the maximal probability values characteristics of the test set. A protein is accepted only if it is predicted with at least one transmembrane region including probability values as high as 0.96. When a nonredundant set of some 800 globular proteins are predicted, the rate of false positives decreases from 26% (the majority with one
helix) to 0.5%.
The predictor of all-ß membrane proteins
The neural network predicting the all-ß membrane proteins has been described before (Jacoboni et al. 2001). However, in this case the rate of false positives was also decreased by using a hidden Markov model similar to that previously described (Martelli et al. 2002). It has been discussed that the transmembrane strand pattern is not as characteristic as that of alpha-transmembrane helices. When some 800 nonredundant globular proteins are presented to the neural network 30% are wrongly predicted with at least one and two transmembrane ß strands. When the HMM is added on top of the network, the rate of false positive decreases down to 5%.
Hunter
Hunter is the suite of programs described above. The core tools are written in C; the parsers and the global framework are written in PERL. The suite is implemented on a Beowulf, comprising eight CPUs. The running time for a genome of medium size (about 5000 genes) is about 5 h. Most of the time is used for computing the sequence profile after sequence alignment towards the nonredundant database done with PSI-BLAST (Altschul et al. 1997).
Statistical indexes to measure the predictor efficiency have been described before (Casadio et al. 1996; Jacoboni et al. 2001; Martelli et al. 2002).
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A., and Wheeler, D.L. 2002. GenBank. Nucleic Acids Res. 30: 1720.
Casadio, R., Fariselli, P., Taroni, C., and Compiani, M. 1996. A predictor of transmembrane
-helix domains of proteins based on neural networks. Eur. Biophys. J. 24: 165178.[Medline]
Eddy, S.R. 1996. Hidden Markov models. Curr. Opin. Struct. Biol. 6: 361365.[CrossRef][Medline]
Emsley, P., Charles, I.G., Fairweather, N.F., and Isaacs, N.W. 1996. Structure of Bordetella pertussis virulence factor P.69 pertactin. Nature 381: 9092.[CrossRef][Medline]
Fischer, D. and Eisenberg, D. 1999. Finding families for genomic ORFans. Bioinformatics 15: 759762.
Frishman, D. and Mewes, H.W. 1999. Genome-based structural biology. Prog. Biophys. Mol. Biol. 72: 117.[CrossRef][Medline]
Frishman, D., Albermann, K., Hani, J., Heumann, K., Metanomski, A., Zollner, A., and Mewes, H.W. 2001. Functional and structural genomics using PEDANT. Bioinformatics 17: 4457.
Gasteiger, E., Jung, E., and Bairoch, A. 2001. SWISS-PROT: Connecting biomolecular knowledge via a protein database. Curr. Issues Mol. Biol. 3: 4755.[Medline]
Hayashi, T., Makino, K., Ohnishi, M., Kurokawa, K., Ishii, K., Yokoyama, K., Han, C.G., Ohtsubo, E., Nakayama, K., Murata, T., et al. 2001. Complete genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic comparison with a laboratory strain K-12. DNA Res. 8: 1122.[Abstract]
Iliopoulos, I., Tsoka, S., Andrade, M.A., Janssen, P., Audit, B., Tramontano, A., Valencia, A., Leroy, C., Sander, C., and Ouzounis, C.A. 2001. Genome sequences and great expectations. Genome Biol. 2: INTERACTIONS0001.
Jacoboni, I., Martelli, P.L., Fariselli, P., De Pinto, V., and Casadio, R. 2001. Prediction of the transmembrane regions of ß-barrel membrane proteins with a neural network-based predictor. Protein Sci. 10: 779787.
Jones, D.T. 2000. Protein structure prediction in the postgenomic era. Curr. Opin. Struct. Biol. 10: 371379.[CrossRef][Medline]
Kelley, L.A., MacCallum, R.M., and Sternberg, M.J. 2000. Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol. 299: 499520.[Medline]
Kersey, P., Hermjakob, H., and Apweiler, R. 2000. VARSPLIC: Alternatively-spliced protein sequences derived from SWISS-PROT and TrEMBL. Bioinformatics 16: 10481049.
Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E.L. 2001. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 305: 567580.[CrossRef][Medline]
Liu, J. and Rost, B. 2001. Comparing function and structure between entire proteomes. Protein Sci. 10: 19701979.
Martelli, P.L., Fariselli, P., Krogh, A., and Casadio, R. 2002. A sequence profile based HMM for predicting and discriminating ß barrel membrane proteins. Bioinformatics 18:S1 4653.
Menne, K.M., Hermjakob, H., and Apweiler, R. 2000. A comparison of signal sequence prediction methods using a test set of signal peptides. Bioinformatics 16: 741742.
Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G. 1997. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 10: 16.
Nielsen, H., Brunak, S., and von Heijne, G. 1999. Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Eng. 12: 39.
Pautsch, A. and Schulz, G.E. 2000. High-resolution structure of the OmpA membrane domain. J. Mol. Biol. 298: 273282.[CrossRef][Medline]
Perna, N.T., Plunkett III, G., Burland, V., Mau, B., Glasner, J.D., Rose, D.J., Mayhew, G.F., Evans, P.S., Gregor, J., Kirkpatrick, H.A., et al. 2001. Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409: 529533.[CrossRef][Medline]
Rost, B. 2002. Enzyme function less conserved than anticipated. J. Mol. Biol. 318: 595608.[CrossRef][Medline]
Rost, B., Casadio, R., Fariselli, P., and Sander, C. 1995. Transmembrane helices predicted at 95% accuracy. Protein Sci. 4: 521533.[Abstract]
Rudd, K.E. 2000. EcoGene: A genome sequence database for Escherichia coli K-12. Nucleic Acids Res. 28: 6064.
Schulz, G.E. 2000. ß-barrel membrane proteins. Curr. Opin. Struct. Biol. 10: 443447.[CrossRef][Medline]
Skovgaard, M., Jensen, L.J., Brunak, S., Ussery, D., and Krogh, A. 2001. On the total number of genes and their length distribution in complete microbial genomes. Trends Genet. 17: 425428.[CrossRef][Medline]
Snijder, H.J. and Dijkstra, B.W. 2000. Bacterial phospholipase A: Structure and function of an integral membrane phospholipase. Biochim. Biophys. Acta 1488: 91101.[Medline]
Thornton, J.M., Todd, A.E., Milburn, D., Borkakoti, N., and Orengo, C.A. 2000. From structure to function: Approaches and limitations. Nat. Struct. Biol. Suppl: 991994.
Turcotte, M., Muggleton, S.H., and Sternberg, M.J. 2001. Automated discovery of structural signatures of protein fold and function. J. Mol. Biol. 306: 591605.[CrossRef][Medline]
Vogt, J. and Schulz, G.E. 1999. The structure of the outer membrane protein OmpX from Escherichia coli reveals possible mechanisms of virulence. Struct. Fold. Des. 7: 13011309.[Medline]
von Heijne, G. 1999. Recent advances in the understanding of membrane protein assembly and structure. Q. Rev. Biophys. 32: 285307.[CrossRef][Medline]
von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S., and Bork, P. 2002. Comparative assessment of large-scale data sets of proteinprotein interactions. Nature 417: 399403.[Medline]
Wimley, W.C. 2002. Toward genomic identification of ß-barrel membrane proteins: Composition and architecture of known structures. Protein Sci. 11: 301312.
Zemla, A., Venclovas, C., Fidelis, K., and Rost, B. 1999. A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment. Proteins 34: 220223.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
T. Myers-Morales, C. Cowan, M. E. Gray, C. R. Wulff, C. E. Parker, C. H. Borchers, and S. C. Straley A Surface-Focused Biotinylation Procedure Identifies the Yersinia pestis Catalase KatY as a Membrane-Associated but Non-Surface-Located Protein Appl. Envir. Microbiol., September 15, 2007; 73(18): 5750 - 5759. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Marani, S. Wagner, L. Baars, P. Genevaux, J.-W. De Gier, I. Nilsson, R. Casadio, and G. Von Heijne New Escherichia coli outer membrane proteins identified through prediction and experimental verification Protein Sci., April 1, 2006; 15(4): 884 - 889. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Cao, A. Porollo, R. Adamczak, M. Jarrell, and J. Meller Enhanced recognition of protein transmembrane domains with prediction-based structural profiles Bioinformatics, February 1, 2006; 22(3): 303 - 309. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Stenberg, P. Chovanec, S. L. Maslen, C. V. Robinson, L. L. Ilag, G. von Heijne, and D. O. Daley Protein Complexes of the Escherichia coli Cell Envelope J. Biol. Chem., October 14, 2005; 280(41): 34409 - 34419. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Fariselli, M. Finelli, I. Rossi, M. Amico, A. Zauli, P. L. Martelli, and R. Casadio TRAMPLE: the transmembrane protein labelling environment Nucleic Acids Res., July 1, 2005; 33(suppl_2): W198 - W201. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. S. Berven, K. Flikka, H. B. Jensen, and I. Eidhammer BOMP: a program to predict integral {beta}-barrel outer membrane proteins encoded within genomes of Gram-negative bacteria Nucleic Acids Res., July 1, 2004; 32(suppl_2): W394 - W399. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||