|
|
||||||||
Stockholm Bioinformatics Center, and Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden
Reprint requests to: Gunnar von Heijne, Stockholm Bioinformatics Center, and Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden; e-mail: gunnar{at}dbb.su.se; fax: 46-8-15 36 79.
(RECEIVED February 13, 2003; FINAL REVISION June 19, 2003; ACCEPTED July 11, 2003)
Supplemental material: See www.proteinscience.org
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.0306003.
| Abstract |
|---|
|
|
|---|
Keywords: Thylakoid lumen; protein sorting; chloroplast; prediction
| Introduction |
|---|
|
|
|---|
Prediction of the subcellular localization of a protein from its amino acid sequence is an important area in bioinformatics (Emanuelsson and von Heijne 2001). One approach to this problem is to try to emulate the cellular process of sorting signal recognition. One of the more widely used methods of this kind is TargetP (Emanuelsson et al. 2000), a neural network-based predictor that assigns proteins to four different locations: the secretory pathway, mitochondria, chloroplasts, and "all other compartments." An attempt at a fully comprehensive prediction scheme is PSORT1 (Nakai and Kanehisa 1992) that distinguishes between no less than 17 compartments when applied to plant proteins.
The recognition of cTPs is already a part of TargetP, but the program does not yet include a routine for predicting lTPs. Because lTPs are quite similar to the signal peptides that target proteins for secretion in bacteria, one way to identify lTPs is to use TargetP to first search for cTPs, followed by a search for signal peptides using the SignalP predictor (Nielsen et al. 1997; Nielsen and Krogh 1998). Such an approach has been used with some success, for example, by Peltier et al. (2002) and Schubert et al. (2002). However, we assumed that performance would be even better with a dedicated lTP predictor trained on a proper lTP data set.
Here, we report such a predictorLumenPthat has been constructed in a similar way as the existing components of TargetP. When coupled with TargetP, LumenP allows proteins of the thylakoid lumen to be identified with high confidence.
| Results |
|---|
|
|
|---|
Based on the observation that only 4% of the combined cTP+lTP signals were longer than 130 residues, only the N-terminal 130 residues were analyzed for each protein. The 138 proteins in the TAT group and the 121 proteins in the Sec group were treated as separate positive training sets, and each set was redundancy-reduced as described in Materials and Methods. After this step, 50 nonhomologous sequences were left in the TAT group and 43 in the Sec group. Because the precise extent of the lTP signal is in general not known for these sequences, a stretch of 35 residues upstream of the lTP cleavage site (determined by experiments if available, otherwise by similarity) was annotated as belonging to the lTP for the neural network training procedure. A redundancy-reduced negative training set of 50 130-residues long nonthylakoid sequences (10 stromal proteins, 10 mitochondrial proteins, 10 nuclear proteins, 10 secreted proteins, and 10 cytosolic proteins) was also collected.
For both the TAT and Sec training sets, two networksone on top of the otherwere trained in the same way as has been done previously for the SignalP (Nielsen et al. 1997) and ChloroP (Emanuelsson et al. 1999) predictors. The first network was trained with the amino acid sequence as input and a sliding window of size 35 residues. The output of the first networks is one score per residue, giving the probability that this residue is part of an lTP. The output values for residues 21 to 110 for each protein were then used as input to a second network with 90 input nodes. The output of the second network is one score per protein, giving the probability that the protein has an lTP. The networks were trained using fivefold cross-validation.
Prediction of lTP cleavage site
An important part of the predictor is a scoring-matrix-based method for predicting the cleavage site location of the lumenal targeting peptide. Focusing exclusively on the region around the cleavage site we pooled the TAT and Sec datasets because these signals are assumed to be cleaved by the same protease. Thus, the entire set of 93 redundancy reduced lumenal sequences were used in the construction of the scoring matrix. First, the proteins were aligned (without gaps) around their cleavage sites, and eight alignment positions were extractedsix from the lTP and two from the mature part of the protein, thus covering the c-region of the signal sequence (von Heijne 1983). Then, the cleavage site scoring matrix was constructed by recording for each position i in the alignment, the frequencies fi,j of each amino acid j, and contrasting this frequency with the frequency pj of that amino acid in a background set (see Materials and Methods). The resulting scoring matrix can be used to scan candidate sequences for the most probable cleavage site. The search is limited to the sequence region comprising residues 50150 (this includes all known cTP+lTP lengths).
The SignalP predictor can also be used to predict lTP cleavage sites. In this case, we employed a truncation scheme (described below) resulting in 26 suggested cleavage sites per sequence, and the one with the highest cleavage site score was chosen as the final prediction. We also compared our results with those obtained using an older scoring-matrix method that was specifically designed to predict lTP cleavage sites (Howe and Wallace 1990).
Performance tests and comparisons
The results of different combinations of test sets analyzed by one or both of the Sec- and TAT-trained neural networks are shown in Table 1
. Not surprisingly, the Sec and TAT networks perform best on their respective set of sequences. For a cutoff score of 0.67, both reach a Matthews correlation coefficient (MCC) of 0.74, with the corresponding sensitivities and specificities in the range 0.80.9.
|
|
|
All Plantae sequences from Swiss-Prot annotated as containing a secretory signal peptide were analyzed (745 sequences) as well as a set of 878 plant sequences annotated as being located in the chloroplast stroma. As shown in Table 3
, LumenP (with the TargetP preselection step) predicted only 0.4% of the signal peptides as being lTPs, while the number of false positives from the stromal set was 18.8%. The combination of TargetP and SignalP, however, predicted a much higher fraction of the stromal proteins as lumenal, 28.8% (HMM) or 58.7% (NN). Table 3
also shows the results of analyses of all Arabidopsis and Oryza proteins present in Swiss-Prot that were not annotated as being thylakoid lumenal. Again, LumenP performed better than SignalP-HMM, and they both performed significantly better than the SignalP-NN. The PSORT1 results support the observation that PSORT1 is very conservative in its prediction of lumenal proteins.
|
Investigating the TargetP+LumenP performance on the Arabidopsis and Oryza Swiss-Prot subsets with known subcellular localization, we found that almost all (88% and 100% for Arabidopsis and Oryza, respectively, data not shown) of the proteins with a Swiss-Prot subcellular localization annotation that were incorrectly assigned as lumenal were annotated as stromal, which is in accordance with the previous tests.
The results of the LumenP and SignalP approaches for prediction of cleavage sites were rather similar to each other; 72 of the 93 (77.4%) cleavage sites in the redundancy-reduced test set were correctly predicted by LumenP, compared to 70 (75.3%) using the computationally more costly SignalP-NN analysis (Table 4
). Testing the performance on the entire set of 259 lumenal proteins revealed again similar performance levels, 54.8% of the cleavage sites were correctly predicted by LumenP and 55.6% by SignalP. It is surprising that both LumenP and SignalP performed much better on the redundancy reduced set of 93 proteins than on the remaining 166 lumenal proteins; we have no good explanation for this at present. In accordance with previous findings (Nielsen and Krogh 1998), we found that the NN version of SignalP is clearly better than the HMM version in predicting cleavage sites, even though the HMM outperforms the NN on the lumenal/nonlumenal prediction (Tables 2
and 3
). The Howe-Wallace scoring matrix, which is based on only 12 sequences, performed worse than all the other methods, but again performed significantly better on the test set (93 sequences) than on the entire lumenal set (259 sequences).
|
To evaluate the frequency of false-positive predictions on a realistic test set, we applied LumenP and TargetP+LumenP to all plant proteins found in Swiss-Prot that were annotated as either containing a signal peptide (i.e., nuclearly encoded secretory proteins) or as being located in the chloroplast stroma. Although LumenP by itself identifies many signal peptides as being lTPs (data not shown), almost no secretory proteins survive through the combined TargetP+LumenP predictor. In contrast, 19% of the proteins annotated as chloroplast stromal (i.e., having a cTP but not an lTP) are predicted as lumenal by TargetP+LumenP (Table 3
). This rather high value suggests that some of these proteins may be misannotated and are, in fact, lumenal, which is further supported by the observation that on the Swiss-Prot A. thaliana and Oryza sativa subsets with annotated subcellular location, almost all of the incorrectly assigned lumen proteins were annotated as stromal. A comparison with the TargetP+SignalP approach (including the sequence truncation scheme described above) for predicting lumenal sequences revealed that the most significant contribution of LumenP is to reduce the number of false positives.
A final application was to scan all the Arabidopsis and Oryza ORFs predicted from the complete genome sequences. Using prescreening by TargetP and then LumenP (cutoff pair 0.67/6.80), 417 out of 25,826 (1.6%) Arabidopsis, and 1200 out of 41,915 (2.9%) Oryza proteins were predicted as being located in the lumen of the thylakoid. A full listing of the predicted lumenal proteins is provided as Supplemental Material.
| Materials and methods |
|---|
|
|
|---|
Because lTPs have been shown to have much less sequence conservation than the mature part of the protein (Peltier et al. 2002), only the cTP+lTP part of the proteins was used for the training of LumenP. The cTP part was not removed because the exact cTP cleavage sites are generally not known. Also, TargetP prediction of cTP cleavage sites is not very reliable (Emanuelsson et al. 1999). Instead, a stretch of 35 residues upstream of the lTP cleavage site (determined by experiments if available, otherwise by similarity) roughly corresponding to the average length of lTPs, were annotated as belonging to the lTP in the neural network training procedure.
Negative set
A mixed negative set of roughly the same size as the positive TAT and Sec sets was constructed. This set contained proteins destined for the chloroplast (but not the thylakoid lumen), mitochondrion, cytoplasm, nucleus, and secretory pathway, in equal numbers.
The chloroplast sequences were extracted from Swiss-Prot release 40 by searching for "SUBCELLULAR LOCATION: CHLOROPLAST" in the CC field and "CHLOROPLAST" in the FT field. Proteins encoded in the chloroplast genome were excluded, as were those of algal origin, because cTPs from the green algae Chlamydomonas reinhardtii have been shown to be more similar to mTPs than to cTPs from higher plants in terms of length and amino acid composition (Franzén et al. 1990). The chloroplast sequences were truncated to the 130 most N-terminal amino acids before redundancy reduction.
All other sequences in the negative test set were picked at random from the redundancy reduced TargetP training set (available at http://www.cbs.dtu.dk/services/TargetP/datasets/datasets.html) from which the sequences redundancy-reduced on the 112 N-terminal amino acids were used. The mixed negative set contained 50 sequences, 10 destined for each compartment.
Redundancy reduction
Redundancy reduction, that is, removal of homologous sequences, of the positive and negative sets was done in three steps. First, all sequences in a set were pairwise aligned all against all using the full Smith-Waterman algorithm (Smith and Waterman 1981) with the PAM250 scoring matrix as implemented in the ssearch program of the FASTA package (Pearson 1990). Based on the distribution of alignment scores, the threshold score above which sequences were considered as too similar for network training was chosen as the value where the actual distribution of scores deviated from the extreme-value distribution expected for a local alignment of random sequences (Pedersen and Nielsen 1997). A pair of proteins whose similarity score is above the chosen cutoff are called "neighbors." The Hobohm algorithm 2 (Hobohm et al. 1992) was applied until no proteins were left that had any neighbors within the cutoff score. This algorithm creates a list of all proteins and their neighbors and then removes the protein that has the largest number of neighbors. Then, the neighbor list is recalculated, and again the protein with the largest number of neighbors is removed, and so on until the list only contains proteins that have no neighbors.
After redundancy reduction, a total of 93 sequences were left in the positive set (50 in the TAT group and 43 in the Sec group).
Cross-validation
Fivefold cross-validation was used during training of the neural networks. Each of the five subsets contained about equal numbers of positive and negative examples, as well as equal numbers of the different types of negative examples, that is, chloroplast, mitochondrial, cytoplasmic, secretory pathway, and nuclear sequences.
Neural network architecture and training
The Billnet (Perantonis and Virvilis 2000) neural network simulator platform (issued under GPL at http://www.iit.demokritos.gr/~vasvir/billnet/) was used for the development of LumenP.
For the recognition of both TAT and non-TAT lumenal proteins, two separate neural networks on top of each other were used. In the first layer network, the input data were as described above, and presented using sparse encoding and a sliding window of size 35 residues. The output of the first layer network is one score per residue, and the outputs corresponding to residues 21 to 110 for each protein (counting from the N-terminus) are forwarded to the second layer network, which outputs one score per protein based on the 90 input values it receives from the first layer network. Networks were separately trained on the TAT and Sec datasets. In the final LumenP predictor, the query protein is processed through both the TAT and Sec networks in parallel, giving two final scores of which the highest is chosen for the prediction.
A standard feed-forward network with a sigmoid transfer function with logistic neurons, one hidden layer, and a sigmoid steepness of 4 was chosen for both the first and second layer network. The number of neurons in the hidden layer were 8 in both the first and second level networks for both the TAT and Sec versions. The back-propagation error method was used as training algorithm and the initial weights were chosen at random. The learning rate was set to 0.001 for all networks, and the number of training cycles to 350 for first layer networks, and 100 or 150 for the TAT and Sec second layer networks, respectively. By choosing a constant number of training cycles for all networks in the cross-validation, we avoid optimizing on the individual test sets. Furthermore, the performance fluctuations were very small in a large region around the chosen training cycle numbers, and test set performance was thus not sensitive to the exact choice of stopping point.
Scoring matrix for cleavage site prediction
A cleavage site scoring matrix was constructed from an alignment of the region around the annotated cleavage sites. The set of 93 redundancy reduced lumenal sequences were used for constructing the alignment. The elements (scores) si,j of the scoring matrix, where i is the sequence motif position and j the amino acid, were then calculated from the multiple alignment in a standard fashion:
![]() |
where fi,j is the frequency of amino acid j at position i, and pj is the background frequency of that amino acid in a background set. A simple form of pseudoconts was used: one was added to each count (Laplaces rule). The total amino acid distribution of the 259 lumenal full-length proteins was used as the background.
Performance measures
Prediction performance was measured by determining the sensitivity (number of true positives/[number of true positives + number of false negatives]), specificity (number of true positives/[number of true positives + number of false positives]), and the Matthews correlation coefficient (Matthews 1975), which is one for a perfect prediction and zero for a completely random assignment.
To further assess the number of predicted false positives, LumenP was also tested on all Plantae sequences from Swiss-Prot release 40 annotated as containing a secretory signal peptide (SP) or annotated as chloroplast (but not thylakoid lumen). All entries of plant origin were extracted by searching for "Eukaryota; Viridiplantae" in the OC line, resulting in 5694 entries. From this set, sequences annotated as containing an SP were extracted by searching for the keyword "SIGNAL" in the FT field, resulting in 745 entries. All sequences with "SUBCELLULAR LOCATION: CHLOROPLAST" in the CC field and "CHLOROPLAST" in the FT field were collected and those annotated as thylakoid proteins were excluded, resulting in 878 sequences. Also, all Arabidopsis (1034 sequences) and Oryza (402 sequences) sequences found in Swiss-Prot release 40 and 40.17, respectively, were analyzed (with proteins with clear annotation of thylakoid localization removed).
Genome-wide datasets
The fully sequenced genomes of A. thaliana (The Arabidopsis Genome Initiative 2000; 25,826 ORFs, downloaded from ftp://ftpmips.gsf.de/cress/arabiprot/, version 2002-04-03) and O. sativa (Goff et al. 2002; 41,915 ORFs, downloaded from ftp://ftp.tigr.org/pub/data/o_sativa/irgsp/PUBLICATION_RELEASE/GENOME/, version 2002-04-19) were analyzed.
Availability
LumenP prediction is available from the authors by request via e-mail (gunnar{at}dbb.su.se).
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Berks, B.C., Sargent, F., and Palmer, T. 2000. The Tat protein export pathway. Mol. Microbiol. 35: 260274.[CrossRef][Medline]
Emanuelsson, O. and von Heijne, G. 2001. Prediction of organellar targeting signals. Biochem. Biophys. Acta 1541: 114119.[Medline]
Emanuelsson, O., Nielsen, H., and von Heijne, G. 1999. ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci. 8: 978984.[Abstract]
Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne, G. 2000. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 300: 10051016.[CrossRef][Medline]
Franzén, L.G., Rochaix, J.D., and von Heijne, G. 1990. Chloroplast transit peptides from the green alga Chlamydomonas reinhardtii share features with both mitochondrial and higher plant chloroplast presequences. FEBS Lett. 260: 165168.[CrossRef][Medline]
Goff, S.A., Ricke, D., Lan, T.H., Presting, G., Wang, R., Dunn, M., Glazebrook, J., Sessions, A., Oeller, P., Varma, H., et al. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92100.
Halpin, C., Elderfield, P.D., James, H.E., Zimmermann, R., Dunbar, B., and Robinson, C. 1989. The reaction specificities of the thylakoidal processing peptidase and Escherichia coli leader peptidase are identical. EMBO J. 8: 39173921.[Medline]
Hobohm, U., Scharf, M., Schneider, R., and Sander, C. 1992. Selection of representative protein data sets. Protein Sci. 1: 409417.[Abstract]
Howe, C.J. and Wallace, T.P. 1990. Prediction of leader peptide cleavage sites for polypeptides of the thylakoid lumen. Nucleic Acid Res. 18: 3417.
Matthews, B. 1975. Comparison of predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405: 442451.[Medline]
Nakai, K. and Kanehisa, M. 1992. A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14: 897911.[CrossRef][Medline]
Nielsen, H. and Krogh, A. 1998. Prediction of signal peptides and signal anchors by a hidden Markov model. Intell. Syst. Mol. Biol. 6: 122130.
Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G. 1997. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 10: 16.
ODonovan, C., Martin, M.J., Gattiker, A., Gasteiger, E., Bairoch, A., and Apweiler, R. 2002. High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Bioinformatics 3: 275284.
Pearson, W.R. 1990. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183: 6398.[Medline]
Pedersen, A.G. and Nielsen, H. 1997. Neural network prediction of translation initiation sites in eukaryotes. Perspectives for EST and genome analysis. Intell. Syst. Mol. Biol. 5: 226233.
Peltier, J.B., Friso, G., Kalume, D.E., Roepstorff, P., Nilsson, F., Adamska, I., and van Wijk, K.J. 2000. Proteomics of the chloroplast: Systematic identification and targeting analysis of lumenal and peripheral thylakoid proteins. Plant Cell 12: 319341.
Peltier, J.B., Emanuelsson, O., Kalume, D.E., Ytterberg, J., Friso, G., Rudella, A., Liberles, D.A., Soderberg, L., Roepstorff, P., von Heijne, G., et al. 2002. Central functions of the lumenal and peripheral thylakoid proteome of Arabidopsis determined by experimentation and genome-wide prediction. Plant Cell 14: 211236.
Perantonis, S. and Virvilis, V. 2000. Efficient perceptron learning using constrained steepest descent. Neural Netw. 13: 351364.[CrossRef][Medline]
Robinson, C., Hynds, P.J., Robinson, D., and Mant, A. 1998. Multiple pathways for the targeting of thylakoid proteins in chloroplasts. Plant Mol. Biol. 38: 209221.[CrossRef][Medline]
Schubert, M., Petersson, U.A., Haas, B.J., Funk, C., Schröder, W.P., and Kieselbach, T. 2002. Proteome map of the chloroplast lument of Arabidopsis thaliana. J. Biol. Chem. 277: 83548365.
Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195197.[CrossRef][Medline]
von Heijne, G. 1983. Patterns of amino acids near signal sequence cleavage sites. Eur. J. Biochem. 133: 1721.[Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
M. E. Rumpho, J. M. Worful, J. Lee, K. Kannan, M. S. Tyler, D. Bhattacharya, A. Moustafa, and J. R. Manhart From the Cover: Horizontal gene transfer of the algal nuclear gene psbO to the photosynthetic sea slug Elysia chlorotica PNAS, November 18, 2008; 105(46): 17867 - 17871. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. van Lis, A. Atteia, L. A. Nogaj, and S. I. Beale Subcellular Localization and Light-Regulated Expression of Protoporphyrinogen IX Oxidase and Ferrochelatase in Chlamydomonas reinhardtii Plant Physiology, December 1, 2005; 139(4): 1946 - 1958. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Sun, O. Emanuelsson, and K. J. van Wijk Analysis of Curated and Predicted Plastid Subproteomes of Arabidopsis. Subcellular Compartmentalization Leads to Distinctive Proteome Properties Plant Physiology, June 1, 2004; 135(2): 723 - 734. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Friso, L. Giacomelli, A. J. Ytterberg, J.-B. Peltier, A. Rudella, Q. Sun, and K. J. v. Wijk In-Depth Analysis of the Thylakoid Membrane Proteome of Arabidopsis thaliana Chloroplasts: New Proteins, New Functions, and a Plastid Proteome Database PLANT CELL, February 1, 2004; 16(2): 478 - 499. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |