|
|
||||||||
Laboratory of Biocomputing, CIRB/Department of Biology, University of Bologna, 40126 Bologna, Italy
Reprint requests to: Rita Casadio, Department of Biology, University of Bologna, Via Irnerio 42; 40126 Bologna, Italy; e-mail: casadio{at}alma.unibo.it; fax: 0039-051-242576.
(RECEIVED June 7, 2002; FINAL REVISION August 6, 2002; ACCEPTED August 9, 2002)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.0219602.
| Abstract |
|---|
|
|
|---|
Keywords: Hidden neural networks; hidden Markov models; neural networks; cysteine-bonding state
| Introduction |
|---|
|
|
|---|
Several analyses of the characteristics of disulfide bond formation in proteins have been performed, including structural and sequence features and classification of connectivity (Harrison and Sternberg 1994; Wedemayer et al. 2000; Fariselli and Casadio 2001). The stability of the protein mainly increases by constraining the unfolded conformation, as many experimental and theoretical studies indicate (among others, see Skolnick et al. 1997; Abkevich and Shakhnovich 2000; Clarke et al. 2000; Welker et al. 2001).
The ever-increasing number of fully sequenced genomes requires powerful tools to perform protein sequence analysis at a genomic scale. However, only few studies have addressed the important problem of predicting the bonding state of cysteine in a protein chain.
The relevance of the flanking residues in predicting a cysteine-bonding state has been shown using statistical methods (Fiser et al. 1992), neural networks (NNs; Muskal et al. 1990; Fariselli et al. 1999), and methods that combine local context and global information about protein sequences (Fiser and Simon 2000; Mucchielli-Giorgi et al. 2002).
In the present paper, we present an approach based on hidden neural networks (HNNs) that combines NNs and hidden Markov models (HMMs) and that outperforms all the existing methods, scoring as high as 88% and 84%, on a cysteine and protein basis, respectively.
| Results and Discussion |
|---|
|
|
|---|
It was previously shown that a perceptron without hidden layers could extract general characteristics of the local contexts conducive to disulfide bond formation (Fariselli et al. 1999). More specifically, when a segment centered in the cysteine to be predicted and including 11 residues of the protein chain was considered, the following features could be captured: (1) The presence of cysteine residues in the environment of the central cysteine strongly favors the disulfide bond formation, with the exception of position 3 [this is in agreement with the fact that in proteins, metal-binding cysteines are typically found in positions i and i+3], and (2) hydrophilic and/or charged residues in the environment are highly conducive toward disulfide-bond formation compared with that of hydrophobic residues, which are poorly conducive (Fariselli et al. 1999).
In the present paper, the network includes an input window comprising a larger number of residues (27 instead of 11). This was performed after a careful search in the parameter space. Also, the number of proteins included in the data set is 1.5-fold higher (and that of cysteine comprising segments 1.7-fold higher than that used before; Fariselli et al. 1999). A statistics of the characteristics of the segment composition conducive to correct bond formation gives, however, results similar to those described before (Fariselli et al. 1999). This again confirms that the local context of the central cysteine is determining the correct bonding state and that a NN is sufficient to capture all the relevant interaction within the input window conducive to the bonding or nonbonding state.
However, the network is unable to capture global information. A network predicts one central cysteine at the time, and this is performed without keeping records of the different predictions associate to a given sequence. In other words, when a cysteine is predicted in a chain, its prediction does not take into account whether other cysteines in the chain are present and what their predicted bonding state is.
For modeling this global information, we use a four-state HMM that ensures that the number of cysteines predicted in the bonding state is even in each chain. This constrains the hybrid system to predict an even number of cysteines in the bonding state in each given chain, independently of the number of cysteines in the protein; NN outputs are then used as emission probabilities of the HMM (Fig. 1
).
|
When only the NN-based predictor is adopted, the average accuracy per cysteine residue is
80% (similar to the accuracy obtained with NN using a smaller set of proteins; Fariselli et al. 1999), and that per protein is 57% (Fig. 2
, blue bars in the bar plot). Remarkably, when the hybrid system (HNN) is tested on the same protein set, accuracy per cysteine residue increases up to 88%, and that per protein improves by at least 27% points (Fig. 2
, red bars). Concomitantly, the cross-correlation coefficient increases from 56% to 73%. The improvement obtained with the hybrid method compared with NN is seemingly caused by the introduction of global "rules" defined by the regular grammar implemented in the HMM. This second step captures the number of cysteines in a chain and also keeps track of the bonding states of all the cysteines in the same chain.
|
In conclusion, in the present paper, we show that a hybrid system combining local with global information outperforms previously developed methods to predict the cysteine-bonding states, confirming that for the problem at hand, a crucial step forward can be made only when global features of the protein chains are taken into consideration.
| Materials and methods |
|---|
|
|
|---|
Nonhomologous proteins (with an identity value <25% and without chain breaks) were selected using the PAPIA system (Noguchi et al. 2001). Segments with cysteines that are interchain disulfide-bonded are included as "free" cysteines in the database (34 segments extracted from 27 monomeric chains and amounting to 0.8% of the database of segments). After this filtering procedure, the total number of proteins is 969 (2.8% contain segments corresponding to interchain disulfide bonds, which are considered free cysteine), with 4136 cysteine-containing segments1446 of which were in the disulfide-bonded state, and 2690 of which were in the nondisulfide-bonded state. For each protein in our database, a profile based on a multiple sequence alignment was created using the BLAST program on the nonredundant data set of sequences. The obtained profiles are used for creating the NN input.
During the training/testing phase, the database has been split in 20 subsets (almost equally sized and distributed) to perform a 20-fold cross-validation. The PDB codes of the proteins included in the data base, the 20-fold cross validation lists, and the training profiles are available at http://www.biocomp.unibo.it/piero/cyspred/cysdataset.tgz.
Measures of performance
The efficiency of the predictors is scored using the statistical indexes defined as follows.
The accuracy is
![]() | ((1)) |
The correlation coefficient C is defined as
![]() | ((2)) |
Finally, the accuracy per protein is
![]() | ((3)) |
Neural networks
Standard feed-forward NNs are implemented with a back-propagation algorithm as learning procedure. The network architecture is similar to that previously used (Fariselli et al. 1999) and consists of a two-layer perceptron with two hidden neurons, one output node (discriminating the disulfide and free cysteine propensities, respectively), and an input layer that consists of 540 neurons (27-residue-long input window). Because of the limited number of examples presently available, an early learning stopping procedure is used to train the networks (Fariselli et al. 1999).
Hidden neural network
A vector-based HMM that can handle emission probability vectors is used on top of the NNs described above. The hybrid system is defined HNN, following the definition of Krogh and Riis (1999). A vector-based HMM, similar to that used in this paper, was recently developed (Martelli et al. 2002).
Briefly, if L is the number of cysteines in the protein and A is the size of the alphabet over which vectors are built (i.e., A = 2, bonding and nonbonding/free cysteine states), we refer to this sequence vector with the following notation:
![]() | ((4)) |
The components of each vector s t are positive and sum to a constant value S (independent of the position t).
The HMM for the specific problem at hand is composed of a Markov model with four states connected by means of the transition probabilities aij (Fig. 1
). The four states are the minimum number of states required to constrain to an even number the paired cysteines in a chain. The probability density function for the emission of a vector from each state is determined by a number A of parameters that are peculiar for each state k and are indicated with the symbols ek(c) (with c = 1,2, . . . ,A):
![]() | ((5)) |
t is the tth state in the path. Z is the normalizing factor with
c ek(c) = 1 (for further details, see Martelli et al. 2002). The vector s t is obtained directly from the NN outputs as
![]() | ((6)) |
Training the HMM parameters is accomplished by using a modified expectation-maximization algorithm (Martelli et al. 2002). To keep the constraints derived by the selected HMM model (Fig. 1
), the prediction of each cysteine is made using one protein at a time and by means of the Viterbi decoding (Durbin et al. 1998).
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Clarke, J., Hounslow, A.M., Bond, C.J., Fersht, A.R., and Daggett, V. 2000. The effects of disulfide bonds on the denatured state of barnase. Protein Sci. 9: 23942404.[Abstract]
Durbin, R., Eddy, S., Krogh, A. and Mitchinson, G. 1998. Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, UK.
Fariselli, P. and Casadio, R. 2001. Prediction of disulfide connectivity in proteins. Bioinformatics 17: 957964.
Fariselli, P., Riccobelli, P., and Casadio, R. 1999. The role of evolutionary information in predicting the disulfide bonding state of cysteines in proteins. Proteins 36: 340346.[CrossRef][Medline]
Fiser, A. and Simon, I. 2000. Predicting the oxidation state of cysteines by multiple sequence alignment. Bioinformatics 6: 251256.
Fiser, A., Cserzo, M., Tudos, E., and Simon, I. 1992. Different sequence environments of cysteines and half cystines in proteins: Application to predict disulfide forming residues. FEBS Lett. 302: 117120.[CrossRef][Medline]
Harrison, P.M. and Sternberg, M.J.E. 1994. Analysis and classification of disulfide connectivity in proteins. J. Mol. Biol. 244: 448463.[CrossRef][Medline]
Kabsch, W. and Sander, C. 1983. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22: 25772637.[CrossRef][Medline]
Krogh, A. and Riis, S.K. 1999. Hidden neural networks. Neural Comput. 11: 541563.[Abstract]
Martelli, P.L., Fariselli, P., Krogh, A., and Casadio, R. 2002. A sequence-profilebased HMM for predicting and discriminating ß-barrel membrane proteins. Bioinformatics 18 (Suppl.1): S46S53.[Abstract]
Mucchielli-Giorgi, M.H., Hazout, S., and Tuffery, P. 2002. Predicting the disulfide bonding state of cysteines using protein descriptors. Proteins 46: 243249.[CrossRef][Medline]
Muskal, S.M., Holbrook, R.S., and Kim, S.H. 1990. Prediction of the disulfide-bonding state of cysteine in proteins. Protein Eng. 3: 667672.
Narayan, M., Welker, E., Wedemeyer, W.J., and Scheraga, H.A. 2000. Oxidative folding of proteins. Acc. Chem. Res. 33: 805812.[CrossRef][Medline]
Noguchi, T., Matsuda, T.H., and Akiyama, Y. 2001. PDB-REPRDB: A database of representative protein chains from the Protein Data Bank (PDB). Nucleic Acids Res. 29: 219220.
Skolnick, J., Kolinski, A., and Ortiz, A.R. 1997. MONSSTER: A method for folding globular proteins with a small number of distance restraints. J. Mol. Biol. 265: 217241.[CrossRef][Medline]
Wedemeyer, W.J., Welkler, E., Narayan, M., and Scheraga, H.A. 2000. Disulfide bonds and protein folding. Biochemistry 39: 42074216.[CrossRef][Medline]
Welker, E., Narayan, M., Wedemeyer, W.J., and Scheraga, H.A. 2001. Structural determinants of oxidative folding in proteins. Proc. Natl. Acad. Sci. 98: 23122316.
| Web reference |
|---|
|
|
|---|
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
R. Sanchez, M. Riddle, J. Woo, and J. Momand Prediction of reversibly oxidized protein cysteine thiols using protein structure properties Protein Sci., March 1, 2008; 17(3): 473 - 481. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Zhao, H.-L. Liu, C.-H. Tsai, H.-K. Tsai, C.-h. Chan, and C.-Y. Kao Cysteine separations profiles on protein sequences infer disulfide connectivity Bioinformatics, April 15, 2005; 21(8): 1415 - 1420. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Lenffer, P. Lai, W. El Mejaber, A. M. Khan, J. L. Y. Koh, P. T. J. Tan, S. H. Seah, and V. Brusic CysView: protein classification based on cysteine pairing patterns Nucleic Acids Res., July 1, 2004; 32(suppl_2): W350 - W355. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Dosztanyi, C. Magyar, G. E. Tusnady, M. Cserzo, A. Fiser, and I. Simon Servers for sequence-structure relationship analysis and prediction Nucleic Acids Res., July 1, 2003; 31(13): 3359 - 3363. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |