|
|
||||||||
Computational Genomics, Department of Microbiology, University of Washington, Seattle, Washington 98109, USA
Reprint requests to Ram Samudrala, Computational Genomics, Department of Microbiology, University of Washington, Box 357242, Rosen Building, 960 Republican St., Seattle, WA 98109, USA; e-mail: ram{at}compbio.washington.edu; fax: (206) 732-6055.
(RECEIVED July 2, 2002; FINAL REVISION October 18, 2002; ACCEPTED October 31, 2002)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.0222303.
Supplemental material: See www.proteinscience.org.
| Abstract |
|---|
|
|
|---|
shifts available). Finally, errors made by PsiCSI almost exclusively involve the interchange of helix or strand with coil and not helix with strand (<2.5 occurrences per 10000 residues). The automation, increased accuracy, absence of gross errors, and robustness with regards to sparse data make PsiCSI ideal for high-throughput applications, and should improve the effectiveness of hybrid NMR/de novo structure determination methods. A Web server is available for users to submit data and have the assignment returned. Keywords: NMR; chemical shifts; secondary structure; neural networks
| Introduction |
|---|
|
|
|---|
| Secondary structure from chemical shifts (CSI) |
|---|
|
|
|---|
chemical shifts are higher than average (downfield) in extended structures and lower than average (upfield) in helices. The same is true for 15N and Cß shifts, whereas the opposite relationship holds for C, and C
shifts. To exploit this information, CSI (Chemical Shift Index) (Wishart et al. 1992; Wishart and Sykes 1994) assigned three indices, -1, 0, and 1, depending on whether the chemical shift was near the average value or at one of the extremes. Consecutive occurrences of like indices were used to identify the presence of secondary structure. To further increase accuracy, a jury system averaged assignments from multiple chemical shiftsC, C
, Cß, and H
to arrive at a consensus assignment. | Secondary structure from sequence (Psipred) |
|---|
|
|
|---|
| Secondary structure from chemical shifts and sequence (PsiCSI) |
|---|
|
|
|---|
| Results and Discussion |
|---|
|
|
|---|
The distribution of Q3 accuracies of PsiCSI, CSI, and PsiPred, is shown in Figure 1
. The distribution of PsiCSI accuracies is very tight, reflecting the consistency of the method. Some of the less accurate results come from large regions of coil being assigned as helix or extended (see Electronic Supplemental Material). It is possible that PsiCSI is detecting some residual structure in these regions. PsiCSI does better than CSI or Psipred in the majority of cases as is expected from the average per residue increase in accuracy. The existence of cases in which PsiCSI performs 40% better than CSI and 28% better than Psipred indicates that PsiCSI is able to compensate when the other methods do extremely poorly.
|
|
|
, and Psipred data, and 86.8% with just H
and Psipred data.
|
Simple use of just the first conformer was attempted with some success (Q3 = 86.1% for 68 proteins; data not shown). However, the level of variation between conformers was sufficiently high (94% average pairwise concordance; 87% concordance between the most divergent pair) that significant errors in secondary structure identification were introduced. The variability was reduced by using a consensus secondary structure (96% concordance). A further difficulty arises that variability can be much higher in regions in which there are fewer experimental NMR constraints. The lack of constraints can be the result of true structural heterogeneity or the result of experimental factors (relaxation, exchange, chemical shift ambiguity), which preclude the observation and identification of NOEs. Thus, the training set was further restricted to residues in which the level of agreement on secondary structure was at least 90%, which accounted for a large majority (85%) of the residues.
The accuracies of the different methods over this subset of residues and over the entire set of residues are shown in Table 4
. Table 5
lists all of the proteins in the test set and the Q3 accuracies using PsiCSI, CSI, and Psipred. All methods improve by
3% when the subset of well-defined regions is used. Large-scale analysis of secondary structure by EVA (Eyrich et al. 2001; Rost and Eyrich 2001) has also detected a 3% lower accuracy for prediction methods when the first conformer of an NMR structure is used rather than a crystal structure. One possible reason for this decrease is that disordered regions are generally not observed in crystal structures, whereas they are present in NMR structures. Our protocol seems to have restored this 3% difference in prediction accuracy by filtering out these regions and also by eliminating the noise introduced by utilizing first conformer structures rather than the consensus of all conformer structures. This type of strategy may be useful for other surveys of secondary structures that include NMR structures. All statistics have been calculated using this subset of residues (9437 residues) with well-defined secondary structure unless otherwise stated.
|
|
To further improve PsiCSI, more data points and more data sources will be required.
The present sample size (9437 residues) still places limits on the data processing scheme. For example, given a sufficient number of data points, it might be possible to improve upon the second layer of neural nets by training a separate set of nets for each residue type. More sample points would also reduce the noise in the original translation of chemical shifts to secondary structure. A larger training set would especially benefit the final neural network, which has many more connections than the other networks and, thus, is more difficult to train. As for data sources, additional NMR information could include more chemical shifts (e.g., from amides), J-coupling constants, and NOE data. These data could easily be incorporated as additional inputs to the neural nets. Because the major weakness of PsiCSI is in distinguishing coil from extended, NOE information, which could be used to infer the existence of non-local hydrogen bonding, is likely to be of greatest benefit. Finally, PsiCSI, has scrupulously avoided a major source of secondary structure information, homology. The secondary structure of close homologs is highly conserved and predictions based on close homology are much more accurate than any sequence-based method. Even in its present form, PsiCSI should perform better on proteins with homology to members of the training set. During prototyping, it was observed that overall accuracies, especially that of the first chemical shift to secondary structure potential translation and that of the final neural net layer, were significantly increased by the inclusion of homologs in the training set. PsiCSI could be easily modified to explicitly include homology information either directly as additional inputs, or indirectly through modifications of the secondary structure potential translations to weight data according to the degree of local sequence homology.
PsiCSI should expedite both experimental and theoretical applications
Presently, NMR secondary structure assignments require manual interpretation of several pieces of data, mainly chemical shifts, J-couplings, and NOEs. PsiCSI approaches the accuracy required to completely automate the process and certainly reduces the amount of additional data that needs to be gathered and interpreted before an assignment is made. The effectiveness of the method with sparse data also means that secondary structures can be confidently assigned at an earlier stage. The fact that PsiCSI does not require heteronuclear chemical shifts to be effective also makes it useful for proteins in which costs and/or poor expression preclude isotopic labeling. The very high accuracy and automated nature of PsiCSI also makes it potentially quite useful for rapid profiling of proteins in high-throughput structural genomic applications. The ability of PsiCSI to function without complete assignment of all of the backbone chemical shifts makes it particularly well suited for use in conjunction with automated chemical-shift assignment methods, which do not always provide complete assignments.
PsiCSI is one of a new generation of applications such as TALOS (Cornilescu et al. 1999), and Rosetta-NMR (Rohl and Baker 2002) that utilize the growing database of structural and sequence information to better interpret experimental data. Rosetta-NMR is an example of more ambitious attempts to marry de novo database-based protein structure simulations with NMR data to directly arrive at a tertiary fold. To reduce the search space, de novo programs often fix or bias secondary structures during the simulation (Ortiz et al. 1999; Samudrala et al. 1999; Bonneau et al. 2001). Small errors can impact upon the convergence of the final structures. However, gross errors, such as those in which large stretches of helix or strand are interchanged, can result in prediction of the wrong fold (Samudrala and Levitt 2002). PsiCSI should be of considerable help for these hybrid applications, not only because of the increase in overall accuracy, but also because of the virtual elimination of gross errors.
| Materials and methods |
|---|
|
|
|---|
For each of the proteins in the final set of 92, the secondary structure was first determined using DSSP (treating H and G as helix, E and B as extended, and everything else as coil). For NMR ensembles, the secondary structure of all of the conformers were determined and a consensus structure obtained. Residues in which there was < 90% agreement between the conformers were excluded from further consideration. A database was made from the remaining residues, relating chemical shift to secondary structure and amino acid type. To translate a given chemical shift into secondary structure potentials, the database was searched for residues with chemical shifts (of the same residue type) that were within 0.4 ppm for C, 0.2 ppm for C
, and Cß, 0.3 ppm for N, and 0.04 ppm for H
. If there were < 20 shifts, the next closest shifts were used until the minimum of 20 was obtained. Chemical shifts from the same protein or related proteins (see below) were excluded. The secondary structure of each residue within this set was tabulated. When there was partial disagreement among conformers as to the state of the residue, the proportion of conformers in each of the 3 states was used in the tabulation (e.g., for a residue in which 9 of 10 conformers are helical and 1 is coil, 0.9 would be added to the helix total, 0.1 would be added to the coil total). The final number of residues in helix, extended, and coil states were divided by the total number of chosen residues to obtain the secondary structure potentials. These potentials were then normalized to take into account the proportion of helix, extended, and coil states in the test set.
Psipred secondary structure potentials were obtained using Psipredv2.3. Two sets of secondary structure potentials can be obtained from Psipred. One set uses only the PsiBlast profiles, whereas the second set smooths the potentials by taking into account local interactions between residues. The first set of potentials was used, as the last neural net in PsiCSI also takes into account local interactions. However, the slightly more accurate smoothed set of potentials was used for all comparisons of accuracy.
Finally, for each potential, the number of correct assignments made using that potential were divided by the total number of assignments made. This was done for each of the three secondary structure states to give a set of three reliability estimates. Because of the strong dependence of these reliabilities on the residue type, the indices were calculated for each of the 20 amino acids to give 20 separate sets of reliability indices per potential.
First layer of neural nets reduce noise by considering shifts at neighboring residues
Because secondary structures involve more than a single residue, the accuracy of the initial set of potentials can be increased by examining the adjacent residues to see whether similar potentials are found. Thus, the potentials from the original residue, and the two adjacent residues, along with the estimates of reliability, provided inputs for the first layer of neural networks. A total of 7 inputs per residue (3 secondary structure potentials, 3 reliability indices, and 1 input to indicate the absence of data due to the window extending beyond the edge of the protein) or 21 total, led into 3 hidden inputs that fed into the final 3 output units. These outputs correspond to three new secondary potentials. The test set was balanced so that equal numbers of residues in the three states were present and then randomly split into two. One set was used to train the set and the other to evaluate the accuracy. Training was accomplished by resilient back-propagation until the evaluation set showed no improvement. This was done three times using a different set of initial values for the weights and the best performing net chosen. Different neural nets were trained for each of the five different chemical shifts. The SNNS (Stuttgart Neural Network System version 4.2) package (http://www-ra.informatik.unituebingen.de/SNNS/) was used to generate and train all of the networks. Reliabilities for the new set of potentials were also estimated.
Second layer of neural nets combine different chemical shift and Psipred potentials
To combine the chemical-shift and Psipred potentials, a second set of neural networks was used. Separate networks were trained for all possible combinations of chemical shift and Psipred data. Each neural network consisted of an input for each of the chemical shift derived secondary potentials (315), the reliability indices, and an input for each of the PsiPred potentials and reliability indices. These fed into a hidden layer of six units and a final output layer of three units again, corresponding to further refined helical, extended, and coil potentials. The second layer of neural nets were trained on balanced sets in the same manner as for the first layer.
Third neural net factors in local interactions
By use of the second layer of neural nets, secondary structure predictions were made with potentials obtained from each of the possible data combinations. These were compared with the actual secondary structure and ranked by their reliablity for each residue type. Outputs from the most reliable combinations were used to provide inputs for the final neural net. The purpose of this neural net was to account for local interactions between secondary structure elements. The architecture was similar to that used in the first layer of networks with seven inputs per residue corresponding to the secondary structure potentials and the reliability indices. However, due to the increased accuracy of the inputs at this point, a larger window of 17 residues could be used. The resulting 119 inputs fed into a hidden layer of 17 and an output layer of 3, corresponding to the 3 final secondary structure potentials. Training was done as before, except that sets were not balanced. Because the best available data nearly always includes a Psipred component, the final network optimizes itself to correct Psipred types of errors and underperforms when only chemical-shift information is available. Thus, estimates of accuracy when only chemical-shift information is available were obtained using a separate network that was trained on chemical shift data (resulting in a minor improvement of 0.5%).
Test sets and cross-validation
For stringent cross-validation, all of the calculations, including the chemical-shift translation, the calculation of reliability indices, the ranking of the performance of the different nets, and neural net training itself, were done by use of a subset that not only excluded the protein to be tested, but also any proteins in the same family [up to the T level as determined by CATH (Orengo et al. 1997)]. By use of software made publicly available by the researchers, CSI and Psipred were also used on the same dataset to predict secondary structure for comparison.
| Electronic supplemental material |
|---|
|
|
|---|
Web server
A server that takes as input a sequence and chemical shift data and returns a secondary structure prediction is accessible via http://protinfo.compbio.washington.edu.
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Bailey-Kellogg, C., Widge, A., Kelley, J.J., Berardi, M.J., Bushweller, J.H., and Donald, B.R. 2000. The NOESY jigsaw: Automated protein secondary structure and main-chain assignment from sparse, unassigned NMR data. J. Comput. Biol. 7: 537558.[CrossRef][Medline]
Bonneau, R., Tsai, J., Ruczinski, I., Chivian, D., Rohl, C., Strauss, C.E., and Baker, D. 2001. Rosetta in CASP4: Progress in ab initio protein structure prediction. Proteins 45: 119126.
Bonvin, A.M., Houben, K., Guenneugues, M., Kaptein, R., and Boelens, R. 2001. Rapid protein fold determination using secondary chemical shifts and cross-hydrogen bond 15N-13C' scalar couplings (3hbJNC'). J. Biomol. NMR 21: 221233.[CrossRef][Medline]
Brenner, S.E. 2001. A tour of structural genomics. Nat. Rev. Genet. 2: 801809.[CrossRef][Medline]
Burley, S.K. 2000. An overview of structural genomics. Nat. Struct. Biol. 7: 932934.
Chou, P.Y. and Fasman, G.D. 1974. Conformational parameters for amino acids in helical, ß-sheet, and random coil regions calculated from proteins. Biochemistry 13: 211222.[CrossRef][Medline]
Cornilescu, G., Delaglio, F., and Bax, A. 1999. Protein backbone angle restraints from searching a database for chemical shift and sequence homology. J. Biomol. NMR 13: 289302.[CrossRef][Medline]
Cuff, J.A. and Barton, G.J. 1999. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins 34: 508519.[CrossRef][Medline]
Delaglio, F., Kontaxis, G., and Bax, A. 2000. Protein structure determiniation using molecular fragment replacement and NMR dipolar couplings. J. Am. Chem. Soc. 122: 21422143.[CrossRef]
Eyrich, V.A., Marti-Renom, M.A., Przybylski, D., Madhusudhan, M.S., Fiser, A., Pazos, F., Valencia, A., Sali, A., and Rost, B. 2001. EVA: Continuous automatic evaluation of protein structure prediction servers. Bioinformatics 17: 12421243.
Frishman, D. and Argos, P. 1995. Knowledge-based protein secondary structure assignment. Proteins 23: 566579.[CrossRef][Medline]
Garnier, J., Osguthorpe, D.J., and Robson, B. 1978. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120: 97120.[CrossRef][Medline]
Jones, D.T. 1999. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292: 195202.[CrossRef][Medline]
Kabsch, W. and Sander, C. 1983. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22: 25772637.[CrossRef][Medline]
Krogh, A., Brown, M., Mian, I.S., Sjolander, K., and Haussler, D. 1994. Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol. 235: 15011531.[CrossRef][Medline]
Lim, V.I. 1974. Algorithms for prediction of
-helical and ß-structural regions in globular proteins. J. Mol. Biol. 88: 873894.[CrossRef][Medline]
Montelione, G.T. 2001. Structural genomics: An approach to the protein folding problem. Proc. Natl. Acad. Sci. 98: 1348813489.
Moseley, H.N., Monleon, D., and Montelione, G.T. 2001. Automatic determination of protein backbone resonance assignments from triple resonance nuclear magnetic resonance data. Methods Enzymol. 339: 91108.[Medline]
Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. 1997. CATHa hierarchic classification of protein domain structures. Structure 5: 10931108.[Medline]
Ortiz, A.R., Kolinski, A., Rotkiewicz, P., Ilkowski, B., and Skolnick, J. 1999. Ab initio folding of proteins using restraints derived from evolutionary information. Proteins 37: 177185.
Richards, F.M. and Kundrot, C.E. 1988. Identification of structural motifs from protein coordinate data: Secondary structure and first-level supersecondary structure. Proteins 3: 7184.[CrossRef][Medline]
Rohl, C.A. and Baker, D. 2002. De novo determination of protein backbone structure from residual dipolar couplings using rosetta. J. Am. Chem. Soc. 124: 27232729.[CrossRef][Medline]
Rost, B. 1996. PHD: Predicting one-dimensional protein structure by profile-based neural networks. Methods Enzymol. 266: 525539.[CrossRef][Medline]
Rost, B. and Eyrich, V.A. 2001. EVA: Large-scale analysis of secondary structure prediction. Proteins (Suppl) 5: 192199.
Samudrala, R. and Levitt, M. 2002. A comprehensive analysis of 40 blind protein structure predictions. BMC Struct. Biol. 2: 318.
Samudrala, R., Xia, Y., Huang, E., and Levitt, M. 1999. Ab initio protein structure prediction using a combined hierarchical approach. Proteins (Suppl) 3: 194198.
Seavey, B.R., Farr, E.A., Westler, W.M., and Markley, J.L. 1991. A relational database for sequence-specific protein NMR data. J. Biomol. NMR 1: 217236.[CrossRef][Medline]
Spera, S. and Bax, A. 1991. Empirical correlation between protein backbone conformation and C
and Cß NMR chemical shifts. J. Am. Chem. Soc. 113: 54905492.[CrossRef]
Wishart, D. and Sykes, B. 1994. The 13C chemical-shift index: A simple method for the identification of protein secondary structure using 13C chemical-shift data. J. Biolmol. NMR 4: 171180.
Wishart, D.S., Sykes, B.D., and Richards, F.M. 1991. Relationship between nuclear magnetic resonance chemical shift and protein secondary structure. J. Mol. Biol. 222: 311333.[CrossRef][Medline]
. 1992. The chemical shift index: A fast and simple method for the assignment of protein secondary structure through NMR spectroscopy. Biochemistry 31: 16471651.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
L.-H. Hung, S.-C. Ngan, T. Liu, and R. Samudrala PROTINFO: new algorithms for enhanced protein structure predictions Nucleic Acids Res., July 1, 2005; 33(suppl_2): W77 - W80. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Zubkov, W. J. Lennarz, and S. Mohanty Structural basis for the function of a minimembrane protein subunit of yeast oligosaccharyltransferase PNAS, March 16, 2004; 101(11): 3821 - 3826. [Abstract] [Full Text] [PDF] |
||||
![]() |
L.-H. Hung and R. Samudrala PROTINFO: secondary and tertiary protein structure prediction Nucleic Acids Res., July 1, 2003; 31(13): 3296 - 3299. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |