|
|
||||||||
Department of Genetics, Informatics Laboratory, Stanford University, Stanford, California 94305, USA
Reprint requests to: Russ B. Altman, Department of Genetics, Informatics Laboratory, 300 Pasteur Drive, Room L329, Stanford University, Stanford, CA 94305, USA; e-mail: russ.altman{at}stanford.edu; fax: (650) 725-7944.
(RECEIVED August 20, 2003; FINAL REVISION August 20, 2003; ACCEPTED September 29, 2003)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.03385504.
| Abstract |
|---|
|
|
|---|
Keywords: surface; fitness function; shape; molecular modeling; principle components
| Introduction |
|---|
|
|
|---|
One useful type of structural data is information about the shape of the molecule. A variety of experimental and computational techniques provides incremental information about the expected shape of a biological macromolecule, including electron microscopy (EM; Frank 1996), sedimentation experiments (Urbanke and Ziegler 1980), homology modeling (Sali and Blundell 1993; Simons et al. 1999), and small-angle scattering experiments (Kaiushina et al. 1985). The most common of these methods is EM, which can directly visualize molecular structures of significant size, such as protein complexes. However, the data resolution of EM is currently 7 to 9 Å (Chiu et al. 2002), whereas crystallography and NMR are in the 2 to 4 Å range. Thus, assignment of individual atom positions is very challenging using EM alone.
We propose and test a computational method that applies shape information as a discrimination metric in the evaluation of structural models. If a model has more shape agreement with measured shape information, that model is more likely to be correct than are models with less agreement. To our knowledge, shape information has not been routinely and automatically used in assessing model fitness. Another potential use for a general shape scoring system (not demonstrated in this work) would be within the context of building novel molecular models using shape information.
To take advantage of the data sources that contribute shape information, we developed a unified, linear data structure to encode shape information called the surface envelope (SE). An SE is any three-dimensional data structure that assigns a number between zero and one to each point in space corresponding linearly to the amount of electron density observed at that point. These numbers are called density values. In practice, assigning density values to every point in real space is computationally expensive, so our SE implementation assigns a regular cubic grid over three-dimensional space, and associates one density value to each grid point. The region around each grid point is called a box and has the shape of a cube. We assign all points within each box to have the same density value as the value associated with the central grid point. Boxes contiguously span three-dimensional space in each direction. Figure 1A
shows an example of an SE.
|
Using shape information to assess the quality of a structural model is a complex task for a variety of reasons. There are two parts to the problem of creating a function capable of scoring model-envelope matches: (1) aligning the model structure to the SE and (2) generating a score from the alignment. An alignment of two objects in three dimensions consists of translating and rotating one object with respect to another. Alignment is required because of the orientation ambiguity intrinsic to shape information, in which individual atoms have not been located within the shape, and so, standard RMS fitting cannot be performed. For example, a model with an accurate molecular structure that is rotated or translated out of alignment with respect to the SE looks like an incorrect structure. Only after the model is translated and rotated so the model and the SE are aligned can a fair assessment be obtained of the model shape. Alignment is confounded by the fact that in most cases, the putative structural model typically does not have a shape that exactly matches the available shape in the SE, so there is no single correct alignment. No simple method exists for exhaustively searching possible translations and rotations for the best alignment. Lanzavecchia et al. (2001) have shown previously the value of using principle components for alignment.
Generating the score is simpler. Once a model is aligned to an SE, there are many ways of assigning a score. When atoms fall in regions of the SE with high-density values, the model should receive a higher score. One may also give models lower scores if they fail to account for all the density values present in the SE. We evaluated three different score metrics called S1, S2, and S3, detailed in the Materials and Methods section. Briefly, the first score, S1, simply counts the overlap between the model and the SE. The second score, S2, applies a weighted average between S1 and a global penalty factor that measures the volume of the SE not filled by atoms in the model. The third score metric, S3, counts atoms from the model within the SE but penalizes each term in the summation if there are local regions in the SE not filled with atoms. S3 does penalize an alignment for regions of the SE not filled, but only in regions close to atoms within the model.
The difficulty of alignment is increased when the structural models are imperfect and do not exactly match the shape encoded in the SE. When molecular models exactly match SE shapes, there is less need to provide continuous scores to differentiate model quality based on shape information. However, our goal is to rank order a large number of models, all of which are wrong by various degrees. Poor matches between the SE shape and the model mean a variety of different alignments may all look equally correct, depending on the scoring method applied. Ideally, we would expect clear correlation between how well a given model conforms to a given shape and how close the model is to being the correct model (RMSD). Furthermore, as the RMSD increases, one would expect this correlation to slowly break down, because as models become increasingly incorrect (higher RMSD) some may match well to the available shape, whereas others may not. We would expect that at higher RMSD, a variety of model-shape matches would be possible, and so, the correlation would break down. Every modelSE pair produces a single point on such a graph, as shown in Figures 5
through 8
, 9B
, and 12
. These figures plot the RMSD versus the match score (which is high for poor matches and zero for perfect matches).
|
|
|
|
| Results |
|---|
|
|
|---|
PC alignment
Figure 2
shows a histogram of the maximum angle between corresponding principal components for each structural model in the Hodor set after alignment by PC alignment (PCA) only. Even for correct (0 RMSD) structures, a few models incorrectly align near 90° and 180° for their maximal angle mismatch. This error occurs for models in which the PC lengths are approximately equal and the corresponding PC calculation in the SE does not have proper correspondence between the components.
|
Iterated closest point alignment
We evaluated a second method, as discussed in the Materials and Methods section, to align all models in the Hodor set with their associated SE. Figure 3
shows a histogram with the resulting angular error distribution. In this case each structure was aligned from its random starting position.
|
Hybrid alignment
The final alignment method uses a combination of both ICP alignment and PCA. Figure 4
shows the histogram of maximal component angles for the Hodor set structures after hybrid alignment. Here we observe the elimination of structures at 180° and a significant reduction of structures at 90°. Note that this result depends on using three initial random starting orientations for each alignment result. At times we observed structures at 180°, but they were sufficiently rare that the data presented in Figure 4
represent our best expectation of alignment performance after many observations.
|
Model scoring
The tests of matching models against SEs consist of aligning and scoring a series of different SE-model pairs and comparing those scores to the RMSD of the model to the correct structure. The correct structure in these cases are either the crystallographic or NMR solved structure from the PDB. The SE in each case is derived directly from the correct structure in a way that simulates experimental measurement of the SE using EM.
Three different sets of results are presented in the following three subsections, using structures derived from (1) decoy sets, (2) random perturbations, and (3) modeling trajectories. In all these cases, we apply the score functions S1, S2, or S3, described later in Materials and Methods. Unless otherwise noted, the widths of the boxes in the SE are 3.7 Å. Following these three results, we present a direct comparison of the different score functions.
Decoy set scoring
We tested 24 different decoy sets generated for the purpose of structure prediction. Table 1
presents the PDB identification of the different structures tested. A decoy set is a large number of different molecular models, each representing a different guess about the correct structure. An online resource at Stanford University called "Decoys R Us" provided the decoy sets (http://dd.stanford.edu). The first decoy set includes 602 different models of 1ctf
[PDB]
as described above. Figure 6
presents a graph showing the match score using S1 on the Y-axis and the RMSD of the model in Ångstroms (Å) on the X-axis. Figure 7
presents the same decoy set aligned and scored using S2.
|
|
|
The second immunoglobin is also a monoclonal antibody Fab fragment with 1730 atoms in 223 residues. The PDB identification is 1dvf
[PDB]
. These results are also in Figure 8
.
The final decoy set contains structures of yellow tuna myoglobin with PDB identification 1myt
[PDB]
. This is an
-helical structure with 1092 atoms and 146 residues. These results are also in Figure 8
.
Random perturbation
Another way to generate a set of models is to take the correct structure and randomly perturb atoms. Each atom is randomly moved 1.0 Å in each direction from its current position to determine its position in the next model in the set. Van der Waals interactions are ignored. Clearly, these models are not indicative of the structural properties observed in nature, nor do they represent reasonable guesses about correct structures. Models generated by small random perturbations provide a smooth and continuous series of models at RMSD ranges very close to the correct structure. The decoys tested in the first two experiments described above do not come closer than a few Ångstroms RMSD to the correct structure.
The results of two randomly generated model sets are presented in this section. The correct structure in both cases is the ribosomal protein 1ctf
[PDB]
. Results of the both random model sets are presented in Figure 9
.
Trajectory analysis
A final source of molecular models for alignment tests includes models created during structural optimization computations (Williams et al. 2001). The set of models used for this test were created from each cycle of a distance-based calculation of the sevarin domain 1svr
[PDB]
, a 94-atom structure. Figure 10
presents these results in three graphs; the top shows the progression of the modeling run as model moves toward the correct structure, the middle graph presents the score for each model from the set against the SE, and the bottom presents the RMSD of each model versus the match score.
|
|
| Discussion |
|---|
|
|
|---|
We observed the expected behavior of match scores in almost all of the envelope matching results. As RMSD increases, the value of the SE would decrease in its ability to discriminate model shapes. Above some RMSD value (we call this the "fall-over point"), the minimum match scores became basically flat. However, depending on the scale on the Y-axis, the actual RMSD value above which shapes become irrelevant varies depending on the structure in question. It was surprising to note that the shape of the graphs differed somewhat between the decoy set matching and the envelope matching done on random perturbation models. Most notably, the random perturbation (Fig. 5
) exhibits a fall-over point at ~10 Å. Models with RMSD above this value show lower correlation with envelope score. Said another way, we observed that models in the random set that were >10 Å RMSD showed high correlation with envelope score. The surprise was that for decoy set scoring, this value was much lower, ~2.5 to 3.5 Å. This can be seen in Figure 8
. One possible explanation for this difference is the clear difference in methods used to generate models in the two different sets. The decoy set models are more tightly packed, resulting in better matches to the correct shape at lower RMSD values.
Comparing the alignment methods highlights the problems with both PCA and ICP used alone. PCA alone occasionally fails even in the alignment of crystal structure models with SE shapes derived from the same models due to incorrect correspondences between PCs in the model and the SE. The frequency of these incorrect correspondences increases for models further from the correct structure. Examples of this effect appear at the top of Figure 9
, showing scores for structure 1dvf
[PDB]
. The PCA algorithm fails because incorrect PC matching occurs. This happens because in some models, the rank ordering of PC vectors may change causing a 90 or 180 flip of the model with respect to the SE. This problem gets worse when the ratio of longest to shortest principle component approaches one, and we found ratios generally <1.2 to 1.3 problematic. An abrupt orientation shift of this magnitude would cause critical failures during modeling. The main failure of ICP comes from its inability to move models all the way to an accurate alignment: an optimization local minimum. The degree of failure in ICP is highly dependent on the starting position of the model, and taking multiple ICP trials mitigates much of the issue. The clear winner in alignment is the hybrid model, applying strengths from both techniques. With the hybrid technique, the several successive ICP trials identify a close starting points for assigning accurate PC correspondences.
Our results demonstrate the power of shape information to differentiate correct structures from incorrect ones. Although the results from the 1ctf
[PDB]
decoy set in Figure 6
are mediocre, slightly better results are seen by using a more complex score function in Figure 7
. With the exception of a few models in the lowest RMSD, the models are effectively not differentiable using our scoring methods. This is partly because PC with similar magnitudes in the 1ctf
[PDB]
structure confounds the PCA. It is also notable that the 1ctf
[PDB]
structure is highly globular, and regardless of the accuracy of the alignment, the shape information from the SE does not add significantly to our understanding of the 1ctf
[PDB]
structure. Other structures with more elongated shapes have more information about their structure encoded in their shape.
The alignment and scoring of other decoy sets shows clear ability of our methods to distinguish low RMSD models from higher ones. Results in Figure 8
for structures 1baf
[PDB]
, 1dvf
[PDB]
, and 1myt
[PDB]
show expected behavior in the RMSD versus match score curves. These results are best with the more accurate score function S3 and hybrid alignment and indicate that using shape information for discrimination may be very valuable for prospective modeling. These promising results lead directly to the application of shape information in creating novel biomolecular models.
| Materials and methods |
|---|
|
|
|---|
Scoring
Matching a model to an SE involves two steps. First, the model and SE must be registered in three-dimensional space. We present two different methods below and a hybrid method that uses parts of both alignment methods. The second step occurs once the registration is complete: A scoring function is applied to determine how well the shape of the model and the SE agrees. We call the resulting score the "match score." The section Scoring Envelope Matches below presents three different scoring functions and the relative merits of each for different purposes.
PC alignment
PCs of a set of points are a set of three orthogonal vectors. The direction of the longest and shortest PC vectors corresponds to the direction with the greatest and smallest variance of point positions from the center of mass (Mortenson 1995). Given the first two components, the direction of the middle component is determined by the right hand rule. The magnitude (length) of each of the three vectors corresponds to the amount of deviation from the center of mass by the complete collection of points in the vector direction. The calculation of PC is performed by eigenvector decomposition on the square symmetric matrix of deviations, M,
![]() | (1) |
where dx is the deviation of a given point from the center of mass in the x direction, with similar definitions for dy and dz. The three eigenvectors of this matrix define the direction of the PCs, and the eigenvalues are the PC lengths. For a set of atoms, the PCs are calculated by assuming each atom is a single point. For SE, each box in the three-dimensional grid is treated as a point, and dx, dy, and dz in equation 1
are weighted by the density number in each box. Figure 1
shows an example of PC vectors superimposed on a molecular model. The model is a C
representation of sevarin protein domain two (PDB ID 1svr
[PDB]
). The alignment uses PC vectors from both the model and the SE. The two structures are superimposed at their center of mass. The atoms in the model are all rotated to bring the first and third (longest and shortest) PC vectors into collinear alignment.
For PCA to work, the lengths of the three vectors must be distinct. A situation in which the components are all the same, such as a sphere, or two of the components are equal, such as a cylinder, introduces rotational symmetry. More precisely, the greater the ratio of magnitude between the longest and shortest component in the correct structure and the SE, the easier it is to ensure that the correct components are aligned in any particular model.
ICP alignment
This alignment method applies an adaptation of the ICP algorithm (Besl and McKay 1992). ICP is a general-purpose method for accurate registration of three-dimensional shapes. The algorithm has three basic steps: (1) compute closest points in the SE for every atom in the model, (2) compute the closed-form translation and rotation that aligns the corresponding modelSE point pairs as well as possible to minimize a distance-based score metric, and (3) apply the translation and rotation to the model.
The algorithm iterates the previous three steps until either the error from the score metric decreases below a threshold, or the same correspondence points are regenerated, leading to a null transformation of the model. The choice of scoring function for determining the transform affects both the speed and the accuracy of convergence. We apply least-squares minimization of distances to minimize the RMSD between all current correspondences.
Another variation to the ICP algorithm includes the addition of a maximum distance cutoff for which no correspondence is assigned. This helps the algorithm to determine the best fit for models that do not fit the SE shape exactly, by leaving out points that fit poorly. We restrict the point correspondences between the model and the SE to unique points in each data set. Thus, each point in the model and the SE is assigned a maximum of one correspondence to a point in the other data structure. The number of points in the SE is much greater than that of the molecular model, so this results in picking unique SE points for each atom point in the model. The algorithm has an observed dependence on the relative starting position of the model with respect to the SE. The method typically does not make significant rotations and translations after the first or second alignment iteration. Each step after the first few is an incremental step to align the two structures.
Hybrid alignment
The early results of both PC and ICP prompted the investigation of a hybrid method of modelSE alignment that took some of the positive features of both methods. The PCICP hybrid alignment method applies the following steps:
Scoring envelope matches
We present three different methods for assigning a score to a model aligned against an SE. The first method defines the total score, S1, with
![]() | (2) |
where Di is the density value from SE at the location of atom i. This score method is fast and atom-centric because all parts of the score function can be attributed to specific atoms. We report a final match score as the difference between the maximum possible score and the calculated value S1, providing a zero score for a perfect match and increasing values for imperfect matches.
The second scoring method includes a penalty to the total score for regions of the SE not filed with atoms. We define S2 with
![]() | (3) |
where penalty factor Pj equals zero for each box j in the SE that contains an atom, and equals the density value in the box (a value from zero to one) if no nearby atom can account for the atom density expected in box j. The variables
1 and
2 are adjustable scaling factors. Two factors make this match scoring function significantly more computationally intensive to calculate than S1. First, the magnitude of m, the number of boxes in the SE, is significantly greater than n, typically by an order of magnitude. The second difficulty is that the implementation of atoms within the model tracks their positions as points. The SE data structure uses boxes on the three-dimensional grid to store the sum of atom densities. To accurately calculate the value of Pj, all adjacent boxes in the SE (the 26 neighbors of Pj) must be checked for atoms close to box j, because atoms with positions near a box edge attribute density to the SE in both boxes adjacent to that edge. S2 is not an atom-centric scoring method. The elements included in the second summation can no longer be attributed to any particular atom thus making it potentially less useful as part of an atom-based modeling method.
The third scoring function includes a penalty factor similar to S2 but also maintains all parts of the score function as attributable to particular atoms. We define S3 with
![]() | (4) |
where Pi is now a penalty term calculated for each atom. Similar to the above penalty, S3 reduces the score for regions of the SE not filled with atoms. In this case, Pi is the sum of penalty term Pj defined above, but calculated over SE boxes adjacent to the box containing atom i. This does not completely account for large sections of the SE not covered by atoms, but it does more accurately represent the shape matching compared with S1.
| Acknowledgments |
|---|
| References |
|---|
|
|
|---|
Branden, C. and Tooze, J. 1999. Introduction to protein structure. Garland Publishing, New York.
Chiu, W., Baker, M.L., Jiang, W., and Zhou, H. 2002. Deriving folds of macromolecular complexes through electron cryomicroscopy and bioinformatics approaches. Curr. Opin. Struct. Biol. 12: 263269.[CrossRef][Medline]
Frank, J. 1996. Three-dimensional electron microscopy of macromolecular assemblies. Academic Press, New York.
Hodor, P.G., Caruana, R., Sassaman, E., Buchanan, B.G., and Rosenberg, J.M. 1999. Examination of amino acid sequence rules for
helix pairing in proteins. Poster session, International Conference on Intelligent Systems for Molecular Biology, Heidelberg, Germany.
Kaiushina, R.L., Izotova, T.D., Mogilevskii, L.I., Shmakova, F.V., and Khurgin, I.I. 1985. Study of the structure of immunoglobins by small-angle x-ray diffraction, I: The structure of IgMCep in solution. Bioorg. Khim 11: 753761.[Medline]
Lanzavecchia, S., Cantele, F., and Bellon, P.L. 2001. Alignment of 3D structures of macromolecular assemblies. Bioinformatics 17: 5862.
Mortenson, M.E. 1995. Geometric transforms. Industrial Press, New York.
Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. 1997. CATH: A hierarchic classification of protein domain structures. Structure 5: 10931108.[Medline]
Sali, A. and Blundell, T.L. 1993. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234: 779815.[CrossRef][Medline]
Schmidt, J.P., Chen, C.C., Cooper, J.L., and Altman, R.B. 1998. A surface measure for probabilistic structural computations. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6: 148156.[Medline]
Simons, K.T., Bonneau, R., Ruczinski, I., and Baker, D. 1999. Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins 3: 171176.
Urbanke, C. and Ziegler, B. 1980. Complete evaluation of sedimentation velocity experiments in the analytical ultracentrifuge. Fres. Z. Anal. Chem. 301: 139140.[CrossRef]
Williams, G.A., Dugan, J.M., and Altman, R.B. 2001. Constrained global optimization for estimating molecular structure from atomic distances. J. Comp. Biol. 8: 523547.[CrossRef]
Zhou, Z.H., Baker, M., Jiang, W., Dougherty, M., Jakana, J., Dong, G., Lu, G., and Chu, W. 2001. Electron cryomicroscopy and bioinformatics suggest protein fold models for rice dwarf virus. Nat. Struct. Biol. 8: 868873.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
J. M. Dugan and R. B. Altman Using surface envelopes to constrain molecular modeling Protein Sci., July 1, 2007; 16(7): 1266 - 1273. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. J. Morris, R. J. Najmanovich, A. Kahraman, and J. M. Thornton Real spherical harmonic expansion coefficients as 3D shape descriptors for protein binding pocket and ligand comparisons Bioinformatics, May 15, 2005; 21(10): 2347 - 2355. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |