|
|
||||||||
1 Protein Structure and Function Group, International Centre for Genetic Engineering and Biotechnology, 34012 Trieste, Italy
2 Department of General Chemistry, University of Pavia, 27100 Pavia, Italy
Reprint requests to: Dr. Oliviero Carugo, International Centre for Genetic Engineering and Biotechnology, Area Science Park, Padriciano 99, 34012 Trieste, Italy; e-mail: carugo{at}icgeb.trieste.it; fax: 39 040 22 65 55.
(RECEIVED February 15, 2001; FINAL REVISION April 3, 2001; ACCEPTED April 10, 2001)
Article and publication are at www.proteinscience.org/cgi/doi/10.1110/
| Abstract |
|---|
|
|
|---|
Keywords: Root-mean-square distance; structure classification; structure comparison; three-dimensional similarity
| Introduction |
|---|
|
|
|---|
![]() | (1) |
where d is the distance between each of the n pairs of equivalent atoms in two optimally superposed structures. The rmsd is 0 for identical structures, and its value increases as the two structures become more different. Rmsd values are considered as reliable indicators of variability when applied to very similar proteins, like alternative conformations of the same protein. On the other hand, rmsd data calculated for structure pairs of different sizes cannot be directly compared, because the rmsd value obviously depends on the number of atoms included in the structural alignment. Clearly, an rmsd value of, say, 3 Å has a different significance for proteins of 500 residues than for those of 50 residues; accordingly, the structural variability of fold types cannot be easily compared in quantitative terms (Irving et al. 2001). In other words, rmsd is a good indicator for structural identity, but less so for structural divergence.
The present communication aims to define a normalized, size-independent rmsd formula that could help to overcome this problem. In order to derive a formula between rmsd and protein dimension, one would need a database of structural alignments, in which all other parameters, such as secondary structure content and amino acid composition of the protein, are either constant (which is not possible) or are evenly distributed with respect to protein chain length. Such experimental data are presently not available. For example, the FSSP database (Holm and Sander 1996) contains a reasonably high number of structural alignments (about 23,000), but 80% of these have small rmsd values (02 Å), which reflects the fact that the percentage of sequence identity is very high (more than 90% residue identity in 60% of the alignments).
We therefore decided to create a large artificial set of rmsd values via extensive self-comparison of 180 nonhomologous (maximal identity 25%) protein structures, selected from the protein data bank (Berman et al. 2000) using the PDB_SELECT (Hobohm and Sander 1994) algorithm. These proteins were selected so as to represent the largest possible variability of amino acid content, sequence length as well as secondary structure content (Table 1
). Each structure was compared, using the algorithm of Kabsch (1976, 1978), with 400,000 of its randomized variants created through random shuffling of the C
equivalencies. All C
atoms were included in superposing each structure with all its variants. Overall, we obtained 400,000 rmsd observations in each of the 180 randomization experiments, which corresponds to a database of 72 million structural alignments. As expected, the distribution of rmsd values thus obtained depends on the size of the protein. The rmsd values are not evenly distributed, rather, the histograms are biased toward the high rmsd values (Fig. 1a
). Moreover, there are characteristic differences between proteins of different length, illustrated by, for example, the different rmsd limits of the 2000 smallest rmsd values in the two experiments, as shown by the shaded areas in Figure 1a
.
|
|
![]() | (2) |
where N is the number of amino acid residues. This curve is accordingly independent of both the number n of observations included in the calculation and the magnitude of rmsd values; a statistical bias is therefore not likely. Given that -1.3
1 - ln(10), the equation can be rearranged to give
![]() | (3) |
It is interesting to note that the value 100, the residue number corresponding to the chosen reference value, rmsd100, appears in the equation. We repeated the normalization procedure on the entire data set with residue numbers of 50, 75, 150, and 200, respectively, and in fact found that a generalized equation is valid with correlation coefficients 0.960.99:
![]() | (4) |
where L is the number of residues chosen as a reference. In other words, the relative root-mean-square distance rmsd/rmsdL is a simple function of the relative dimension N/L. Equation 3 can be simply rearranged to give a formula for a normalized rmsd value:
![]() | (5) |
The chain length of 100 residues was primarily chosen because this is the mean number of amino acids per domain (Xu and Nussinov 1998). rmsd100 is therefore an rmsd value that would be observed for a pair of structures of 100 residues exhibiting the same degree of similarity as the structures actually compared. In other words, the rmsd100 value can be considered as a normalized, size-independent indicator of structural variability. For example, suppose that the C
atoms of two pairs of protein structures, 50 and 200 residues long, respectively, can be superposed to give a final rmsd value of 1.0 Å. For the first pair of sequences sharing N = 50 equivalent residues, the corresponding rmsd100 value will be 1.524 Å The second pair of structures (N = 200) is considerably more similar to each other (rmsd100 = 0.741 Å) despite the fact that the crude rmsd values are the same. In other words, the normalized rmsd100 qualitatively reflects the intuitive view that larger structures have a higher probability to differ one from the other. Because the data were derived from proteins with more than 40 residues we suggest that the rmsd100 formula should be applied to alignments that include more than 40 residues. On the other hand, it follows from the mathematical form of the equation that the formula can be applied only for structural alignments with more than 14 residues; for smaller N values the ratio in equation 2 would be negative.
We think that the normalized rmsd can be useful in estimating the quality of an NMR ensemble of models, in applying multivariate statistical techniques to structural bioinformatic problems, as well as in comparing limited sets of protein three-dimensional structures.
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Carugo, O. and Eisenhaber, F. 1997. Probabilistic evaluation of similarity between pairs of three-dimensional protein structures utilizing temperature factors. J. Appl. Cryst. 30: 547549.
Domingues, F.S., Koppensteiner, W.A., and Sippl, M.J. 2000. The role of protein structure in genomics. FEBS Lett. 476: 98102.[CrossRef][Medline]
Hobohm, U. and Sander, C. 1994. Enlarged representative set of protein structures. Protein Sci. 3: 522531.[Abstract]
Holm, L. and Sander, C. 1996 Mapping the protein universe. Science 273:595602.
Irving J.A., Whisstock J.C., and Lesk A.M. 2001. Protein structural alignments and functional genomics. Proteins 42:378382.[CrossRef][Medline]
Kabsch, W. 1976. A solution for the best rotation to relate two sets of vectors. Acta Crystallogr. A 32: 922923.[CrossRef]
. 1978. A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr. A 34: 827828.[CrossRef]
Peters-Libeu, C. and Adman, E.T. 1997. Displacement-parameter weighted coordinate comparison: I. Detection of significant structural differences between oxidation states. Acta Crystallogr. D 53: 5676.[CrossRef]
Xu, D. and Nussinov, R. 1998. Favorable domain size in proteins. Fold. Des. 3: 1117.[CrossRef][Medline]
Yang, A.-S. and Honig, B. 2000. An integrated approach to the analysis and modeling of protein sequences and structures. I. Protein structural alignment and a quantitative measure for protein structural distance. J. Mol. Biol. 301: 665678.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
D. A.C. Beck, A. L. Jonsson, R. D. Schaeffer, K. A. Scott, R. Day, R. D. Toofanny, D. O.V. Alonso, and V. Daggett Dynameomics: mass annotation of protein dynamics and unfolding in water by high-throughput atomistic molecular dynamics simulations Protein Eng. Des. Sel., June 1, 2008; 21(6): 353 - 368. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. E. Christensen, B. B. Kragelund, P. von Wettstein-Knowles, and A. Henriksen Structure of the human beta-ketoacyl [ACP] synthase from the mitochondrial type II fatty acid synthase Protein Sci., February 1, 2007; 16(2): 261 - 272. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Carugo Statistical validation of the root-mean-square-distance, a measure of protein structural proximity Protein Eng. Des. Sel., January 12, 2007; (2007) gzl051v2. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. L. ADAMS, M. R. STAHLEY, M. L. GILL, A. B. KOSEK, J. WANG, and S. A. STROBEL Crystal structure of a group I intron splicing intermediate RNA, December 1, 2004; 10(12): 1867 - 1887. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Spagnolo, I. Toro, M. D'Orazio, P. O'Neill, J. Z. Pedersen, O. Carugo, G. Rotilio, A. Battistoni, and K. Djinovic-Carugo Unique Features of the sodC-encoded Superoxide Dismutase from Mycobacterium tuberculosis, a Fully Functional Copper-containing Enzyme Lacking Zinc in the Active Site J. Biol. Chem., August 6, 2004; 279(32): 33447 - 33455. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. C. Prasad, S. Vajda, and C. J. Camacho Consensus alignment server for reliable comparative modeling with distant templates Nucleic Acids Res., July 1, 2004; 32(suppl_2): W50 - W54. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. L. Sierk and W. R. Pearson Sensitivity and selectivity in protein structure comparison Protein Sci., March 1, 2004; 13(3): 773 - 785. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Goldsmith-Fischman and B. Honig Structural genomics: Computational methods for structure analysis Protein Sci., September 1, 2003; 12(9): 1813 - 1821. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |