|
|
||||||||
Biology Department, Northeastern University, Boston, Massachusetts 02115, USA
Reprint requests to: Valentin A. Ilyin, Biology Department, Northeastern University, 360 Huntington Avenue, Boston, MA 02115, USA; e-mail: ilyin{at}neu.edu; fax (617) 373-3724.
(RECEIVED January 30, 2004; FINAL REVISION April 13, 2004; ACCEPTED April 13, 2004)
| Abstract |
|---|
|
|
|---|
Keywords: protein structure; structure alignment; structural similarity; topological invariant; common structural core
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.04672604.
| Introduction |
|---|
|
|
|---|
Significant progress has been made in both the methodology and algorithms of comparative structure analysis. Superimpositions of the structures and consequent structure-based sequence alignments have been characterized from different points, and several methods and databases on protein structural alignments have been developed. Systematic comparison of all available protein structures has led to the development of several different protein structure classification schemes. Each classification scheme is derived from different algorithms, with slightly different goals in mind. FSSP (Holm and Sander 1996) is based on the DALI (Holm and Sander 1993) algorithm, whereas CATH (Orengo et al. 1997) has its roots in the SAP algorithm (Taylor et al. 1994), SCOP (Murzin et al. 1995) is derived primarily by visual inspection, MMDB (Ohkawa et al. 1995) uses the VAST algorithm (Gibrat et al. 1996), and CE-ALL (Shindyalov and Bourne 2001) uses the CE algorithm (Shindyalov and Bourne 1998). Many other methodologies and classifications have been reported (for reviews, see Madej et al. 1995; Godzik 1996; Holm and Sander 1999; Shindyalov and Bourne 2000; Koehl 2001). Those highlighted here provide Web-accessible resources that are kept current as new structures become available from the Protein Data Bank (PDB; Berman et al. 2000).
With the numerous methods available, one might think the structure superimposition problem has been solved, but upon close examination, there are many unresolved questions. The structural alignment problem has proven to be NP-hard (Lathrop 1994; Godzik 1996). Despite the significant progress, current methods produce different results. A one-target function cannot be defined to evaluate an alignment; therefore, two criteria for the similarity between three-dimensional structures have been developed, namely root mean square deviation (RMSD) and number of equivalent (aligned) positions (Ne). The "right" alignment is some-where on the two-dimensional space representing relations between those two values, RMSD and Ne. All the methods of comparative structure analysis are facing this problem and are developing different "heuristics" to find a proper balance between lower RMSD and a larger number of aligned positions (Shindyalov and Bourne 1998). This leads to different alignments for the same proteins; consequently, the methods do not produce the same results. For example, the overlap between the alignments from the popular methods FSSP/DALI and CE is just 40% (Shindyalov and Bourne 2000); therefore, there are still many unknowns in the structural alignment problem. Taking into account the above, we thought it would be useful to approach this problem from a different point of view.
We present here an application of a different mathematical model to produce structural alignments. Similarity between protein structures has been analyzed using three-dimensional Delaunay triangulation patterns (Delaunay 1934) derived from the backbone representation of protein structure. The Delaunay triangulation (DT) has been known for a long time in mathematics, computer science, protein studies, and many other scientific fields. DT is topologically linked to Voronoi tessellation, originally developed by Voronoi (1908) and first applied to proteins by Richards (1974). Since then, both tessellations have been used successfully in the calculation of standard volumes of protein residues, characterizing proteinprotein interactions, understanding protein motions, analyzing cavities in protein structure (Chothia 1975; Finney 1975 Finney 1977; Gerstein et al. 1994; Harpaz et al. 1994), and in the analysis of volume of atoms on the protein surface (Gerstein et al. 1995). Delaunay simplexes were also used to develop four-body potential for statistical analysis of protein structures (Singh et al. 1996) and for studies on protein-specific correlations to protein stability changes by hydrophobic core mutations (Carter et al. 2001).
Presented here is a study on the application of the three-dimensional DT model for the structural alignment of proteins, based on our novel TOPOFIT method, which superimposes protein structures by identification of the common volume subgraph of Delaunay triangulation.
| Results and Discussion |
|---|
|
|
|---|
|
The conformity of the backbone topologies in the tessellation patterns has been analyzed for the growing seed of the structural alignment. An example of the seed growth is shown on Figure 2A
. One can see that the number of aligned positions (Ne) is growing along with an increase of the joint distance (RMSD; see Materials and Methods), and more and more aligned residues are associated with the seed. The backbone contacts in both proteins match each other from the beginning, but their topological correspondence remains the same only until some point, after which the topologies of the backbones start to diverge. The divergence begins at RMSD 11.5 Å (pointed by arrow in Fig. 2A
) at the joint distance of 3 Å. The topology of the growing seed is equivalent in both compared proteins until this point, and then there is a small area when the topology is almost the same with some small number of mismatches. After this area, the topology starts to deviate dramatically: The growing seed includes more and more mismatches; backbone contacts in one protein correspond to nonbackbone contacts or do not have correspondence with any contacts in another protein at all. The number of mismatches increases rapidly up to 50%, which is shown as the darker region in Figure 2A
. We will refer to the place on the RMSD/Ne curve where topology starts to deviate as a topomax point, a point on the curve where the growing seed of topologically equivalent spatial volumes reaches its maximum.
|
, all-
, and
/
classes of protein structures and their structural neighbors. Each protein pair has been aligned several times by the TOPOFIT method at joint distances ranging from 1 Å to 7 Å by 0.5 Å steps, and all the statistically significant seeds for the pair have been collected according to Z-score and the size. The distribution of the matching backbones versus the resulting RMSD of the alignment for each seed is shown in Figure 2B
It should be noted that the topomax point on the RMSD/Ne curve defined by the TOPOFIT method is actually not a point in a geometrical sense, it is rather a small region where the topology starts to deviate. In this area there are a small number of mismatches (see Fig. 2A
) and the graph volumes in both proteins are almost the same. The existence of this small region is logical because protein structures are defined by experiment and have some experimental errors, which produce small deviations in sensitive DT patterns. The DT patterns can also be affected by small conformational movements and mutations present in the structural neighbors. Taking into account the small region of mismatches, based on the results from Figure 2
, we consider it reasonable to identify the topomax point at a position with up to 80%85% matches, allowing up to 15%20% mismatches, which corresponds to the joint distance parameter of 3 Å.
It is particularly worth emphasizing that the topomax point is not an input threshold in the algorithm; it is not defined a priori. This feature point has been obtained as an output result of the comparison of the topological patterns between two structures and the determination of maximal common topologically invariant volumes between the structures. Therefore, the TOPFIT method presents an objective way to identify common parts of proteins that are identical in the topology of the DT patterns.
Evaluation of the TOPOFIT method to identify structural neighbors
The statistical significance of the TOPOFIT alignments has been evaluated on a data set of 10,731 structurally nonrelated protein pairs where 1,393,272 seeds have been extended as described in Materials and Methods, and a Z-score to identify structural neighbors has been derived (see Fig. 8
below).
|
The results of the TOPOFIT alignments are shown in Figure 3
on the RMSD/Ne plot. The majority of the alignments are distributed from 1 to 2 Å of RMSD while few alignments are found at RMSD equals 2 Å or more. The distribution also shows three distinct areas of the structural neighbors. An initial analysis of the contents of these three clearly separated areas suggests the following: In area A, an exact match, the pairs include matches between mutant proteins, different crystal forms of the same protein or its closest homologs; in area B, structural subfamily, a strong correlation of two or more domains between proteins is present; in area C, structural superfamily, proteins with at least one similar domain structure up to 100 residues are present, and this area contains the majority of all the alignments. A more detailed analysis of the contents in the observed areas is one of the future directions in the research by the TOPOFIT method.
|
|
|
|
|
The TOPOFIT structural alignments will be used for a general classification of protein folds based on the topological structural invariants across the structural families, detection of the variations in the topology between proteins, which will be provided by automated large-scale calculations, and for studying sequencestructure features and mapping variations in gene sequences onto structural families of the encoded proteins. A strong geometrical correlation between the topological invariants extends the usability of the method, for example, to more detailed comparative analysis of the functional sites between proteins, and for active and binding sites mapping, characterization, and classification, which will be useful for biomedical studies. The common subgraph volumes can also be used to help better understand the problem of protein folding and protein stability, which will result in applications of the TOPOFIT methods for comparative modeling of proteins. This is based on identification of the common graph volume by TOPOFIT methods derived from the alignment of target sequence with one or more structural templates, producing structural models. Therefore, the TOPOFIT method will be insightful for a wide range of protein studies.
Structure comparison using the TOPOFIT method is implemented in our integrated multiple structure-sequence viewer, Friend (Abyzov et al. 2003), and also publicly available from our TOPOFIT Web server at http://mozart.bio.neu.edu.
| Materials and methods |
|---|
|
|
|---|
Rn, the Voronoi cell of x
A is |
| (1) |
where d denotes the Euclidean distance in Rn. The nearest neighbor set N(x) of x
A is the set of points that are closest to x in Euclidean distance. For each point u
Rn define nb(A, u) as the set of points x' A\{u}. A point v
Rn is a Voronoi vertex (corner of the Voronoi cell) if |nb(A, v)| is maximal over all nearest sets. The Delaunay cell of v is the convex hull conv(nb(A, v)). The complex (or triangulation) of A is therefore a partition of the convex hull conv(A) into the Delaunay cells of its Voronoi vertices. The Delaunay complex is derived from the Voronoi diagram in the sense that there is a natural bijection between the two complexes that reverses the face inclusions. Apart from degenerate cases, in three-dimensional space each Delaunay cell is a tetrahedron with four points of A at its corners. This procedure therefore defines 4-edges (sets of 4 "mutually adjacent" vertices) in a (protein) structure in a parameter-free way. The 2-edges of a contact graph and 3-edges can, of course, be derived directly from the tessellation by considering subsets. We use C
atoms of the backbone chain of a protein for computation. The tessellation has been calculated using QHULL (Barber et al. 1996).
Therefore, the Delaunay tessellation of protein structure uniquely identifies close spatial neighbors to each particular point. This is the most important feature of DT used for the TOPOFIT method.
Contact graphs
The three-dimensional structure of a linear biopolymer, such as a protein, can be approximated by their contact structure, that is, by the list of all pairs of monomers that are spatial neighbors. Contact structures of polypeptides were introduced by Ken Dill and coworkers in the context of lattice models of protein folding (Chan and Dill 1990Chan and Dill 1991). The structures of proteins form a special class of contact structures. We assume that the monomers, amino acids alike, are numbered from 1 to n along the backbone. For simplicity we shall write [n] = {1, . . . , n}. The adjacency matrix of the backbone B has the entries Bi,j+1 = Bi+1,i = 1, i
[n 1]. In a more general context, polymers with cyclic or branched backbones could be considered. A contact structure is represented by the contact matrix C with the entries cij = 1 if the monomers i and j are spatial neighbors without being adjacent along the backbone, and cij = 0 otherwise. Hence, Cij = 0 if |i j|
1. Note that both B and C are symmetric matrices. We define the (contact) diagram ([n],
) to consist of n vertices labeled 1 to n and a set of arcs that connect nonconsecutive vertices. The diagram is simply a graphical representation of the contact matrix. As an example, we show the conventional backbone diagram of the protein crambin together with its discretized structure represented by the contact matrix and contact graph (Fig. 7
). The contact graph has the adjacency matrix A = B + C. The contact graph is used as an internal two-dimensional representation of protein contacts and is useful in many intermediate calculations.
|
The comparison algorithm
Consider two proteins labeled A and B. The superimposition algorithm has three steps. The first step is a Delaunay tessellation of points representing the proteins. The set of tetrahedrons for proteins A and B we call TA and TB, respectively. The second step includes initial classification of the tetrahedrons by shape, volume, and backbone topology and then systematic pairwise superimposition of all the tetrahedrons inside the different types in both proteins. For each tetrahedron ta of a particular type from the protein A, the best matched tetrahedron tb of the same type from the protein B is found. The pair of tetrahedrons (ta,tb) is called a "seed" and is used in the next step of the algorithm. The goal of the third step is to iteratively add as a many points to the seed as possible. The seed extension is a volume extension algorithm, not a backbone extension; each step adds one or more new tetrahedrons to the growing seed. Each new pair of points is chosen from the neighboring DT pattern by the topology and a restriction parameter called "joint distance," D. Because points are always added to both growing subgraphs, they can always be superimposed and the lowest RMSD is recalculated. If there are pairs of points (xa,xb), xa
ta, xb
tb with superimposition distance d'xa,xb) > D, then they are excluded from ta and tb. The number of matching tessellation contacts is also evaluated for all points. The match is defined when the contacts are present in both proteins and they are the same in type; either both are backbones or both are interresidue contacts, if not it is a mismatch.
The algorithm stops when there are no more new points or when the number of mismatches exceeds the threshold (15%); in other words we grow the seed a little further from the beginning of the topological mismatch.
Collecting a test set for different structural classes
The structural neighbors of seven proteins representing different structural classes have been collected up to the lowest scores from current popular methods DALI and CE, resulting in a total of 2905 proteins. PDB codes for the query proteins are 1bt3
[PDB]
chain A (mainly
helices), 1at0
[PDB]
(
sheets), 1hxn
[PDB]
(mainly
sheets), 1huw
[PDB]
(
helices), 1qj4
[PDB]
chain A (
sheets and
helices), 1zin
[PDB]
(
helices and
sheets), and 451c (
helices). The results from DALI were obtained from their Web server (Holm and Sander 1996), and the results of the CE calculations were downloaded from the CE Web site (Shindyalov and Bourne 2001).
Statistical significance of TOPOFIT alignments
To evaluate the statistical significance of the TOPOFIT structural alignments and derive a scoring function for identification of structurally related proteins, the frequency distribution for Ne and RMSD for a set of structurally nonrelated proteins has been analyzed (a second test set). All-to-all alignments for a known nonredundant set of nonhomologous proteins (Hobohm et al. 1992) have been calculated. To minimize bias, only nonoverlapping seeds were extended. The frequency distribution of 10,731 alignments is shown in Figure 8
on the RMSD/Ne plot. The observed frequency distribution has been analytically approximated. The distribution of Ne for each value of RMSD was approximated by Gaussian distribution with mean µ(RMSD) and
(RMSD) depending on RMSD. The parameters µ and
were obtained from the least-squares fit of the experimental distributions for each RMSD. The Gaussian approximation describes the data very well: Typical values of
2 for the approximation are ranging from the 0.95 to 1.2 per degree of freedom. The dependences µ(RMSD) and µ(RMSD) +
(RMSD) were approximated by exponents:
![]() |
![]() |
For a given RMSD and Ne the Z-score was calculated as deviation of Ne from the Gaussian average µ normalized to the Gaussian
![]() |
The obtained relation between Ne and RMSD for a Z-score equal to 3 is shown in Figure 8
. Alignments above Z-score 3 are considered to have a statistically significant structural correlation.
The complexity of TOPOFIT algorithms is n x m; the calculation varies significantly with an average of 15 sec per alignment based on the data set in Figure 3
on a 2.4-GHz Linux box with 512 Mb of memory, which is comparable to other methods.
| Acknowledgments |
|---|
| References |
|---|
|
|
|---|
Barber, C.B., Dobkin, D.P., and Huhdanpaa, H.T. 1996. The Quickhull algorithm for convex hulls. ACM Trans. Math. Software 22: 469483.[CrossRef]
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235242.
Carter Jr., C.W., LeFebvre, B.C., Cammer, S.A., Tropsha, A., and Edgell, M.H. 2001. Four-body potentials reveal protein-specific correlations to stability changes caused by hydrophobic core mutations. J. Mol. Biol. 311: 625638.[CrossRef][Medline]
Chan, H.S. and Dill, K.A. 1990. Origins of structure in globular proteins. Proc. Natl. Acad. Sci. 87: 63886392.
. 1991. Polymer principles in protein structure and stability. Annu. Rev. Biophys. Biophys. Chem. 20: 447490.[CrossRef][Medline]
Chothia, C. 1975. Structural invariants in protein folding. Nature 254: 304308.[CrossRef][Medline]
Delaunay, B. 1934. Sur la sphere vide. Bull. Acad. Sci. USSR VII: Class Sci. Mat. Nat. 7: 793800.
Finney, J.L. 1975. Volume occupation, environment and accessibility in proteins. The problem of the protein surface. J. Mol. Biol. 96: 721732.[CrossRef][Medline]
. 1977. The organization and function of water in protein crystals. Philos. Trans. R. Soc. Lond. B Biol. Sci. 278: 332.[Medline]
Fischer, D., Elofsson, A., Rice, D., and Eisenberg, D. 1996. Assessing the performance of fold recognition methods by means of a comprehensive benchmark. Pac. Symp. Biocomput. 300318.
Gerstein, M., Sonnhammer, E.L., and Chothia, C. 1994. Volume changes in protein evolution. J. Mol. Biol. 236: 10671078.[CrossRef][Medline]
Gerstein, M., Tsai, J., and Levitt, M. 1995. The volume of atoms on the protein surface: Calculated from simulation, using Voronoi polyhedra. J. Mol. Biol. 249: 955966.[CrossRef][Medline]
Gibrat, J.F., Madej, T., and Bryant, S.H. 1996. Surprising similarities in structure comparison. Curr. Opin. Struct. Biol. 6: 377385.[CrossRef][Medline]
Godzik, A. 1996. The structural alignment between two proteins: Is there a unique answer? Protein Sci. 5: 13251338.[Abstract]
Harpaz, Y., Gerstein, M., and Chothia, C. 1994. Volume changes on protein folding. Structure 2: 641649.[Medline]
Hobohm, U., Scharf, M., Schneider, R., and Sander, C. 1992. Selection of representative protein data sets. Protein Sci. 1: 409417.[Abstract]
Holm, L. and Sander, C. 1993. Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233: 123138.[CrossRef][Medline]
. 1996. The FSSP database: Fold classification based on structurestructure alignment of proteins. Nucleic Acids Res. 24: 206209.
. 1999. Protein folds and families: Sequence and structure alignments. Nucleic Acids Res. 27: 244247.
Kabsch, W. 1978. A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr. A 34: 827828.[CrossRef]
Koehl, P. 2001. Protein structure similarities. Curr. Opin. Struct. Biol. 11: 348353.[CrossRef][Medline]
Lathrop, R.H. 1994. The protein threading problem with sequence amino acid interaction preferences is NP-complete. Protein Eng. 7: 10591068.
Madej, T., Gibrat, J.F., and Bryant, S.H. 1995. Threading a database of protein cores. Proteins 23: 356369.[CrossRef][Medline]
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536540.[CrossRef][Medline]
OHearn, S.D., Kusalik, A.J., and Angel, J.F. 2003. MolCom: A method to compare protein molecules based on 3-D structural and chemical similarity. Protein Eng. 16: 169178.
Ohkawa, H., Ostell, J., and Bryant, S. 1995. MMDB: An ASN.1 specification for macromolecular structure. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3: 259267.[Medline]
Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. 1997. CATHA hierarchic classification of protein domain structures. Structure 5: 10931108.[Medline]
Richards, F.M. 1974. The interpretation of protein structures: Total volume, group volume distributions and packing density. J. Mol. Biol. 82: 114.[CrossRef][Medline]
Schuster, P. and Stadler, P.F. 1999. Discrete models of biopolymers. In Handbook of computational chemistry and biology (ed. A. Konopka), pp. 8788. Marcel Dekker Inc., New York.
Shindyalov, I.N. and Bourne, P.E. 1998. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11: 739747.
. 2000. An alternative view of protein fold space. Proteins 38: 247260.[CrossRef][Medline]
. 2001. A database and tools for 3-D protein structure comparison and alignment using the Combinatorial Extension (CE) algorithm. Nucleic Acids Res. 29: 228229.
Singh, R.K., Tropsha, A., and Vaisman, I.I. 1996. Delaunay tessellation of proteins: Four body nearest-neighbor propensities of amino acid residues. J. Comput. Biol. 3: 213221.[Medline]
Taylor, W.R., Flores, T.P., and Orengo, C.A. 1994. Multiple protein structure alignment. Protein Sci. 3: 18581870.[Abstract]
Voronoi, G.F. 1908. Nouveles applications des parametres continus a la thorie des formes quadratiques. J. Reine Angew Math. 134: 198287.
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
F. S. Domingues, J. Rahnenfuhrer, and T. Lengauer Conformational analysis of alternative protein structures Bioinformatics, December 1, 2007; 23(23): 3131 - 3138. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Uzun, C. M. Leslin, A. Abyzov, and V. Ilyin Structure SNP (StSNP): a web server for mapping and modeling nsSNPs on protein structures with linkage to metabolic pathways Nucleic Acids Res., July 13, 2007; 35(suppl_2): W384 - W392. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Birzele, J. E. Gewehr, G. Csaba, and R. Zimmer Vorolign--fast structural alignment using Voronoi contacts Bioinformatics, January 15, 2007; 23(2): e205 - e211. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. M. Leslin, A. Abyzov, and V. A. Ilyin TOPOFIT-DB, a database of protein structural alignments based on the TOPOFIT method Nucleic Acids Res., January 12, 2007; 35(suppl_1): D317 - D321. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Abyzov, M. Errami, C. M. Leslin, and V. A. Ilyin Friend, an integrated analytical front-end application for bioinformatics Bioinformatics, September 15, 2005; 21(18): 3677 - 3678. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE |