|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Bioinformatics Program and 2 College of Pharmacy, University of Michigan, Ann Arbor, Michigan 48109, USA
Reprint requests to: Gordon M. Crippen, College of Pharmacy, University of Michigan, 428 Church Street, Ann Arbor, MI 48109-1065, USA; e-mail: gcrippen{at}umich.edu; fax: (734) 763-2022.
(RECEIVED February 23, 2005; FINAL REVISION June 10, 2005; ACCEPTED September 20, 2005)
| Abstract |
|---|
|
|
|---|
Keywords: structural alignment; database search; flexible alignment; fold recognition; structural genomics
Article published online ahead of print. Article and publication date are at http://www.proteinscience.org/cgi/doi/10.1110/ps.051428205.
| Introduction |
|---|
|
|
|---|
Using sequence alignment techniques, evolutionary linkages between proteins can be found, and functions for novel proteins can be inferred from known proteins. Well-defined statistical theories have been founded to measure the statistical significance of alignment scores (ASs) (Karlin and Altschul 1990). Based on the extreme value distribution (EVD) model, the probability of finding an AS s larger than x between unrelated sequences is equal to
![]() | (1) |
where
and µ are scale and location parameters, respectively.
P(s > x) is also known as the P-value. Some use the E-value instead of the P-value to measure the statistical significance of an alignment. The E-value is defined as
![]() | (2) |
The lower the P-value or E-value, the more probable it is that two proteins are homologous to each other.
If sequence identities go below the twilight zone of 20%~30% (Jaroszewski et al. 2002), sequence alignment cannot reliably find functional linkage any more. On the other hand, similarities can be detected for proteins having <10% sequence identity with the aid of three-dimensional (3D) structures. The ongoing Protein Structure Initiatives (PSI) will ultimately make 3D structural annotations available for almost every protein sequence (Norvell and Machalek 2000), which will afford greater opportunity for applying structural information to function annotation. Many structure-based alignment methods have been developed (CE [Shindyalov and Bourne 1998]; Dali [Holm and Sander 1993]; LOCK 2 [Shapiro and Brutlag 2004]). Generally, there are two types of approaches to the structural alignment: coordinate-based and environment-based.
In the coordinate-based approach, an alignment is just like aligning two sets of points (or fragments), and the similarity is evaluated based on how well the two sets can be superimposed in 3D space. The problem is that each protein chain is isolated from its native environment, thus ignoring interchain interactions. This situation becomes worse when protein structures are broken into domains before structural comparison (some intrachain interactions are also missing), worse when a domain is only represented as a trace of C
atoms (some side-chain interactions are missing too), and even worse when the similarity measurement is only based on Cartesian coordinates of those C
atoms (amino acid types are disregarded). Unfortunately, this is the situation we find in most structural alignment algorithms. The consequence of this approach is obvious: These methods are unable to distinguish protein chains in different oligomeric states.
In the environment-based approach, structure-derived descriptors rather than explicit Cartesian coordinatebased distances are used to generate the structure-based alignment. Most proteins are polymers of the 20 standard amino acids, and different physicochemical environments will favor different amino acid types. Thus a structure alignment can be transformed into an alignment of those environments, while the physicochemical properties of each residues environment can be described as a combination of solvent accessibility, hydrogen bond strengths, and other structure-derived environmental descriptors (SED). Since these SEDs may be precalculated under native conditions, intermolecular and intramolecular interactions can be retained very well, and an alignment can be derived in the end that is compatible wsith the native state.
A pure environmental-based structural alignment method was developed by Suyama et al. (1997). Four SEDs (called "pseudoenergy functions" in their article)i.e., side-chain packing, solvation, hydrogen bonding, and local structureare used to convert a 3D structure into a position-dependent matrix that contains the fitness scores of the 20 residue types to each position. Given two structures, an environmental compatibility distance matrix is built based on covariances between positions in the two 3D position-dependent matrices. Then the Needleman-Wunsch dynamic programming algorithm is used to generate the final alignment (Needleman and Wunsch 1970). This method can produce globally environment-compatible structural alignments, but it is hard to justify the accuracy of the alignment results. We can imagine that positions on two distinct
-sheet strands buried in a protein core may have exactly the same local environment (Taylor 1999), which cannot be differentiated based on the SEDs. Therefore, alignment errors are expected among those regions. Furthermore, there is no available statistical theory yet to measure the significance of global ASs produced by the Needleman-Wunsch algorithm.
COMPARER is another method to search for environmentally compatible structural alignments (
ali and Blundell 1990). Twenty structure- and sequence-derived descriptors are used to build "element-by-element dissimilarity distance matrices," over which alignments are determined by the Needleman-Wunsch algorithm. In order to get accurate alignment, an additional simulated annealing step is used to optimize the original alignment and correct incompatible interresidue hydrogen bond patterns for each aligned residue pair. COMPARER can produce relatively accurate environmentally compatible alignments. However, it does not provide a measurement to score the goodness of an alignment result. Instead, users have to check the corresponding 3D superimposed structure coordinates, "as automation of this alignment process cannot be guaranteed" (http://www-cryst.bioc.cam.ac.uk/). Thus this method is also not able to evaluate environmental incompatibilities in structural alignments.
Basically, coding 3D structures into SEDs alone will result in loss of information, which makes it hard to obtain accurate structural alignment. In many structural alignment methods, an environment-based alignment is only the first step, followed by additional coordinate based alignment procedures to refine the initial alignment (Taylor 1999). However, in most of these programs, there are no constraints imposed to ensure environmental compatibility during the latter step, and the previously discussed drawbacks for coordinate-based approaches still apply to these methods.
In order to measure environmental compatibilities, the most important thing is to derive SEDs under a realistic environment. Recently RCSB has released a new format of protein structure file: the biological unit file. Each biological unit file contains a biological unit, which is defined as "the macromolecule that has been shown to be or is believed to be functional" (http://www.rcsb.org/pdb/). Biologically functional interactions (especially intermolecular interactions) in these files are more complete and realistic than those in the original Protein Data Bank (PDB) files (Berman et al. 2000). Based on realistic environments provided by these files, we have developed a novel structural alignment method as follows.
At first, we apply some (environmental) standards to eliminate all of the environmentally incompatible chain fragment pairs from the candidate pool. Then we search for (nonoverlapping) maximal cliques in the candidate pool: Each maximal clique is the largest set of residues that can be aligned in Cartesian space. After iteratively determining all the cliques, the gaps (loop regions) between aligned residue pairs are bridged by dynamic programming over environmentally compatible distance matrices.
There are several unique features of this alignment method: First of all, the resulting alignments are both structurally compatible (alignments in core regions can be superimposed in 3D space) and environmentally compatible (environmentally incompatible parings have been eliminated). Second, loop regions can be aligned without having similar 3D structures. Third, each clique corresponds to a core region of a domain/motif; proteins with multiple domains may be aligned by introducing multiple cliques. Therefore, it is not necessary to initially split the chain into domains.
Structural ASs have been found to follow the EVD in some coordinate-based alignment methods (FATCAT [Ye and Godzik 2004]; Structal [Levitt and Gerstein 1998]). Using the methods reported in FATCAT, we did a similar simulation of random structural alignments, based on which we have successfully developed two statistics to measure the alignment quality of a clique: One is the number of the aligned residue pairs (clique size [CS]), and the other is the AS, to measure overall environmental compatibility. The former corresponds to a coordinate-based similarity measurement, while the latter corresponds to an environment-based similarity measurement. We find that both statistics roughly follow the EVD. With the environment-based statistic (AS), we are finally able to distinguish protein structures with different oligomeric states.
| Results |
|---|
|
|
|---|
|
Statistics of structural alignments
We find that the both the CS (i.e., the number of structurally aligned residues) and the environment-based AS approximately follow the EVD (Fig. 2
), although the quantile-to-quantile plots indicate that the fitting is not perfect in regions of very small CS/AS and extremely large CS/AS. A similar result has been found and discussed in FATCAT, when Ye and Godzik (2004) fit structural ASs to an EVD model. Basically, our EVD model will give higher than expected P-values or E-values for extremely high CS/AS. Therefore, our EVD model is a little bit more stringent in the high scoring regions. Interestingly, the opposite trend is found in FATCAT.
|
![]() | (3) |
![]() | (4) |
and the correlation coefficients for µ and
are 0.996 and 0.896, respectively. The EVD parameters for ASs are
![]() | (5) |
![]() | (6) |
and the correlation coefficients for µ and
are 0.997 and 0.966, respectively. Based on these linear functions, the P-value and E-value of a CS or of ASs can be estimated by equations 1 and 2.
Overall performance of statistics
Recently, receiver-operating characteristic (ROC) curves have been used by Kolodny et al. (2005) and Gribskov and Robinson (1996) to evaluate the performance of different geometric measurements for scoring structural alignments. Using an automatic protein structure classification database, CATH (Orengo et al. 1997), as a reference, we plot the ROC curves for our structural alignment algorithm. In general, all three statistics (CSs, ASs, and the combination of the CS and the AS) largely agree with CATH. Among them, using CSs as the measurement of the statistical significance gives the best agreement with CATH, while using environment-based ASs gives the least agreement (Fig. 3
).
|
|
Statistical vs. nonstatistical measurement
A coordinate-based measurement,
(Maiorov and Crippen 1995), was also derived after 3D superposition of the extended global structural alignment. The advantage of
is that it is size independent and can be viewed as a scaled root mean squared deviation (RMSD). A pair of structures with clear visual similarity will have
< 0.5. We randomly sampled 200 pairs of structures with different levels of similarities and calculated
values. The results are shown in Figure 5
. As we can see, with a cutoff of 0.01, the combination of E-values can identify more homologs than by using pure coordinate-based
.
|
= 0.27 for 62 aligned residue pairs). CATH defines them as homologs. The size of the first clique is statistically significantly larger than unrelated structure pairs (E-value = 3.3 x 105). However, the environment-based score (E-value = 0.12) is not statistically significant, suggesting there are some differences in the environments.
|
of the "flexible" superposition is 0.10. It can be shown that subsequent cliques can also be evaluated by similar statistics as the first clique (data not shown). Therefore, statistics were also applied to measure the statistical significance of the second clique, giving E-value 1.7 x 109 and 1.7 x 1015 for the CS and environment score, respectively.
|
|
| Discussion |
|---|
|
|
|---|
Many investigators have recognized that RMSD is not a good measurement of the quality of structural alignment. Many alternative measurements have been developed (Maiorov and Crippen 1995; Yang and Honig 2000; Kolodny et al. 2005). However, all of these measurements are coordinate-based. Therefore, given a structural alignment carried out on the domain level, there is no way to distinguish proteins with different oligomeric states because the intermolecular/interdomain interactions are irrelevant to these measurements. Another concern is that this coordinate-based approach has resulted in a misconception in the structural alignment area: The best structural alignment method should find the largest number of equivalent residue pairs as long as each pair of residues is close enough after being superimposed in 3D space. For alignments in conserved structural regions (i.e., the core regions), this approach is reasonable enough. But such an approach may produce matching errors between residues in loop regions.
With an environmental-based approach, we can guarantee the environmental compatibility between matched residues. This not only eliminates possible wrong alignments but also makes it possible to find some possible alignments in loop regions. We used a very stringent cutoff (1.5 Å) to define core regions (compared with a cutoff of 3.0 Å that CE used). The length of region alignments based on core regions may seem to be relatively small compared with other methods because we want to make sure that the alignment is correct with no misleading alignments. A much longer alignment is easily obtainable by using the extension procedures described in Materials and Methods.
Domain splitting is a hard problem, but rigid body superposition of unsplit domains can result in erroneous conclusions. For example, Gan et al. (2002) used 8fab:A versus 1dcl:B as a pair of proteins with pretty high sequence identity but low structural similarity. In fact, it is just an example of structural flexibility. Although domain splitting is an alternative to solve this problem, we have shown that the overall structural similarity can also be found by using our method. Instead of an example of discrepancy between sequence and structure similarities, it is an example of how structures and sequences are related. The high environmental-based AS (low E-values) suggests that the environments of the two structures are quite similar on the biological unit level.
In the fold recognition area, each fold is converted to a structural template, and a sequence is aligned to a library of structural profiles in order to find the most suitable structure for the sequence. In order to accommodate structural flexibilities, a structural template can be generated by a family of proteins (Shi et al. 2001; Tang et al. 2003). This calls for "accurate" structural alignment methods. Note the "accurate" here does not mean "superimposable" structural alignment, but a sequence-compatible structural alignment. Our environmental-based approach is what is needed because positions that tend to favor similar types of amino acids will tend to be aligned together.
| Materials and methods |
|---|
|
|
|---|
Preprocessing of biological unit files
Three SEDs were used in our work: bond strength of secondary structures (BSSS), relative solvent accessibility (RSA), and fraction buried by polar atoms (FBP). These three SEDs were calculated for each residue in a structure. We used the method described in DSSP (Kabsch and Sander 1983) to calculate the electrostatic energy of a hydrogen bond. In DSSP, helices and sheets are defined in terms of having a certain pair of hydrogen bonds, and a hard cutoff (1.0 kcal/mol) was used to define the existence of a hydrogen bond. If a residue has both of the interturn hydrogen bonds required in a helix, it is declared to be a helical residue; if it has both the required interstrand hydrogen bonds, it is declared to be a
-sheet residue. Here we use the strengths of the two pairs of hydrogen bonds to decide whether a residue is more helical or
-strand, but instead of using a fixed cutoff, we define BSSS as a continuous variable representing secondary structures by ranging over both positive and negative values. The absolute value of BSSS is equal to the magnitude of the weaker of the two hydrogen bond energies; residues in helices are given a negative sign, and residues in sheets are given a positive sign. Thus the more negative, the more likely a residue is in a helix; the more positive, the more likely the residue is in a sheet. Values close to zero correspond to the coil state.
Solvent-accessible surface area (SASA) (Lee and Richards 1971) of each atom was determined by placing 512 equally spaced sample points on the surface of its imaginary "solvent sphere," with a radius equal to the sum of the atoms van der Waals radius and the radius of a water molecule. If a point was in the solvent sphere of any other atom, it was defined as buried; otherwise, it was defined as solvent-accessible. SASA of an atom was then determined by (Nacc/512)Areaap. Nacc is the number of solvent-accessible points. Areasp is the surface area of the solvent surface. The calculations of SASA were always performed in a biological unit context. SASA for a residue was the sum of the SASAs for each atom in the residue. RSA for a residue was defined as the ratio between SASA of the residue in a biological unit versus SASA for that residue X in the pentapeptide GGXGG.
FBP for a residue was given by (Np/Ntotalb), where Np is the total number of sample points in the residue that were buried by polar atoms and Ntotalb is the total number of buried sample points for the residue.
3D1D table
At first, we classified environment states based on each of the above SEDs. Five BSSS states, five RSA states, and three FBP states were defined, giving altogether 5 x 5 x 3 combinations of states. The five BSSS states were defined by (1) strong helix with BSSS < 2.067, (2) medium helix with 2.067 < BSSS < 0.181, (3) weak helix with 0.181 < BSSS < 0.031, (4) coil state with 0.031 < BSSS < 0, and (5)
strand with BSSS > 0. The five RSA states were defined by (1) thoroughly buried with RSA < 1.4%, (2) 1.4% < RSA < 12.5%, (3) 12.5% < RSA < 31.8%, (4) 31.8% < RSA < 54.5%, and (5) very exposed with RSA > 54.4%. Three FBP states were defined by (1) FBP < 26.7%, (2) 26.7% < FBP < 33.5%, and (3) FBP > 33.5%. For each SED the boundaries of the states were chosen so that equal numbers of residues from the 1632 proteins fell into each state.
The 3D1D table has 20 rows and 75 columns, where entry i, j represents the log-likelihood of residue type i being in environment state j, based on a survey of the 1632 proteins. Some bins contained very few hits for certain amino acids, so pseudo-counts were added to observations based on the background frequency of all residue types. The weight of pseudo-counts was controlled to be 10% of the total observations in each state bin. Based on the corrected counts, the log-likelihoodbased 3D1D score for amino acid i in bin j was obtained as follows: 100 log {P(i, j)/[P(i) P(j)]}. (For the 75 x 20 3D1D table used in this work, see Supplemental Material.) Each 3D structure was translated into a structural profile by the 3D1D table. A structural profile is an N by 20 matrix, where N is the number of residues in a structure and each row contains 3D1D scores for the 20 residue types in the state corresponding to that position.
Self-recognition test
For each structural profile, the native sequence fitness score s was defined as the sum of 3D1D scores for the native residues at each position. This fitness score was transformed into a Z-score:
. The mean and standard deviation of fitness scores for each protein were estimated by calculating the fitness scores for 200 random permutations of the native sequence. Thus the overall amino acid composition was held constant, but the permuted sequences had very low sequence identity to the native.
Alternatively, in order to test the recognition of the correct fold by a single sequence, we noted that the 1632 proteins fell into 677 homologous families based on CATH version 2.5.1. In order to ensure that the set of alternative proteins was structurally diverse, we selected at random one representative from each family. The sequence of each representative was threaded onto all 677 structural templates by a Smith-Waterman dynamic programming algorithm with affine gap penalties (300 for opening a gap and 30 for extending a gap). To compensate for varying chain lengths, the resulting threading score was divided by the length of the 3D structural template that it was threaded to. The Z-score over adjusted threading scores was calculated as above, where s is the adjusted score for threading the native sequence onto the native structure, and the mean and standard deviations were estimated by the adjusted scores from threading that same sequence onto all 677 structural templates.
Core structural alignment
The alignment of core structures was carried out in the following three steps:
1. Determination of candidate aligned fragment pairs (AFPs). If two octapeptides from two structures were similar in terms of 3D local structures as well as compatible in environments, they were treated as a candidate pairing for structural alignment. The 3D local structure similarity was calculated by
![]() | (7) |
where Dij is the difference of paring two peptides with length m = 8 in protein A (starting at piA and pjA) and two peptides in protein B (starting at piB and pjB). Dii is the difference of pairing a peptide in A and a peptide in B starting at piA and piB, respectively. See the CE algorithm (Shindyalov and Bourne 1998) for details. A pair of structurally similar octapeptides (i.e., AFP with m = 8) was defined to have Dii <1.5 Å. The environmental compatibility for each residue pair was defined as the Pearson correlation coefficient, cor, of the two corresponding bins of states on the basis of 3D1D scores. The environment compatibility of two octapeptides (Eij) is equal to the sum of the compatibility scores for the eight pairs of residues:
![]() | (8) |
Where
is a vector of 20 3D1D scores of residue types for the correspondent state bin of residue i in structure A. If Eij
0, we defined the two octapeptides to be environmentally incompatible.
2. Extension of AFP-based alignment by maximal clique finding algorithm. Consider a mathematical graph having N nodes corresponding to the AFP candidates found in the previous step. Two nodes are joined by an undirected edge if the two AFPs i and j satisfy (1) Dij < 1.5 Å, (2) (piA piB) * (pjA pjB) > 0, and (3) |piA pjA|
m and |pjB pjB|
m. In other words, the two AFPs are geometrically compatible, sequence order is maintained, and the octapeptides do not overlap in sequence in either protein. A maximal clique of such a graph corresponds to the largest subset of completely connected nodes, i.e., the most aligned residues involved in mutually geometrically compatible AFPs. Bron and Kerboschs method (1973) was used to find the maximal clique.
3. Iterative clique finding. After a run of the maximal clique finding routine, we eliminated AFPs in the resulting clique from the connection matrix and reran the maximal clique finding procedure to find another clique. Iterations ended when the returned CS was <20, or the maximal number of allowed cliques (e.g., three) had been achieved.
The quality of a clique (i.e., a core region) was measured by two statistics: the CS (CS = number of structurally aligned residues) and the environment-based AS (AS = sum over all residue pairs in the clique of the environmental compatibility scores, as in equation 8).
EVD model
We adopted the method used by FATCAT (Ye and Godzik 2004) to generate unrelated structures: Given a chain length l, all protein chains longer than l were selected from the above 1632 chains. Then we randomly chose two proteins from different topologies and chopped them into two fragments containing l contiguous residues each at random starting points. Since it is unlikely that structures in different topologies are actually homologous (Sierk and Pearson 2004), this randomization gives us a pair of unrelated structures. For each chain length of 43, 55, 70, 90, 106, 148, 191, 245, and 314, up to 5000 pairs of unrelated structures were generated. The structural profile of each fragment was calculated in the context of the original biological units (see "Preprocessing of biological unit files"). Three hundred five structures were used in this training set. EVD parameters were estimated by the method of moments (Altschul and Erickson 1986).
Large-scale test
The unbiased test set was generated by a method similar to the one described by Sierk and Pearson (2004). Among our 1632 protein chains selected above, we removed proteins with chain length >300 residues, resulting in 1456 structures. After removing structures used in the training set, as well as homologous superfamilies containing only one structure, we got 880 structures representing 101 topologies and 174 homologous families. We then selected out 44 "large" homologous families having more than five members per family and used the longest chain from each of these families as a query. The 44 queries covered 31 CATH topologies. The remaining 836 structures were used as our target library. The test was carried out by pairwise alignment of each of the 44 query structures to the 836 targets in the library. Altogether, 36,784 pairwise alignments were performed.
ROC curves were generated by calculating specificity and sensitivity at different cutoffs. E-values of CS and AS of a clique were calculated. The geometric mean of these two E-values was also used as the combination of size and AS (CS AS). The three E-values were used to evaluate the statistical significance of a structural alignment.
In ROC plot, specificity and sensitivity are defined as follows:
![]() | (9) |
![]() | (10) |
and positives and negatives in equations 9 and 10 were defined at the CATH homologous family level. This is more challenging than the topology level tests used in purely structural comparisons, and it is more appropriate because we also include environmental effects in AS.
EPQ tests were performed as described in Sierk and Pearson (2004). Coverage was defined as the percentage of correct hits (i.e., CATH homologs) found at a given cutoff. EPQ was defined as the ratio of the total number of errors occurring at a given cutoff over the total number of queries (i.e., 44 in this work). Because CATH is fairly stringent to define homologs, some structurally well-aligned structures may be classified into different homologous families. Sierk and Pearson (2004) suggested using toplogy classification to count errors. Therefore, we used two criteria to define errors: CATH non-homologs and CATH nontopologs (in Fig. 4, A and B
, respectively).
Extension to global structural alignment
The global alignment was made by linking the fragments found in the maximal clique finding procedure. In gaps between fragments, an alignment path was generated by dynamic programming over the environmental compatibility scoring matrix. Rigid structure superposition was generated by superimposing residue pairs found in the first clique. When more than one clique was found, residue pairs were superimposed for each clique, and nonclique residues were rotated and translated corresponding to the operation of the nearest clique. The aligned residue pairs were used in the calculation of
(Maiorov and Crippen 1995).
is defined as
![]() | (11) |
where ai and bi are the coordinates (x,y,z) centered at the origin for the ith aligned residues from structure A and B, respectively. Among the 36,784 pairwise structural alignments, we randomly sampled 200 alignments with at least 60% of the residues covered by cliques. We eliminated structures with small CS because those structures cannot be similar, whether measuring by statistical or nonstatistical methods. Among the 200 structures, 52 pairs were from different CATH classes, 51 pairs were from the same class but different architectures, 38 were from same architecture but different topologies, 18 were CATH topologs, and 41 were CATH homologs. After extensions to global alignments,
was calculated based on the above equation.
| Electronic supplemental material |
|---|
|
|
|---|
| Footnotes |
|---|
| Acknowledgments |
|---|
| References |
|---|
|
|
|---|
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235242.
Bowie, J.U., Luthy, R., and Eisenberg, D. 1991. A method to identify protein sequences that fold into a known three-dimensional structure. Science 253: 164170.
Bron, C. and Kerbosch, J. 1973. Finding all cliques of an undirected graph [H]. Commun. ACM. 16: 575577.[CrossRef]
Fischer, D., Elofsson, A., Rice, D., and Eisenberg, D. 1996. Assessing the performance of fold recognition methods by means of a comprehensive benchmark. Pac. Symp. Biocomput. 1996: 300318.
Gan, H.H., Perlow, R.A., Roy, S., Ko, J., Wu, M., Huang, J., Yan, S., Nicoletta, A., Vafai, J., Sun, D., et al. 2002. Analysis of protein sequence/structure similarity relationships. Biophys. J. 83: 27812791.
Gribskov, M. and Robinson, N.L. 1996. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem. 20: 2533.[CrossRef][Medline]
Holm, L. and Sander, C. 1993. Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233: 123138.[CrossRef][Medline]
Jaroszewski, L., Li, W., and Godzik, A. 2002. In search for more accurate alignments in the twilight zone. Protein Sci. 11: 17021713.
Kabsch, W. and Sander, C. 1983. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22: 25772637.[CrossRef][Medline]
Karlin, S. and Altschul, S.F. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. 87: 22642268.
Kolodny, R., Koehl, P., and Levitt, M. 2005. Comprehensive evaluation of protein structure alignment methods: Scoring by geometric measures. J. Mol. Biol. 346: 11731188.[CrossRef][Medline]
Koradi, R., Billeter, M., and Wüthrich, K. 1996. MOLMOL: A program for display and analysis of macromolecular structures. J. Mol. Graph. 14: 5155.[CrossRef][Medline]
Kuhl, F.S., Crippen, G.M., and Friesen, D.K. 1984. A combinatorial algorithm for calculating ligand-binding. J. Comp. Chem. 5: 2434.
Lee, B. and Richards, F.M. 1971. The interpretation of protein structures: Estimation of static accessibility. J. Mol. Biol. 55: 379400.[CrossRef][Medline]
Levitt, M. and Gerstein, M. 1998. A unified statistical framework for sequence comparison and structure comparison. Proc. Natl. Acad. Sci. 95: 59135920.
Maiorov, V.N. and Crippen G.M. 1995. Size-independent comparison of protein three-dimensional structures. Proteins 22: 273283.[CrossRef][Medline]
Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443453.[CrossRef][Medline]
Norvell, J.C. and Machalek, A.Z. 2000. Structural genomics programs at the US National Institute of General Medical Sciences. Nat. Struct. Biol. S7: 931.
Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. 1997. CATH: A hierarchic classification of protein domain structures. Structure 5: 10931108.[Medline]
Rice, D.W. and Eisenberg, D. 1997. A 3D1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence. J. Mol. Biol. 267: 10261038.[CrossRef][Medline]
ali, A. and Blundell, T.L. 1990. Definition of general topological equivalence in protein structures: A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. J. Mol. Biol. 212: 403428.[CrossRef][Medline]
Shapiro, J. and Brutlag, D. 2004. FoldMiner: Structural motif discovery using an improved superposition algorithm. Protein Sci. 13: 278294.
Shi, J., Blundell, T.L., and Mizuguchi, K. 2001. FUGUE: Sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 310: 243257.[CrossRef][Medline]
Shindyalov, I.N. and Bourne, P.E. 1998. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11: 739747.
Sierk, M.L. and Pearson, W.R. 2004. Sensitivity and selectivity in protein structure comparison. Protein Sci. 13: 773785.
Suyama, M., Matsuo, Y., and Nishikawa, K. 1997. Comparison of protein structures using 3D profile alignment. J. Mol. Evol. 44S: 163173.[CrossRef]
Tang, C.L., Xie, L., Koh, I.Y., Posy, S., Alexov, E., and Honig, B. 2003. On the role of structural information in remote homology detection and sequence alignment: New methods using hybrid sequence profiles. J. Mol. Biol. 334: 10431062.[CrossRef][Medline]
Taylor, W.R. 1999. Protein structure comparison using iterated double dynamic programming. Protein Sci. 8: 654665.[Abstract]
Yang, A.S. and Honig, B. 2000. An integrated approach to the analysis and modeling of protein sequences and structures. III. A comparative study of sequence conservation in protein structural families using multiple structural alignments. J. Mol. Biol. 301: 691711.[CrossRef][Medline]
Ye, Y. and Godzik, A. 2004. Database searching by flexible protein structure alignment. Protein Sci. 13: 18411850.
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
Y. Chen and G. M. Crippen An iterative refinement algorithm for consistency based multiple structural alignment methods Bioinformatics, September 1, 2006; 22(17): 2087 - 2093. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||