|
|
||||||||
Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry, and California Institute for Quantitative Biomedical Research, University of California at San Francisco, San Francisco, California 94143, USA
Reprint requests to: Marc A. Marti-Renom, Mission Bay Genentech Hall, University of California, San Francisco, San Francisco, CA 94143, USA; e-mail: marcius{at}salilab.org; fax: (415) 514-4231.
(RECEIVED August 18, 2003; FINAL REVISION December 19, 2003; ACCEPTED January 9, 2004)
| Abstract |
|---|
|
|
|---|
Keywords: protein sequence alignment; sequence profiles; comparative protein structure modeling
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.03379804
| Introduction |
|---|
|
|
|---|
One of the most significant improvements in alignment accuracy was achieved through the use of multiple sequence alignments and the corresponding sequence profiles (Gribskov et al. 1987, 1990; Gribskov 1994). For proteins, a sequence profile lists a preference for the 20 standard amino acid residue types at each position in a given multiple sequence alignment. The PSI-BLAST program relies on the BLAST algorithm (Altschul et al. 1990) to collect homologs of a query sequence and construct its profile by iteratively scanning a sequence database (Altschul et al. 1997; Park et al. 1998); a basic step in this calculation is a comparison of the query sequence profile with each sequence in the database. A multiple sequence alignment can also be transformed into a Hidden Markov Model (HMM), a class of probabilistic models that are generally applicable to a time series or linear sequences (Eddy 1998). A particularly successful method in this class is implemented in the SAM package (Hughey and Krogh 1996) that outperforms other sequence-based methods for fold recognition (Karplus et al. 1998; Park et al. 1998; Madera and Gough 2002). The SATCHMO algorithm in the LOBSTER package simultaneously constructs a similarity tree and compares multiple sequence alignments of each internal node of the tree using HMMs (Edgar and Sjolander 2003a, b). The CLUSTALW program compares two multiple sequence alignments by scoring an alignment of two positions, one from each profile, as the average of all pairwise substitution scores for the amino acid residues in the two profiles (Higgins and Sharp 1988; Thompson et al. 1994). The LAMA program aligns two multiple sequence alignments by first transforming them into profiles and then comparing the two to each other by the Pearson correlation coefficient (Pietrokovski 1996). Similarly, the FFAS program was developed to align two sequence profiles with each other (Jaroszewski et al. 2000; Rychlewski et al. 2000). A related approach was also used by Yona and Levitt (2000) Yona and Levitt (2002) to construct the ProtoMap database of protein sequence families (Yona et al. 1999, 2000; Yona and Levitt 2002). Al-Lazikani and co-workers (2001) combined multiple structure and sequence comparisons to improve the accuracy of alignments of SH2 domains. Most recently, the COMPASS program was developed to locally align two multiple sequence alignments with assessment of statistical significance (Sadreyev and Grishin 2003). These methods compare two profiles by constructing a matrix of scores for matching every position in one profile to each position in the other profile, followed by either local or global dynamic programming to calculate the optimal alignment. It was noted previously that profileprofile alignment methods are capable of detecting more remote relationships compared to the sequence-profile methods, such as PSI-BLAST (Rychlewski et al. 2000; Panchenko 2003; Sadreyev and Grishin 2003).
Another significant improvement of the alignment accuracy in the low similarity range was achieved by considering protein structure information for one of the sequences in a pairwise comparison. The methods in this class include threading and 3D template matching (Bowie et al. 1991; Godzik and Skolnick 1992; Jones et al. 1992; Kelley et al. 2000; Shi et al. 2001; Fischer 2003. For review, see Jones 1997; Levitt 1997; Smith et al. 1997; Torda 1997; David et al. 2000).
Yet another approach is implemented in the SEA program, which aligns a pair of remotely related sequences by optimizing a match between the predicted conformations of their short segments (Ye et al. 2003). The resulting alignments were more accurate than the pairwise sequence alignments by BLAST (Altschul et al. 1990) and ALIGN (Myers and Miller 1988), as well as the profileprofile alignments by FFAS (Jaroszewski et al. 2000; Rychlewski et al. 2000).
For closely related protein sequence pairs, with sequence identity over 40%, an accurate alignment is almost always trivial to obtain. In contrast, despite the methodological advances listed above, alignments in the so-called "twilight zone" of less than 30% sequence identity still contain many errors (Rost 1999). Unfortunately, most sequences share less than 30% sequence identity to a known structure (Sanchez and Sali 1998; Pieper et al. 2002). On average, pairwise alignments between sequences at 30% sequence identity have ~20% of the residues aligned incorrectly (Johnson et al. 1993). Some pairs of related proteins have almost no correctly aligned positions when aligned by sequence-based alignments methods (Venclovas et al. 2001).
Alignment accuracy in the twilight zone is crucial for several applications, including comparative protein structure prediction (Blundell et al. 1987; Marti-Renom et al. 2000; Baker and Sali 2001). To calculate an accurate comparative model, it is necessary to identify and correctly align at least one template structure to the target sequence. An incorrect alignment invariably leads to an inaccurate model, because none of the existing comparative model building methods can generally recover from an incorrect alignment (Marti-Renom et al. 2000). A number of studies assessed both the sensitivity of alignment methods in the detection of remote homologs (Park et al. 1997; Muller et al. 1999; Kelley et al. 2000; Rychlewski et al. 2000; Yona and Levitt 2000; Panchenko 2003; Sadreyev and Grishin 2003) and alignment accuracy (Jaroszewski et al. 2000; Sauder et al. 2000; Blake and Cohen 2001; Panchenko 2003; Sadreyev and Grishin 2003), as well as optimized the alignment accuracy for protein structure prediction (Jaroszewski et al. 2000; John and Sali 2003).
In this study, we optimized alignments specifically for comparative protein structure prediction. We begin by describing 13 profileprofile alignment protocols, the training and testing alignment sets, and measures of alignment accuracy (Materials and Methods). Next, we benchmark the alignment accuracy of our profileprofile alignment protocols relative to representative sequence-sequence, profile-sequence, and other profileprofile alignment methods (Results). Finally, we discuss our improvements in sequence alignment from the point of view of comparative protein structure modeling (Discussion).
| Materials and methods |
|---|
|
|
|---|
Multiple sequence alignment
For each sequence in a pair of sequences to be aligned, a multiple sequence alignment with its homologs was prepared by scanning the nonredundant protein sequence database at NCBI (June 2002) with the program PSI-BLAST, version 2.11 (Altschul et al. 1997). The scanning was performed without filtering out compositionally biased segments, was run for up to 20 iterations, and included all matches with an e-value smaller than 0.0005. Up to 1000 sequences with the most significant e-values were retained in the multiple sequence alignment. The default values were used for all other parameters. The multiple sequence alignment and the profile were saved after each iteration. The PSI-BLAST multiple sequence alignment of a sequence was defined to be the sequence-profile alignment with the most significant e-value from any of the iterations.
Sequence weighting
Sequence weighting is part of the calculation of a sequence profile from a multiple sequence alignment, and is used to compensate for nonuniform distribution of the homologs in the alignment. We applied two different weighting schemes.
First, we tested the often used position-based sequence weighting (Henikoff and Henikoff 1994) that assigns low weights to overrepresented sequences and high weights to unique sequences:
![]() | (1) |
where ri is the number of different residue types at position i and ni,j is the frequency of the residue type in sequence j at position i.
Second, we also tested our variation of the position-based sequence weighting that increases the weights of those sequences that are more similar to the query sequence:
![]() | (2) |
where Oa(i,1),b(i,j) are the Blosum62 odds ratios for matching the residue type a in the query sequence with the residue type b in sequence j, defined as "qij/eij" in the original paper (Henikoff and Henikoff 1992).
Sequence profile
A sequence profile of a given set of similar sequences specifies a preference for each of the 20 standard amino acid residue types at each of the residue positions in the set. A number of different estimation schemes have been suggested, because a multiple alignment may not contain a sufficiently large number of homologs to calculate a statistically robust profile solely from the occurrence of each residue type in the multiple alignment. They generally depend on prior or expected probabilities of residue occurrences and/or residue-residue substitutions (Henikoff and Henikoff 1996). We tested three different profile-building methods.
First, profiles generated by pseudo-counting (Henikoff and Henikoff 1996) as implemented in the PSI-BLAST program (Altschul et al. 1997): the use of pseudo-counting for profile generation was chosen for its simplicity of implementation and comparable performance to other tested approaches (Henikoff and Henikoff 1996).
Second, profiles generated by pseudo-counting (Henikoff and Henikoff 1996) as implemented by us in the MODELLER-7 program: the probability of a residue type a to occur at position i in a multiple alignment is estimated by:
![]() | (3) |
![]() | (4) |
![]() | (5) |
Ni is the sum of the weights Wj(1) (eq. 1) for the sequences that do not have a gap at position i. ni,a is the sum of the weights Wj(1) for the sequences with residue type a at position i. Bi is the total number of pseudo-counts at position i and depends on the parameter m that is set to the optimal value of 5 (Henikoff and Henikoff 1996). bi,a is the number of pseudo-counts for residue type a at position i. Ma is the probability of residue type a in the background distribution that is obtained from the Blosum62 matrix. Ma,b are the Blosum62 probabilities (Henikoff and Henikoff 1992) for matching the residue type a in the query sequence with the residue type b in sequence j. Both ni,a/Ni and bi,a/Bi are estimates of Pi,a, based on the observed and pseudo-counts, respectively. Correspondingly, Pi,a is a weighted sum of the two estimates, with the contributions determined by Ni and Bi. If Ni is larger than Bi, Pi,a is dominated by the observed counts, whereas if Bi is larger than Ni, Pi,a is dominated by pseudo-counts.
Third, our variation of the Henikoff and Henikoff schema with sequences weighted proportionally to their similarity to the query sequence, using Wj(2) (eq. 2) instead of Wj(1) (eq. 1).
Profileprofile substitution scores
Ideally, an optimal alignment of two profiles P and Q would be obtained by relying on a matrix of probabilities Si,j that any pair of profile positions Pi and Qj are "equivalent." It is not clear what the best definition of "equivalent" is and how to calculate such a probability of equivalence, given two profile distributions Pi and Qj. As a result, we are forced into a "parametric" approach, whereby we calculate a "substitution score" that approximates the probability of equivalence. Such substitution scores, together with a gap penalty function, can then be used to obtain an optimal alignment of two profiles by dynamic programming. Six recipes for calculating profileprofile substitution scores Si,j for each pair of profile positions i and j were tested.
First, the dot product between two distributions Pi and Qi at profile positions i and j, respectively:
![]() | (6) |
Second, the correlation coefficient between two distributions Pi and Qj:
![]() | (7) |
Third, the Euclidean distance between two distributions Pi and Qi:
![]() | (8) |
Fourth, a substitution score based on the Jensen-Shannon divergence measure DJS for two distributions (Lin 1991; Yona and Levitt 2000):
![]() | (9) |
![]() | (10) |
![]() | (11) |
The R vector can be seen as the most likely parent distribution of Pi and Qj. DKL is the Kullback-Leibler distance, also called the "cross-entropy measure" in information theory.
is a parameter between 0 and 1, set to 0.5 in this study.
and its complement (1-
) are the weights given to the Pi and Qj distributions, respectively. The Jensen-Shannon divergence, though not being a true metric, is bound by 0 and 1. It is 0 when the two compared distributions are identical and 1 when they are not related at all.
Fifth, for each position in a multiple sequence alignment, a pairwise residue substitution probability matrix was calculated as a weighted sum of the Blosum62 substitution probability matrix and the matrix of relative residue substitution frequencies observed at the given position in the multiple sequence alignment. Next, the substitution score for two multiple alignment positions i and j was calculated by averaging over these residue substitution probabilities for all pairs of residues containing a residue from each of the two compared positions:
![]() | (12) |
![]() | (13) |
![]() | (14) |
![]() | (15) |
where fa(i) is the observed frequency of residue type a at position i in the first multiple alignment corrected for sequence weights as defined above (using equation 1), Ma,b(i) is the substitution probability matrix for residue types a and b at position i in the first multiple alignment, Ma,b is the Blosum62 substitution probability matrix for residue types a and b, and
1 and
2 are scalar weights. Variable n is the number of the pairwise residue-residue substitutions within the multiple alignment at position i, and
is a smoothing parameter (set to 0.1 by optimization of the alignment accuracy on a learning set of alignments).
Sixth, the score Si,j(6) was defined as the correlation coefficient between the corresponding values in two posterior substitution matrices Ma,b (i) and Mb,a (j) for positions i and j in the first and second multiple alignments, respectively.
After the substitution scores were computed according to one of the six recipes above, they were scaled to fit the range from 0 to 1000.
Alignment methods
The testing pairs of sequences were aligned by (1) heuristic pairwise sequence alignment as implemented in BLAST 2.1.2 (Altschul et al. 1990), (2) pairwise sequence alignment by global dynamic programming with an affine gap penalty function as implemented in the ALIGN command of MODELLER-7 (Sali et al. 2001), (3) sequence-profile alignment as implemented by PSI-BLAST 2.1.2 (Altschul et al. 1997), (4) Hidden Markov Model (HMM) as implemented in SAM 3.3.1 (Hughey and Krogh 1996) and LOBSTER (Edgar and Sjolander 2003a), (5) pairwise sequence alignment based on matching predicted local structure as implemented in the SEA Web server (Ye et al. 2003), (6) multiple sequence alignment by CLUSTALW 1.81 (Thompson et al. 1994), (7) profileprofile alignments as implemented by COMPASS 1.24 (Sadreyev and Grishin 2003), and (8) the 13 schemes of profileprofile alignment by global dynamic programming with an affine gap penalty function as implemented by the SALIGN command of MODELLER-7.
For BLAST, a high e-value threshold of 100 was used for accepting an alignment between two sequences. Otherwise, the pair of sequences was ignored. Here, we focus on the alignment accuracy rather the accuracy of the methods to detect relationships. Therefore, we increased the e-value threshold relative to the commonly used value of ~10-4 to produce the maximum number of pairwise alignments obtained from the BLAST program. All other parameters were kept at their default values. Only four pairs of sequences did not have any fragments that could be aligned by this method.
For ALIGN, the default parameters were used. They include the AS1 residue type similarity matrix calculated from the reference structure alignments (file "as1.mat" in the MODELLER distribution; Overington et al. 1992), the initiation gap penalty u of -450, and the extension gap penalty v of -50; the penalty for a gap of n residue positions is u + v n.
For PSI-BLAST, multiple sequence alignments of each one of the two sequences were calculated as described above. The sequence-profile alignment with the most significant e-value from any of the iterations with either of the two sequences as queries was used as the PSI-BLAST alignment. Only two pairs of sequences did not have any fragments that could be aligned by this method.
For SAM, the following protocol was used (R. Karchin, pers. comm.). The w0.5 script with default parameters was applied to build HMMs for the target and template sequences, using their PSI-BLAST multiple sequence alignments. Next, the program hmmscore in the SAM package (sw = 0; select_align = 8; adpstyle = 5) was employed to align the HMM of the target and the template with the template and the target sequences, respectively, resulting in two generally different template-target alignments. The alignment with the most significant e-value as reported by the hmmscore program was selected.
For LOBSTER, the COACH algorithm was used through the -coach option to align a multiple sequence alignment against a Hidden Markov Model. First, the program was used to build HMMs for the target and the template sequences, using their PSI-BLAST multiple sequence alignments. Next, we aligned the HMMs of the target and the template to the template and target sequences, respectively, resulting in two generally different template-target alignments. The alignment with the higher bit score as reported by LOBSTER was selected.
For SEA, the Web server at http://ffas.ljcrf.edu/sea/ was used with the default parameters: the FRAGlib library (http://ffas.ljcrf.edu/frag/) for extracting structural fragments with a cutoff of -1.5, local alignment with the initiation gap penalty u of -5 and the extension gap penalty v of -1, and the weight for local similarity of 0.5.
For CLUSTALW, the profile alignment option (i.e., number 3 in the main CLUSTALW menu) with the default parameters was used (Thompson et al. 1994). We used this option over the multiple sequence alignment option (i.e., number 2 in the main CLUSTALW menu) to benchmark CLUSTALW using the same profiles as for the other tested programs.
For COMPASS, the default parameters were used to align the target and template multiple sequence alignments (Sadreyev and Grishin 2003).
For SALIGN, the 13 different protocols were tested, combining three different ways to construct a profile with four different ways to score a match between two profile positions, as well as two protocols based on posterior substitution probability matrices, as described above (Table 1
). The PSI-BLAST profiles cannot be used with the Jensen-Shannon scheme for calculating the profileprofile substitution scores because this scheme relies on probabilities Pi and Qj that are not reported in the PSI-BLAST output.
|
Training and testing alignment sets
Because our aim is to improve the accuracy of comparative protein structure modeling, the reference alignments were pairwise, structure-based alignments. They were extracted from our comprehensive database of pairwise structure-based alignments, DBAli (Marti-Renom et al. 2001). The alignments in DBAli were calculated by superposing all pairs of proteins of known structure in the Protein Data Bank (PDB, Feb. 1999; Berman et al. 2002) that are classified into the same H class in the CATH database (Orengo et al. 1999), using the program CE (Shindyalov and Bourne 1998). There are 33,920 such alignments with a Z-score higher than 3.8 (Shindyalov and Bourne 1998). They cover the entire spectrum of sequence and structure similarities.
First, 387 alignments were extracted from DBAli by requiring up to 40% sequence identity, at least 100 aligned residues, at least 50% of the residues aligned, and that at least 90% of the residues of one chain are covered in the alignment. Second, structure pairs that did not have at least 50% of the residues in the shorter chain aligned by MAMMOTH (Ortiz et al. 2002) were also eliminated, resulting in the final set of 300 reference alignments. These 300 alignments were randomly divided into the training and testing sets of 100 and 200 alignments, respectively. The training set of alignments was used to optimize the gap initiation and gap extension penalties for all of our alignment protocols and the parameter
for the two posterior substitution probability matrix protocols, and the testing set was used to assess the performance of all examined alignment methods. The PDB chain identifiers, chain lengths, percentage sequence identities, root-mean-square deviations (RMSDs) for the aligned C
atoms, average percentages of the aligned C
atoms, and percentages of structurally equivalent residues (below) are listed separately for the training and testing alignments in Supplementary Table 1
(http://salilab.org/suppmat/suppmat.shtml). Distributions of these features for all 300 alignments are shown in Figure 1
, indicating that the selected structure pairs indeed represent difficult alignment cases, with an average pair sharing only 20% sequence identity and 65% of structurally equivalent C
atoms superposed with an RMSD of 3.5 Å.
|
atoms in the two structures was calculated upon rigid-body least-squares superposition of all the C
atoms, as implemented in the SUPERPOSE command of MODELLER (Sali et al. 2001).
Second, the percentage of structurally equivalent positions was defined as the percentage of the C
atoms in the shorter of the sequences that are within a certain cutoff (e.g., 1, 2, 3, 4, and 5 Å, and their average) of the corresponding atoms in the superposed structure ("structure overlap"). Unless indicated otherwise, the structure overlap quoted is the average over all cutoffs.
Additionally, the alignment methods were assessed by the percentage of alignments with the structure overlap higher than 30% ("success rate"); structure pairs with at least as much overlap have the same fold (Abagyan and Batalov 1997).
In addition to the assessment of structure similarity implied by an alignment, we evaluated the accuracy of the alignment through a comparison with the CE structure-based alignment. First, the fraction of correctly aligned positions was defined as the percentage of positions in the tested alignment that were identical to those in the CE structure-based alignment ("CE overlap"); the residue-gap matches are ignored in this calculation.
Secondly, the shift score, which ranges from -e for two completely different alignments to 1 for identical alignments, was also calculated (Cline and Karplus 1998). We used e = 0.2, as suggested (Cline and Karplus 1998). The shift score incorporates both coverage and error.
Optimization of gap penalties and 
The optimal gap initiation and extension penalties for the 11 profileprofile alignment protocols were identified by maximizing the average percentage of correctly aligned positions for the training set of sequence pairs. The maximization scanned all combinations of the initiation penalties from -1000 to 0 in steps of 50 and the extension penalties from -200 to 0 in steps of 10. The gap initiation, gap extension, and the
parameters for the two posterior substitution probability matrix protocols were optimized on a 3D grid, with
ranging from 0.001 to 10 (0.001, 0.01, 0.1, 1, and 10). The optimal parameters for each alignment protocol are listed in Table 1
.
Significance of an observed difference in the alignment accuracy
A statistical analysis of the differences between alignment accuracies of various methods was performed. For this analysis, the alignment accuracy of a method was measured independently by the average shift score and CE overlap, both calculated for the 200 testing pairs of sequences. The significance of the differences was computed using Students t-test statistics (Marti-Renom et al. 2002).
| Results |
|---|
|
|
|---|
parameter were optimized separately for each one of the 13 protocols, by relying on the 100 training alignments. To assess SALIGN and a variety of other alignment methods, we used the 200 reference structure-based alignments. First, we assessed the differences in accuracy between the 13 different SALIGN protocols. Next, we compared two of the SALIGN protocols for profileprofile alignment by global dynamic programming to a heuristic pairwise sequence alignment (BLAST), a pairwise sequence alignment by global dynamic programming (ALIGN), a heuristic sequence-profile alignment (PSI-BLAST), two HMM methods (as implemented in SAM and LOBSTER), a pairwise sequence alignment by matching predicted local structures (SEA), and two profileprofile alignment methods (CLUSTALW and COMPASS). Finally, to illustrate the utility of our method, we describe two examples of comparative protein structure modeling that benefit from profileprofile alignment.
SALIGN protocols
The SALIGN protocols that on average aligned most positions correctly were DPHH, CCHH, and CCHS, (c.f. Table 1
) with the average CE overlaps of 56.4%, 56.3%, and 55.5%, respectively (Table 2
). These three protocols are marginally superior to the other 10 SALIGN protocols, but the differences are significant only for CCHH, and DPHH, (Fig. 2A
). When the shift score is used to assess the alignment accuracy, we cannot differentiate in accuracy between CCHS, CCHH, CCPBP, DPHS, and DPPBP at the confidence level of 95% (Fig. 2B
). For assessment of SALIGN relative to other alignment methods, we have chosen the protocols CCHH and CCHS based on their marginal superiority over the other protocols according to both accuracy measures as well as previous studies that reported good performance of the correlation coefficient (Pietrokovski 1996; Rychlewski et al. 2000; Edgar and Sjolander 2003a; Panchenko 2003; Sadreyev and Grishin 2003).
|
|
The accuracy of a profileprofile alignment method must depend on the accuracy of the input multiple sequence alignments. In general, a multiple sequence alignment prepared by PSI-BLAST depends significantly on the number of iterations and the e-value cutoff for inclusion of a sequence in the alignment. As described in Materials and Methods, our protocol for constructing a multiple sequence alignment already selects automatically the iteration that results into the highest statistical significance of the alignment between the profile and the other sequence. This protocol is based on the empirical observation that accuracy of the alignments correlates with their statistical significance, albeit weakly in the low-similarity range (Altschul et al. 1997; Brenner et al. 1998; Sauder et al. 2000). In addition, we tested the impact of the e-value cutoff used for the construction of PSI-BLAST multiple sequence alignments on the accuracy of the resulting SALIGN alignments. There is no dependence of the SALIGN alignment accuracy on this e-value cutoff, in the range from 10-20 to 10-4 (e.g., the average structure overlap varies from 35% to 36%).
Assessment of SALIGN relative to other alignment programs
For SAM, LOBSTER, ALIGN, CLUSTALW, and SALIGN, the alignment covers 100% of the residues in the input sequences. The average coverage by the SEA server is 97.2%; only five alignments (2.5% of all alignments in the test set) have less than 90% coverage. Therefore, the coverage of SEA is comparable to that of SAM, LOBSTER, ALIGN, CLUSTALW, and SALIGN. However, the coverage of BLAST, PSI-BLAST, and COMPASS alignments is smaller than any of the other methods (Fig. 3
). For BLAST, only 24 alignments (12% of the testing set) cover over 75% of both chains, 70 for PSI-BLAST (35%), and 108 for COMPASS (54%). BLAST could not find any significant hits for four alignments, whereas PSI-BLAST could not find a hit for two of the 200 test alignments.
|
|
|
RMSD of 7.8 Å (Table 3
RMSD upon superimposition by PSI-BLAST, BLAST, and COMPASS alignments of 6.5, 5.6, and 4.8 Å, respectively, is lower than that of SALIGN. However, this difference is a consequence of a much smaller number of aligned residues (Fig. 3
RMSD of SALIGN, PSI-BLAST, SEA, and SAM are comparable, SALIGN has much higher coverage. Therefore, it is more useful for comparative protein structure modeling because it allows modeling of a larger fraction of a target sequence without sacrificing the RMSD accuracy of a model relative to the other tested alignment methods.
For the sequence-sequence alignment methods (ALIGN and BLAST), the alignment success rate (i.e., the fraction of the test alignments with at least 30% structure overlap) averaged over all superposition cutoffs is ~30%; for sequence-profile methods (PSI-BLAST, SAM, and LOBSTER), it is ~40%; for the sequence-structure method (SEA), it is ~45%; for the COMPASS profileprofile method, it is 49%; for SALIGN, it is ~53% (Table 4
). CLUSTALW, a profileprofile method that aligns two consensus sequences based on multiple sequence alignment, has the alignment success rate of ~36%. For the 5 Å cutoff alone, the SALIGN CCHH and CCHS protocols also have a higher alignment success rate than any other tested method, aligning more than 80% of the test alignments with at least 30% structure overlap (Table 4
). This performance is 13% higher than that of SEA and ~20% higher than that of SAM, LOBSTER, COMPASS, and PSI-BLAST. Only COMPASSs success rate for the higher-resolution cutoffs (i.e., 1, 2, and 3 Å) was similar to that of SALIGN, indicating that the local alignments by COMPASS are more accurate than those by PSI-BLAST and BLAST. The two SALIGN protocols have a higher alignment success rate than any of the other tested methods over the whole range of structural overlap cutoffs (Fig. 5
). Even pairs of sequences with as little as 4% sequence identity can sometimes be aligned reasonably well by the CCHH and CCHS protocols.
|
|
|
Examples
To illustrate SALIGN, we aligned and modeled two target sequences from the fourth Critical Assessment of Techniques for Protein Structure Prediction (CASP) meeting (Moult et al. 2001): an enolase enzyme from E. coli (target T0111) and a hypothetical protein from H. influenzae (target T0092). T0111 shares 45% sequence identity to the template structure with the PDB code of 5enl
[PDB]
, and T0092 res only 8% sequence identity with the most similar template 1d2c
[PDB]
:A.
The model for the easy target T0111, based on the SALIGN alignment and the default MODELLER model building routine model (Sali and Blundell 1993), is similar in accuracy to the best model presented at the CASP4 meeting (Fig. 6A
). For this example, PSI-BLAST generated an alignment that led to a slightly better model in terms of RMSD, but with a smaller number of correctly modeled residue positions (i.e., C
atoms within 3 Å of their correct positions).
|
| Discussion |
|---|
|
|
|---|
We focused on the utility of sequence profiles to enhance the coverage and accuracy of sequence alignment for comparative protein structure modeling (Sanchez and Sali 1997; Marti-Renom et al. 2000; Pieper et al. 2002). We expected that the conservation and variation of residue types at a given position in a family alignment would allow us to score more accurately whether or not two given positions should be aligned than by comparing only two residue types (i.e., sequence-sequence alignment), or even a residue type and a distribution of residue types (i.e., sequence-profile alignment). We combined several schemes to generate profiles with several recipes to compare the profile positions with each other, resulting in 13 different protocols. The corresponding substitution scoring matrices were used in a global dynamic programming procedure with an affine gap penalty function to create optimal alignments. The protocols were evaluated with the aid of a testing set of alignments and subsequently compared to other existing and widely used alignment programs. This comparison was performed with a view of using the alignments for comparative protein structure modeling.
Comparative modeling is limited by the accuracy and extent of the alignment between the modeled sequence and the template structure(s) (Marti-Renom et al. 2000). Two fundamentally distinct features that cannot trivially be combined describe the quality of a model: (1) the fraction of the protein sequence that is modeled (i.e., coverage) and (2) the accuracy of the modeled region. Generally, the smaller the fraction of the target modeled, the more accurate the model. For example, the accuracy of a model can be increased at the expense of coverage by retaining only the core of the fold and eliminating loops and termini from the model.
A case in point are the local alignment methods, such as BLAST (Altschul et al. 1990), PSI-BLAST (Altschul et al. 1997), and COMPASS (Sadreyev and Grishin 2003). These algorithms generally do not align whole sequences, but only regions that are quite similar to each other. In contrast, global dynamic programming implemented in SALIGN ensures an optimal alignment that is forced to cover whole sequences. In the testing set of 200 pairwise alignments, PSI-BLAST covered less than 75% of the chain residues for 70 alignments and had 60 alignments under 50% coverage (Fig. 3
). In contrast, CLUSTALW (Thompson et al. 1994), ALIGN (Sali and Blundell 1993), SAM (Hughey and Krogh 1996), LOBSTER (Edgar and Sjolander 2003a), and SALIGN always covered 100% of the sequences, whereas SEA (Ye et al. 2003) and COMPASS covered on the average 97% and 64% of the sequences, respectively. Despite larger coverage, SALIGN still outperformed COMPASS, PSI-BLAST, and BLAST in the accuracy of what was covered. The Students t-test statistics show that the differences observed between SALIGN and the other tested alignment methods are significant at the confidence level of 95%.
We assessed the 13 different protocols of SALIGN. The two marginally best protocols used the correlation coefficient to compare two profile positions described by (1) the Henikoff-Henikoff scheme (Henikoff and Henikoff 1992, 1994) and (2) its variation that weighs sequences proportionally to their similarity with the target sequence. Most of the differences between the 13 SALIGN protocols were not significant at the 95% confidence level (Fig. 2
). However, there were significant differences between the two protocols of SALIGN that have the best average accuracy and the protocol that always picks the best of the 13 alignments. This fact encourages further development of the profileprofile alignment method.
In summary, the alignment success rate (i.e., the fraction of alignments with more than 30% structure overlap within 5 Å) of the SALIGN method is ~13% higher than that for SEA, ~20% higher than that for COMPASS, SAM, LOBSTER, and PSI-BLAST, and 25%45% higher than those of CLUSTALW, ALIGN, and BLAST (Table 4
). Moreover, the best SALIGN protocols increased the structure overlap (5 Å cutoff) by 6%28% relative to the other benchmarked methods without sacrificing the coverage of the aligned sequences. The fraction of the correctly aligned residues relative to the structure-based alignment by our top protocol is 56%, which can be compared with the accuracies of 26%, 42%, 43%, 43%, 43%, 48%, 49%, and 50% for BLAST, ALIGN, CLUSTALW, PSI-BLAST, COMPASS, SAM, SEA, and LOBSTER, respectively.
The present results quantify the significant improvement in the accuracy of sequence alignment that is achieved by the use of multiple sequences, in agreement with previous studies (Pietrokovski 1996; Rychlewski et al. 2000; Edgar and Sjolander 2003a; Panchenko 2003; Sadreyev and Grishin 2003). Here, we emphasize an implementation in our publicly available program MODELLER (http://salilab.org/modeller) as well as an increase in the coverage and accuracy of the method, not its novelty. The tests described in the Results section indicate that the SALIGN protocol aligns pairs of sequences with significantly higher coverage and accuracy than the other benchmarked methods used with recommended settings. However, the gap penalties for SALIGN were optimized for a training set of alignments of similar difficulty as the testing set of alignments. In contrast, although we did use the recommended settings for the other benchmarked programs, these settings may not be optimal for this particular benchmark. It is difficult to estimate how much better the other benchmarked methods would perform if their options and parameters were optimized based on the current training set of alignments.
Other existing methods for profileprofile alignment were not compared in this analysis, for several reasons. The LAMA program was developed to detect sequence relationships between local conserved blocks (Pietrokovski 1996) and not to align two sequences using global dynamic programming that allows for gap insertions. The FFAS program (Rychlewski et al. 2000) was developed for fold assignment and not optimized for sequence alignment. Moreover, the SEA program, which we did test here, was shown by the authors to be superior in alignment accuracy to the FFAS program (Ye et al. 2003). Finally, we could not find programs by Yona and Levitt (2002) and Panchenko (2003). In addition, the COMPASS program, which we did benchmark, compares favorably against the program developed by Yona and Levitt (Sadreyev and Grishin 2003).
None of the methods benchmarked in this paper, including SALIGN, rely on structural information to align two multiple sequence alignments. However, there are methods that do use structure information, including threading and consensus methods, and they can produce accurate sequence-structure alignments (e.g., Kelley et al. 2000; Shi et al. 2001; Fischer 2003; John et al. 2003). We did not benchmark SALIGN against any structure-dependent methods, for three reasons. First, there already is a published benchmark on the EVA Web site (http://cubic.bioc.columbia.edu/eva; Eyrich et al. 2001; Koh et al. 2003) that compares several sequence-structure, SAM, and PSI-BLAST methods. Therefore, the EVA Web site provides an indirect estimate of SALIGN relative to the sequence-structure methods. For example, the FUGUE program (Shi et al. 2001) aligns correctly ~6% more C
atoms within 3.5 Å than the PSI-BLAST program, and is approximately comparable to SAM (http://cubic.bioc.columbia.edu/eva/fr/Pairwise_bestof5.html, December 15, 2003). Similar results are reported by EVA for the 3D-PSSM server (Kelley et al. 2000). Second, many of the threading programs are implemented as Web servers or are not generally available from the authors. Their exact input and output are difficult to control, and consequently informative comparisons are difficult. And finally, although we were motivated by comparative modeling, even application of profileprofile alignment to pairs of sequences, none of which has a known structure, is an important problem.
SALIGN is a starting point for incorporating additional information into the alignment process, to further increase the accuracy of the resulting alignments. For example, information derived from the 3D structure of one of the aligned sequences, such as the environment-dependent substitution matrices (Overington et al. 1992) and variable structure-dependent gap penalties (Zhu et al. 1992; Koretke et al. 1996; Yang 2002), is likely to further improve the utility of sequence-structure alignment in comparative modeling applications.
Currently, SALIGN is used as a module in ModPipe, our software pipeline for large-scale modeling of all available protein sequences (Bairoch and Apweiler 2000) that are detectably related to at least one known protein structure (Eswar et al. 2003). An extrapolation of the present results indicates that SALIGN applied to large-scale modeling will result in an additional ~100,000 sequences that have more than 30% of residues aligned correctly to the closest structure, in comparison to the current models that were calculated based on the PSI-BLAST alignments.
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Al Lazikani, B., Sheinerman, F.B., and Honig, B. 2001. Combining multiple structure and sequence alignments to improve sequence detection and alignment: Application to the SH2 domains of Janus kinases. Proc. Natl. Acad. Sci. 98: 1479614801.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403410.[CrossRef][Medline]
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 33893402.
Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28: 4548.
Baker, D. and Sali, A. 2001. Protein structure prediction and structural genomics. Science 294: 9396.
Barton, G.J. 1996. Protein sequence alignment and database scanning. In Protein structure prediction: A practical approach. (ed. M.J.E. Sternberg). IRL Press at Oxford University Press, Oxford.
. 1998. Protein sequence alignment techniques. Acta Crystallogr. D Biol. Crystallogr. 54: 11391146.[CrossRef][Medline]
Berman, H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K., Feng, Z., Gilliland, G.L., Iype, L., Jain, S., et al. 2002. The Protein Data Bank. Acta Crystallogr. D. Biol. Crystallogr. 58: 899907.[CrossRef][Medline]
Blake, J.D. and Cohen, F.E. 2001. Pairwise sequence alignment below the twilight zone. J. Mol. Biol. 307: 721735.[CrossRef][Medline]
Blundell, T.L., Sibanda, B.L., Sternberg, M.J., and Thornton, J.M. 1987. Knowledge-based prediction of protein structures and the design of novel molecules. Nature 326: 347352.[CrossRef][Medline]
Bonneau, R., Strauss, C.E., and Baker, D. 2001. Improving the performance of rosetta using multiple sequence alignment information and global measures of hydrophobic core formation. Proteins 43: 111.[CrossRef][Medline]
Bowie, J.U., Luthy, R., and Eisenberg, D. 1991. A method to identify protein sequences that fold into a known three-dimensional structure. Science 253: 164170.
Brenner, S.E., Chothia, C., and Hubbard, T.J. 1998. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. 95: 60736078.
Cline, M. and Karplus, K. 1998. On the alignment shift and its measures. UCSC-CRL-97-27 27.
Cuff, J.A. and Barton, G.J. 2000. Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins 40: 502511.[CrossRef][Medline]
David, R., Korenberg, M.J., and Hunter, I.W. 2000. 3D-1D threading methods for protein fold recognition. Pharmacogenomics 1: 445455.[CrossRef][Medline]
Eddy, S.R. 1998. Profile hidden Markov models. Bioinformatics 14: 755763.
Edgar, R.C. and Sjolander, K. 2003a. SATCHMO: Sequence alignment and tree construction using hidden Markov models. Bioinformatics 19: 14041411.
. 2003b. Simultaneous sequence alignment and tree construction using hidden Markov models. Pac. Symp. Biocomput.: 180191.
Eswar, N., John, B., Mirkovic, N., Fiser, A., Ilyin, V.A., Pieper, U., Stuart, A.C., Marti-Renom, M.A., Madhusudhan, M.S., Yerkovich, B., et al. 2003. Tools for comparative protein structure modeling and analysis. Nucleic Acids Res. 31: 33753380.