|
|
||||||||
Institute for Cancer Research, Fox Chase Cancer Center, Philadelphia, Pennsylvania 19111, USA
Reprint requests to: Roland L. Dunbrack Jr., Institute for Cancer Research, Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA 19111, USA; e-mail: RL_Dunbrack{at}fccc.edu; fax: (215) 728-2412.
(RECEIVED December 26, 2003; FINAL REVISION March 12, 2004; ACCEPTED March 16, 2004)
| Abstract |
|---|
|
|
|---|
Keywords: sequence profiles; profileprofile alignment; PSI-BLAST
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.03601504.
| Introduction |
|---|
|
|
|---|
Accurate sequence alignment between a target sequence and a parent structure to be used as template for modeling is now usually determined with the help of other protein sequences related to both target and parent sequence. Such multiple sequence information is often represented in some form of profile, a matrix of dimensions 20 x L (or 21 x L if information on gaps is contained in the profile), where L is usually the length of the target or query sequence. The profile contains information on the proportion of each amino acid in each column of a multiple sequence alignment of proteins related to the query sequence. Profiles were originally proposed by Gribskov et al. (1987) as a means for database search. They proposed that profiles could be constructed both from related sequences and using structural information to determine which amino acids were likely at each position of a known structure and to determine position-specific gap penalties. Somewhat later they also proposed using secondary-structure-specific substitution matrices and including surface accessibility in determining the profile (Luthy et al. 1991).
The use of profiles has greatly accelerated with the development of the PSI-BLAST program by Altschul et al. (1997). PSI-BLAST uses heuristics to search protein databases rapidly for sequences with good alignment scores to a query sequence. From these sequences and a pairwise amino acid substitution matrix (Henikoff and Henikoff 1993), a profile is constructed that contains the log-odds scores (relative to database frequency) of each amino acid at each position of the query sequence. This profile matrix is then used to search the same database again, but now the position-specific scores are used to evaluate alignments instead of pairwise amino acid comparisons. High-scoring sequences in the second round are used to build a new profile, and the iteration continues.
A generalization of profile-to-sequence database searches and alignments was proposed by Pietrokovski (1996). Instead of searching a database of sequences, one can create a database of profiles from the sequence database, each of which contains information on a protein family. This profile database can now be searched with a profile constructed from a query protein family of interest using profile-to-profile alignments. Several groups have published profile-to-profile alignment methods (Pietrokovski 1996; Rychlewski et al. 1998, 2000; Yona and Levitt 2002; Panchenko 2003; Sadreyev and Grishin 2003; von Öhsen and Zimmer 2003; von Öhsen et al. 2003). Most of these use the standard Smith-Waterman local alignment method (Smith and Waterman 1981), but they vary significantly in a number of important respects. A schematic of the profileprofile alignment procedure is shown in Figure 1
.
|
The second major variation is the method for scoring the alignment of one column of the query profile against a column from a database or template profile. It is, in fact, rather surprising that there is no consensus on how this should be done, from either a theoretical or practical point of view. The methods include correlation coefficients, Euclidean distances (Pietrokovski 1996), sums over substitution matrices weighted by the query and template frequencies (von Öhsen and Zimmer 2003), scores based on information theory (Yona and Levitt 2002), and dot products of log-odds scores with log-odds scores (Pietrokovski 1996), frequencies with frequencies (Rychlewski et al. 2000), or frequencies or amino acid counts with log-odds scores (Sadreyev and Grishin 2003). With the Smith-Waterman algorithm (Smith and Waterman 1981), pairwise scores must on average be negative with the maximum score positive. Some scoring functions require a shift so that the Smith-Waterman algorithm can identify common regions as completely as possible without aligning unrelated regions, and this is discussed in some publications on profileprofile scoring methods.
A third variation occurs for gap penalties. Some methods, such as the BLOCKS database alignments of Pietrokovski (1996) and the core alignment method (Panchenko et al. 1999) used by Panchenko (2003), do not require gap penalties. A recent paper by Mittelman et al. (2003) investigated the different columncolumn scoring functions in generating short ungapped alignments. But local and global dynamic programming methods do require a gap penalty that must be carefully set to produce reasonable alignments. These parameters are usually optimized on a training set to go with the chosen columncolumn scoring function. Because the predominant use of profileprofile alignments is to generate a sequencestructure alignment, it is usually the case that a structure is known for the sequence used to build one of the profiles. From this structure, one can use the secondary structure as well as surface accessibility information.
Finally, profileprofile alignment is used to search databases of profiles, usually derived from sequences of known structure in the PDB, and to produce accurate sequencestructure alignments of the query sequence and template structure. Methods should therefore be tested both for their search sensitivity and specificity as well as their alignment accuracy and completeness. The size of benchmarks used and whether both search and alignment accuracy have been examined also vary among published methods.
In this paper, we explore some alternatives in each category described above with a large benchmark of structurally aligned protein pairs. We examine five issues: (1) seven different scoring functions for comparing two profile columns; (2) how to optimize gap penalties; (3) weighting schemes; (4) whether including fewer or more divergent sequences in each profile is helpful; and (5) whether adding a secondary-structure substitution matrix is beneficial. The benchmark we have derived is larger than most used in previously published profileprofile methods. This is necessary given the large variance in search efficacy and alignment accuracy each method exhibits over a test set.
| Materials and methods |
|---|
|
|
|---|
= (IDCE + IDDALI)/2.
> 40% were discarded.
0.9 or (b) consensus structural alignment length (Nboth) is >100, and the alignment length difference between CE and DALI is <20% of the average length of the CE and DALI alignments. The resulting data set contains 3441 alignment pairs, involving 1627 sequences that belong to 374 SCOP families and 128 superfamilies. The sequence identity range is 0%35%. We call this group of alignments Set A (for "Accuracy"). About one-third of the pairs in Set A were selected randomly to be used as a training set (1136 pairs), with the rest serving as a testing set (2305 pairs).
To evaluate the database searching selectivity and sensitivity, we constructed Set S (for "search"), which was derived from Set A by using the following criteria: No more than two sequences in this data set belong to the same SCOP family, and no more than 10 sequences are from the same superfamily. Set S contains 665 sequences comprising 441,560 alignment pairs (in both directions). Of these, 3320 pairs are true positives, and all others are false positives. True positives were defined in such a way that either both sequences in the alignment belong to the same SCOP fold designation or they are both classified in SCOP as Rossmann-like folds. The Rossmann-like fold consists of a parallel
-sheet in the order 32145, often decorated by varying numbers of
-helices and additional sheet strands. In SCOP 1.48, there are 25 different folds annotated as Rossmann-like fold, and many of these are putatively homologous (Sadreyev and Grishin 2003).
Profile generation
PSI-BLAST (Altschul et al. 1997) was used to build multiple alignments through database searching. We used PSI-BLAST in two ways to build multiple sequence alignments. The first method was accomplished by searching a version of the nonredundant protein sequence database (Wheeler et al. 2004), nr, with low-complexity segments masked with the seg program (Wootton 1994; length parameter = 20) for five rounds with an E-value cutoff for both printing and inclusion in the position-specific scoring matrix of 0.002. The multiple sequence alignment was taken from the final round. We call these Last Round profiles.
Because the initial multiple alignment from PSI-BLAST usually contains many very closely related sequences, we culled the multiple alignments using a mutual sequence identity of 98% (PSI-BLAST culls sequences that are 98% identical or more to the query only). On the other hand, a PSI-BLAST multiple alignment may also contain very distantly related sequences. These sequences may create "noise" in building the profile, either because they are false positives, or because they are poorly aligned to the other sequences in the multiple alignment. We therefore created another set of multiple alignments in which not only redundant sequences were removed from the alignment, but also very distantly related sequences (sequence identity to query <15%) were also removed. We call these Cut Low Ident profiles.
Weighting schemes
An important step in building a profile from a multiple alignment is weighting each sequence in the alignment. We tested three well-established weighting schemes: (1) the Henikoff weighting scheme (Henikoff and Henikoff 1994), which is used in PSI-BLAST and many other profile-related applications; (2) PSIC weighting (Sunyaev et al. 1999), which is used in the COMPASS profileprofile alignment algorithm (Sadreyev and Grishin 2003); (3) an FFAS-like weighting scheme (Rychlewski et al. 2000) that we label Seq Divergence weighting.
In Henikoff weighting, for an amino acid at position m of sequence i, we first determine the subset of sequences that also have an amino acid in the same column of the multiple alignment. Each sequence in this subset may begin and end in different columns of the multiple alignment. We find the first column in which all of these sequences in the subset are represented either with an amino acid or an internal gap (i.e., excluding N- and C-terminal gaps). We call this Cleft. Similarly, we identify the last column of this subset in which all of these sequences are represented either with an amino acid or an internal gap. We call this Cright. The weight of the amino acid at position m of sequence i is then
![]() | (1) |
where N jdiff is the number of different amino acids at alignment position j but considering only those sequences in the subset, and nij is the total number of the same amino acid type as the residue of sequence i in alignment column j in the subset.
In PSIC weighting, the weight of sequence i at alignment position m with amino acid type a(i) is
![]() | (2) |
where Nm is the number of sequences that have amino acid a(i) at position m, n aeff is the effective count of amino acid a(i) at that position,
![]() | (3) |
where Fa is the average number of different amino acid types per position in the sequences that have residue type a(i) at position m. To calculate Fa, we use a similar procedure as with the Henikoff weighting. For PSIC, the subset is defined differently. It now includes only those sequences that have the same amino acid as sequence i, not all sequences that have an amino acid in the column. Once we have the subset of sequences, Cleft and Cright are defined in the same way as in Henikoff weighting above. Fa is then the number of different amino acids averaged over the columns from Cleft and Cright, inclusive.
In the SeqDivergence weighting scheme, the weight of sequence i is
![]() | (4) |
where si,j is the similarity score of sequences i and j,
![]() | (5) |
and Ai,j is the alignment score of sequence i and j calculated with the BLOSUM62 mutation matrix.
It should be noted that the Henikoff weighting method assigns the same weights to amino acids in a sequence that form the same subset, regardless of the different amounts of variation in each column. PSIC, on the other hand, assigns lower weights if many sequences have the same amino acid at a particular position. Whereas Henikoff and PSIC assign weights that vary along the sequence, SeqDivergence assigns a constant weight for each whole sequence.
We calculate "target" frequencies of the 20 amino acids at each position from the observed counts weighted with each weighting scheme listed above using the pseudocount method used in PSI-BLAST (Altschul et al. 1997). However, the Seq Divergence weighting scheme calculates these target frequencies by adding the "balanced family profile frequencies" (Rychlewski et al. 2000):
![]() | (6) |
where fa is the weighted fraction of amino acid type a (or gap) in the sequences aligned at a given position, Wi is the weight of sequence i, and a(i) is the amino acid type in sequence i at this position. From CE structural alignments and following Rychlewski et al., we calculate the probability of a mutation from amino acid b to a (or deletion) pb,a, and then we use a pseudocount method to obtain amino acid target frequencies in a given position as
![]() | (7) |
where N is a normalization coefficient that ensures that
![]() |
For the other methods we use the pseudocount method used in PSI-BLAST (Altschul et al. 1997) to calculate the amino acid target frequency at the given position.
Scoring functions
One of our goals is to test whether the choice of scoring function for aligning positions in two profiles affects sequence alignment accuracy and searching sensitivity and specificity. We implemented several scoring functions already discussed in the literature (Pietrokovski 1996; Rychlewski et al. 2000; Yona and Levitt 2002; Sadreyev and Grishin 2003; von Öhsen and Zimmer 2003; von Öhsen et al. 2003). The functions we have tested are very similar to those tested by Mittleman et al. (2003) in gapless profileprofile alignment tests.
Sum of pairs
These scoring functions use the summation of the products of frequencies for both columns for every combination of amino acid a and b. There are two variants of this function: one (Cross Product) multiplies the products by the corresponding log-odds elements of the substitution matrix BLOSUM62, sab:
![]() | (8) |
The other function (Log Average) multiplies the products by the corresponding BLOSUM62 matrix amino acid substitution frequencies, qab, and takes the logarithm to get the final profileprofile alignment scores (von Öhsen and Zimmer 2003; von Öhsen et al. 2003):
![]() | (9) |
qab is the BLOSUM62 matrix frequency (calculated from the standard log-odds form) of amino acid a being aligned to amino acid b.
Dot product
This is the summation of the products of the frequencies or log-odds values for both columns. Two functions in this family have been tested, one (Dot P Freq) computes the product using frequencies:
![]() | (10) |
and the second function (Dot P Odds) calculates the dot product using log-odds values:
![]() | (11) |
where
![]() |
are log-odds values with pa as the background frequency of amino acid a.
Pearsons correlation coefficient (CORREL)
This function was used in the LAMA method for profileprofile comparison (Pietrokovski 1996):
![]() | (12) |
Jensen-Shannon function
This score function, Jensen Shannon, was introduced by Yona and Levitt (2002). It involves the calculation of a divergence score and a significance score. The divergence score D is computed using the equation
![]() | (13) |
where
![]() |
and the significance score S is calculated by
![]() | (14) |
where
![]() |
The final substitution score is the combination of the divergence score D and significance score S:
![]() | (15) |
Symmetric log-odds multinomial score
The score function used in COMPASS, Log Odds Multin, is a natural extension of PSI-BLAST for alignments of profiles with profiles (Sadreyev and Grishin 2003):
![]() | (16) |
where na 1 and na 2 are the effective counts for each amino acid in columns 1 and 2, which are calculated with the following formula (Pei et al. 2003; Sadreyev and Grishin 2003):
![]() | (17) |
And c1 and c2 are weighting parameters to balance the contribution of the two terms in the formula:
![]() | (18) |
![]() | (19) |
A very similar function was suggested to us by Stephen Altschul (pers. comm.) in 2001.
Gap penalty and zero-shift parameters optimization
We use the Smith-Waterman local alignment algorithm to align profiles (Smith and Waterman 1981). This algorithm requires the average score between random pairs of positions from the two profiles to have a negative score, so that alignments do not extend into unrelated regions of the two profiles. Some of the scores described above yield only positive scores and therefore require a "zero-shift." We used Set A as a training set of pairwise structure alignments to optimize the gap penalty parameters and zero-shift value in terms of alignment accuracy with respect to CE structural alignments.
Three parameters were used to monitor the alignment quality: QModeler, QDeveloper, and QCombined. QModeler and QDeveloper were first introduced by us in Sauder et al. (2000) as fM and fD, standing for the quality of the alignment from a modelers point of view and the quality from a developers point of view. These values were renamed and used by Yona and Levitt (2002) and by Sadreyev and Grishin (2003), who used the Yona-Levitt names. We use the newer names here for sake of consistency with the later papers. QModeler is the fraction of correctly aligned positions in the profileprofile alignment, and QDeveloper is the fraction of correctly aligned positions in the structural alignment. That is,
![]() | (20) |
![]() | (21) |
where nC is the number of aligned pairs in common between the sequence and structure alignments, nA is the number of aligned pairs in the sequence (or profileprofile) alignment, and nS is the number of aligned pairs in the structure alignment. QModeler penalizes sequence alignments that are too long; that is, when nA >> nS. QDeveloper penalizes sequence alignments that are too short; that is, when nA << nS. Taken together, they give a picture of the good and bad features of a sequence alignment method. Yona and Levitt introduced another parameter, QCombined, that penalizes sequence alignments that are either too long or too short. It is defined as
![]() | (22) |
where nT is the total number of unique alignment pairs in either the sequence or structure alignment or both. If a pair of residues is aligned in both the sequence and structure alignments, it is only counted once in nT. We have found that QCombined is more useful as a parameter for optimizing gap parameters than either QModeler or QDeveloper. Optimization based on QModeler results in very short alignments of the most highly conserved regions and hence low QDeveloper scores, whereas optimization on QDeveloper results in very long alignments often with low QModeler scores.
We optimized the gap penalties using QCombined as the target function for several of the scoring schemes described above. For each scheme, we need to determine the optimal combination of zero-shift parameter and gap-open and gap-extension parameters. We optimized parameters on a combination of five gap-open values, five zero-shift values, and three gap-extension values, for a total of 75 sets. Gap-open candidates were determined based on the variation scale of columncolumn alignment scores; zero-shift candidates were assigned according to the average columncolumn score; and three gap-extension candidates were selected as 0.1, 0.075, and 0.05 x gap-open. If the best parameters (in terms of QCombined) included one at the extremes of its range, then additional values beyond that extreme were tested. The resulting gap parameters are subsequently referred to as Full Opt Gaps.
This process of optimizing the parameters for each individual pair function by testing the whole training set on 75 parameter sets is very time-consuming. Rychlewski et al. (2000) have introduced a protocol for determining gap penalties and zero-shift parameters in their FFAS profileprofile alignment method by transforming columncolumn scores into the standard normal (µ = 0;
= 1) distribution, and determining the gap penalties for scores that follow a standard normal distribution. We compared the results of this protocol to the optimization over 75 parameter sets (see Results).
In the Results section, we call the gap parameters determined by this procedure Fitted Gaps. In this procedure, every aligned pair has its own transformation as follows: First, the L1 columns of profile 1 are compared with the L2 columns of profile 2 using a particular scoring scheme (Cross Product, Jensen-Shannon, etc.), and the average µ and standard deviation
of the L1 x L2 scores are calculated. The score for any pair of columns S1,2 is then transformed to the standard normal distribution:
![]() | (23) |
The scoring function used to align the two profiles includes a constant zero-shift term to produce an average score that is negative:
![]() | (24) |
The gap penalties as well as the zero-shift value used in this scheme are those provided by Rychlewski et al. (2000), optimized on the UCLA-DOE fold recognition benchmark (Fischer et al. 1996).
Gaps in profiles
PSI-BLAST produces profiles that are the length of the query sequence. That is, all columns in the multiple sequence alignment with a gap character in the query sequence are excluded from the profile. We used this method to create profiles we refer to as No Gaps In Query. We also created profiles that included these columns to test whether this procedure improves alignments of search selectivity. In the Results section, these profiles are referred to as Gaps In Query profiles.
Adding secondary structure similarity measures
A secondary structure substitution matrix was determined for use in profileprofile alignments as follows: Because the primary purpose for profileprofile alignments is usually to determine distant relationships between proteins of unknown structure and those of known structure, we have determined an asymmetric substitution matrix between predicted secondary structures and known structures. A similar substitution matrix for predicted versus experimental secondary structures has been used by Ginalski et al. (2003) in their ORFEUS server. Secondary structure for the proteins in Set A was predicted from output profiles from the last round of PSI-BLAST using the PSI-PRED program (Jones 1999), and the experimental secondary structures were determined from the corresponding PDB files with the program Stride (Frishman and Argos 1995). The substitution matrix was determined using a standard procedure (Henikoff and Henikoff 1993) from the structure alignments of proteins in Set A as determined by the program CE.
To use the secondary structure substitution matrix, we need to balance this matrix with the columncolumn scores for each scoring scheme. First, both the columncolumn scores and the secondary structure scores were adjusted to the standard normal distribution, as in the equation for S1,2 above. Second, we set the fraction of columncolumn score and secondary structure score from x of 0.50 to 0.90 in steps of 0.05:
![]() | (25) |
where T is the total score and R is the secondary-structure matrix substitution score. This score is then shifted to the standard normal, and the zero-shift of 0.12 is used to produce the final score.
| Results |
|---|
|
|
|---|
To compare these scoring functions fairly, we optimized the gap-open, gap-extension, and zero-shift parameters for each scheme independently of the others. We optimized the gap parameters based on the value of QCombined, as described in Materials and Methods. QCombined is the fraction of correct pairs aligned out of the total number of unique aligned pairs in either the predicted sequence alignment or structure alignment to which it is compared (Yona and Levitt 2002). The optimized parameters are listed in Table 1
. A comparison of the alignment accuracy and search sensitivity-specificity for these functions is shown in Figure 2
. At the top of the figure, other choices in the profileprofile alignment protocol that were used to generate the data in the figure are also listed. In the case of Figure 2
, the Last Round multiple sequence alignments were used, the Henikoff weighting scheme was used, and no secondary structure information (No Sec Str) was included. In this figure and the subsequent figures, No Gaps In Query was used (see below). In Figure 2
, we show the values of QModeler, QDeveloper, and QCombined as a function of sequence identity. QModeler is the fraction of correctly aligned pairs out of the number of aligned pairs in the predicted sequence alignment. QDeveloper is the fraction of pairs in the structure alignment that are also aligned in the predicted sequence alignment.
|
|
The results for search specificity and sensitivity in Figure 2
(lower right panel) show that the searching abilities of the different functions differ significantly. The Log Odds Multin, Correl, and Log Average scoring functions perform significantly better than the other functions. For instance, Correl finds 850 true positives before the 100th false positive, compared with 560 for Cross Product.
We chose three functions for further analysisthe Log Odds Multin, Jensen Shannon, and Cross Product functions, because they perform well in at least one of the measures in Figure 2
. Other functions could have been chosen.
Optimizing gap penalties and zero-shift parameter with the FittedGaps scheme
Because the parameter optimization is time-consuming and we wished to explore several other features of profileprofile alignment, we decided to test a method proposed by Rychlewski et al. (2000) for optimizing these parameters. They proposed establishing optimized gap and zero-shift parameters for columncolumn scores that follow a standard normal distribution with mean of 0.0 (without the zero-shift) and variance of 1.0, and normalizing any scoring system of interest to the standard normal. We tested this for the three functions we have chosen for further analysis, as shown in Figure 3
. The standard-normal scheme works quite well, producing alignment quality in terms of QCombined that is approximately equal to the fully optimized parameters over the full range of sequence identity. This was true without further optimization of the parameters proposed by Rychlewski et al. Starting from the parameters from Rychlewski et al., we optimized the parameters for each of the three scoring schemes. The best parameters were only slightly different from the Rychlewski parameters (data not shown). Whereas optimizing on QCombined resulted in similar results between Fitted Gap Param and Full Opt Gap Param, the Log Odds Multin scoring function behaves differently under the two optimization schemes. For Full Opt Gap Param, as noted above, it exhibits higher QModeler and lower QDeveloper than the others, whereas with Fitted Gap Param, it behaves more like the other functions. Fitted Gap Param improves the search capability of all three scoring functions shown in Figure 3
. The reason is presumably that the columncolumn scores are adjusted individually for every pair of profiles to be aligned after calculation of their average columncolumn score and its standard deviation. Thus, for instance, two profiles with similar amino acid compositions will have relatively stricter gap parameters when compared with the adjusted columncolumn scores. This may reduce false positives, for instance, for all-
versus all-
structures.
|
We decided to test the issues of weighting with sequence choice together using the Log Odds Multin scoring function. The results are shown in Figure 4
. We tested three weighting schemes, that of Henikoff and Henikoff used by PSI-BLAST (Henikoff and Henikoff 1994), that of Rychlewski et al. (2000) used in the FFAS programs, and the PSIC scheme (Sunyaev et al. 1999) used in COMPASS (Sadreyev and Grishin 2003). We refer to these three schemes as Henikoff weighting, Seq Divergence weighting, and PSIC weighting, respectively. The PSIC weighting scheme appears to be better than the Henikoff and Seq Divergence weighting schemes in both sequence alignment accuracy using all three quality scores and search sensitivity and specificity, for the Log Odds Multin columncolumn scoring function, regardless of what sequence choice scheme is used. The PSIC weighting also improves alignments with the other scoring functions (data not shown).
|
Adding secondary structure information
Because profileprofile alignments are often used to align sequences without a known structure to those with a known structure for the purpose of structure prediction, we decided to test whether comparing the predicted secondary structure of a query sequence (based on its profile) and the known secondary structure of another sequence would improve alignments. We used a set of structure alignments, predicted secondary structures on profiles based on the sequences in these alignments and the PSIPRED program (Jones 1999), and the known secondary structures of these proteins based on their experimental structures to derive a secondary structure substitution matrix. The Secondary Structure Substitution Matrix has the form:
![]() | (26) |
where the rows represent predicted secondary structures, Helix, Coil, Sheet, respectively, and the columns represent experimental secondary structures in the same order. The values in the matrix are sensible, because Helix-for-Sheet substitutions are given the most negative values, and Sheet-for-Sheet the most positive value.
To use the secondary structure substitution matrix, we had to optimize the balance between the columncolumn scores and the secondary structure matrix scores. This was done as described in Materials and Methods. For the seven functions we studied, the weights of the secondary structure matrix (out of 100% total) were: Cross Product, 20%; Log Average, 30%; Dot P Freq, 35%; Dot P Odds, 20%; CORREL, 30%; Jensen Shannon, 25%; and Log Odds Multin, 30%. For all these functions, the use of the secondary structure matrix improved sequence alignment accuracy slightly. The results are shown in Figure 5
. Secondary structure information improved the search capability of the Cross Product function significantly. It improved the results for the Log Odds Multin function only at low numbers of false positives (Fig. 5
, inset).
|
|
The Log Odds Multin and Log Average functions perform better than others in terms of specificity and sensitivity, whereas the Dot P Odds behaves significantly worse. Because the scoring functions used to generate these data were all fitted to the standard normal, they are all approximately on the same scale. We therefore tried combining them in one score simply by adding the scores for all seven functions for each alignment pair. This results in better search sensitivity and specificity than any of the individual scores (Fig. 6
, lower right panel). This result does not depend on any prior knowledge of the correct answer, and therefore could be used to improve search capability.
| Discussion |
|---|
|
|
|---|
Mittleman et al. (2003) recently published a similar study in which they compared columncolumn scoring methods in gapless alignments. They chose several functions either the same as or quite similar to those tested here, including Correl, Dot P Odds, Log Odds Multin, Cross Product, Jensen Shannon, and Dot P Freq. They used a test set of 1800 pairwise structure alignments with the DALI program, whereas we have used a set of protein pairs with consistent DALI and CE structure alignments to reduce noise due to choice of structure alignment method. When we used the Full Opt Gap Param parameters, the test set was 2305 pairs, and with the Fitted Gap Param parameters, it was 3441 pairs. They judged the alignment quality by the number of short fragments (of length 5 or 7) correctly aligned in the top 20 hits for each pair of profiles. This parameter is not directly comparable to QModeler, QDeveloper, or QCombined used here, but it is probably closest to QModeler, because there is no adjustment for the length of the structure alignment. The results in Mittleman et al. (2003) are quite similar to those here, with the log-odds-based methods (Dot P Odds, Log Odds Multin) and Jensen Shannon method behaving better than the others. This is encouraging considering the different protocols used for deriving the alignments and assessing their accuracy.
A key component in these results has been the use of a large benchmark of structural alignments of homologous proteins used in judging sequence alignment accuracy and search sensitivity and specificity. We have been able to give results with reasonably small uncertainties at all levels of sequence identity, to identify significant variations among methods chosen in each step of profileprofile alignment. This is in contrast to some smaller sets used when new methods are developed and in the CASP series of experiments, where the test sets are generally too small to reach firm conclusions, although some important trends have been identified (Bourne 2003; Venclovas et al. 2003). We have used the same principle in developing side-chain prediction (Canutescu et al. 2003) and loop modeling methods (Canutescu and Dunbrack Jr. 2003), and it appears to be becoming more the rule than the exception as recent efforts indicate (Mittelman et al. 2003).
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Bourne, P.E. 2003. CASP and CAFASP experiments and their findings. Methods Biochem. Anal. 44: 501507.[Medline]
Canutescu, A.A. and Dunbrack Jr., R.L. 2003. Cyclic coordinate descent: A robotics algorithm for protein loop closure. Protein Sci. 12: 963972.
Canutescu, A.A., Shelenkov, A.A., and Dunbrack Jr., R.L. 2003. A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci. 12: 20012014.
Fischer, D., Elofsson, A., Rice, D., and Eisenberg, D. 1996. Assessing the performance of fold recognition methods by means of a comprehensive benchmark. Pac. Symp. Biocomput. pp. 300318.
Frishman, D. and Argos, P. 1995. Knowledge-based protein secondary structure assignment. Proteins 23: 566579.[CrossRef][Medline]
Ginalski, K., Pas, J., Wyrwicz, L.S., von Grotthuss, M., Bujnicki, J.M., and Rychlewski, L. 2003. ORFeus: Detection of distant homology using sequence profiles and predicted secondary structure. Nucleic Acids Res. 31: 38043807.
Gribskov, M., McLachlan, A.D., and Eisenberg, D. 1987. Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. 84: 43554358.
Henikoff, S. and Henikoff, J.G. 1993. Performance evaluation of amino acid substitution matrices. Proteins 17: 4961.[CrossRef][Medline]
. 1994. Position-based sequence weights. J. Mol. Biol. 243: 574578.[CrossRef][Medline]
Holm, L. and Sander, C. 1993. Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233: 123138.[CrossRef][Medline]
Jones, D.T. 1999. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292: 195202.[CrossRef][Medline]
Luthy, R., McLachlan, A.D., and Eisenberg, D. 1991. Secondary structure-based profiles: Use of structure-conserving scoring tables in searching protein sequence databases for structural similarities. Proteins 10: 229239.[CrossRef][Medline]
Mittelman, D., Sadreyev, R., and Grishin, N. 2003. Probabilistic scoring measures for profileprofile comparison yield more accurate short seed alignments. Bioinformatics 19: 15311539.
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536540.[CrossRef][Medline]
Panchenko, A.R. 2003. Finding weak similarities between proteins by sequence profile comparison. Nucleic Acids Res. 31: 683689.
Panchenko, A., Marchler-Bauer, A., and Bryant, S.H. 1999. Threading with explicit models for evolutionary conservation of structure and sequence. Proteins 37: 133140.
Pei, J., Sadreyev, R., and Grishin, N.V. 2003. PCMA: Fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics 19: 427428.
Pietrokovski, S. 1996. Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res. 24: 38363845.
Rychlewski, L., Zhang, B., and Godzik, A. 1998. Fold and function predictions for Mycoplasma genitalium proteins. Fold. Des. 3: 229238.[CrossRef][Medline]
Rychlewski, L., Jaroszewski, L., Li, W., and Godzik, A. 2000. Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci. 9: 232241.[Abstract]
Sadreyev, R. and Grishin, N. 2003. COMPASS: A tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol. 326: 317336.[CrossRef][Medline]
Sauder, J.M., Arthur, J.W., and Dunbrack Jr., R.L. 2000. Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins 40: 622.[CrossRef][Medline]
Shindyalov, I.N. and Bourne, P.E. 1998. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Prot. Eng. 11: 739747.
Sjolander, K., Karplus, K., Brown, M., Hughey, R., Krogh, A., Mian, I.S., and Haussler, D. 1996. Dirichlet mixtures: A method for improved detection of weak but significant protein sequence homology. Comput. Appl. Biosci. 12: 327345.
Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195197.[CrossRef][Medline]
Sunyaev, S.R., Eisenhaber, F., Rodchenkov, I.V., Eisenhaber, B., Tumanyan, V.G., and Kuznetsov, E.N. 1999. PSIC: Profile extraction from sequence alignments with position-specific counts of independent observations. Protein Eng. 12: 387394.
Venclovas, C., Zemla, A., Fidelis, K., and Moult, J. 2003. Assessment of progress over the CASP experiments. Proteins 53Suppl. 6: 585595.
von Öhsen, N. and Zimmer, R. 2001. Improving profileprofile alignments via log average scoring. In Algorithms in bioinformatics, First International Workshop, WABI 2001 (eds. O. Gascuel and B.M.E. Moret), pp. 1126. Springer-Verlag, Berlin.
von Öhsen, N., Sommer, I., and Zimmer, R. 2003. Profileprofile alignment: A powerful tool for protein structure prediction. Pac. Symp. Biocomput. 252263.
Wheeler, D.L., Church, D.M., Edgar, R., Federhen, S., Helmberg, W., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Sequeira, E., et al. 2004.Database resources of the National Center for Biotechnology Information: Update. Nucleic Acids Res. 32 Database issue: D35D40.
Wootton, J.C. 1994. Non-globular domains in protein sequences: Automated segmentation using complexity measures. Comput. Chem. 18: 269285.[CrossRef][Medline]
Yona, G. and Levitt, M. 2002. Within the twilight zone: A sensitive profileprofile comparison tool based on information theory. J. Mol. Biol. 315: 12571275.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |