|
|
||||||||
Protein Science, Vol 4, Issue 6 1145-1160, Copyright © 1995 by Cold Spring Harbor Laboratory Press
ARTICLE |
W. R. PEARSON
Department of Biochemistry, University of Virginia, Charlottesville, Virginia 22908
We have compared commonly used sequence comparison algorithms, scoring matrices, and gap penalties using a method that identifies statistically significant differences in performance. Search sensitivity with either the Smith-Waterman algorithm or FASTA is significantly improved by using modern scoring matrices, such as BLOSUM45-55, and optimized gap penalties instead of the conventional PAM250 matrix. More dramatic improvement can be obtained by scaling similarity scores by the logarithm of the length of the library sequence (ln()-scaling). With the best modern scoring matrix (BLOSUM55 or JO93) and optimal gap penalties (-12 for the first residue in the gap and -2 for additional residues), Smith-Waterman and FASTA performed significantly better than BLASTP. With In()-scaling and optimal scoring matrices (BLOSUM45 or Gonnet92) and gap penalties (-12, -1), the rigorous Smith-Waterman algorithm performs better than either BLASTP and FASTA, although with the Gonnet92 matrix the difference with FASTA was not significant. Ln()-scaling performed better than normalization based on other simple functions of library sequence length. Ln()-scaling also performed better than scores based on normalized variance, but the differences were not statistically significant for the BLOSUM50 and Gonnet92 matrices. Optimal scoring matrices and gap penalties are reported for Smith-Waterman and FASTA, using conventional or ln()-scaled similarity scores. Searches with no penalty for gap extension, or no penalty for gap opening, or an infinite penalty for gaps performed significantly worse than the best methods. Differences in performance between FASTA and Smith-Waterman were not significant when partial query sequences were used. However, the best performance with complete query sequences was obtained with the Smith-Waterman algorithm and ln()-scaling.
This article has been cited by other articles:
![]() |
X. Huang and D. L. Brutlag Dynamic use of multiple parameter sets in sequence alignment Nucleic Acids Res., January 28, 2007; 35(2): 678 - 686. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. H. Serres and M. Riley Genomic Analysis of Carbon Source Metabolism of Shewanella oneidensis MR-1: Predictions versus Experiments J. Bacteriol., July 1, 2006; 188(13): 4601 - 4609. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. G. Leonardi A generalization of the PST algorithm: modeling the sparse nature of protein sequences Bioinformatics, June 1, 2006; 22(11): 1302 - 1307. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y.-m. Huang and C. Bystroff Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions Bioinformatics, February 15, 2006; 22(4): 413 - 422. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. A. Price, G. E. Crooks, R. E. Green, and S. E. Brenner Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap Bioinformatics, October 15, 2005; 21(20): 3824 - 3831. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Nozaki and M. Bellgard Statistical evaluation and comparison of a pairwise alignment algorithm that a priori assigns the number of gaps rather than employing gap penalties Bioinformatics, April 15, 2005; 21(8): 1421 - 1428. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Chevreux, T. Pfisterer, B. Drescher, A. J. Driesel, W. E.G. Muller, T. Wetter, and S. Suhai Using the miraEST Assembler for Reliable and Automated mRNA Transcript Assembly and SNP Detection in Sequenced ESTs Genome Res., June 1, 2004; 14(6): 1147 - 1159. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. L. Sierk and W. R. Pearson Sensitivity and selectivity in protein structure comparison Protein Sci., March 1, 2004; 13(3): 773 - 785. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Nair and B. Rost Sequence conserved for subcellular localization Protein Sci., December 1, 2002; 11(12): 2836 - 2847. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Liang, B. Labedan, and M. Riley Physiological genomics of Escherichia coli protein families Physiol Genomics, April 10, 2002; 9(1): 15 - 26. [Abstract] [Full Text] [PDF] |
||||
![]() |
B.-J. M. Webb, J. S. Liu, and C. E. Lawrence BALSA: Bayesian algorithm for local sequence alignment Nucleic Acids Res., March 1, 2002; 30(5): 1268 - 1277. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J. Mackey, T. A. J. Haystead, and W. R. Pearson Getting More from Less: Algorithms for Rapid Protein Identification with Multiple Short Peptide Sequences Mol. Cell. Proteomics, February 1, 2002; 1(2): 139 - 147. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Lundstrom, L. Rychlewski, J. Bujnicki, and A. Elofsson Pcons: A neural-network-based consensus predictor that improves fold recognition Protein Sci., November 1, 2001; 10(11): 2354 - 2362. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. A. Schaffer, L. Aravind, T. L. Madden, S. Shavirin, J. L. Spouge, Y. I. Wolf, E. V. Koonin, and S. F. Altschul Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements Nucleic Acids Res., July 15, 2001; 29(14): 2994 - 3005. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Prlic, F. S. Domingues, and M. J. Sippl Structure-derived substitution matrices for alignment of distantly related sequences Protein Eng. Des. Sel., August 1, 2000; 13(8): 545 - 550. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. G. Reese, G. Hartzell, N. L. Harris, U. Ohler, J. F. Abril, and S. E. Lewis Genome Annotation Assessment in Drosophila melanogaster Genome Res., April 1, 2000; 10(4): 483 - 501. [Abstract] [Full Text] |
||||
![]() |
T. Sapir, D. Horesh, M. Caspi, R. Atlas, H. A. Burgess, S. G. Wolf, F. Francis, J. Chelly, M. Elbaum, S. Pietrokovski, et al. Doublecortin mutations cluster in evolutionarily conserved functional domains Hum. Mol. Genet., March 22, 2000; 9(5): 703 - 712. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. C.W. May Towards more meaningful hierarchical classification of amino acid scoring matrices Protein Eng. Des. Sel., September 1, 1999; 12(9): 707 - 712. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Geetha, V. Di Francesco, J. Garnier, and P. J. Munson Comparing protein sequence-based and predicted secondary structure-based methods for identification of remote homologs Protein Eng. Des. Sel., July 1, 1999; 12(7): 527 - 534. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. R. Sunyaev, F. Eisenhaber, I. V. Rodchenkov, B. Eisenhaber, V. G. Tumanyan, and E. N. Kuznetsov PSIC: profile extraction from sequence alignments with position-specific counts of independent observations Protein Eng. Des. Sel., May 1, 1999; 12(5): 387 - 394. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. A. Salamov, M. Suwa, C. A. Orengo, and M. B. Swindells Combining sensitive database searches with multiple intermediates to detect distant homologues Protein Eng. Des. Sel., February 1, 1999; 12(2): 95 - 100. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Kragler, G. Lametschwandtner, J. Christmann, A. Hartig, and J. J. Harada Identification and analysis of the plant peroxisomal targeting signal 1 receptor NtPEX5 PNAS, October 27, 1998; 95(22): 13336 - 13341. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. E. Brenner, C. Chothia, and T. J. P. Hubbard Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships PNAS, May 26, 1998; 95(11): 6073 - 6078. [Abstract] [Full Text] [PDF] |
||||
![]() |
R F Smith Perspectives: sequence data base searching in the era of large-scale genomic sequencing. Genome Res., August 1, 1996; 6(8): 653 - 660. [Abstract] [PDF] |
||||
![]() |
K. J. Chave, I. E. Auger, J. Galivan, and T. J. Ryan Molecular Modeling and Site-directed Mutagenesis Define the Catalytic Motif in Human gamma -Glutamyl Hydrolase J. Biol. Chem., December 15, 2000; 275(51): 40365 - 40370. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. E. Graham, R. Overbeek, G. J. Olsen, and C. R. Woese An archaeal genomic signature PNAS, March 28, 2000; 97(7): 3304 - 3308. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Liang, B. Labedan, and M. Riley Physiological genomics of Escherichia coli protein families Physiol Genomics, April 10, 2002; 9(1): 15 - 26. [Abstract] [Full Text] [PDF] |
||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |