|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, London WC1E 6BT, United Kingdom
(RECEIVED August 10, 2004; FINAL REVISION February 24, 2005; ACCEPTED February 25, 2005)
There are more than 200 completed genomes and over 1 million nonredundant sequences in public repositories. Although the structural data are more sparse (~13,000 nonredundant structures solved to date), several powerful sequence-based methodologies now allow these structures to be mapped onto related regions in a significant proportion of genome sequences. We review a number of publicly available strategies for providing structural annotations for genome sequences, and we describe the protocol adopted to provide CATH structural annotations for completed genomes. In particular, we assess the performance of several sequence-based protocols employing Hidden Markov model (HMM) technologies for superfamily recognition, including a new approach (SAMOSA [sequence augmented models of structure alignments]) that exploits multiple structural alignments from the CATH domain structure database when building the models. Using a data set of remote homologs detected by structure comparison and manually validated in CATH, a single-seed HMM library was able to recognize 76% of the data set. Including the SAMOSA models in the HMM library showed little gain in homolog recognition, although a slight improvement in alignment quality was observed for very remote homologs. However, using an expanded 1D-HMM library, CATH-ISL increased the coverage to 86%. The single-seed HMM library has been used to annotate the protein sequences of 120 genomes from all three major kingdoms, allowing up to 70% of the genes or partial genes to be assigned to CATH superfamilies. It has also been used to recruit sequences from Swiss-Prot and TrEMBL into CATH domain superfamilies, expanding the CATH database eightfold.
Keywords: CATH; HMM; sequence profile benchmarking; intermediate sequence library; sequence alignments
Article published online ahead of print. Article and publication date are at http://www.proteinscience.org/cgi/doi/10.1110/ps.041056105.
Reprint requests to: Mark Dibley, Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, Gower Street, London, WC1E 6BT, United Kingdom; e-mail: dibley{at}biochem.ucl.ac.uk; fax: +020-7679-7193.
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
A. J. Reid, C. Yeats, and C. A. Orengo Methods of remote homology detection can be combined to increase coverage by 10% in the midnight zone Bioinformatics, September 15, 2007; 23(18): 2353 - 2360. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. H. Greene, T. E. Lewis, S. Addou, A. Cuff, T. Dallman, M. Dibley, O. Redfern, F. Pearl, R. Nambudiry, A. Reid, et al. The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution Nucleic Acids Res., January 12, 2007; 35(suppl_1): D291 - D297. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Yeats, M. Maibaum, R. Marsden, M. Dibley, D. Lee, S. Addou, and C. A. Orengo Gene3D: modelling protein structure, function and evolution Nucleic Acids Res., January 1, 2006; 34(suppl_1): D281 - D284. [Abstract] [Full Text] [PDF] |
||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH |