|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Department of Biological Sciences and
2 Department of Information Science, Kanagawa University, Hiratsuka, Kanagawa 259-1293, Japan
Reprint requests to: Joji M. Otaki, Department of Biological Sciences, Kanagawa University, 2946 Tsuchiya, Hiratsuka, Kanagawa 259-1293, Japan; e-mail: otaki{at}bio.kanagawa-u.ac.jp; fax: 81-463-58-9684.
(RECEIVED August 31, 2004; FINAL REVISION November 9, 2004; ACCEPTED November 10, 2004)
| Abstract |
|---|
|
|
|---|
Keywords: protein sequence; database search; sequence availability; constituent sequence; rare short sequence
Article published online ahead of print. Article and publication date are at http://www.proteinscience.org/cgi/doi/10.1110/ps.041092605.
| Introduction |
|---|
|
|
|---|
-helices is made between the ridge of one helix and the groove of the other helix, which are mostly made of three or four residues (Chothia et al. 1981). Since then, protein databases have been used almost exclusively for similarity searches (Altschul et al. 1990, 1997; Jonassen 2000; Yona and Brenner 2000; Baldi and Brunak 2001; Mount 2001; Schuler 2001). Although highly successful, this approach might have difficulty in identifying short, but functional sequences of amino acids in proteins. On the other hand, few systematic studies on protein database characters themselves have been performed. Since proteomics is expected to play an important role in biological sciences (Boguski and Mclntosh 2003; Tyers and Mann 2003), it would be highly valuable to statistically characterize the present-day protein database with >1.5 million records.
Accordingly, we pursued an approach alternative to similarity searches. Since the present-day protein database have accumulated hypothetical protein sequences derived from genome projects, characterization of the database itself would reveal fundamental knowledge on how protein chains are constructed. Our data suggest that information on the primary structure of proteins exists in the context of connections of short sequences. We also show that some amino acid sequences of proteins occur much less frequently than expected and discuss possible implications of the existence of these rare short sequences in proteins.
| Results and Discussion |
|---|
|
|
|---|
This discussion above, however, is based on the presumption that all 20 amino acids are randomly utilized in equal frequency. Obviously, this is not the case in real proteins. Accordingly, to examine which sets of amino acids are frequently used and which are rarely used, one of the fundamental questions in protein science, it is necessary to study how frequently a given set of short amino acid sequence appears in the database against the random probability of occurrence, i.e., to clarify differences in availability of short sequences. In this study, we considered proteins to be composed of unit sequences of n amino acids (n = 1, 2, 3, 4, 5). We call these unit sequences constituent sequences, and investigated how frequently they appeared in the nonredundant (nr) protein database. In other words, we first generated all possible combinatorial sets of three, four, and five amino acids (203 or 8000 triplet species, 204 or 160,000 quartet species, and 205 or 3,200,000 pentat species, respectively) (Table 1
), assuming these sets as a unit of information, and asked how these sets of amino acids occur in the database.
|
|
It is obvious that the more database records increase, the less frequency distribution of relative triplet-count is biased by nonrandom sampling of sequence records, because the sample population becomes closer to the parent population, a collection of all protein species on the earth. Since there are already 1.5 million records in the nr protein database, it is unlikely that the characteristic triplet-count distribution shown above is simply an artifact due to a database bias itself. However, the possibility still exists that the characteristic distribution might have resulted from over-representation or under-representation of particular proteins in the database. To examine this possibility, we further performed similar triplet analyses in five phylogenetically distinct biological species, human (Homo sapiens), mouse (Mus musculus), fruit fly (Drosophila melanogaster), soil nematode (Caenorhabditis elegans), and a colon bacterium (Escherichia coli), whose genome sequences have already been known.
We first obtained amino acid composition in each species. Although they were slightly different from one another, their overall trend of the ranked count seemed to be almost identical to that of the whole nr database and almost invariable throughout species (data not shown). According to equations 4 and 5, we further obtained histograms for the relative triplet-count distributions in each species (Fig. 1D
). They were all essentially similar to that of the whole nr protein database, compared with the randomly generated one in terms of fundamental statistical values (data not shown). Two-tailed Mann-Whitney U-test with Bonferroni correction indicated that they were all similar to one another with P > 0.01 (data not shown), except the combination of human and nematode (P = 8.27 x 106), and this significant difference between these two species seemed to arise mainly from the difference in peak values. This result excluded the possibility that the nonrandom nature of the triplet composition was a mere reflection of an artificial database bias. We concluded that this nonrandom nature seemed to originate not from an inherent bias of the database itself, but from some biological reasons. This tendency of low availability of particular constituent sequences was also observed in the frequency distribution of quartet count and also of pentat count (data not shown).
Some pentat species never appeared in the database
Although all 8000 triplet species and all 160,000 quartet species existed in the database in spite of this low availability, some particular pentat species never appeared in the database, which we called "zero-count pentats." The zero-count pentats can be considered as a representative case where low availability of particular constituent sequences was driven to extremes. There were 12,080 zero-count pentat species among all 3,200,000 pentat species, i.e., 0.4% of the entire population. A collection of 12,080 zero-count pentat species, which we called "zero-count pentat pool" or simply "zero-pentat pool," was thus to be characterized.
For comparison, we calculated "theoretical" pentat-counts for each of 3,200,000 pentat species and generated a "theoretical" zero-pentat pool from an imaginary pool of amino acids whose composition was identical to that of the nr database, thus taking account the biased amino acid composition. Amino acid composition of the real and theoretical zero-count pentat pools (Fig. 1A
, solid and broken lines, respectively) was markedly different; most amino acids were used more frequently in the real pool than in the theoretical one, except for histidine, cysteine, and tryptophan. This result again indicated that the real zero-count pentat species cannot be predicted readily from the amino acid composition of the whole nr database. That is, the availability difference was well reflected in the real zero-count pentat pool.
Theoretically, 832 pentat species had counts <1.00, hence, they were expected not to appear even once in the database, whose collection we called "theoretical zero-pentat pool". There were only 82 theoretical zero-pentat species when applying more stringent criteria, counts <0.50. In reality, 12,080 pentat species never appeared in the nr database, 14.5 times (when applying counts <1.00) and 147.3 times (when applying counts <0.50) more than expected, respectively. Thus, some pentat species never occurred in the database, even though they were expected to appear.
On the other hand, other pentat species occurred in the database, even though they were expected to be zero (Table 2
; Fig. 2A
). These deviations from the theoretical calculations, based on the nr database composition, showed the existence of not only low availability, but also high availability of some constituent sequences, even if the nonuniform nature of the amino acid composition was taken into account. However, the theoretical zero-count pentat species (832 species) with high real counts showed a repeated usage of a few amino acids such as cysteine, tryptophan, and histidine. Since simple repetition of cysteine, tryptophan, and histidine residues would easily make particular pentat species receive high counts in the nr database, biological significance of these pentats is less clear.
|
|
|
Although unlikely considering a physicochemical nature of peptides, the first and second possibilities were tested simply by chemically synthesizing them as peptides and biologically expressing them as fusion proteins in Escherichia coli. For this purpose, we chose four zero-count pentats with the highest theoretical counts, KHAMY (theoretical pentat count or tpc 36), PGIMW (tpc 32), WPCLE (tpc 32), and CMPAN (tpc 31). We also chose CMWCM (tpc 1) as a representative zero-count pentat because it was composed of CMW, MWC, and WCM, all of which were three most frequently used triplets in the zero-pentat pool (data not shown). For comparison, CMWRL (tpc 13) was also chosen, because it had the highest counts among CMWXX (where X was any amino acids). CMPAN already chosen also served as a good comparison for CMWCM and CMWRL, because it had the highest counts among CMXXX (where X was any amino acids).
These six zero-count pentat species in total showed a variety of chemical characters including hydrophobicity, but were successfully synthesized using a conventional Fmoc method with a reasonably high yield (Table 4
; Fig. 3A,B
). This result immediately excluded the first possibility. In addition, all were successfully synthesized as a part of soluble proteins using a conventional E. coli BL21Star(DE3) system, although yields somewhat varied (Fig. 3C,D
), excluding the second possibility. Thus, the third possibility is most likely. Although we could not exclude the possibility that other zero-count pentats may not be synthesized so easily, it is tempting to speculate that zero-count pentats in general are not highly toxic. It is conceivable that a highly toxic sequence of amino acids may be utilized exactly because of such toxicity as toxins.
|
|
The varied availability detected in this study could be considered as a remnant of a fixed evolutionary accident, but this putative "fixation" is fundamentally different from that of the genetic code, for example. The latter cannot be changed once the system has started to be used, but the former can at least theoretically be revised or ignored easily. Our results imply that proteins on the earth highly evolved in the direction where functionally useful constituent sequences were repeatedly used in any parts of proteins. Thus, instead of an evolutionary remnant, it could be a consequence of functional protein evolution.
Among the zero-pentat pool, there was no significant positional bias in the amino acid usage among five positions of amino acids in a pentat (data not shown). However, this does not exclude the possibility that some distribution patterns of certain amino acids exist in the zero-pentat pool. Indeed, it is known that certain patterns of hydrophobic and nonhydrophobic amino acids are favored in
-helices (Vazquez et al. 1993). Similarly, binary patterns of polar and nonpolar amino acids are favored in
-helices and solvent-exposed
-strands (West and Hecht 1995), whose biological significance is at least partially known (Broome and Hech 2000; Mandel-Gutfreund and Gregoret 2002).
In this context, it would be valuable to examine relationships between secondary structures and triplets, quartets, and pentats. As some residues are preferred in a given secondary structure (Chou and Fasman 1974; Lim 1974; Garnier 1978), some pentats, for example, may be favored in a given secondary structure. Availability difference may be a reflection of the number of these secondary structures in proteins at the population level. Another possibility is that proteins containing a particular triplet, quartet, or pentat may preferably belong to a particular protein family, and the family composition in the database may be a cause of the nonrandom nature of the frequency distributions of triplet, quartet, and pentat counts. It is reasonable that protein records for a given triplet, quartet, or pentat are to be examined with respect to structural and functional protein classifications using specialized databases such as PDB (Protein Data Bank) (Berman et al. 2000) and SCOP (Structural Classification of Proteins) (Murzin et al. 1995). Alternatively, more sophisticated algorithms (Rigoutsos and Floratos 1998; Huynh and Rigoutsos 2004) may help us to elucidate such relations. These studies could reveal a hidden logic of building protein chains.
It is well known that structural and functional similarities among proteins are frequently conserved in a stretch of a few amino acids. A notable example can be drawn from the G-protein-coupled receptor (GPCR) superfamily, in which little sequence similarity can be found unless two receptors are very closely related (Schwartz 1996; Wess 1998). This makes the conventional similarity search much less useful than one might expect. The combinatorial analyses for triplet, quartet, and pentat performed here can be considered to be complementary to similarity searches. There are several sites with just a single or a few residues that are relatively conserved and functionally important among a group of GPCRs, one of which is called "DRY sequence" (triplet of aspartic acid, arginine, and tyrosine) located at the boundary between the third transmembrane domain and the second intracellular loop (Schwartz 1996; Wess 1998; Ohyama et al. 2002; Wilbanks et al. 2002). Since GPCRs cannot readily be classified based on similarity alignments of sequences, several lines of research already proposed new methods for characterizing GPCR sequences with or without alignments (Graul and Sadee 2001; Otaki and Firestein 2001; Karchin et al. 2002; Lapinsh et al. 2002), one of which is an alignment-independent method using principal component analysis (Lapinsh et al. 2002). This method takes into account functional contributions of all amino acids in GPCRs. Effectiveness of this method indicated varied but significant contributions of all amino acids to the final classification, in contrast to more traditional alignment-based methods that consider only the most important ones, or conventional motifs. Consistent with this, triplet, quartet, and pentat analyses performed here also indicated that any part of a protein molecule is better considered to be composed of evolutionarily selected constituent sequences. In other words, almost any amino acids at a given site influence neighboring ones.
Although most, if not all, zero-count pentats detected in the database as of September 2003 may exist in some proteins on the earth (indeed, some of them already exist in the database as of July 2004), they might be useful in designing proteins with a new folding pattern and a unique biological character. Physiologically important protease-sensitive proteins may be transformed into protease-resistant ones with the use of zero-count pentats.
| Materials and methods |
|---|
|
|
|---|
Definitions and operations
We first analyzed 8000 combinatorial sets of three amino acids (triplets). Detailed procedures were described elsewhere (Otaki et al. 2003). A three-amino-acid window that defines a triplet in a large linear sequence is conceptually slid one by one along the protein chain, so that a given amino acid residue is an overlapping part of three different triplets unless it is located at the ends of the chain. Thus, the total number of existing triplets in all sample records (defined as Q below) can be written as:
![]() | (1) |
Where nj is the number of amino acid residues in a given protein j, N is the number of protein records in the database, and A is the total number of amino acid residues in the database. Alternatively, based on triplet count for each triplet akalam or
in the database, Tklm or T
, the total number of existing triplets (Q) can be expressed as follows, considering there are 8000 different triplets:
![]() | (2) |
Conversely, from the probabilistic expression of amino acid count for each amino acid (p, q, or r) in the database, Pp, Pq, or Pr, the expected triplet count, E
, for each triplet apaqar or
is given as follows:
![]() | (3) |
Difference between theoretically estimated triplet count E
and the real triplet count T
for each triplet in the database is expressed as follows:
![]() | (4) |
Likewise, difference between theoretically estimated triplet count E
and randomly generated triplet count R
from the population with the identical amino acid composition is expressed as follows:
![]() | (5) |
We call DTriplet and DRandom relative triplet-counts. The frequency distribution of DRandom is supposed to show random fluctuations of the sampling procedure itself around a central value, resulting in the normal error curve. Distribution histograms for DTriplet and DRandom were compared with each other.
Similar operations were performed in analyzing quartets and pentats. General expression of equation 1 for the total number of combinations of n amino acids in the nr protein databases can be written as:
![]() | (6) |
This equation was used to calculate the number of combinations Q in Table 1
.
Computer programs
We developed a program in JAVA to count the number of each amino acid and each triplet in the database, and to subsequently execute several operations. The output data were exported to the Microsoft Excel 2000 and processed numerically and graphically. Statistical analyses were performed using ystat 2002 together with Excel.
To generate a theoretical random distribution from the population with the identical amino acid composition, sampling procedure was exhaustively repeated as many times as the number of amino acid residues in the database, which was equivalent to having a random reconstitute of theoretical proteins from all the real database records that are conceptually broken into pieces of amino acid monomers.
To demonstrate the operational accuracy using the JAVA program developed by ourselves and the one for the random sampling procedure, we used the "ABC model," in which three imaginary amino acids represented by letters, A, B, and C, were treated using these programs with a given composition and an imaginary population of 100 million letters. In this case, only 27 triplets exist, making the system simpler and amenable to calculations by hand. The output data generated by the programs were compared to the hand-calculated ones. We found these outputs were virtually identical, except for unavoidable fluctuations from the random sampling procedure itself, confirming the operational accuracy (data not shown).
Peptide synthesis and protein expression
Six representative nonbiological pentats were chemically synthesized by Fmoc method using a peptide synthesizer Symphony (Rainin Instrument) and analyzed by reverse-phase high-performance liquid chromatography (RPHPLC) and Mass spectroscopy with the help of QIAGEN laboratory, Japan. RPHPLC was performed using Gold Nouveau/338 HPLC system (Beckman) equipped with Hydrosphere ODS Vydac C-18 column (L 250 mm x ø 4.6 mm, particle size 5 µm). Samples in 0.1% trifluoroacetic acid were separated in the water-acetonitrile solvent gradient system with flow rate 1.0 mL/min. Absorbance was monitored at 215 nm. Mass spectrometry (MALDI-TOF-MS) was performed with MAT-LC/Q (Finnigan) to confirm the identities of synthetic products.
For bacterial expression, a double-stranded DNA molecule made of two strands of synthetic oligonucleotides corresponding to a zero-pentat amino acid sequence was directionally inserted into an expression vector pET102/D-TOPO (Invitrogen), in which the insert was flanked by thioredoxin in the N-terminal side and V5 epitope in the C-teminal side (Fig. 4C). The recombinant plasmid was isolated from single colonies of Escherichia coli TOP10 and transformed into E. coli BL21Star(DE3). Expression from the T7 promoter was induced with 0.8 mM IPTG in 10 mL LB-ampicillin. Bacterial culture (800 µL aliquot) was sampled at 0, 2, 4, and 6 h after the IPTG addition. Each aliquot was prepared in 60 µL SDS-sample buffer, boiled, and sonicated. This lysate (30 µL each) was applied to 18% Trisglycine SDSpolyacrylamide gel with 1.5 mm thickness. After performing an electrophoresis, protein was either detected on the gel with Colloidal Blue Coomassie G-250 (Invitrogen) or transferred onto a PVDF membrane and probed with anti-V5 antibody conjugated with HRP (Invitrogen). Signals on the membranes were detected using 3,3'-diaminobenzidine (Vector Lab).
| Acknowledgments |
|---|
| References |
|---|
|
|
|---|
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 33893340.
Baldi, P. and Brunak, S. 2001. Bioinformatics: The machine learning approach. MIT Press, Cambridge, MA.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The protein data bank. Nucleic Acids Res. 28: 235242.
Boguski, M.S. and Mclntosh, M.W. 2003. Biomedical informatics for proteomics. Nature 422: 233237.[CrossRef][Medline]
Broome, B.M. and Hecht, M.H. 2000. Nature disfavors sequences of alternating polar and non-polar amino acids: Implications for amyloidogenesis. J. Mol. Biol. 296: 961968.[CrossRef][Medline]
Chothia, C. 1984. Principles that determine the structure of proteins. Annu. Rev. Biochem. 53: 537572.[CrossRef][Medline]
Chothia, C., Levitt, M. and Richardson, D. 1981. Helix-to-helix packing in proteins. J. Mol. Biol. 145: 215250.[CrossRef][Medline]
Chou, P.Y. and Fasman, G.D. 1974. Prediction of protein conformation. Biochemistry 13: 222245.[CrossRef][Medline]
Garnier, J., Osguthorpe, D.J., and Robson, B. 1978. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120: 97120.[CrossRef][Medline]
Graul, R.C. and Sadee, W. 2001. Evolutionary relationships among G protein-coupled receptors using a clustered database approach. AAPS PharmSci. 3: E12.[CrossRef][Medline]
Huynh, T. and Rigoutsos, I. 2004. The Web server of IBMs bioinformatics and pattern discovery group: 2004 update. Nucleic Acids Res. 32: W10W15.
Jonassen, I. 2000. Methods for discovering conserved patterns in protein sequences and structures. In Bioinfromatics: Sequence, structure, and data-banks. A practical approach (eds. D. Higgins and W. Taylor), pp. 143166. Oxford University Press, New York.
Jukes, T.H. 1973. Arginine as an evolutionary intruder into protein synthesis. Biochem. Biophys. Res. Commun. 53: 709714.[CrossRef][Medline]
Jukes, T.H., Holmquist, R., and Moise, H. 1975. Amino acid composition of proteins: Selection against the genetic code. Science 189: 5051.
Karchin, R., Karplus, K., and Haussler, D. 2002. Classifying G-protein coupled receptors with support vector machines. Bioinformatics 18: 147159.
King, J.L. and Jukes, T.H. 1969. Non-Darwinian evolution. Science 164: 788798.
Lapinsh, M., Gutcaits, A., Prusis, P., Post, C., Lundstedt, T., and Wikberg, J.E. 2002. Classification of G-protein coupled receptors by alignment-independent extraction of principal chemical properties of primary amino acid sequences. Protein Sci. 11: 795805.
Lim, V.I. 1974. Algorithms for prediction of
-helical and
-structural regions in globular proteins. J. Mol. Biol. 88: 873894.[CrossRef][Medline]
Mandel-Gutfreund, Y. and Gregoret, L.M. 2002. On the significance of alternating patterns of polar and non-polar residues in
-strands. J. Mol. Biol. 323: 453461.[CrossRef][Medline]
Mount, D.W. 2001. Bioinformatics: Sequence and genome analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, T. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536540.[CrossRef][Medline]
Ohyama, K., Yamano, Y., Sano, T., Nakagomi, Y., Wada, M., and Inagami, T. 2002. Role of the conserved DRY motif on G protein activation of rat angiotensin II receptor type 1A. Biochem. Biophys. Res. Commun. 292: 362367.[CrossRef][Medline]
Otaki, J.M. and Firestein, S. 2001. Length analyses of G-protein-coupled receptors. J. Theor. Biol. 211: 77100.[CrossRef][Medline]
Otaki, J.M., Gotoh, T., and Yamamoto, H. 2003. Frequency distribution of the number of amino acid triplets in the non-redundant protein database. J. Jpn. Soc. Information Knowledge 13: 2538.
Ramachandran, G.N. and Sassiekharan, V. 1968. Conformation of polypeptides and proteins. Adv. Protein. Chem. 28: 283437.
Rigoutsos, I. and Floratos, A. 1998. Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14: 5567.
Schuler, G.D. 2001. Sequence alignment and database searching. In Bioinformatics: A practical guide to the analysis of genes and proteins (eds. A.D. Baxevanis and B.F.F. Ouellette), pp. 187214. Wiley-Liss, New York.
Schwartz, T.W. 1996. Molecular structure of G-protein-coupled receptors. In Textbook of receptor pharmacology (eds. J.C. Foreman and T. Johansen), pp. 6584. CRC Press, Boca Raton, FL.
Tyers, M. and Mann, M. 2003. From genomics to proteomics. Nature 422: 193197.[CrossRef][Medline]
Vazquez, S., Thomas, C., Lew, R.A. and Humphreys, R.E. 1993. Favored and suppressed patterns of hydrophobic and nonhydrophobic amino acids in protein sequences. Proc. Natl. Acad. Sci. 90: 91009104.
Wess, J. 1998. Molecular basis of receptor/G-protein-coupling selectivity. Pharmacol. Therapeutics 80: 231246.[CrossRef][Medline]
West, M.W. and Hecht, M.H. 1995. Binary patterning of polar and nonpolar amino acids in the sequences and structures of native proteins. Protein Sci. 4: 20322039.[Abstract]
Wheeler, D.L., Church, D.M., Edgar, R., Federhen, S., Helmberg, W., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Sequeira, E., et al. 2002. Database resources of the National Center for Biotechnology Information: 2002 update. Nucleic Acids Res. 30: 1316.
Wilbanks, A.M., Laporte, S.A., Bohn, L.M., Barak, L.S., and Caron, M.G. 2002. Apparent loss-of-function mutant GPCRs revealed as constitutively desensitized receptors. Biochemistry 41: 1198111989.[CrossRef][Medline]
Yona, G. and Brenner, S.E. 2000. Comparison of protein sequences and practical database searching. In Bioinformatics: Sequence, structure, and data-banks. A practical approach (eds. D. Higgins and W. Taylor), pp.167190. Oxford University Press, NY.
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
T. Tuller, B. Chor, and N. Nelson Forbidden penta-peptides Protein Sci., October 1, 2007; 16(10): 2251 - 2259. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |