|
|
||||||||
1 Institute for Molecular Bioscience, University of Queensland, St. Lucia 4072, Brisbane, Australia
2 Protagonist Propriety Limited, St. Lucia 4067, Queensland, Australia
Reprint requests to: Mark L. Smythe, Institute for Molecular Bioscience, University of Queensland, St. Lucia 4072, Brisbane, Australia; e-mail: m.smythe{at}imb.uq.edu.au; fax: +61-7-3346-2101.
(RECEIVED July 8, 2004; FINAL REVISION September 22, 2004; ACCEPTED September 23, 2004)
| Abstract |
|---|
|
|
|---|
Keywords: disulphide; disulfide; nonhomologous; PDB; PDBSELECT; arrangement; pattern
Abbreviations: IDSB, intramolecular disulphide bond
PDB, Protein Data Bank SCOP, structural characterization of proteins database Con. A, concavalin-A
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.04923305.
| Introduction |
|---|
|
|
|---|
When disulphide bonds are between cysteines in the same protein chain, they are referred to as intramolecular disulphide bonds (IDSB). Knowledge of the location of IDSBs is useful for structure prediction and taxonomy. For sequence-based native conformation prediction, it is useful to be able to predict where in a protein sequence the IDSBs will form. This knowledge dramatically reduces the number of possible conformations of the protein, making conformation prediction more feasible (Casadio et al. 2000; Fariselli and Casadio 2001). If the three-dimensional (3D) structure of a protein is known, the arrangement of the IDSBs can be used to taxonomically classify proteins which have little or no secondary structure, by aligning their IDSBs in space (Mas et al. 2001) or, if some secondary structure is present, by analyzing patterns of IDSB placement relative to the secondary structure (Harrison and Sternberg 1996).
It has been predicted (Richardson 1981; Miller et al. 1987; White 1992) that proteins with a high concentration of IDSBs form a distinct subset of proteins. Harrison and Sternberg (1994) confirmed this by showing that the ratio between the number of IDSBs and the number of residues in proteins (or IDSB concentration) is bimodally distributed on a logarithmic scale.
Although previous work (Harrison and Sternberg 1994) has identified the existence of two populations of IDSB-containing proteins, and examined proteins IDSB bonding patterns (Table 1
) as a whole, no work to our knowledge has examined the differences in IDSB bonding between the two populations. To do this, a data set of 1280 sequence non-homologous, IDSB-containing protein chains was derived from the Protein Data Bank (PDB) (Berman et al. 2000), using a modified version of the PDBSELECT algorithm (Hobohm and Sander 1994). The arrangement of IDSBs within the protein chains (referred to as connectivities) was then examined. This included the concentration of disulphide bonds, the pattern of disulphide bonding, the "distance" (i.e., number of sequential amino acids) between each half-IDSB, and the "distances" between the first half-IDSB and the N terminus and the last half-IDSB and the C terminus (referred to as "tail lengths"). By analyzing trends in IDSB connectivities as a function of IDSB concentration, different patterns of PDB headers, SCOP folds (Murzin et al. 1995), and connectivities were observed.
|
| Results |
|---|
|
|
|---|
|
|
|
|
|
|
|
In order to analyze bonding patterns, canonical representations of disulphide bond bonding patterns were created by numbering half-IDSBs sequentially from the N terminus. Although these "1-2 3-4"-style bonding patterns were canonical, they were also difficult to interpret for chains with more than two IDSBs. The relationship between any two IDSBs can be described as "independent," "overlapping," or "enclosed" (Table 1
). To help clarify the independent (I), overlapping (O), or enclosed (E) content of bonding patterns, each of the IDSBs is systematically removed (from the N to the C terminus) and the relationship between the remaining pairs is described (Fig. 6
). For example, 1-2 3-4 5-6 is annotated (III), 1-4 2-5 3-6 is annotated (OOO), and 1-5 2-3 4-6 is annotated (IOE).
|
|
Loop size correlations and patterns
Patterns in the sizes of the loops within a chain can be examined by treating chains n IDSB loops as n-dimensional points. The loop sizes of each chain are then represented by a single point.
Loop sizes in chains with overlapping bonding patterns were correlated. For example, Figure 7B
illustrates a weak correlation (r = 0.759) between loop 1 and loop 2 length, in chains with a 1-3 2-4 (overlapping) disulphide bond arrangement. In three-IDSB chains with a 1-4 2-5 3-6 (OOO) arrangement, a stronger correlation was noted (r ranged from 0.837 to 0.960). Although both IDSB-rich and -poor chains were considered together when calculating correlation coefficients, protein chains with overlapping bonding patterns (1-3 2-4 or 1-4 2-5 3-6 [OOO]) were predominantly IDSB-rich (Fig. 5
).
In enclosed (1-4 2-3) IDSB bonding patterns, where the outer loop was less than 40 residues, the sizes of loop 1 and loop 2 were strongly correlated (r = 0.946, Fig. 7C
). Enclosed IDSBs were found equally in IDSB-rich and -poor chains; however, loops larger than 40 residues were exclusively found in IDSB-poor chains.
Two clusters in the loop size distribution of chains with a 1-2 3-4 IDSB bonding pattern (mainly IDSB-poor) accounted for 62% of 1-2 3-4 chains (Fig. 7A
, boxed areas). The larger cluster contained 68 chains: the first loop having 021 residues and the second, 036 residues. This cluster contained most of the 119 chains with a 1-2 3-4 IDSB bonding pattern, having at least one loop <36 residues. The second cluster contained 25 chains: the first loop having 5073 residues and the second, 4667 residues. In total, 44 chains with a 1-2 3-4 IDSB bonding pattern contained at least one loop between 46 and 73 residues.
This mapping technique efficiently illustrates differences in loop size distributions within different disulphide bonding patterns.
In chains with a 1-2 3-4 5-6 (III) IDSB bonding pattern, no loops of size 14 residues were seen (Supplemental Material: 12_34_.csv). This was unusual for three reasons: First, 13 and 15 residue loops were the most common loop size (eight loops each) for these chains. Second, all loop sizes up to 24 residues were seen (by which point the frequencies had decreased to one or two observations). Finally, 14 residues was not an uncommon loop size in other bonding patterns; for example, it was the second most common loop size in chains with a 1-4 2-5 3-6 (OOO) bonding pattern.
Tail lengths
The number of sequence contiguous amino acids between the first half-IDSB and the N terminus and between the last half-IDSB and the C terminus were referred to as "tail lengths." IDSB-poor tail lengths were more widely distributed than IDSB-rich. This was illustrated by their percentiles: for IDSB-rich: 10%, 0 residues; 50%, 2 residues; 90%, 9 residues; in comparison to IDSB-poor percentiles of 10%, 2 residues; 50%, 24 residues; 90%, 173 residues. The most common IDSB-rich N- and C-terminal tails were two, one, zero and zero, one, two residues, respectively. The most common IDSB-poor N and C-terminal tails were two, three, five and one, two, zero residues, respectively. These results are different to those reported in the literature (Harrison and Sternberg 1996), where two and three residue tails were, and were predicted to be, the most common. A second mode near 21 residues was also seen in both the IDSB-poor N-and C-terminal tails. No correlation between N- and C-terminal tail lengths within the same protein chain was found (r = 0.2 for both IDSB-rich and -poor).
Fold and bonding pattern
The relationship between SCOP folds and bonding patterns was initially investigated by analyzing the bonding patterns of each SCOP fold. The 1280 sequence nonhomologous protein chains in the data set featured 246 SCOP folds. Of these folds, a third were comprised of proteins having more than one IDSB bonding pattern. One of the folds, the "knottins," featured 28 different bonding patterns.
For some IDSB bonding patterns, a relationship between bonding pattern and the SCOP fold of the protein chain was found. SCOP folds can provide valuable insights on both the structure and function of a protein. In this work, the bonding pattern data were too sparse to draw conclusions for chains with more than three IDSBs, as not every possible bonding pattern was represented in the data.
Bonding patterns which were strongly associated with one particular SCOP fold included 1-2 3-6 4-5 (EII) with "C-type lectin-like"; 1-3 2-4 5-6 (IIO), and 1-4 2-5 3-6 (OOO) with "Knottin"; and 1-5 2-4 3-6 (OOE) with "De-fensin-like." Bonding patterns were also found to be strongly associated with PDB headers. For example, 1-2 3-4 (I) with "Hydrolase" and "Immune system," 1-3 2-4 (O) with "Cytokine" and "Toxin," 1-2 3-4 5-6 (III) with "Hydrolase," and 1-4 2-5 3-6 (OOO) with "Toxin." These relationships between bonding patterns and SCOP folds/PDB headers were a further illustration of the analogous relationship between bonding pattern and IDSB concentration (Fig. 5
), and IDSB concentration and SCOP folds/PDB headers (Fig. 1
).
Some SCOP folds and PDB headers predominantly contained particular loop sizes (Supplemental Material: scoploop.csv). PDB headers were not clearly delineated over the 716 residue range; however, beyond 16 residues associations were visible (e.g., "Cytokine" with 2327 and 3642 residues and "Hydrolase inhibitor" with 3336 residues).
A three-way relationship between loop sizes, SCOP fold, and PDB header was also found. When the loop sizes of chains with an overlapping (1-3 2-4) 2-IDSB bonding pattern were plotted against each other, a cluster of 12 IDSB poor chains was found (Fig. 7B
, loop 1: 2326 residues, loop 2: 3741 residues). The cluster accounted for most of the overlapping 2-IDSB chains which had loops in the size range (20 chains had a loop size between 23 and 26 residues, and 19 chains had a loop size between 37 and 41 residues). All of the chains in the cluster had similar PDB headers and SCOP folds, despite having nonhomologous sequences. According to their PDB headers, nine were "cytokines," and all of the known SCOP folds were classified as "IL8-like" folds.
| Discussion |
|---|
|
|
|---|
This work required an unbiased set of protein chains to draw statistical inferences. The nonhomologous list used in this work had three advantages over the commonly used PDBSELECT list: the number of chains in the set was larger (three times larger than the PDBSELECT, seven times larger than Harrison and Sternbergs 1994 work), bonding patterns of the chains were considered when determining whether proteins were similar, and chains with less than 30 residues (which were almost by definition IDSB-rich) were included. In our list, 91 chains (7%) had less than 30 residues.
The main prior work in this area is by Harrison and Sternberg (1994), who partitioned the chains by length, at 72 and 193 residues, to produce three groups with equal numbers of chains in each. Therefore, Harrisons >193 residue partition is composed entirely of IDSB-poor chains. Our partitioning, near 25 residues / IDSB, is based on differences in frequency (data density), chain lengths, headers, and SCOP folds, and represents a more thorough examination of the data.
In the IDSBs per-chain frequency distribution (Fig. 4
), the number of IDSB-poor chains decreased approximately by half with each additional IDSB, whereas IDSB-rich chains had a bell-shaped distribution centered around three IDSBs per chain. In other words, IDSBs were distributed by overall IDSB frequency in IDSB-poor chains, whereas IDSB-rich chains followed different rules favoring three IDSBs per chain.
A chain with two, four, or an odd number of cysteines was more often an IDSB-poor chain, with no guarantee that all the cysteines would bond to form IDSBs. Having six, eight, 10, 12, or 14 cysteines, a chain was more likely to be an IDSB-rich chain, with all of its cysteines forming IDSBs. This information may be of use for cysteine bonding prediction. The chain length and the number of cysteines gives an indication of the IDSB concentration, which indicates whether the chain will be IDSB-rich or -poor, which in turn, favors particular IDSB bonding patterns.
Unbonded cysteines contain highly reactive thiol groups, which can form intermolecular disulphide bonds, leading to protein precipitation. To prevent this, in molecules large enough to be able to do so, unbonded cysteines are usually buried in the core of the protein (Petersen et al. 1999). Chains were found to be biased (IDSB-rich chains more so) to having fewer and no unbonded cysteines. Overall, IDSB-rich chains had fewer unbonded cysteines than IDSB-poor chains (Fig. 3
). As IDSB-rich chains are, on average, smaller than IDSB-poor, they have less solvent-inaccessible volume in which to bury unbonded cysteines.
Our median loop sizes for IDSB-rich and -poor are similar to Thornton (1981), even though the data set has increased from 50 to over 1000 protein chains. Loop sizes in chains were found not to be independently drawn from the overall frequency distribution for the bonding pattern. Instead, common loop size combinations (clusters) shaped the frequency distributions for loop sizes. Clusters in loop sizes were associated with SCOP folds. Unexpectedly, the most common loop lengths in IDSB-rich chains were longer than those of IDSB-poor chains. This may be explained by IDSB-poor chains favoring independent bonding patterns, which minimize loop length, and IDSB-rich favoring overlapping bonding patterns, which maximize loop length.
Harrison and Sternberg (1994) theorized that IDSBs near the termini of a protein chain are more entropically favorable if they occur a few residues from the end of the chain. In chains longer than 50 residues, they found a peak at two to three residues, which had a significant deviation from a uniform distribution. Results from our work showed that over all protein chains, zero, one, and three residue tails were equally common, with two residue tails being approximately 15% more frequent (Fig. 8
). This means that the results from this work did not support Harrison and Sternbergs theory that 0 residue tails are less favorable than two or three. As Harrison analyzed chains >50 residues, considering only the longer IDSB-poor chains, our results still showed that one residue C-terminal tails are more common than two residue tails, and zero residue tails are more common than three.
|
IDSB-connectivity prediction enjoys mixed success (Fariselli et al. 1999; Ceroni et al. 2003; Rost and Liu 2003), which may be due to the existence of two (or more) distinct groups of proteins in the available data, which have differing connectivities, PDB classifications, and 3D SCOP folds. In related work, separating these groups has improved the accuracy of our own IDSB-connectivity prediction efforts (J. Trygg, unpubl.).
This work has attempted to show that analysis of the connectivities of IDSB-containing protein chains must take into account the existence of the (at least) two distinct classes of proteins. IDSB-rich and IDSB-poor chains connectivities follow different patterns, and perhaps different rules for their formation. As far as we are aware, no one has analyzed and compared the connectivities of proteins by partitioning the data into IDSB-rich and IDSB-poor chains. IDSB connectivities are also useful descriptors, halfway between a 1D and a 3D description of a molecule, with potential applications in taxonomy and native structure prediction. The data set itself (and algorithm used to construct the data set) may be of use to researchers interested in a large, nonhomologous set of IDSB containing proteins.
| Materials and methods |
|---|
|
|
|---|
Creation of the data set
The October 7, 2003 edition of the PDB database was mirrored locally into a relational database. The PDB entries were parsed, attributes were calculated, and stored in the database. Chains which did not contain at least five standard amino acids and one IDSB were not considered further.
Modifications to Hobohms PDBSELECT algorithm
The PDBs protein chains are biased, due to the presence of mutants and serial refinements of structures. This work required unbiased (nonhomologous) data in order to be able to draw inferences. Duplicate chains were identified through sequence alignments and removed. This resulted in a diverse set of sequences, which is considered to be equivalent to a set of proteins with diverse functions and roles (Sander and Schneider 1991; Abagyan and Batalov 1997).
The algorithm used to generate the nonhomologous list was modified from the PDBSELECT algorithm (Hobohm et al. 1992; Abagyan and Batalov 1997). The changes to the original algorithm were aimed at increasing the number and diversity of the IDSB-containing proteins. Only chains that contained an IDSB were used as input to the algorithm. Limiting the input to IDSB-containing chains prevented a chain without an IDSB causing a sequence homologous IDSB-containing chain to be removed from the list. Chains with different IDSB bonding patterns were treated as dissimilar, regardless of the similarity between their amino acid sequences, as proteins with different IDSB bonding patterns have different 3D topologies. The unmodified PDBSELECT algorithm does not consider IDSB bonding in its similarity calculation.
The unmodified PDBSELECT algorithm excludes chains less than 30 residues in length and therefore discards some nonhomologous IDSB-rich chains. The algorithm was modified to impose no lower size constraints. Due to the common use of the PDBS-ELECT, these very short, IDSB-rich chains have not been previously analyzed. All chains, regardless of the structures quality, were included as input. This has not effected the validity of our results as this work was concerned only with sequences and IDSB locations, whereas the PDBSELECT list was intended to also be a list of high-quality 3D structures.
The July 2003, 25% homology PDBSELECT list was also stored in a separate table in the database to compare the results from our nonhomologous list to a generally accepted list.
Calculation of proteins attributes
The numbers of intra- and intermolecular disulphide bonds were calculated from the SSBOND record of the PDB file. The amino acid sequence was parsed out of the SEQRES record of the PDB file and translated into single-letter sequences. IDSB concentration was calculated through
. Proteins SCOP folds (Murzin et al. 1995) were determined by submitting a query to a mirror of the SCOP database (http://scop.wehi.edu.au/scop/search.cgi).
Loops were defined as the residues between half-IDSBs in a protein chain. Tail lengths were defined as the number of residues from each termini to the nearest half-IDSB.
| Footnotes |
|---|
| Acknowledgments |
|---|
| References |
|---|
|
|
|---|
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235242.
Betz, S. F. 1993. Disulfide bonds and the stability of globular proteins. Protein Sci. 2: 15511558.[Abstract]
Casadio, R., Compiani, M., Fariselli, P., Jacoboni, I. and Martelli, P.L. 2000. Neural networks predict protein folding and structure: Artificial intelligence faces biomolecular complexity. SAR QSAR Environ. Res. 11: 149182.[Medline]
Ceroni, A., Frasconi, P., Passerini, A., and Vullo, A. 2003. Predicting the disulfide bonding state of cysteines with combinations of kernel machines. J. VLSI Signal Process. 35: 287295.
Darby, N. and Creighton, T.E. 1995. Disulfide bonds in protein folding and stability. Methods Mol. Biol. 40: 219252.[Medline]
Fariselli, P. and Casadio, R. 2001. Prediction of disulfide connectivity in proteins. Bioinformatics 17: 957964.
Fariselli, P., Riccobelli, P., and Casadio, R. 1999. Role of evolutionary information in predicting the disulfide-bonding state of cysteine in proteins. Proteins 36: 340346.[CrossRef][Medline]
Harrison, P.M. and Sternberg, M.J.E. 1994. Analysis and classification of disulphide connectivity in proteins. The entropic effect of cross-linkage. J. Mol. Biol. 244: 448463.[CrossRef][Medline]
. 1996. The disulphide
-cross: From cysteine gemoetry and clustering to classification of small disulphide-rich protein folds. J. Mol. Biol. 264: 603623.[CrossRef][Medline]
Hobohm, U. and Sander, C. 1994. Enlarged representative set of protein structures. Protein Sci. 3: 522524.[Abstract]
Hobohm, U., Scharf, M., Schneider, R., and Sander, C. 1992. Selection of representative protein data sets. Protein Sci. 1: 409417.[Abstract]
Huang, X. and Miller, W. 1991, A time-efficient, linear-space local similarity algorithm. Adv. Appl. Math. 12: 337357.[CrossRef]
Mas, J.M., Aloy, P., Martí-Renom, M.A., Oliva, B., de Llorens, R., Aviles, F.X., and Querol, E. 2001. Classification of protein disulphide-bridge topologies. J. Comput. Aided Mol. Des. 15: 477487.[CrossRef][Medline]
Matsumura, M. and Matthews, B.W. 1991. Stabilization of functional proteins by introduction of multiple disulfide bonds. Methods Enzymol. 202: 336356.[Medline]
Miller, S., Janin, J.A., Lesk, A.M., and Chothia, C. 1987. Interior and surface of monomeric proteins. J. Mol. Biol. 196: 641656.[CrossRef][Medline]
Murzin, A., Brenner, S.E., Hubbard, T.J.P., and Chothia, C. 1995. SCOP: A Structural Classification of Proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536540.[CrossRef][Medline]
Petersen, M.T.N., Jonson, P.H., and Petersen, S.B. 1999. Amino acid neighbours and detailed conformational analysis of cysteines in proteins. Protein Eng. 12: 535548.
Richardson, J.S. 1981. The anatomy and taxonomy of protein structure. Adv. Protein Chem. 34: 167339.[Medline]
Rost, B. and Liu, J. 2003. The PredictProtein server. Nucleic Acids Res. 31: 33003304.
Sander, C. and Schneider, R. 1991. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9: 5668.[CrossRef][Medline]
Thornton, J.M. 1981. Disulphide bridges in globular proteins. J. Mol. Biol. 151: 261287.[CrossRef][Medline]
White, S. 1992. Amino acid preferences in small proteins. J. Mol. Biol. 227: 991995.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
J. Song, Z. Yuan, H. Tan, T. Huber, and K. Burrage Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure Bioinformatics, December 1, 2007; 23(23): 3147 - 3154. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |