|
|
||||||||
Structural Biology and Bioinformatics Center, Structure and Function of Biological Membranes Laboratory, Free University of Brussels (ULB), B-1050 Brussels, Belgium
Reprint requests to: Erik Goormaghtigh, Structural Biology and Bioinformatics Center, Structure and Function of Biological Membranes Laboratory, CP 206/2, Free University of Brussels (ULB), Bld du triomphe, Acces 2, B-1050 Brussels, Belgium; e-mail: egoor{at}ulb.ac.be; fax: 32-2-650-5382.
(RECEIVED March 11, 2003; FINAL REVISION June 12, 2003; ACCEPTED June 12, 2003)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.0354703.
Subscripts are used to designate specific secondary structure types using the notation of the Dictionary of Secondary Structures of Proteins (DSSP), and are as follows: H,
-helix; E, ß-sheet; T, turn (all types); G, 310 helix; I,
-helix; S, a sharp bend in the protein backbone which cannot be assigned as T; B, a residue with extended 
angles that cannot be assigned as E. The notation
other will used to designate all residues with non-H, E, or T assignments, whereas C will be used to specifically designate residues that are not given any secondary structure assignment by DSSP.
| Abstract |
|---|
|
|
|---|
-helix and ß-sheet contents, and they represent a more comprehensive variety of fold types than any previous reference set. This report includes a detailed presentation of the reasoning behind the rational protein selection process, a description of the properties of the RaSP50 set, and a discussion of the types of structural and spectral variations that are represented in the set. Keywords: Proteins; analysis; methods; circular dichroism; infrared spectroscopy; crystallography; statistical analysis; secondary structure; chemometrics; basis sets, protein; databases, protein; data interpretation
Abbreviations: AU, absorbance unit CATH, Class Architecture Topology and Homology CD, circular dichroism spectroscopy CSA, camphor sulfonic acid EMBL, European Molecular Biology Laboratory FC, fractional composition (percentage) of a secondary structure type in a protein FTIR, Fourier transform infrared spectroscopy HSSP, homology-derived secondary structure of proteins IR, infrared spectroscopy NMR, nuclear magnetic resonance PDB, Brookhaven Protein Data Bank RaSP, rationally selected proteins RMS, root-mean squared SCOP, Structural Classification of Proteins
| Introduction |
|---|
|
|
|---|
The earliest efforts, published in the 1970s, utilized small sets of three (Saxena and Wetlaufer 1971) or 58 (Chen and Yang 1971; Chen et al. 1972, 1974) proteins (basis sets) with some of the earliest crystal structures. Several years later, larger basis sets were introduced by Chang et al. (1978) and Hennessey and Johnson (1981). Between them, the sets contained 25 proteins representing from 0% to 79%
-helix and 0% to 51% ß-sheet (as measured by the assignment algorithms used in their studies), even though more recently a few proteins have been added to this core of 25 proteins. There are some features of protein spectra analysis, discussed below, that make it different than other common applications of chemometric analysis methods. However, an essential property of any basis set is that the full range of possible analyte concentrations (or protein structures) must be represented if analysis accuracy is to be obtained. Thus, for protein basis sets, the need for the broadest possible range of representative structures cannot be overemphasized.
The standard basis set proteins cover a wide range of
-helix and ß-sheet contents (concentrations), which suggests that they constitute a good basis set. However, whether or not this is true depends on the relevance of the criteria used by assignment algorithms to evaluate protein secondary structure concentrations, such as those of Levitt and Greer (1977) or Kabsch and Sander (1983). It is undeniably true that there is a relationship between assigned protein secondary structures and spectral band shape. However, these correlations are actually quite general in nature. For example,
-helix produces a signal around 1654 cm-1 in infrared spectra, but a huge number of bands at other frequencies assigned to
-helix can be found in the literature (1648 to 1662 cm-1 in 1H2O, 1642 to 1660 cm-1 in 2H2O, reviewed by Goormaghtigh et al. (1994). This clearly shows that protein IR spectra also reflect variations in the nature of
-helices in addition to the number of residues in helical conformations. Although such dependencies are recognized, they are not yet well understood (Nevskaya and Chirgadze 1976; Dousseau and Pézolet 1990). In CD spectra, the dependence of band shape on
-helix content is more consistent, but all-ß-sheet proteins exhibit many different CD band shapes (Perczel et al. 1992). The existence of two types of CD spectra for ß-rich proteins form the basis for their classification as ßI- and ßII-proteins as proposed by Sreerama and Woody (2003). The source of ßII-protein CD, which resembles that of unordered polypeptides, is not yet clearly understood. However, lacking alternative or complementary criteria for protein selection, other than intuition, protein secondary structure contents have been used by default.
To simplify the discussion, we will focus primarily on infrared studies of proteins in 1H2O, as these are most comparable to the present work. Comprehensive reviews of CD analyses have been written by Woody (1995) and Venyaminov and Yang (1996). The mathematical methods used in the development of CD spectra-based analysis of protein conformation have typically been multivariate regressions of one form or another. In 199091, statistical methods were introduced into the IR analysis field with the nearly simultaneous publication of several studies (Dousseau and Pezolet 1990; Kalnin et al. 1990; Lee et al. 1990; Sarver and Krueger 1991a, b). These focused primarily on mathemati-cal methods, but also addressed other issues relevant to IR such as the data regions used, the effects of deuteration, different normalizations, and the combination of IR and CD spectra. More recent infrared studies in this genre have dealt with data types (Pancoska and Keiderling 1991; Pancoska et al. 1991; Pribic et al. 1993; Baumruk et al. 1996; Wi et al. 1998) and alternative analysis methods (Venyaminov and Vassilenko 1994; Pancoska et al. 1996), and one study briefly mentions the effects of removing proteins from a 28-protein basis set (22 folds, as defined below) in a vibrational CD study in D2O (Pancoska et al. 1995).
Previously, fractional compositions (FCs) were the only readily available information that could be used in basis protein selection, other than intuition, and so FCs have necessarily been the primary criterion for selection. Now that the number of available crystal structures has reached the thousands, coherent protein structure classification systems have been developed. CATH (Orengo et al. 1997) and SCOP (Barton 1994; Murzin et al. 1995; Hubbard et al. 1997) are two such schemes that classify proteins based on their three-dimensional structures. The recent appearance of SCOP and CATH has made it possible to include protein fold as a criterion in basis protein selection and to evaluate how well a basis set reflects the full range of known tertiary structures.
In the following text we discuss the use of information contained in these and other crystal structure databases to construct a protein basis set that may help to answer many of the questions posed here. We identified 115 commercially available (and 5 other) proteins with solved crystal structures as potential basis set candidates. From these, a 50-protein basis set was assembled that not only covers the broadest possible range of secondary structure contents, but also the largest possible variety of tertiary structures. We demonstrate the germaneness of the approach used here with a comparison of spectra from proteins with similar secondary structure contents (FCs) but different folds.
| Results and Discussion |
|---|
|
|
|---|
In this study, the CATH database was used as the primary tool in a search for suitable proteins. To supplement the ranges of
-helix (FCH) and ß-sheet (FCE) represented, the PDB_SELECT database was also utilized. From the list of potential basis set candidates obtained in this manner, proteins were selected and their suitability for use in an experimental basis set evaluated using supplemental information obtained from the SCOP, DSSP, and SWISS-PROT databases. If a protein was found to be unusable (not pure when acquired, denatured), potential replacements were identified by returning to the CATH or SCOP databases. The resulting basis set of rationally selected proteins (RaSP) represents a wide range of different protein structures, and the spectra of these proteins exhibit a wide range of variation, some of which does not depend entirely on secondary structure FCs.
Construction of the RaSP set
Protein selection based on fold
The first step in the protein selection process was a search of the CATH database for proteins that have been crystallized, classified, and are also commercially available. Commercial availability was chosen as an important criterion for protein selection in order to make the RaSP basis set accessible to every scientist. The goal of the selection strategy was to identify representative proteins for as many different folds as possible. By using protein fold as the primary protein selection criterion, the largest possible range of structural variation has been incorporated into the set. Because of the dependence of protein spectra on subtler differences in structure, it follows that this selection strategy has produced a wide range of spectral variation as well (see below). To maximize the variety of structures represented in the RaSP set, a few proteins provided by some of our collaborators were included because they represented unique folds.
The CATH system was chosen as the primarily protein selection tool for this study because its organization provided a convenient and effective way to choose proteins with differing folds. The CATH numbers for all proteins identified as potential basis set members are listed in Table 1
. A CATH number is assigned to each domain in a protein, so it is possible for multidomain or multichain proteins to have several different CATH numbers or more than one domain with the same CATH number. To simplify the table, the CATH number is listed only once (not repeated) for proteins with multiple domains that have the same CATH assignments.
|
-helix and ß-sheet in the classified proteins. Since the band shapes of IR and CD spectra depend strongly on
-helix and ß-sheet content (FCH and FCE), it is immediately apparent that there must be some correlation between spectra and classification. A protein domains position in the second level of CATH, Architecture (A), is based on its overall fold (Fig. 1
/ß and
+ß domains. The third level of CATH, Topology (T; Fig. 1
|
ß
sandwich), whereas other Architectures have only one known Topology (ß-clam). Both the Architecture and Topology levels of the CATH hierarchy represent large differences in structure, so their combined use for protein selection insures a broad range of structural, and thus spectral, variation. In contrast, the smaller structural differences between proteins whose classification is identical except in their CATH Homology number would not be expected to produce large spectral differences. Because of this, it was possible to move on once a commercially available protein was identified in a given Topology. Roughly 90 proteins were identified in the fold-based search as potential reference set candidates. Ideally, if a basis set is to represent all possible variations of protein structure, it should completely fill CATH (or SCOP) space. Not surprisingly, we found that the majority of the proteins listed in these databases are not available commercially. Despite this limitation, the proteins identified as RaSP set candidates are well distributed in CATH space, and thus we can conclude that they provide the most comprehensive sampling possible of known protein structures. Since the structural differences that cause variations in protein spectra are not completely understood, we suggest that the range of folds represented by a basis set may be the best measure of how likely it is to be applicable, in a general sense, as a reference for protein structure determination.
Protein selection based on
-helix and ß-sheet contents
The secondary structure percentages, FCH and FCE, are two of the best characterized structural characteristics of proteins and are the strongest, if somewhat inconsistent, determinants of spectral band shape. Therefore, protein classification schemes and secondary structure percentages are complementary, and both were used here to maximize the extent of structural variation included in the RaSP set. Figure 2
presents the proteins considered for the RaSP set as a function of their
-helix (FCH) and ß-sheet (FCE) contents in HE space. Again the full range of available structures is indicated on the graph for comparison (small points). These points show the HE values of proteins in the PDB_SELECT database, which groups proteins based on sequence homologies.
|
110. Many of the added proteins had low FCs of non-periodic structures (T, C, etc.), and fall near or above the line defined by (0.8)%H + %E = 52, or have FCH values between 40% and 65%. Special attention was given to identifying potential helical proteins at this stage because analysis of all-helix protein IR spectra, especially curve fitting, generally tends to underestimate the actual FCH. Furthermore, few examples of proteins with more than 45%
-helix have appeared in previous basis sets. In this step, it was necessary to select some proteins with redundant folds to obtain the widest possible HE space distribution (TRO and PAB; PEP, PGN and REN; SBC and SBN; IGG and SOD; see Tables 1 and 2
|
-helix and ß-sheet contents of less than 35% (indicated by the line %H + %E = 35 in the figure; CAS, MTH, TIK). There were three other regions where acceptable proteins were rare. These are proteins with ß-sheet contents >52%, proteins with >10%
-helix and >34% ß-sheet, and proteins with 50%70%
-helix. We were able to partially fill two of these gaps with ATX and APE, which are not available commercially.
Elimination of unsuitable proteins
Up to this point, the focus of the selection process was on maximizing potential spectral variability by finding as many commercially available proteins as possible. The next step was to optimize the crystal structure information for the proteins. This was achieved by identifying the best crystal structure for each protein and also by rejecting those proteins whose crystal structures did not meet the requirements discussed below. Where possible, rejected proteins were replaced by more suitable proteins with the same CATH number.
This reference information refinement process began with a comparison of the sources (species) of the commercial proteins and crystallized proteins (Table 2
). Wherever possible, crystal structurecommercial protein pairs with the best possible sequence homologies were chosen. To accomplish this, the SCOP database, at the Protein level of the hierarchy, was consulted to find all species from which each protein had been crystallized. Comparison with the commercial protein sources often resulted in a direct species match (i.e., an identical sequence).
Next, the sequence homology between the crystal structures and commercial proteins was evaluated, where possible, using the HSSP database (Sander and Schneider 1993; Dodge et al. 1998). Normally, proteins with 25%30% sequence homology will have similar folds (Flores et al. 1993; Hilbert et al. 1993; Orengo et al. 1994). However, relatively small differences in primary sequence can cause variation in the spectra of proteins (Prestrelski et al. 1991). For example, bovine and human ALA have 75% sequence homology, and the same general fold; however, both their CD and IR spectra are significantly different (Keiderling et al. 1994). Therefore, those proteins with less than 85% sequence identity were rejected as potential basis set candidates. Exceptions, retained because of unique structures, were CAH (79% identity) and CNA (82%).
To further verify the correspondence of each chosen crystal structure to the commercially available protein, the SWISS-PROT database was consulted. For proteins that could be located in SWISS-PROT, the number of amino acids in the sequence was counted after removal of any propeptides, signal sequences, etc., and then compared with the number of residues in the crystal structure. Proteins with crystal structures that lacked a substantial number of residues were rejected; it was possible to set the cutoff to 5%6% missing residues without seriously reducing the number of potential proteins.
Optimizing crystal structure quality
It was possible to go one step further in the optimization of reference information by analyzing the quality of protein X-ray crystal structures. There are actually several factors that can cause discrepancies between crystal structures of the same protein. Such variations are often artifacts from the process of solving and refining the structure, but can sometimes reflect real differences in the structure of the protein in crystals obtained from different conditions. To illustrate, the DSSP assignments of 37 complete, wild-type crystal structures of lysozyme (LSZ) were examined with the STRUCTAB program. The results listed in Table 3
show that the FC values for these crystal structures cover a considerable range. Thus it is obvious that if one of these structures can be identified as being of higher quality than the others, it should be used as the reference in structure analysis.
|
Protein acquisition and purity testing
The rejection of unsuitable proteins from the set left 106 potential basis set candidates. Finally, proteins were selected from these for acquisition in groups of 1020, based on their cost, advertised purity, and position in the CATH and HE space distributions. SDS-PAGE was used to screen acquired proteins for purity. Estimates of protein purity are included in Table 2
. Many proteins proved to be less than
85% pure, and so were discarded. Overall, 22 out of a total of 72 acquired proteins were discarded because of impurity or other technical problems. Where possible, discarded proteins were replaced by acceptable substitutes that were found by returning to CATH (
10 new potential proteins were found at this stage). The effects of purity on analysis results will be discussed elsewhere.
After the selection, acquisition, and elimination process, there remained a set of 50 proteins that had been acquired and met all selection criteria or were considered to be reasonable compromises.
Description of the RaSP50 basis set
We will refer to the final set of basis proteins as the RaSP50 set from this point onward. The proteins in the RaSP50 set range from 6 (INS) to 700 (IGG) kD in size, and contain from 1 to 7 (ATX) chains and 1 to 6 (PAH) domain folds.
The relevance of the RaSP50 set as a general tool for experimental studies is demonstrated by the range of structures represented as compared to known protein structures. The RaSP50 set contains 42 unique protein types which represent all four CATH classes, 18 out of 31 possible Architectures, and at least 60 of the 482 known Topologies. These 60 CATH Topologies contain roughly 6200 of the 10,344 domain entries included in the CATH database at the time the basis set was constructed. To look at the comprehensive nature of the RaSP50 set from a different perspective, consider the final column of Table 1
, which lists the superfold family membership of the proteins (Orengo et al. 1994). Currently nine protein superfold families have been identified, which account for 46% of all known nonhomologous protein folds (note that many proteins do not have any superfold domains). The RaSP50 set contains representatives of eight of these superfolds. It was not possible to include a representative of the ninth, split
ß sandwich.
To more clearly illustrate some of the structural variation in the RaSP50 set, fragments of several protein crystal structures are diagramed in Figure 3
(Kraulis 1991). In the top row of the figure, the ß-strands of four of the RaSP50 proteins are drawn. The structure of avidin (AVI) is a simple ß-barrel with eight antiparallel ß-strands connected primarily by short loops (omitted for clarity). Several of the ß-class proteins included in the RaSP set, such as AVI, have predominantly regular and undistorted strands. The ß-sheets in CNA are an example of sharply curved ß-strands and also illustrate interactions that can give rise to distortion of common structures. BTE is an example of a much simpler strand system, and UOX illustrates that many different variations and distortions can coexist in a single protein. The middle row of the figure gives three examples of
ß proteins with different tertiary structures, including sheet around helix (UBQ), helix on sheet faces (RNA), and helix around sheet (
ß
sandwich, PGK). The one protein with no
-helix or ß-sheet included in the RaSP50 set, MTH, also appears here. The bottom row includes helical portions of four proteins. Many different variations of
-helix types are included in the RaSP50 set, including long regular helices (FTN), medium-length (MBA), and short helices (INS). The four selected helices from LOX, which contains 45 helices overall, illustrates some of the helix distortions present in the RaSP50 set, including kinks (bottom two) gradual bends (third from bottom), and sharp bends (top helix) that may all contribute to variations in spectra.
|
-helices; twisted, flat, and curved ß-sheets; ß-bulges; many different modes of packing (topologies); and many different combinations of helix and sheet FCs.
Several suitable proteins were found in all of the CATH Architectures with more than 10 Topologies, except 1.30. Of the three Architectures with more than 50 topologies, the
ß sandwich (CATH 3.30) was most challenging. The three RaSP50 3.30 proteins are all complex and have multiple domains (OVA, PAH, UOX). The distribution of the RaSP50 set members in the
ß
sandwich Topology (CATH 3.40; 24 identified; 14 acquired; five discarded) is more even than that of the
ß sandwich, and ranges from 3.40.50 to 3.40.630. The 3.40 Architecture contains the
ß doubly wound superfold family, whereas 3.30 includes proteins of the Split
ß sandwich superfold, which is the only superfold that could not be represented in the RaSP50 set.
The third large CATH Architecture, all-
/non-bundle (CATH 1.10), deserves special attention here. In all, 29 1.10 proteins representing
29 different Topologies were identified (some of these proteins have not yet been assigned CATH numbers), with 16 proteins (15 Topologies) appearing in the final RaSP50 set. Under the SCOP system, 14 RaSP domains are of the all
-helix class, and 11 of these are in the RaSP50 set. The difference in the number of proteins with CATH and SCOP all
-helix classifications arises from the different ways that protein domains are identified in these two systems: Several domains with
or
+ß assignments in the SCOP database are treated as multiple domains in CATH, with one or more classified as all
-helix (BLM, CAB, CAT, DNA, PAH, PAP, POX, THR). Regardless of the classification system used, a wide variety of different helical topologies with classical and exceptional helix geometries are represented in the RaSP50 set, including examples of both the
up-down and globin superfold families.
HE space distributions
We now turn our focus to
-helix and ß-sheet fractional compositions (FCs), which was the secondary protein selection criterion used in the construction of the RaSP set. Figure 2
compares the fractional compositions, FCH and FCE, of the RaSP proteins with
900 PDB_SELECT members (35% homology cutoff). The points in the figure representing the PDB_SELECT proteins occupy a relatively well defined region of HE space, with the majority between the lines %H + %E = 35% and %H + %E = 65%, though several all-
proteins have more than 65%
-helix. Nearly all RaSP set members also fall within these limits, but some portions of HE space are populated more densely than others. This point about the relationship between
and ß structures was already reported by Pancoska et al. (1995). Considering again the PDB_SELECT proteins, a region of high protein density is bounded by the %H + %E = 35%-65% limits and the lines %E-%H = 5% and %E = 10% (region 1). The highest density of RaSP proteins also lies in region 1 and includes 43 proteins. If the PDB_SELECT database with a 55% homology cutoff is used (data not shown), smaller clusters of proteins are apparent within the limits %H = 10%20% and %E = 30%40% (region 2) and at %H = 0%12% and %E = 43%50% (region 3). These regions contain four and seven RaSP50 proteins, respectively. Overall, the distribution of the RaSP50 proteins in HE space parallels the natural dispersion. This was achieved even though
-helix and ß-sheet composition was not the primary criterion for protein selection.
Structure frequency distributions
We have shown that the distribution of protein folds and fractional secondary structure compositions in the RaSP50 set covers much of the same range as crystallized proteins. There are, in addition to the CATH and HE spaces, other relevant comparisons that will further establish the general nature of the RaSP50 set. Figure 4
presents the frequencies of different structures that occur in the RaSP50 and PDB_SELECT proteins as modified histograms. It is immediately clear from the figure that the selection processes used in constructing the RaSP50 set has produced distributions that are similar to the patterns in the population of known protein structures. The FCH histogram (Fig. 4H
) shows that the largest number of proteins have <10%
helix in both RaSP50 and PDB_SELECT (35%) sets. They also have peaks at 30%40% helix, which is consistent with the high density of proteins in region 1 of HE space (Fig. 2
). The ß-sheet distribution for the PDB proteins also reflects the HE region 1 population density, but the curve for the RaSP50 set is slightly different because of our focus on proteins with high helix content (and therefore low FCE). This curve also emphasizes the relatively few acceptable proteins found with 20%30% ß-sheet. The RaSP50 and PDB_SELECT curves are highly similar in the FCT and FCC plots, as they are in the structure size distributions on the right side of the figure. These distributions provide further evidence that the RaSP50 basis set has many of the same essential properties as the population of proteins of known structure.
|
The variability of spectra from proteins with similar FCs but different folds
To conclude our examination of the set, we briefly present several examples of RaSP50 protein spectra to illustrate the extent of this variation, and to reveal the consequences of the customary HE-based protein selection strategy. However, before continuing, several points must be made. The first is that all of the spectra of the RaSP50 set proteins have the general character that is expected based on their crystal structures. The next important point is that variations in protein spectra may also reflect the contributions of side chains to the spectra. For CD, the exact nature of aromatic side chain contributions can depend strongly on their local environment (Grishina and Woody 1994; Woody 1994; Woody and Dunker 1996), but most IR-active side chains are carboxylates and amines, which are found mainly at protein surfaces. Their signals should not normally be strongly perturbed by neighboring residues, and can therefore be modeled with some accuracy. There has been an active but unpublished discussion among protein IR spectroscopists about the effect of side chain bands on analysis results, so to insure that the differences shown here result from protein secondary structure, side chain bands have been subtracted from all of the IR spectra shown. Such an approach was suggested by Venyaminov in previous work (Venyaminov and Kalnin 1990).
Our brief survey begins with proteins of highly
-helical character. In order to merge the information contained in the CD and IR spectra for future investigations, hybrid CD-IR spectra were built. At the top of Figure 5
, the pair of hybrid CD+IR spectra labeled A were collected from HBN and FTN, which have 68.9% and 71.3% H (
-helix) respectively, as assigned by DSSP. The helices of both of these proteins are illustrated in Figure 3
; they both have no ß-sheet (E) and nearly identical turn (T) and
other structure FCs. Their IR spectra are similar in shape, but there is an offset apparent between their infrared amide I bands (1656 and 1653 cm-1, respectively) as well as a reproducible intensity difference in their amide II bands. It should be kept in mind that this is just one example; FTN has predominantly long helices (2228 residues) that pack against each other at roughly 20° angles. In contrast, the average length of the helices in HBN is 14 residues, and several pack at an
50° angle. The shape of the CD spectra of these two proteins is largely the same, but the FTN spectrum has a markedly smaller intensity. The effect of the number of residues involved in the
-helical structures was suggested by Nevskaya and Chirgadze (1976). The frequency of the amide I main component was shown to rise when its length is decreased. The CD intensities of CSA and TRO (Fig. 5B
) are much more similar, but their IR spectra have distinctly different shapes. These two proteins again have very similar secondary structure FCs (59.6% and 62.3% H, 10.8% and 8.6% E, 2.3% and 3.7%T, respectively). CSA is a relatively large protein with 16 helices of
15 residues and four 20-residue helices; TRO has a single long helix (33 residues) around which seven shorter helices (712 residues) are packed. Despite the similar ratio of long and short helices in each protein, their amide I maxima are offset by 5 cm-1 (1656 versus 1651 cm-1, respectively), and have substantially different widths.
|
34%H,
12%E) but completely different folds (Table 1
-helix content. The spectral differences that result from fold are even more pronounced in the smaller proteins RNA and UBQ (
33% H,
17% E, see also Fig. 3
It is commonly accepted that CD is most effective when used for determination of the
-helix content in proteins, whereas IR is usually considered to be more accurate, that is, more consistent, for the determination of FCE. If we now examine proteins with high ß-sheet content, we see that both IR and CD spectra can be substantially different for ß-class proteins. The first such example, shown in Figure 5E
, is a comparison of the spectra of BTE and CNA. BTE is a small protein with five ß-strands (Fig. 3
), and CNA contains a large, bent ß-sandwich, but they both have 44% ß-sheet. Their CD spectra have the expected low intensity compared to
-helical proteins, but their shapes are completely different in the 210240 nm region. Although the contribution of aromatic residues to the CD spectra may be invoked to explain this difference, especially the positive BTE signal at 228 nm, their IR spectra are also dissimilar. The amide I maxima of these proteins are offset by 12 cm-1 (1647 versus 1635 cm-1), although the higher turn content of BTE (16% versus 11% T for CNA) could produce a small portion of the spectral difference above 1660 cm-1. The ß-barrel protein AVI and the well known ß-sandwich IGG provide our final examples. These both have ß-sheet contents close to 52%, which is exceeded by few proteins in either the RaSP set or the PDB. The difference in their CD spectra parallels the previous example, and their IR spectra match quite well above 1660 cm-1, but at this point there is an inflection in the IR spectrum of AVI which is not present in the IGG spectrum. The AVI peak is significantly broader than that of IGG, and the shapes of their amide II bands are not at all similar.
If secondary structure content had been the sole criterion used for the selection of RaSP set proteins, only one of each pair of spectra listed in the preceding paragraphs would have been included. The discussion and spectra above clearly reinforce our assertion that although the correlation between FC and spectral band shape does follow a certain trend, many variations occur as well.
Conclusions
The search for an accurate and rapid method for determining protein structure from optical spectra has continued since protein secondary structure was discovered. There have been many attempts to capture the correlation between protein structure and IR or CD band shape using a wide variety of approaches. The appearance of multivariate statistical methods in the protein structure field originally looked promising because these methods are highly effective for the characterization of simple chemical systems. Unfortunately their performance has not been consistently reliable with proteins. Many groups have attempted to improve the general accuracy of secondary structure determinations by developing new mathematical methods, but the results of the present study suggest that the choice of basis proteins should also play a major role in the overall reliability of analysis.
Despite the fold-dependent spectral variations illustrated above, the structure-spectrum relationship in the basis sets of previous studies has been strong enough to allow
-helix and ß-sheet contents to be determined to within 4%12% (RMS error). However, it should be kept in mind that these numbers may be predominantly a measure of the internal consistency of the spectra included in each basis set, and may strongly reflect the extent or lack of variation, of the type shown above, that is present in each basis set spectra. Because the relationship between secondary structure "concentrations" (FCs) and spectral characteristics is complex, it is important that a basis set contain examples of spectral variations in hopes that sufficient information will be incorporated into a calibration. This information will then be available to be called upon in the analysis of a protein of unknown structure.
Since the 50 members of this rationally selected protein basis set (RaSP50) represent the widest possible range of protein structure FCs and folds, its use should facilitate new progress in the development of spectroscopic protein structure analysis or other methods which require an experimentally accessible set of proteins that are representative of known protein structures.
| Materials and methods |
|---|
|
|
|---|
Protein selection began with a search for proteins with crystal structures available in the CATH crystal structure database (version 1.0) that were also commercially available from the Sigma or Fluka biochemical companies. The search was not restricted to single-domain proteins. CATH (http://www.biochem.ucl.ac.uk/bsm/cath_new) and SCOP (http://scop.mrc-lmb.cam.ac.uk/scop) are crystal structure databases that are organized by three-dimensional structure. Identification data for all proteins located in this search are listed in Table 2
, and their structural parameters are given in Table 1
. If one or two proteins with the same Class (C), Architecture (A), and Topology (T) numbers (i.e., the same fold, see Discussion) were found, the search in that C.A.T. level was terminated. Some proteins with similar folds that have been widely used in previous studies were, however, selected as well. If a protein was purchased and found to be unsuitable for inclusion in the basis set (impure or denatured), the CATH database was again consulted to find a substitute with a similar fold. The PDB_SELECT database (Hobohm et al. 1992; Hobohm and Sander 1994) was used as an additional source for identifying potential basis proteins. PDB_SELECT is a database of protein crystal structures selected to have a low level of sequence homology, and is organized in several levels with increasing homology cutoffs. The Brookhaven Protein Data Bank identification codes (PDB ID) for proteins with unique
-helix and ß-sheet contents were taken from the PDB_SELECT listing at the 35% sequence homology level, and then used as the starting point in searches of the CATH or SCOP databases for a commercially available protein with the same fold.
Crystal structure quality and sequence homology
To maximize the accuracy of analyses performed using spectra from the basis set, care was taken to insure that the crystal structures used here corresponded as closely as possible to the actual structures of the proteins purchased (see Discussion). After the initial selection of potential basis proteins, the SCOP database was used to find the closest possible species match between the commercially available and crystallized form of each protein. The HSSP database was then used to compare the sequence homology of the crystallized and commercially available proteins. The SWISS-PROT database (Bairoch and Boeckmann 1991; Bairoch and Apweiler 1997 Bairoch and Apweiler 1998) was used to determine the actual number of amino acids in each protein. This number was compared with the number of residues in the crystal structure. Only the sequence of the mature protein was considered: Propeptides and signal sequences listed in SWISS-PROT were eliminated before the comparison.
The CATH database includes only crystal structures with a resolution of 3 Å or better, and thus no lower-resolution structures were considered. A database of output from the WHAT_CHECK crystal structure validation program (Hooft et al. 1996; accessible through the SCOP/PDB3D access pages) was used to evaluate the quality of the potential protein crystal structures and to choose a best crystal structure for each protein. The main criteria were the number of unsatisfied buried hydrogen bonds and the RMS Z scores for the backbone conformation and Ramachandran plots. These quantities are described in the WHAT_CHECK on-line documentation. RMS Z scores are generated by WHAT_CHECK for different properties of a crystal structure by comparing the structure under question with a set of known high-quality crystal structures. If the crystal structure being examined is of higher quality than the reference structures, its RMS Z score will be positive. An RMS Z score is poor if it is less than -3 (three standard deviations worse than the distribution of the reference structures), and bad if less than -4. The RMS Z scores for the selected crystal structure of each protein are listed in Table 2
.
Protein purity
Proteins were selected for acquisition from the list of potential basis set candidates based on their fold, secondary structure, cost, and the advertised purity of the commercial products. Commercial preparations known to be impure, (e.g., stabilized with BSA) were rejected. The purity of acquired proteins was examined with SDS-PAGE and Coomassie blue staining. Some contaminated proteins were successfully purified using size-exclusion chromatography with a 30x1cm Sephadex G-100 column (DPR, TMT, and UOX; see Table 2
for protein identification codes).
The processing of spectral and crystal structure information
Several computer programs were written in Array Basic (Galactic Industries, http://www.galactic.com) to process both collected spectra and reference information. Software developed for this study includes programs designed to read and tabulate various data from a series of structure assignment program output files. The assignment program DSSP (Kabsch and Sander 1983) was used for all secondary structure data presented here. Residues that were missing from the crystal structures were given an irregular structure (C) assignment. Several output files were generated, including a summary of the percentage of residues assigned to each structure type, the amino acid content, and the calculated extinction coefficients (Gill and von Hippel 1989) for each DSSP file read. The second program generates synthetic infrared side chain spectra based on the known amino acid composition of a protein and subtracts them from the protein spectrum (Venyaminov and Kalnin 1990; Goormaghtigh et al. 1996; Barth 2000). The final program was designed to combine IR and CD spectra into a single array. This program can normalize and baseline-correct IR spectra and adjust the relative scaling of CD spectra so that they fall in a similar intensity range as the IR spectra. The IR spectra presented here have been normalized to the same intensity, and CD spectra have all been converted to mean residue ellipticity (deg cm2 dmole-1) and then scaled by a constant.
Protein solutions
Proteins that were purchased as lyophilized powders were dissolved directly in 2 mM HEPES (1H2O) pH 7.2, 0.1% NaN3. For ammonium sulfate suspensions, the protein was first pelleted by centrifugation, and the excess (NH4)2SO4 solution was removed before dissolving the pellet in buffer. The initial protein solutions were made at a concentration of 4% (w/w) taking into consideration the mass of buffer salts in lyophilized powders, if present. Small molecules from the commercial preparations were removed by extensive dialysis against HEPES buffer (2mM, pH 7.2, 0.02% NaN3) at 4°C, or by passing the sample through a 0.7x4cm Sephadex G-25 (Pharmacia) size-exclusion centrifuge column equilibrated with this buffer. The desalting was repeated for proteins with (NH4)2SO4 or high concentrations of other salts. The high NaN3 concentration in the initial solution allowed the effectiveness of desalting to be verified for each sample because the N3- ion has a characteristic IR band at 2048 cm-1. For the proteins IGG, LCL, and ADH, the solutions were clarified by centrifugation before use. Stock solutions were made by adjusting the concentration of the desalted proteins to 3% by the addition of buffer or by concentration with a Microcon 3 or 10 (Amicon) where necessary. Stock solutions were used directly for transmission IR measurements. For CD measurements, the stock solution was diluted to
0.01% with 2 mM HEPES (1H2O) pH 7.2, without NaN3 because NaN3 absorbs strongly at low wavelengths. Exceptions to the general procedure are as follows: CNA was maintained at pH 5.2 during all manipulations. INS was dissolved at pH > 10, and the KOH solution was immediately exchanged for the standard HEPES buffer pH 7.2 with a Sephadex G-25 centrifuge column (2x); the final pH was 7.2.
CD spectroscopy
CD spectra were recorded in a 1-mm cell on a JASCO J-710 spectrometer (calibrated with CSA) and constantly purged with N2 at 5 l/min. Each spectrum was the accumulation of eight scans with a 1-nm slit width, a time constant of 0.5 sec, and a scan rate of 50 nm/min, for a nominal resolution of 1.7 nm; total collection time was 13 min. The protein concentrations in filtered solutions were adjusted to give a detector voltage of 495±15 V at 185 nm, which provided a good noise level and avoided flattening of the spectra. The absorbance spectra of the samples from 185260 nm were obtained simultaneously and examined using the JASCO software. After background subtraction, the CD samples typically had an absorbance of
0.7 AU at 192 nm, and gave raw CD intensities larger than -5 mdeg in the 200230 nm region. For analysis, background-corrected spectra were converted to mean residue ellipticity (deg dmole-1 cm2) using a concentration determined from the absorbance at 205 nm. They were then scaled by a constant (0.0015) to provide intensities similar to those of the processed infrared spectra. The protein extinction coefficient at 205 nm was calculated based on the number of peptide bonds in the protein and an assumed peptide bond
205 of 5167 l mole-1 cm-1, which is based on a combination of previously published values (Hennessey Jr. and Johnson Jr. 1981; Scopes 1987). With the instrument used here, mean residue ellipticities calculated using A205 were more consistent than those based on A192, which has been used in some other studies. The accuracy of this value was confirmed by comparison with concentrations determined from A280 for proteins with known
280 values (Gill and von Hippel 1989), and also by comparison with published spectra when possible.
Infrared spectroscopy
Infrared spectra were collected on a dry-air-purged Bruker IFS-55 spectrometer with a liquid N2-cooled MCT detector; 512 scans were accumulated for each spectrum at a resolution of 2 cm-1. Spectra were collected using the 3% protein stock solutions (1H2O) placed between CaF2 windows separated with a 5-µm Teflon spacer. The buffer spectrum was subtracted with the help of a software package, written in Array Basic by K. Oberg. This program designed for the iterative optimization of various subtractions that are commonly encountered in protein infrared spectroscopy. Here, the subtraction scaling factor was adjusted so that the slopes of the baselines from 19901900 and 18501740 cm-1 were the same (Powell et al. 1986). The buffer spectrum subtraction scaling factors determined typically ranged from 0.98 to 1.01. Water vapor signal was removed using the area of the vapor peaks at 1717 or 1772 cm-1 to determine the subtraction-scaling factor. The intensities of the protein spectra collected in this study were in the range of 10 to 55 mAU, with a typical spectrum having an intensity of 3540 mAU. The RMS noise level from 22002100 cm-1 was 9.8x10-6AU, giving signal-to-noise ratios of 1000 5600, with an average value around 3700. A more conservative estimate, based on the RMS noise level in the amide I region in the presence of buffer (1.8x10-5AU), gives signal-to-noise ratios of 5503000, with an average value around 2000.
| Acknowledgments |
|---|
-hemolysin, ATX); and Valentina Bychkova, Russian Academy of Sciences, Moscow (human
-lactalbumin, ALA). The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
. 1998. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998. Nucleic Acids Res. 26: 3842.
Bairoch, A. and Boeckmann, B. 1991. The SWISS-PROT protein sequence data bank. Nucleic Acids Res. 19 (Suppl.): 22472249.
Barth, A. 2000. The infrared absorption of amino acid side chains. Prog. Biophy. Mol. Biol. 74: 141173.[CrossRef][Medline]
Barton, G.J. 1994. SCOP: Structural classification of proteins. Trends Biochem. Sci. 19: 554555.[CrossRef][Medline]
Baumruk, V., Pancoska, P., and Keiderling, T.A. 1996. Predictions of secondary structure using statistical analyses of electronic and vibrational circular dichroism and Fourier transform infrared spectra of proteins in H2O. J. Mol. Biol. 259: 774791.[CrossRef][Medline]
Brahms, S. and Brahms, J. 1980. Determination of protein secondary structure in solution by vacuum ultraviolet circular dichroism. J. Mol. Biol. 138: 149178.[CrossRef][Medline]
Chang, C.T., Wu, C.S., and Yang, J.T. 1978. Circular dichroic analysis of protein conformation: Inclusion of the ß-turns. Anal. Biochem. 91: 1331.[CrossRef][Medline]
Chen, Y.H. and Yang, J.T. 1971. A new approach to the calculation of secondary structures of globular proteins by optical rotatory dispersion and circular dichroism. Biochem. Biophys. Res. Commun. 44: 12851291.[CrossRef][Medline]
Chen, Y.H., Yang, J.T., and Martinez, H.M. 1972. Determination of the secondary structures of proteins by circular dichroism and optical rotatory dispersion. Biochemistry 11: 41204131.[CrossRef][Medline]
Chen, Y.H., Yang, J.T., and Chau, K.H. 1974. Determination of the helix and ß form of proteins in aqueous solution by circular dichroism. Biochemistry 13: 33503359.[CrossRef][Medline]
Dodge, C., Schneider, R., and Sander, C. 1998. The HSSP database of protein structure-sequence alignments and family profiles. Nucleic Acids Res. 26: 313315.
Dousseau, F. and Pezolet, M. 1990. Determination of the secondary structure content of proteins in aqueous solutions from their amide I and amide II infrared bands. Comparison between classical and partial least-squares methods. Biochemistry 29: 87718779.[CrossRef][Medline]
Flores, T.P., Orengo, C.A., Moss, D.S., and Thornton, J.M. 1993. Comparison of conformational characteristics in structurally similar protein pairs. Protein Sci. 2: 18111826.[Abstract]
Gill, S.C. and von Hippel, P.H. 1989. Calculation of protein extinction coefficients from amino acid sequence data. Anal. Biochem. 182: 319326.[CrossRef][Medline]
Goormaghtigh, E., Cabiaux, V., and Ruysschaert, J.M. 1994. Determination of soluble and membrane protein structure by Fourier transform infrared spectroscopy. III. Secondary structures. Subcell. Biochem. 23: 405450.[Medline]
Goormaghtigh, E., de Jongh, H.H., and Ruysschaert, J.M. 1996. Relevance of thin films prepared for attenuated total reflection Fourier transform infrared spectroscopy: Significance of the pH. Appl. Spectrosc. 50: 15191527.[CrossRef]
Greenfield, N. and Fasman, G.D. 1969. Computed circular dichroism spectra for the evaluation of protein conformation. Biochemistry 8: 41084116.[CrossRef][Medline]
Grishina, I.B. and Woody, R.W. 1994. Contributions of tryptophan side chains to the circular dichroism of globular proteins: Exciton couplets and coupled oscillators. Faraday Discuss. 99: 245262.
Hennessey Jr., J.P. and Johnson Jr., W.C. 1981. Information content in the circular dichroism of proteins. Biochemistry 20: 10851094.[CrossRef][Medline]
Hilbert, M., Bohm, G., and Jaenicke, R. 1993. Structural relationships of homologous proteins as a fundamental principle in homology modeling. Proteins 17: 138151.[CrossRef][Medline]
Hobohm, U. and Sander, C. 1994. Enlarged representative set of protein structures. Protein Sci. 3: 522524.[Abstract]
Hobohm, U., Scharf, M., Schneider, R., and Sander, C. 1992. Selection of representative protein data sets. Protein Sci. 1: 409417.[Abstract]
Hooft, R.W., Vriend, G., Sander, C., and Abola, E.E. 1996. Errors in protein structures. Nature 381: 272.[Medline]
Hubbard, T.J.P., Murzin, A.G., Brenner, S.E., and Chothia, C. 1997. SCOP: A structural classification of proteins database. Nucleic Acids Res. 25: 236239.