|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Koto-ku, Tokyo 135-0064, Japan2 School of Life Science, Tokyo University of Pharmacy and Life Science, Hachioji, Tokyo 192-0392, Japan3 Institute for Bioinformatics Research and Development (BIRD), Japan Science and Technology Corporation, Chiyoda-Ku, Tokyo 102-8666, Japan
Reprint requests to: Junichi Higo, School of Life Science, Tokyo University of Pharmacy and Life Science, 1432-1 Horinouchi, Hachioji, Tokyo 192-0392, Japan; e-mail: higo{at}ls.toyaku.ac.jp; fax: +81-426-76-5863.
(RECEIVED July 1, 2004; FINAL REVISION November 26, 2004; ACCEPTED December 3, 2004)
| Abstract |
|---|
|
|
|---|
distances. Using principal component analysis (PCA) on the intrasegment C
C
atomic distances, the conformational space of protein segments, which we call the protein segment universe, has been visualized, and three essential coordinate axes, suitable for describing the universe, have been identified. Three essential axes specified radius of gyration, structural symmetry, and separation of hairpin structures from other structures. Among the segments of arbitrary length, 622 residues long, the conservation of those axes was uncovered. Further application of PCA to the two largest clusters in the universe revealed local structural motifs. Although some of motifs have already been reported, we identified a possibly novel strand motif. We also showed that a capping box, which is one of the helix capping motifs, was separated into independent subclusters based on the C
geometry. Implications of the strand motif, which may play a role for proteinprotein interaction, are discussed. The currently proposed method is useful for not only mapping the immense universe of protein structures but also identification of structural motifs. Keywords: protein segment universe; structure classification; principal component analysis; local structural motif; helix capping
Article published online ahead of print. Article and publication date are at http://www.proteinscience.org/cgi/doi/10.1110/ps.04956305.
| Introduction |
|---|
|
|
|---|
Exploring the protein structure universe at a fold (or structural domain) level could answer a fundamental question: How are proteins distributed in a conformational space? Holm and Sander (1996) have shown that the conformational space has a biased distribution depending on arrangements and combinations of secondary-structure elements. By visualizing a global representation of the fold universe, Hou et al. (2003) have clearly shown the distribution controlled by essential axes segregating into four major structural classes (
,
,
+
, and
/
). On the other hand, exploring the structure universe of short to medium size segments is also important, because protein folds share a few types of common structural units of short to medium size. Salem et al. (1999) have shown that most of 10 super-folds highly contain supersecondary structural units, such as
-hairpin,
-hairpin, and 

. Taylor (2002) has suggested that a variety of protein structures can be simplified by using a periodic table of all possible combinations of
-helices and
-sheets, and that the protein-structure comparison can be automatically done by using the periodic table. Therefore, investigating physicochemical and/or evolutionary principles governing segment structures of short to medium size may be useful for understanding the structural diversity of longer segments and proteins.
For the segment or protein structure classification, the choices of similarity measure and clustering algorithm are the crucial issues. Classification of short (typically four to nine residues long) polypeptide segments embedded in proteins has been attempted as reviewed (Tomii and Kanehisa 1999). Many similarity measures and clustering algorithms have been proposed for different goals of the surveys. The RMSD after structural superposition (Unger et al. 1989; Matsuo and Kanehisa 1993; Unger and Sussman 1993; Micheletti et al. 2000) and the C
C
distances coupled with the pseudotorsion angles along the backbone trace (Rackovsky 1990; Rooman et al. 1990; Prestrelski et al. 1992; Fetrow et al. 1997) are the typical measures for identification of structural motifs or building blocks. The structural pattern of short segments in proteins most dominantly found in these surveys is
-helix. It is possible to detect
-strands and more detailed motifs, such as capping motifs, in high-resolution clustering. Fetrow et al. (1997) employed the C
C
distances and the pseudotorsion angles, developed an autoassociative neural network for identifying structural motifs of seven residues long, and succeeded in distinguishing four capping motifs of helix and strand. Hunter and Subramaniam (2003) developed a hypercosine clustering method based on a geometrical similarity of segments for classifying canonical local shapes with an abundant segment data set. The identified local shapes included some typical motifs with various combinations of
-turns,
-strands, and
-helices, and a particular twisted motif with high frequencies of glycine and proline. A structure comparison method, which used geometric invariants of a tetrahedron generated by four C
atoms in a segment, detected structural motifs in loop structure, where a large number (1.2 million) of fragments were taken for the structural library (Tendulkar et al. 2004). Recently, the classification results of short segments were applied for protein structure prediction. The backbone dihedral angles were employed for a neural network, which was based on self-organized maps, and local structural motifs of five residues long were identified with significant amino acid preference (de Brevern et al. 2000). The characterized motifs were applied for prediction of protein backbone structures based on a Bayesian probabilistic approach.
Since protein structure should be expressed by a number of degrees of freedom, a technique to reduce the high dimensional space to a lower dimensional one is indispensable. In this study, we used a well-known mathematical method, principal component analysis (PCA). Takahashi and Go (1993) applied PCA on the atomic coordinates of short peptides (i.e., two or three residues long) after the structural superposition, and showed that the short-peptide backbone can be classified into some conformational clusters. Hou et al. (2003) calculated the overall pairwise similarity scores among representatives from SCOP (Murzin et al. 1995) folds using the DALI program (Holm and Sander 1993), and projected the folds on a three-dimensional (3D) PCA space. Moreover, many groups used the PCA method to analyze the trajectories of molecular dynamics simulation (for example, Amadei et al. 1993; Kitao et al. 1998; Kamiya et al. 2002). With applying PCA on a conformational ensemble of a short peptide of about 10 residues long, where the conformations were sampled with a generalized-ensemble method (i.e., multicanonical molecular dynamics simulation), the peptide folding pathways were visualized in the 3D PCA space (Ikeda and Higo 2003; Ikeda et al. 2003).
The purposes of the present study were (1) to uncover the essential structural axes governing the structural variations of protein segments by visualizing a structure universe of short to medium size (622 residues long) segments, which were taken from all fold types of globular proteins, and (2) to identify local structural motifs distinguishable in the visualized conformational distribution. For these purposes, we applied PCA for the intrasegment C
C
atomic distances, and attempted to obtain a global representation of the segment structure universe in the PCA space. Although the measure of structural similarity we used was simple, it was efficient and powerful for mapping the segment universe with an extreme density gradient. To address the second purpose, we focused on the structure universe of segments 10 residues long because of its tractability and the profitable results of motif identification.
| Results |
|---|
|
|
|---|
C
atomic distances, then PCA was applied on the variancecovariance matrix, and the PCA axes constructed the conformational space (see Materials and Methods). Figure 1
-helix and
-strand), were obtained. Helix segments were assigned to the largest cluster in U10 (Pall = 32.07%). The definition of Pall is given in the subsection "Analysis of the protein segment universe" in Materials and Methods. The lowest PMF region (red line; discriminated by PMF = 2.36 kcal/mol) in this cluster consisted of complete helices, and different types of helical segments surrounded the complete-helix region. The strand cluster (Pall = 8.16%) was found on the opposite side of the helix cluster in U10, which mainly originated from
-sheets in the proteins and rarely from extended loops. Our method did not discriminate strands that participate in parallel and anti-parallel
-sheets, since those segments have similar conformations at the level of the C
-carbon geometry. The populated central region of the strand cluster (PMF < 2.36 kcal/mol) consisted of fully extended strands (Pall = 3.29%), and the surrounding of the central region consisted of partly deformed strands (Pall = 4.87%). We further analyzed both the helix and strand clusters, classifying them into subclusters, as shown later.
|
-hairpin clusters discriminated by PMF = 3.33 kcal/mol were symmetrically arranged on a semicircle arch around an edge of U10. Those
-hairpin clusters were quantitatively distinguishable from one another by the position of
-turn and the loop conformation characterized by the intrasegment hydrogen-bonding patterns (Fig. 2
-turn at the ith and the (i + 1)th residues, was called HPNi,i+1. The position of
-turn moved from the N/C- to the C/N-terminal side of the segment, when shifting the
-hairpin cluster along the semicircle arch. The HPN3,4, HPN4,5, HPN5,6, HPN6,7, and HPN7,8 had double hydrogen bonds between the (i 1)th and (i + 2)th residues (Fig. 1A
|
|
-helical with the helix termination at the fifth residue. Amino acid Gly was favored at the sixth residue: FUm6(Gly) = 1.98. Hydrophobic residues were favored at the second and the ninth residues: FUm2(Cys) = 0.83, FUm2(Leu) = 0.83, and FUm9(Val) = 1.01. The hydrophobic contacts between the two residues were frequently observed in Um. Cluster Un also consisted of helix C-capping structures. The N-terminal helical conformations in Un were similar to those in Um. Amino acid Gly was favored at the sixth residue: FUn6(Gly) = 1.72. The favorable positions for the hydrophobic residues were the second and the seventh residues: FUn2(Leu) = 1.03, FUn2(Trp) = 0.95, FUn2(Met) = 0.81, FUn7(Val) = 0.74, FUn7(Ile) = 0.58, and FUn7(Phe) = 0.56. The side-chain contacts between the two residues were frequently observed, suggesting that Un correlates with the Schellman motif (Schellman 1980; Aurora et al. 1994). Cluster Uo consisted of helixloophelix structures. Amino acid Gly was favored at the fifth and the sixth residues: FUo5(Gly) = 1.47 and FUo6(Gly) = 1.10. The segments mostly originated from joint loops connecting two helices. In asym2, there were two conformational clusters (Up and Uq). Cluster Up consisted of helix N-capping structures, where the last five residues were helical with the helix termination at the fifth residue where amino acids Ser and Thr were strongly favored: FUp5(Ser) = 1.37 and FUp6(Thr) = 0.97. The hydrophobic residues were favored at the fourth and the ninth residues: FUp4(Met) = 0.95, FUp4(Ile) = 0.84, and FUp9(Trp) = 0.86. The side-chain contacts between the two residues were frequently observed, suggesting that Up correlates with the box motif (Harper and Rose 1993). Cluster Uq consisted of segments characterized as strand + 310-helix. The N-terminal half formed a strand conformation. The C-terminal half contained a significant amount of 310-helix. The percentage of segments with three or more 310-helical residues in Uq was 37.5%.
General features of the protein segment universe
We show, here, the meanings of PCA axes describing the protein segment universe, and the conservation of those axes over segment universes from U6 to U22. Figure 3
shows a cumulative contribution of the first three axes, S13, and individual contributions of the first five axes. Interestingly, up to U16, S13 exceeded 80%, and even for U22 it was 76.3%. Thus, only three axes can account for the original structural variations of short to medium size segments. The individual contribution Q1 was especially large compared to those of the other axes, although Q1 remarkably decreased at extension of segment length. Contrarily, Q2 significantly increased up to 21 residues long (maximum contribution rate = 21.06%) and decreased for segments longer than 22 residues long. Other contributions (i.e., Q3Q5) slightly increased with the segment length, which may be a natural outcome resulting from the decrease in Q1. The image of conformational space was conserved for longer segments. Figure 4
demonstrates U20. The distribution of conformational clusters (especially helix and
-hairpin clusters) in U20 was similar to that in U10. Although the strand cluster found in U10 diminished in U20, fully extended strands were located on the opposite side of the
-hairpin clusters on v1.
|
|
qk = wk
k1/2vk (see Equation 1, below) gives the conformational deviation along the kth eigenvector vk from the average
q
. The deviation
di,j (i.e., the deviation of the distance, di,j, between the ith and the jth C
atoms from the average distance
di,j
) can be obtained by picking up the corresponding element to di,j from
qk. We expressed
qk with triangle maps (Fig. 5AF
C
distance between the N-terminal and the C-terminal residues) or the radius of gyration of segments. The correlation coefficient between
q1 and the end-to-end distance was 0.82, and that between
q1 and the radius of gyration was 0.94 for U10. In Figure 5, B and E
-hairpin clusters was controlled by v2(Fig. 1A
|
-helices. Subclusters Hg (Phelix = 0.95%) and Hh (Phelix = 1.71%), located near Hf, could be characterized as conformations where helices were broken at the N- or the C-terminal residue, respectively. The definition of Phelix is given in the subsection "Analysis of the protein segment universe" in Materials and Methods.
|
-helical region was from the fifth to the 10th residues. The amino acid preferences (Equation 4, below) FXi(A), where X = Ha, ..., Hd, are shown in Figure 7B
|
2,
3] angles were different between the subclusters: [111.4°, 112.8°] for Hb, and [133.0°, 135.2°] for Hc, where the values are the average over segments in each subcluster.
Subcluster Hd (Phelix = 0.52%) consisted of helix N-capping segments, which agreed well with the big-box motif (Seale et al. 1994). The
-helical region was from the 4th to the 10th residues. At the N-terminal helix-breaking point (the third residue), Asp, Ser, and Asn were favored: FHd3(Asp) = 1.14, FHd3(Ser) = 1.06, and FHd3(Asn) = 1.08 (Fig. 7B
), and large hydrophobic residues, such as Thr, Phe, and Cys, were favored at the first residue (Fig. 7B
): FHd1(Tyr) = 0.78, FHd1(Phe) = 0.64 and FHd1(Cys) = 0.66. The structural conversion of segments in subcluster Hd was relatively wrong compared with that in the other subclusters (see the caption for Fig. 6
). However, the segments frequently formed either the 16 or 17 side-chain contact, which is a common feature in the big-box motif.
Subcluster He (Phelix = 0.71%) consisted of helix C-capping segments, which agreed well with the Schellman motif (Schellman 1980; Aurora et al. 1994). The
-helical region was from the first to the sixth residues. At the seventh residue, which is one residue after the C-terminal helix-breaking point, Gly was strongly favored: FHe8(Gly) = 2.12. The majority of the He segments contained a turn at the seventh or the eighth residue, and did not form a side-chain-to-backbone hydrogen bond. The 49 side-chain contact was frequently formed in He. In the fourth and the ninth residues, nonpolar and hydrophobic residues were favored: FHe4(Cys) = 1.12, FHe4(Leu) = 0.89, FHe4(Met) = 0.85, and FHe9(Trp) = 0.84. On the other hand, polar and hydrophilic residues were disfavored: FHe4(Asn) = 1.76, FHe4(Ser) = 0.80, FHe9(Asp) = 1.63, and FHe9(Gln) = 1.61.
Figure 6
indicates that the axis v2 separates the helix N-capping segments in subclusters (i.e., HaHd) from the other clusters (i.e., HeHh), and that the subclusters HaHd are well separated from one another along the axis v3, when the other subclusters are removed. Triangle maps (Fig. S1AC in Supplemental Material) indicate the changes of C
C
distances along the three essential eigenvectors (i.e., each PCA axis) derived from the helical segment subuniverse.
Strand segment subuniverse
To further investigate the strand cluster (PMF < 2.84 kcal/mol) in U10, we applied PCA again on the cluster, and generated a strand segment subuniverse (S13 = 79%. Then the fully extended strands populated again in the central region of the obtained subuniverse, overlapping on the surrounding regions consisting of the partly deformed strands (Fig. 8A
). Thus, the strand cluster was not separated into independent subclusters.
|
PMF < 2.84 kcal/mol, from the strand cluster. Thus the collected segments are those remaining after removing the central strand core (a red region at the bottom of Fig. 1A
Although the majority of the fully extended strands were removed by the procedure explained above, 37 fully extended
-strands still remained in the subuniverse and they formed subcluster Sb (Pstrand = 0.65%). The definition of Pstrand is given in the subsection "Analysis of the protein segment universe" in Materials and Methods. Eighty-nine percent of the segments in Sb originated from anti-parallel
-sheets in proteins. Subcluster Sa (Pstrand = 1.50%) consisted of
-strands broken at the 10th residue. Amino acids Pro, Glu, and Asp were disfavored in the Sa and Sb segments, whereas Val, Ile, Trp, and Phe, which are generally favored in strands, were favored.
Subcluster Sc (Pstrand = 1.48%) consisted of segments whose strand region was from the third to the 10th residues bending around the third residue. Amino acids Gly and Pro were favored at the N-terminal region: FSc1(Gly) = 1.06 and FSc2(Pro) = 0.85. The third residue relatively favored Arg, which is known as a
-strand breaker (Colloch and Cohen 1991): FSc3(Arg) = 0.74. Thus, there is a possibility that Sc is a N-capping strand. Subcluster Sd (Pstrand = 2.07%) consisted of segments whose strand region was from the fourth to the 10th residues bending around the fourth residue. We could not find any particular amino acid preference in Sd.
Subcluster Se (Pstrand = 1.40%) consisted of 79 curved strands, which mostly originated from extended loops or strands in
-sheets. The percentage of segments with five or more residues serving strandstrand hydrogen bonds was 58%. Generally, Pro is rarely found in strands. However, the preferences of Pro assigned to the strand region (the fourth to the eighth residues) were relatively high (Fig. 8D
). Note that the preference for Pro was not specifically high on a site, but nonspecifically high in the strand region. Twelve Se segments contained two or three Pro residues in the strand region. We named Se the "Pro-rich curved strand motif." The Se segments were frequently located on the protein (or domain) surface: The average solvent accessible surface area (ASA) assigned to Se was the largest in the strand subclusters (Sa, 403; Sb, 354; Sc, 369; Sd, 415; Se, 554; Sf, 429; Sg, 427; and Sh, 394 Å2), where ASA was calculated on the condition that the segment was embedded in the protein (or domain). In He, segments with large ASA (ASA > 800 Å2) were exposed loops or strands in
-hairpin, whereas all of the segments buried in the interior of the protein (ASA < 200 Å2) were
-sheet strands. We found that 14 Se segments participated in proteinprotein interactions (see Discussion).
Subcluster Sf (Pstrand = 6.39%) consisted of strand C-capping segments whose strand region was from the first to the seventh residues. Amino acids Pro, Asp, and Ser were relatively favored in the C-terminal region. Twenty-five percent of the Sf segments had a turn at the ninth residue and 30% at the 10th residue.
Subcluster Sg (Pstrand = 0.53%) consisted of strand C-capping segments whose strand region was from the second to the seventh residues. Amino acids Pro, Asp, and Gly were favored in the C-terminal region: FSg8(Pro) = 1.09, FSg9(Asp) = 1.39, and FSg10(Gly) = 1.15. There were seven segments where Asp existed at the ninth residue, and the side-chain orientations of Asp converged well as displayed in Figure 8D
. Six of the seven side chains of Asp formed intraprotein hydrogen bonds. Thus, we named Sg "Asp C-cap strands."
Subcluster Sh (Pstrand = 0.51%) consisted of strand C-capping segments whose strand region was from the first to the seventh residues. Amino acids Pro, Asp, and Gly were favored in the C-terminal region: FSh8(Pro) = 1.12, FSh8(Asp) = 1.11, and FSh9(Gly) = 1.03. Thus, both Sg and Sh had a high preference for Asp in the C-terminal region. However, the side-chain orientations of Asp did not converge well in Sh compared with those in Sg. Besides, the majority of side chains of Asp in Sh were exposed to solvent without participating in the intraprotein hydrogen bond.
Triangle maps (Fig. S1D-I in Supplemental Material) indicate the changes of C
C
distances along the three essential eigenvectors (i.e., each PCA axis) derived from the strand segment subuniverse.
| Discussion |
|---|
|
|
|---|
C
atomic distances for the measure. The benefits of using this measure are that the computation of the variancecovariance matrix does not require the structural superposition and that the segment backbone structure can be reconstructed from the C
C
atomic distances. Note that our method is sufficiently rapid for building the structural universe including segment structures with a vast data set. The disadvantages of this measure are that the side-chain conformational differences cannot be directly detected and that the structural chirality is not considered. These disadvantages are due to the fact that the relative position between two atoms was not expressed by a vector but by a scalar (i.e., distance). For instance, right- and left-handed helices are exactly the same in the C
C
atomic distances. The significance of applying the PCA method to a complicated system, such as protein structure, is to obtain a small number of essential variables that account for a large proportion of the original structural variation, and then to represent the variation in a low dimensional PCA space. In the present study, by using three axes with large contributions, which were able to considerably cover the structural variation of segments (S13 of U10 was 87.8%), we clearly showed the overall distribution of segment conformations in the 3D PCA space. On the other hand, even if such essential variables are obtained, it is often difficult with the PCA method to understand what the variables really mean for the given original information. We succeeded in defining the meaning of the three essential elements (i.e., radius of gyration, structural symmetry, and separation of hairpin structures from other structures), which are suitable for describing the structural diversity of protein segments. It is reasonable that the first principal component remarkably correlated with radius of gyration, because the most sensitive quantity to the variety of segment conformations, collected from various proteins, is probably radius of gyration.
By comparing U10 and U20, we showed that the symmetrical distribution is a common feature for the short to medium size segments in the PCA space. This means that two segments, which have conformations similar to each other when the residue numbering of one segment is inversed, appear with an approximately equal frequency in proteins. Thus, we presume that in the short to medium size segments, there is almost no bias acting on the chain direction from the N to C terminus or from the C to N terminus, and that this symmetry is a "property of a string" of short to medium size segments in natural proteins (polypeptide chains). We also observed the asymmetrical areas (asym1 and asym2) in U10. The causal reason for this asymmetry is the structure differences between the N- and C-terminal caps of helices, and this structural difference may be caused by the difference of the hydrogen-bond patterns. In fact, we observed that only the N-terminal cap helix favored the side-chain-to-backbone hydrogen bond, which agrees well with the observation by Aurora and Rose (1998).
We applied PCA two times on the segment ensemble to analyze the helical or strand segment subuniverse. This means that the segment universe is viewed with two different scales. The first PCA made the large-scale structures of the universe visible, where helical, strand, and hairpin clusters distributed. The second PCA made the details of the universe visible, where subclusters distributed. The triangle maps of the first three axes for the helix subuniverse (Fig. S1AC in Supplemental Material), where some locally restricted conformational deviations were seen, were different from those for the whole segments (Fig. 5AC
). As shown in Results, the axes v2 and v3 from the helix subuniverse described the variety of the helix capping structures. For the strand segments taken from the strand cluster (PMF < 2.84 kcal/mol) in U10, the strand cluster was not separated into independent subclusters unless the central strand core in U10 was eliminated (Fig. 8A
). The reason should be addressed that the shape of the strand segment subuniverse was analogous with that of the strand cluster in U10, and that the strand cluster did not show a fine structure in U10. The triangle maps of the first two axes of the strand subuniverse were similar to those of U10 (see Fig. S1D,E in Supplemental Material). The strand segment universe has a continuous distribution whereas the helix segment universe has a discontinuous one.
The helix-capping motifs have been summarized in a previous review (Aurora and Rose 1998). In the helical segment subuniverse, some subclusters corresponded to the helix-capping motifs ever reported (Schellman 1980; Harper and Rose 1993; Aurora et al. 1994; Seale et al. 1994). The two subclusters Hb and Hc should belong to the box motif, according to the classification method of Aurora and Rose (1998). We showed that the separation into the two subclusters resulted from the difference in the side-chain-to-backbone hydrogen-bond patterns between Hb and Hc. Since our classification method is based on the C
C
distances, it cannot directly detect the side-chain conformational differences. However, our method is also useful to detect the side-chain conformational differences, when the side-chain conformation correlates with the main-chain conformation.
One may consider that the structure classification of strands is less important than that of helices, because a strand is not stabilized by the intrastrand interactions but is usually stabilized by interacting with the surrounding strands in the same
-sheet. The majority of the strand cluster consisted of fully extended strands, and the trends of amino acid preferences favorable for
-strand have just been confirmed. This may mean the lesser importance of the strand-cluster classification. However, the deformed strands were classified well into subclusters with the specific amino acid preferences (Fig. 8D
). Typically, we could identify a strand subcluster Se, named the Pro-rich curved motif. Based on the status of Se segments participating in proteinprotein interactions, we have categorized them into the following three cases:
Three segments with one or two Pro residues (5csc A415424, 1dqa A529538, and 1djn A702711) were identified as case 1. Especially, the two Pro residues (Pro418 and Pro422) in 5csc have been suggested by Bergdoll et al. (1997) as the hinge prolines, which are important for the oligomerization with the arm exchange. They have proposed that the existence of Pro at the root of the exchanged arm induces the oligomerization by imposing the arm to a favorable position. The two remaining segments in case 1 may play the same role for the dimerization, since the Pro residues are located at the root of the arms in the dimer. Five segments (trimer: 1el6 A194203, 1pya A1019, 1i9r A199208; pentamer: 1i9b A4857, A8089) were identified as case 2. The arrangements of Se segments in the oligomer vary, such as a triangular or a spherical shape. Six segments (1i4j A7079, 1g61 A20922101, 1d8i A402411, 1qpa A320329, A330339, and 1a9x A985994) were identified as case 3. Thus, we suggested that Se motifs, which are localized on the protein surface, play an important role for oligomerization or maintenance of protein complexes in various ways. Figure S2 in Supplemental Material displays three cases of Se segments at the protein interface. We found that some segments in cases 2 and 3 did not contain Pro residue, although the common curved conformations were conserved. In the analysis, we used a segment ensemble selected from the current structural database. More analysis should be done, when the structural database becomes more abundant in the feature. Then, we may find Pro residue in a complex that involves a protein homologous with the current one.
A role of edge strands in proteinprotein interaction has been studied (Richardson and Richardson 2002). The edge strand was defined as that bordered on only one side by another
-strand in a
-sheet (Minor and Kim 1994), and then the other side of the edge strand may be used for avoiding edge-to-edge aggregation when the strand is located on the protein surface. Richardson and Richardson (2002) have studied a strategy adopted by natural
-sheet proteins to avoid the protein aggregation. A method to distinguish the edge strand from the other strands based on the sequence information was proposed (Siepen et al. 2003). We investigated the relation between the edge strand and the Se subcluster, and found that 19 of 79 Se segments were exposed edge strands. Then, the current study showed that the edge strand is a member of the Se subcluster in the segment universe.
We used the fold representatives to generate the segment ensembles. One may consider that structural representatives should be collected from each superfamily, because the variety of structures in a fold group is larger compared to that in a superfamily. We consider that our result does not change much even though the representatives are collected from each superfamily, because one superfamily often forms one fold group. In fact, in 625/731 of the SCOP folds, which we used in this analysis, one fold group has only one superfamily member. To evaluate the sensitivity of data size and the selection of structural representatives on the structural distribution, we constructed a 3D conformational space using a segment ensemble generated from 100 proteins, which were randomly picked up from the full SCOP folds (i.e., 731 folds). The obtained conformational distribution was highly similar to the original one: Secondary structure clusters (helix, strand, and hairpins) existed as dense cores in the conformational space, and the shape of the whole distribution was again like that of a shoe (data not shown).
The current study showed that the protein segment universe had a symmetrical shape (i.e., shoe-shaped) in the PCA space. It should be noted that the universe 20 residues long, U20, exhibited a shape similar to that of U10. This similarity may reveal a general feature of short to medium size segments. For longer segments (i.e., 30-residue segments), the shape was not similar to those of U10 and U20: The shape changed from the shoe shape to a different one at around 25 residues long, which may indicate that the boundary between the peptide-like structure and the protein-like one in natural proteins is at around 25 residues long (will be reported elsewhere). This discontinuity of the universe shape may reveal the existence of a boundary between segment-like and protein-like structures.
| Materials and methods |
|---|
|
|
|---|
, all-
,
+
,
/
, multidomain, and small protein classes, respectively. Membrane proteins were excluded. The 731 proteins were selected for preparing a segment ensemble (see http://www.cbrc.jp/~ikeda/psu/list.html). Tertiary structures of the proteins were taken from the Protein Data Bank (Berman et al. 2000) to build the segment ensemble. Then, we cut the protein structures into short to medium size segments (622 residues long) with a window sliding by one residue along the sequence. Segments with incomplete coordinate data were excluded. The number of the 10-residue and 20-residue segments in each segment ensemble was 116,182 and 106,324, respectively.
Constructing a protein segment universe using PCA
We calculated all intrasegment C
C
atomic distances for each segment in the ensemble. Denoting di,j as the distance between the ith and jth C
atoms in a segment, a distance set was expressed for the segment as q = [d1,2, d1,3, d1,4,. . .., dn1,n] = [q1, q2, q3, ... , qn(n1)/2], where n is the number of residues in the segment. Then, a variancecovariance matrix, C, was calculated as Cij =
qiqj
qi
qj
, where Cij is the (i,j)th element of the matrix. The average
...
was taken over all segments in the ensemble.
A set of eigenvectors {v1, v2, v3, ... ,vn(n 1)/2} and eigenvalues {
1,
2,
3, ... ,
n(n 1)/2} was obtained with diagonalyzing C, where two equations, Cvi =
ivi and vivj =
ij, are satisfied. If the distribution of segments in the conformational space is expressed by a gaussian,
i corresponds to the standard deviation along vi. Although the real distribution cannot be simply expressed by the gaussian, the situation that
i corresponds to the standard deviation is maintained. Thus, eigenvectors with larger eigenvalues are more important to study the conformational variety of the segments. Here, eigenvalues are arranged in the descending order:
i >
j if i < j.
A conformational space constructed by the eigenvectors is called "PCA space," where an eigenvector corresponds to a PCA axis. The origin of the PCA space is set on the average C
C
atomic distances:
q
= [
q1
,
q2
,
q3
, ... ,
qn(n 1)/2
]. Then, any position (i.e., any segment structure) in the PCA space can be expressed by a linear combination of eigenvectors as
![]() | (1) |
where
qk = wk
k1/2vk is the conformational deviation from
q
along vk, and wk is the amount of deviation along vk (note that vk is normalized). Thus, q can be represented as [w1, w2, w3, ... , wn(n 1)/2], and wi be regarded as the ith coordinate assigned to vi in the PCA space.
To study the segment universe, we used three eigenvectors with the three largest eigenvalues: v1 (the first PCA axis), v2 (the second), and v3 (the third). Thus, the cumulative contribution of the three PCA elements to the whole conformational distribution is assessed by
![]() | (2) |
where Qi =
i/
k all
k. The larger the Q13, the larger the contribution of the three PCA axes to the whole distribution. The three eigenvectors construct a 3D PCA space, and the introduction of the 3D space makes it possible to view the segment universe.
Analysis of the protein segment universe
As described above, the distribution of segments in the 3D PCA space gives an image of the segment universe. We designated the segment universe n residues long as Un. We defined a vector, r, to express the segment position in the 3D PCA space: r = [w1, w2, w3]. After obtaining the density
(r) of the distribution at each position r in the space,
(r) was converted to the form of the potential of mean force (PMF):
![]() | (3) |
where R is the gas constant, T = 300 K, and
max is the maximum density to set PMF at the maximum position to zero. We expressed the 3D PCA space by the PMF contour map because the segment structure space had an extreme density gradient.
The number of segments involved in each conformational cluster was represented by a ratio (i.e., percentage) to that in the segment universe (or subuniverse): Pall, Phelix, and Pstrand are ratios to U10, the helix subuniverse, and the strand subuniverse, respectively.
Amino acid preference
We calculated a preference, Fxi(Aj), of amino acid Aj at position i in the segments of a cluster (or subcluster) X:
![]() | (4) |
where PXi(Aj) is the frequency of amino acid Aj at position i in cluster X and Pref(Aj) is the frequency of amino acid Aj in all of the segments of a segment universe (Bystroff and Baker 1998).
| Footnotes |
|---|
| References |
|---|
|
|
|---|
Aurora, R. and Rose, G.D. 1998. Helix capping. Protein Sci. 7: 2138.[Abstract]
Aurora, R., Srinivasan, R., and Rose, G.D. 1994. Rules for
-helix termination by glycine. Science 264: 11261130.
Bergdoll, M., Remy, M.H., Cagnon, C., Masson, J.M., and Dumas, P. 1997. Proline-dependent oligomerization with arm exchange. Structure 5: 391401.[Medline]
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235242.
Bystroff, C. and Baker, D. 1998. Prediction of local structure in proteins using a library of sequence-structure mo