|
|
||||||||
1 Sackler Institute of Molecular Medicine, Department of Human Genetics and Molecular Medicine, Sackler School of Medicine and
2 School of Computer Science, Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 69978, Israel
3 Laboratory of Experimental and Computational Biology, Intramural Research Support Program, SAIC, Inc., NCI-Frederick, Frederick, Maryland 21702, USA
Reprint requests to: Ruth Nussinov, NCI-Frederick, Building 469, Room 151, Frederick, MD 21702, USA; e-mail: ruthn{at}ncifcrf.gov; fax: (301) 846-5598.
(RECEIVED October 18, 2002; FINAL REVISION April 3, 2003; ACCEPTED April 17, 2003)
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.0237103.
| Abstract |
|---|
|
|
|---|
Keywords: Binding-site; phage display library; artificial evolution; active site prediction; graph algorithms
| Introduction |
|---|
|
|
|---|
Several computational methods exist for predicting protein interaction sites. These attempt to predict binding sites at different resolutions: entire domains, a sequence window, or at the single amino acid level. The majority of these methods are sequence-based. The "proline-brackets" method takes advantage of the high frequency of prolines near interaction sites (Kini and Evans 1995). Other approaches are based on correlated mutations (Pazos et al. 1997) or coevolution of proteins with their interaction partners (Lichtarge et al. 1996a, b, 1997; Goh et al. 2000; Sowa et al. 2001). According to the correlated mutations approach, residues close to an interaction site are expected to mutate simultaneously during evolution. On the other hand, the coevolution approach looks for simultaneous mutations in two interacting partners rather than in a single protein.
Some of the sequence-based methods also rely on the genomic context (Dandekar et al. 1998). Gene order conservation is considered a fingerprint for interacting proteins. A different type of conservation, primary structure conservation, was also utilized for binding site prediction methods. Several sorts of conservation were taken into consideration in different methods: residues that are conserved within a subfamily of proteins but that differ between subfamilies, residues that are conserved in a few subfamilies (Fariselli et al. 2002), and domain homologies (Marcotte et al. 1999). Domain homologies were used for site prediction on genome sequences (Marcotte et al. 1999). Interaction between two proteins is suggested if in another organism homologous domains are assembled as a single protein. Interaction sites involving hydrophobic residues are proposed to be predicted using the mean
-helical hydrophobic moment (Xavier et al. 2000). In contrast to the above-mentioned methods for interaction site prediction based on primary structure, which uses additional information, a support vector training machine uses solely the physicochemical properties of the sequence (Bock and Gough 2001).
Some methods use both sequence and structural data for the prediction. Sequence profile, together with solvent accessible surface area and neighboring residues, were used for binding site prediction using neural networks (Huan-Xiang and Yibing 2001). Finally, structure-based methods are also used for binding site prediction. Superposition requires a homologous protein with a known binding site, whereas docking (reviewed in Halperin et al. 2002) requires structures (or models) of the two binding molecules. Some of the existing computational methods for binding site prediction cannot be applied to all proteins. Domain shuffling and hydrophobic residue involvement in the interaction are examples of assumptions that do not hold for all proteins, and are a fundamental requirement for computational methods for binding site prediction. Nevertheless, protein binding sites are generally hydrophobic (Tsai et al. 1997), with large, though variable, extent of nonpolar surface areas. Further, information about active site residues is sometimes available from site-directed mutagenesis, chemical cross-linking, and phylogenetic data (Gabb et al. 1997; Blizynuk and Gready 1999). Here, even in the absence of experimental data, it is sometimes possible to predict the correct binding site (Aloy et al. 2001). Functional regions in proteins have also been identified by surface mapping of phylogentic information (Armon et al. 2001). Potential hydrogen bonding groups, enzyme clefts, and charged sites on a protein surface, have all been used for binding site prediction (Gilson and Honig 1987; Laskowski 1995; Laskowski et al. 1996; Frommel et al. 1996; Bliznyuk and Gready 1999; Pettit and Bowie 1999). Because binding sites are at least partially flexible, searches for part-flexible part-rigid sites have also produced encouraging results (Todd et al. 1998; Freire 1999; Todd and Freire 1999; Luque and Freire 2000). Algorithms that predict the location of hinges and modes of motions (e.g., Hayward et al. 1997), or those that carry out structural comparisons of a protein family, in particular, if they allow hinge-bending movements (Shatsky et al. 2000, 2002) should be useful as well. Binding sites may, in principle, be predicted through residue hot spots (Bogan and Thorn 1998; DeLano 2000, 2002; Hu et al. 2000).
Several experimental strategies can be used for the analysis of the spatial organization of protein complexes. These include chemical cross-linking, two-hybrid systems, hydrogendeuterium exchange, protein microarrays, random mutagenesis, inhibition assays, alanine scanning, protection from chemical alteration, or proteolytic digestion and phage display, to name a few. These methods can provide four types of data: (1) constraints (i.e., proximity of specific residues from opposing partners in a complex); (2) binding site location (i.e., assignment of the binding site to a specific fragment); (3) hot spots determination (i.e., identification of those residues that contribute dominantly to the binding energy); and (4) binding site characterization (i.e., characterize a set of sequences that bind a target molecule or the consensus properties required for binding). Some of these methods can be applied not only in a case-specific manner, but also provide a generic tool. One fruitful method for mapping interactions of protein complexes is screening phage display libraries for peptide ligands (Geysen et al. 1986; Ferrer and Harrison 1999; Li et al. 2001; Wu et al. 1999, 2000; reviewed by Sidhu et al. 2003).
A critical aspect of phage display is the construction of combinatorial peptide libraries. Synthetic oligonucleotides, fixed in length but with unspecified codons, can be cloned as fusions to capsid genes of a filamentous phage (Enshell-Seijffers et al. 2001). These libraries, often referred to as random peptide libraries, can then be tested for binding to target molecules of interest. This is often done using a form of affinity selection known as Biopanning (Kay et al. 1996). Once a combinatorial library is built, it can be applied to a wide array of macromolecules, proteinaceous and nonproteinaceous, those that are known to interact with small peptides and those that had previously undefined specificity for peptides. Only a modest amount of time, effort, and resources are needed for biopanning a library displaying up to 1013 different peptides.
The potential of phage display for computational binding site prediction has been shown recently (Tong et al. 2002). Phage display and a large-scale two-hybrid system were combined for computational prediction of interaction sites. Consensus sequences based on phage display peptides were used to search genomic sequences for potential ligands. The intersection between the phage display prediction and the two-hybrid system results is expected to yield biologically relevant sites. This strategy was applied successfully to SH3 binding proteins in yeast. The SH3 binding motifs are sequential rather than "truly" 3D, that is, they are not discontinuous order-independent residues on the binding site surface. Nevertheless, although this strategy was applied only to chain-contiguous epitopes and was not yet examined on a diverse data set, the correlation between the phage display binding site mapping, the two-hybrid system, and previous biochemical data in the examined proteins is encouraging.
Here we present SiteLight, a novel computational tool for prediction of a binding site on a 3D structure using phage display libraries. SiteLight is applicable to complexes made up of a protein termed Template, and any type of molecule, termed Target. Given the 3D structure of a Template and a collection of sequences derived from biopanning against the Target, SiteLight predicts the interaction site of the Template with the Target. The algorithm can be divided into three main stages: (1) a combinatorial division of the Template surface to overlapping patches; (2) a one-dimensional (1D) to 3D alignment of peptide sequences with surface patches; (3) scoring the derived matches and assessing the results. SiteLight was implemented in C++, and runs on the order of a minute (on Red-Hat Linux 7.1, 1 processor, Pentium 4 1.80 GHz, 256 KB cache machine).
To assess the ability of SiteLight to correctly predict binding sites, we have created a data set that includes experimental results from 25 complexes and 39 phage display libraries. A variety of complex types are represented in the data set. SiteLight was tested from three different aspects. First, algorithm validity: SiteLight was run on peptides derived computationally from a known binding site. SiteLights ability to select the binding site out of the entire protein surface was examined. This simple experiment confirms the correctness of the method. Second, phage display libraries verification: From each phage display library one peptide that yields the best results was selected. The purpose of this presentation is to show that in each library there is at least one peptide that can be mapped to the binding site. This supports the applicability of phage display libraries to 3D binding site mapping. Third, assessing SiteLights performance: SiteLight was run with all the library peptides (in contrast to using only one peptide that yields the best results) and without prior knowledge of the binding site location. Because the correct binding site is known from the crystal structure of the complex, we should be able to confirm or refute the binding site predicted by SiteLight. To the best of our knowledge, this is the first study that attempts to validate the applicability of phage display libraries for automated binding site prediction on 3D structures.
| Results |
|---|
|
|
|---|
|
|
Type III
The purpose of these libraries is also to display the mutated variants of the binding site region. However, they differ from Type II with respect to the mutated parts. In Type II libraries, the hot spot regions are mutated, whereas in Type III libraries these regions are kept constant, while the flanks are mutated. Both of these approaches try to change the binding affinity to the target protein. Type II libraries fulfill this task by mutating positions that are hypothesized to contribute significantly to the binding, whereas Type III libraries test positions that are presumed not to contribute significantly to the binding.
Type IV
The purpose of this library type is to create novel binding proteins. A protein that binds the Target weakly is used as a Library Template. By mutating this protein it may acquire binding capability. In this molecular evolution process, a nonbinding protein turns into a binding one through mutation and selection. Thus, in this library mutated fragments of the nonbinding protein are displayed on the phage surface.
Type V
This type is used to bypass the limitations of the Types II and III libraries. These libraries are not generic, because the nonmutated regions may play a role in the binding process. First, they may match the Target sterically and be energetically favored. Second, they help the mutated region to acquire a "correct" conformation. Correct conformation may not be established in the absence of the nonmutated regions. Therefore, a local structural environment formed by the nonmutated regions is essential for binding. Although this conclusion was neither confirmed nor refuted by experimental methods, its implications were not overlooked. Further, libraries of Types II and III are not generic because they might produce an epitope composed of both mutated and nonmutated regions. The nonmutated regions are composed of native residues located in the binding site, and therefore might contribute to the interactions. Thus, libraries of Types II and III are less informative than epitopes derived only from the mutated regions. In the Type V libraries our goal is to separate the contribution of the mutated regions to the binding process. To achieve this purpose, we use two different local structural environments. These environments are taken from two proteins that are known to bind the same {it Target}. One of these is used as a Library Template, while the binding site is searched on the second. This is in contrast to the other library types, where the same protein serves both as a Library Template and for binding site searching (see Fig. 1
). Overall, 14 Type I and 24 Semicombinatorial libraries were used. Among these are 17 Type II, three Type III, three Type IV, and one Type V libraries.
The algorithm
SiteLight seeks to match a phage display derived peptide to a 3D epitope on a protein surface. The peptides are expected to imitate the binding site on the Template protein with respect to amino acid chemical properties and spatial organization. The surface of the protein is divided into overlapping patches. The division is based on geodesic distances between two residues rather than on Euclidean distances. SiteLight examines the potential match of each peptide in the library with each patch. For each potential match a bipartite graph, Graph 2, is created. Its vertices represent patch and peptide residues. Its edges represent similarity between two residues. Residue similarity is determined by a similarity matrix (Table 3
in Supplemental Material). The best alignment of a peptide and a patch represented in a bipartite graph is found by the maximal bipartite matching algorithm. The score of each match is determined by the best alignment. Potential matches are sorted by their scores. High scoring matches are iteratively selected until 25% of the Template protein is covered. The minimized surface is expected to include the binding site between the Template and the Target. To assess SiteLights success, the "correct" binding site is determined. This site is defined by the TemplateTarget complex interface (i.e., Template residues that are spatially proximal to the Target). The correctness of each match is represented by the patch overlap with the interface. Figure 2
gives a flow chart of the algorithm, and Materials and Methods describes it in further detail.
|
|
Phage display libraries verification
We have tested the validity of the input for SiteLight, that is, the phage display library biopanning results. The purpose is to show that in every library there is at least one peptide, referred to as Peptide A, which can be mapped to the correct binding site. If it does, it should score higher than if it is aligned by SiteLight to other parts of the protein. According to its definition, Peptide A is sufficient for binding site prediction. The method for selecting Peptide A is detailed in Materials and Methods. In 76% of the complexes presented in Table 2
, a correct solution (over 25% overlap with the interface) was ranked as the first using Peptide A. The best way to estimate this result is by comparing it to the one obtained with the Artificial Interface Peptides. There the result for the Artificial Interface Peptides can be regarded as the best possible solution that can be obtained. Because the data set is the same, these results are comparable. The percent of complexes in which a correct solution was ranked first based on Peptide A is only 6% less than the one obtained with the Artificial Interface Peptides. Only in 2 out of the 43 cases no peptide could be aligned with the binding site for a given complex and a library. The rank of the best match, that is, the solution that overlaps with the interface to the highest extent, is also high in almost all cases (see column 9 in Table 2
). The exception is in the haptenantibody category. Overall, the categories with the worst results are the haptenantibody and the antigenFc.
|
|
Hsc70detailed result analysis
Hsc70 is a constitutively expressed protein. It prevents misfolding and aggregation of newly synthesized or misfolded proteins. Hsc70 consists of three domains: ATPase, SBD (substrate binding domain), and a C-terminal domain. ATP binding by the ATPase domain regulates substrate binding by an unknown mechanism. Substrate binding promotes ATP hydrolysis. The Hsc70/ADP complex with the substrate is more stable. ATP hydrolysis is also stimulated by Hsp40 proteins. Substrate release is dependent on the exchange of bound ADP for ATP. This reaction is promoted by a nucleotide exchange factor: GrpE in prokaryotes and Bag-1 in euokaryotes. Bag-1 was shown to stimulate the ATPase rate of Hsc70 in an Hsp40-dependent manner and to promote substrate release from the chaperone (Sondermann et al. 2001). Two structures were used for the analysis: the complex of Hsc-70 ATPase doamin and Bag-1, and unbound Bag-1 (PDB codes 1HX1
[PDB]
and 1I6Z
[PDB]
, respectively).
Hsc70 phage display libraries
Bovine Hsc70 was used as a Target for screening a 15-mer and a 6-mer phage display random peptide libraries. Each library contained about 108 clones. Three clones were sequenced from the 6-mer library. Ninety-seven clones were sequenced from the 15-mer library after three rounds of selection. These sequences were enriched with lysines, histidines, and aspartic acid. Binding specificity to Hsc70 was confirmed by negative and positive tests. The negative test examined peptide binding to other proteins (BSA, actin, and streptavidin). The positive test examined the peptides ability to stimulate ATPase activity in two ways: (1) inorganic phosphate release measurement, and (2) competition with the pigeon cytochrome c peptide, which stimulates ATPase activity (Takenaka et al. 1995).
Previous results have suggested that the heptamers may have an improved affinity compared to hexamers. Therefore, 7-mer peptides were designed based on sequences obtained by biopanning the 15-mer and the 6-mer libraries. Two groups of control sequences of 6 and 7-mer lengths were also constructed. These peptides failed to pass the binding specificity tests (Sondermann et al. 2001).
SiteLight results for the Hsc70Bag-1 complex
In Table 3
it can be seen that there is a correlation between the specificity of peptide binding and its ability to predict the binding site. There is a significant difference between control peptides and specific-binding peptides. Control peptides of both 7 and 6 amino acids (lines 7 and 8 in Table 3
) were not mapped to the binding site. Either none or 3 residues out of 21 interface residues were found using heptamers and hexamers control peptides, respectively. In comparison, 14 and 12 residues out of 21 interface residues were found using heptamers and hexamers specific-binding peptides, respectively. For the 15-mer peptides, no control was provided. In each of the three libraries (15-mer, heptamer, and hexamer) there is at least one peptide (i.e., Peptide A) that can be mapped to a binding site better than to other parts of the molecule (lines 13 in Table 3
). Bag-1 surface was minimized by 75%. The number of surface residues was reduced from 104 to 26. Using each of the 15-mer, heptamer and hexamer Peptide As, a different set of residues was located in the reduced surface. This surface contains 17, 16, and 12 interface residues for the 15-mer, heptamer, and hexamer respective Peptide As. The highest ranking solutions (solution no. 1) obtained by the 15-mer, by the heptamer and by the hexamer Peptide As overlap with the interface by 68%, 100%, and 57%, respectively.
When the entire libraries (lines 46 in Table 3
) were used, a modest decline in prediction quality is observed compared to the Peptide A results. The reduced interface contains 13, 14, and 12 interface residues for the 15-mer, heptamer, and hexamer entire libraries, respectively. This yields an average decline of 1.6 residues. The rank of the highest solution and the percentage of interface overlap did not change when the 15-mer and heptamer Peptide As were replaced by the entire libraries of the 15-mer and heptamers. As in other examined cases, no positive correlation between peptide length and binding site prediction quality was observed.
Bound and unbound Bag-1
Here our goal is to predict the Hsc70/Bag-1 binding site if no such complex is available. In the absence of the Hsc70/Bag-1 complex, the Template is the unbound rather than the bound Bag-1. Therefore, we compared the performance of SiteLight for the bound and unbound Bag-1 structures. The bound structure of Bag-1 (1HX1
[PDB]
) is taken from Homo sapiens. The unbound structure of Bag-1 (1I6Z
[PDB]
) is from Mus musculus. The sequences of the bound and unbound Bag-1 were aligned using CLUSTAL X (1.81) multiple sequence alignment. They share 85% residue identity and 93.5% similarity. Bag-1 unbound structure (1I6Z
[PDB]
) was structurally aligned to bound Bag-1 (1HX1
[PDB]
chain B) using FlexProt (Ma et al. 2002; Shatsky et al. 2002). FlexProt detects the optimal flexible structural alignment of a pair of protein structures. The first structure is assumed to be rigid, while in the second structure potential flexible regions are automatically detected. The root-mean-squared deviation (RMSD) is 2.14 Å for an alignment of 112 residues (the entire length of bound Bag-1) without hinges. The RMSD could not be lowered by insertion of one or two hinges. The structure and sequence alignment are presented in Figure 4. The results obtained with Peptide A for the bound and unbound Bag-1 (Table 3
) show that the number of predicted residues was smaller for the unbound than for the bound Bag-1. The ranks of the first (highest) solution that overlaps the interface by 25% or more and of the solution that overlaps the interface to the largest extent draw a different picture. The solution that overlaps the interface by at least 25% ranks number one for all six cases: bound 15-mer, heptamer, and hexamer and unbound 15-mer, heptamer, and hexamer. The interface overlap percentage of the highest solution indicates a nonuniform trend. It is higher for the unbound 15-mer and hexamer (80% for both) compared with bound 15-mer and hexamer (68% and 57%, respectively). It is lower for the unbound heptamer compared with bound heptamers (87% and 100%, respectively).
The example of Hsc70/Bag-1 demonstrates SiteLights ability to predict a conformational epitope based on phage display peptide sequences. The Hsc70/Bag-1 interface consists of two helices. Library peptides that were mapped to the binding site, including Peptide A, could be mapped to both. This demonstrates a correlation between the specificity of peptide binding and its ability to predict the binding site. Control peptides were poorly mapped to the binding site compared to specific-binding peptides. The length of the specific-binding peptides does not seem to affect the prediction quality. Reasonable results could also be obtained using the unbound structure of Bag-1. This demonstrates the applicability of SiteLight to both bound and unbound structures.
| Discussion |
|---|
|
|
|---|
One of the initial research goals was to establish a data set on which SiteLight can be tested. Our data set includes 25 complexes and 39 phage display libraries. Although this data set is large and diverse, it is far from being ideal. It needs to be considerably enlarged. Further, ideally, it should primarily consist of the generic {it Type I} libraries. Because the interest in phage display application for computational prediction of binding sites is only beginning, we expect that this limitation will be resolved in the near future. Current available data were not fully exploited yet. In 2002, Valuev et al. (2002), the creators of ASPD, have predicted that its size would double within a year. The enlargement of the available data set will enable application of statistical tools in a meaningful way. In addition to the expected growth in combinatorial phage display data, the data set available to SiteLight may be enlarged by using nonphage-based methods of peptide display and selection. These include both artificial evolution methods that are not phage-based (like ribosome display; Hanes and Pluckthun 1997) and bacterial display (Wikstrom 2000), as well as large scale peptide display methods (like peptide microarrays). From a computational point of view, there is no difference between peptides derived from any of these methods. However, it remains to be determined if SiteLight will perform equally well on such inputs as it does on phage display inputs.
Two types of validation tests, algorithm validation and phage display libraries verification, were carried out. The first confirmed the ability to remap a binding site using peptides derived from this site. The second revealed at least one peptide in 95% of the tested libraries that can be aligned to the binding site better than to other parts of the protein. The existence of such peptides in each of these libraries reinforces the idea that random phage display libraries can be mapped to a 3D binding site. The expected result in the first type of validation is clearto be able to remap the site from which peptides were derived. Such an experiment can be carried out with various sites regardless of their binding properties. On the other hand, in the second validation test the expected result is not entirely clear because a few types of epitopes can be defined as the "correct" answer.
A binding site can be divided into two types of epitopes: structural and energetic. An energetic epitope consists of amino acids that can be shown to individually contribute significantly to the binding energy. This epitope is also known as hot spots. On the other hand, a structural epitope is expected to be larger than the energetic epitope because not all interface residues are biologically relevant (Valdar and Thornton 2001). Some of the structural nonenergetic residues may be critical for dictating the 3D configuration of the epitope. Thus, a consensus derived from combinatorial phage display peptides may include only hot spots, only critical residues dictating the 3D structure, other structural epitope residues, or their combination. The relative prevalence of these groups in the peptides sequences is unknown. Experimental methods of hot spots determination vary from one study to another and are often incomplete in the sense that only a subset of the positions are examined. Therefore, we have chosen crystallographically defined interfaces. Such a definition is uniform, and appears most applicable to computational studies. Nevertheless, it is unavoidable that sometimes the epitope represented by phage display peptides deviates from this definition.
SiteLights performance can be assessed within the framework of the available data set and a "correct" epitope definition. SiteLight can reduce the protein surface by 75% without excluding the binding site. The reduced surface includes at least one solution that overlaps the interface by at least 50% in 63% of the cases. The fact that we do not achieve "correct" solution in the remaining cases can be the outcome of a number of reasons: (1) partial exploration of the solution space. It is possible that the patches that are not explored include the ones that would yield the best results. (2) Site mimicry: sites that have similar amino acid composition as the binding site cannot be distinguished from it. This mimicry can shift the results from the interaction site. The existence of such mimicry is indicated by false positives found using interface-derived peptides. (3) Irrelevant peptides: The library is expected to include both relevant peptides that bind the Target and irrelevant peptides that do not bind. Despite the negative selection steps, there are derived peptide sequences that are intrinsic to components of the biopanning process itself, including the plastic, the immobilization system or the blocking agents (Adey et al. 1995). These irrelevant peptides might mask the "correct" peptides that can be mapped to the binding site. Furthermore, (4) the relevant peptides might not all bind the Target at the same site as the Template. Let us consider a protein with multiple binding partners. The binding sites on this protein can overlap, but can also be distinct. Distal interaction sites can each bind a different set of peptides. (5) Related to the last possible reason for failure is the fact that during affinity selection a single high-affinity binding site might dominate the library. Other binding sites may not be represented at all. Thus, the presence of peptides that mimic the Template binding site depends on the biopanning components, the number of binding sites, and the affinity of the peptides for each site. In this regard, it would be interesting to experimentally examine proteins with a few binding partners, particularly those with different site affinity to their natural ligands. There may be a correlation between the affinity of a site to its natural ligand and its ability to select affinity peptides. Nevertheless, such a correlation does not necessarily exist, because the peptides affinity can exceed the affinity of the natural ligand. Fewer rounds of selection might allow selection of peptides for multiple sites. In such a case, the peptides will be mapped to more than one Template, and may be divided into groups according to their similarity. Each group should be aligned to one of the Templates.
Sequence alignment of the peptides may, in principle, help in discriminating between relevant and irrelevant peptides for a single binding protein partner. Relevant phagotope discrimination based on multiple alignment of peptides derived from random phage display libraries was carried out for the primary biliary cirrhosis and type I diabetes (Davies et al. 1999). Although not yet tested on a large data set, the encouraging preliminary results suggest that this procedure may be adopted in the future to improve SiteLights performance. Peptides can be divided into groups according to their alignment. This alignment can then be utilized in two ways: First, each group will yield a consensus that will be searched on the Template. Second, weights can be assigned to the peptides and (or) position according to its deviation from the consensus. All of the peptides will be searched on the Template. This may yield improved prediction.
One of the measurements used for phage display peptides characterization is affinity to the Target molecule. Affinity data was not used in this study due to two reasons: First, affinity data are either incomplete or completely missing for most of the libraries in the data set. This problem can be partially resolved by an indirect affinity data. A potential substitution for direct affinity data is the number of appearances of the peptide. This number is assumed to reflect the prevalence of the peptide in the postpanning library. Because Biopanning is based on the principle that high-affinity binders are enriched with selection, if the representative sample of sequenced peptides is big enough, the frequency of the peptide in the library is likely to reflect the affinity. Second, no correlation was found between the affinity and the "goodness" of the binding site mapping. Assuming that such a correlation should exist is not straightforward. In some libraries peptides with improved affinity (i.e., higher affinity than the native substrate) were found (Lang et al. 2000). Such peptides are expected to differ from the native binding site with respect to binding site location and/or residue composition. A good peptide for binding site mapping using SiteLight is one that (1) binds in the same or overlapping location, (2) in a similar conformation, and (3) consists of similar residues as the native Template. Therefore, there is no simple correlation between peptide affinity and similarity to the native binding site. It might be interesting to examine this correlation when the binding location is confirmed, for example, by competitive elution, catalytic panning or structure determination. In such cases, peptides that bind the Target with similar affinity as the Template might mimic better the Template binding site.
The SiteLight algorithm may also be applicable to other biologic problems in addition to phage-based binding site mapping. SiteLight attempts to answer a broader question than mapping a 1D peptide sequence, to a 3D protein structure. Because the sequence order of the peptide is disregarded, SiteLight can also be applied for 3D3D amino acid similarity detection. The main reason for not using the peptide sequential order is the inclusion of Semicombinatorial libraries in the data set. Some of the peptides were parsed (see Materials and Methods) and their order was lost. This unique feature of SiteLight, performing sequence alignment without sequence order, might be applicable to searching promiscuous activities. These are typically found by a search of a random library of ligands (James and Tawfik 2001). Mimicry of known interaction sites can be searched by creating a series of peptides that represent the amino acid composition of a known binding site. These Virtual Peptide libraries can be used to search a database of structures to discover similar sites in unrelated structures.
Conclusions
Here we illustrate that random phage-display peptide libraries can be applied to binding site mapping on a 3D structure. SiteLight provides a vehicle for such an application. SiteLight maps binding sites consisting both of contiguous residues, and those that constitute "truly" 3D conformational epitopes. The algorithm is highly efficient and effective. It successfully remaps short peptides (320 amino acids long) to the sites they were derived from even on large 3D protein structures.
SiteLight was able to reduce the surface by 75% without excluding the binding site. The reduced surface included at least one solution that overlaps with the interface by at least 50% in 63% of the cases. Although some trends appear to occur, nevertheless, unfortunately, no firm conclusions can be drawn regarding the applicability of this method to different molecular groups (antigenantibody, proteinprotein, etc.) due to the current limited data set size. This limitation also holds with respect to the comparison between different phage display peptide library types.
In particular, this study appears to validate the applicability of phage-display libraries for automated binding site prediction on 3D structures, and as such, suggests the feasibility of their further broadened utility.
| Materials and methods |
|---|
|
|
|---|
Peptides parsing
In combinatorial libraries the entire mutated fragment of each peptide is used without further parsing. However, semicombinatorial libraries contain mutated and nonmutated parts. If the entire peptide is used, recognition of the binding site could be obvious, because the nonmutated parts would obviously match the interface. Thus, the peptides were parsed to imitate combinatorial libraries-derived peptides as much as possible. For all libraries, only the mutated regions were used for the binding site search. Positions that were mutated by less than eight amino acids were removed. Two mutated positions were linked if the sequential distance between them did not exceed four positions. Thus, for example, if position number 1 in a peptide was mutated by 8 residues, position 2 mutated by 2, position 3 by 4 residues, position 4 by 12, and position 5 by 10 residues, the new peptide would consist only of positions 1, 4, and 4. Because our matching is 1D to 3D, the order on the chain can be disregarded. By omitting positions 2 and 3, we do not use information that would otherwise straightforwardly lead to matching to the binding site. Further, if the distance between positions mutated by at least eight residues is larger than four residues, the peptide is cut into short fragments, where each fragment contains strictly the highly mutated residue positions. The minimum fragment length is three residues. If a crystal structure of the Library Template (defined in Fig. 1
) is available, the peptides are joined if their residues are next to each other in a three dimensional space.
Algorithm description
The SiteLight algorithm can be divided into three main stages: creation of surface patches, matching surface patches with phage libraries-derived peptides, and scoring the solutions to assess their correctness.
Stage 1: Creation of surface patches
There are three steps in the creation of surface patches:
1. Molecular shape representation.
This step computes the molecular surface of the molecule. First, a high density Connolly surface is generated by the Molecular Surface program (Connolly 1983a,b). The Connolly surface is generated by rolling a probe ball over the van der Waals surfaces of the atoms of the molecule. Three types of shapes are created: convex, saddle, and concave. A sparse surface representation is computed (Lin et al. 1994) consisting of a limited number of critical points disposed at key locations over the surface. The sparse surface representation is composed of three types of points nicknamed caps, belts, and pits. These correspond to the face centers of the convex, saddle, and concave areas. A cap point belongs to one atom, a belt to two atoms. and a pit to three atoms.
2. Surface-distance calculation.
Based on the set of sparse critical points, we construct a graph (Graph 1 below). The graph represents surface distances between two residues. Each vertex V is an atom center that belongs to a surface residue. A residue is defined as a surface residue if at least one of its atoms is assigned to a critical point. This definition is very loose, and reflects our goal of exploiting crystallographic data in an imprecise manner. It enables application of this algorithm to low-resolution unbound and modeled structures. An edge connects two atoms u and v if they share a "critical" point. Therefore, a cap does not create edges, whereas a belt and a pit create two and three edges, respectively. To create a walk between every two C
-atoms in the graph, C
-atoms with a zero degree, that is, unconnected vertices are connected to the closest atom of the same residue that is either a belt or a pit. The geodesic distance between two connected atoms is calculated and assigned to the connecting edge. The surface distance between two residues is calculated as the shortest path between the corresponding C
-atoms. Graph 1 is then
![]() |
![]() |
![]() |
An example for Graph 1 is presented in Figure II in the supplemental material.
3. Selecting patch members.
The goal of this stage is to divide the protein surface to overlapping patches. The number of possible patches equals the number of walks of K steps, where K is the number of residues in a phage library peptide in graph G1. If n is the number of residues in graph G1, then the number of walks is proportional to 2n. Therefore, only some of the possible patches are created. C
-atoms are iteratively used as patch centers. The patch radius is determined with respect to the average peptide length, X, according to Equation 1. All residues with a surface distance from the patch center lower than the patch radius are regarded as members of that patch. Because nonidentical centers can produce identical patches, the patches are processed to remove multiple appearances. Therefore, the number of patches equals to or is smaller than the number of surface residues. This method explores the patch solution space only partially, and creates nearly ball-shaped patches. A correction to this nonuniform space sampling is achieved by Equation 1. The average number of residues in patches cut by this radius is higher than X, the average peptide length. If X = 5.0, the patch radius is 8.775 Å, and can include seven residues. The number of combinations of five residues out of a group of seven residues is 7!/(75)! = 2520. In other words, the alignment of a five-residue peptide with a seven-residue patch can be compared with an alignment of a five-residue peptide with 2520 five-residue patches. The patch radius is
![]() |
With Va, Vb being vertices belonging to the target and imapped peptide.
Stage 2: Matching surface patches with phage libraries derived peptides
There are two steps in this stage:
1. 3D sequence matching.
The goal of this step is to match peptide residues to patch residues. Because the patch lacks sequential order, sequence alignment methods cannot be used. Because the peptides structures are usually unknown, structural alignment methods cannot be used. Phage display-derived peptides are often structured. Although this is not common for peptides, the peptides structures are rarely solved.
To align 1D data from a peptide to 3D data from a surface patch, we use a maximal bipartite graph algorithm. Each surface patch is matched with each peptide. For each match a bipartite graph, Graph 2, is created. The graph is composed of two parts: vertices representing patch residues, and vertices representing peptide residues. All possible patch and peptide residues pairs are connected by edges. Here an edge represents similarity between two residues. The edge score is determined by a similarity matrix detailed below (Table 3
in Supplemental Material).The maximal bipartite algorithm is used to determine the best alignment between the patch and the peptide. A set of edges, M, of a graph G(V,E) with no self-loops is called a match if every vertex is incident to at most one edge of M (Horwitz 1989). V is a vertex and E is an edge. The bipartite matching complexity is given by Equation 2, where n is the number of patch residues and m is the number of peptide residues.
![]() |
Graph 2:
![]() |
![]() |
![]() |
![]() |
![]() |
with Va, Vb being vertices belonging to the target and mapped peptide.
2. Similarity matrix.
Amino acid similarity can be quantified according to geometric criteria (size, shape), chemical properties, and frequency of replacement in sequences, surfaces, or binding sites. Because we are looking for binding mimicry, we have used chemical similarity to score the amino acids pairs. The matrix we used is based on the one proposed by McLachlan (1972) and presented in Table 3
in the Supplemental Material.
Stage 3: Scoring and correctness assessment
The score of each match was determined according to the best alignment found by the maximal bipartite matching algorithm. The scores of the edges that participate in the alignment are summed. This score is expected to reflect the degree of similarity between the peptide and patch matches. The matches were sorted according to this score. A high score is equivalent to a high rank of a solution.
High scoring matches are iteratively selected until 25% of the Template protein is covered. The number of selected matches can therefore vary even between proteins of the same size (i.e., same number of surface residues). If patches corresponding to high scoring matches overlap to a large extent, then the number of selected matches needed to cover 25% of the Template would be larger compared to the modest number of overlapping patches. This method was chosen over a fixed number of selected matches because it guarantees reduction of the effective surface. On the other hand, even a small fixed number of selected matches can include large portions of the Templates surface. Such a prediction does not contribute to the identification of the binding site location. The now reduced surface is expected to include the binding site between the Template and the Target.
Because the entire Template molecules we use as a data set (presented in Table 1
) derive from complexes, the "correct" answer can be calculated. Interface residues (i.e., Template residues that are spatially proximal to the Target) are defined using geometric hashing. The atoms of the Template are inserted to a geometric hash of a 0.5 Å3 bin. The hash is queried by Target atoms with a 4 Å threshold. The distance between the Target atoms used for the query and each of the query results is calculated. If it is lower than 4 Å the Template atom is defined as an interface atom. A residue is defined as an interface residue if any of its atoms is an interface atom. The correctness of each match is represented by the patch overlap with the interface that is calculated according to Equation 3. If A is the number of patch residues that are interface residues, and B the number of patch residues, the match correctness is:
![]() |
Peptide A
Peptide A is defined as a peptide that can be mapped to the correct binding site. If it does, it should score higher than if it is aligned by SiteLight to other parts of the protein. According to its definition, Peptide A is sufficient for binding site prediction. To choose such a peptide for a specific library, a few runs of SiteLight were performed. In each of these the input library consisted of a single peptide. Thus, the number of runs needed for the selection of Peptide A equals the size of the library (i.e., the number of peptides). The selection of Peptide A was based on the following criteria: (1) the number of predicted residues, (2) the highest rank of the solution that overlaps at least 25% of the interface; (3) the highest overlap (percentage) of the solution with the interface; (4) the rank of the solution with the largest overlap with the interface; and (5) the rank of the solution with the largest interface overlap percentage. The definitions of these criteria are given in the legend to Table 2
.
| Acknowledgments |
|---|
| References |
|---|
|
|
|---|
Aloy, P., Querol, E., Aviles, F.X., and Sternberg, M.J.E. 2001. Automated structure-based prediction of functional sites in proteins: Applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J. Mol. Biol. 311: 395408.[CrossRef][Medline]
Armon, A., Grauer, D., and Ben-Tal, N. 2001. ConSurf: An algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogentic information. J. Mol. Biol. 307: 447463.[CrossRef][Medline]
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The protein data bank. Nucleic Acids Res. 28: 235242.
Bliznyuk, A.A. and Gready, J.E. 1999. Simple method for locating possible ligand binding sites on protein surfaces. J. Comp. Chem. 20: 983988.[CrossRef]
Bock, J.R. and Gough, D.A. 2001. Predicting proteinprotein interactions from primary structure. Bioinformatics 17: 455460.
Bogan, A.A. and Thorn, K.S. 1998. Anatomy of hot spots in protein interfaces. J. Mol. Biol. 280: 19.[CrossRef][Medline]
Connolly, M. 1983a. Solvent-accessible surfaces of proteins and nucleic acids. Science 221: 709713.
. 1983b. Analytical molecular surface calculation. J. Appl. Crystallogr. 16: 548558.[CrossRef]
Dandekar, T., Snel, B., Huynen, M., and Bork, P. 1998. Conservation of gene order: A fingerprint of proteins that physically interact. Biochem. Sci. 23: 324328.
Davies, J.M., Scealy, M., Cai, Y., Whisstock, J., Mackay, I.R., and Rowley, M.J. 1999. Multiple alignment and sorting of peptides derived from phage-displayed random peptide libraries with polyclonal sera allows discrimination of relevant phagotopes. Mol. Immunol. 36: 659667.[CrossRef][Medline]
DeLano, W.L. 2002. Unraveling hot spots in binding interfaces: Progress and challenges. Curr. Opin. Struct. Biol. 12: 1420.[CrossRef][Medline]
DeLano, W.L., Ultsch, M.H., de Vos, A.M., and Wells, J.A. 2000. Convergent solutions to binding at a proteinprotein interface. Science 287: 12791283.
DesJarlais, R.L., Sheridan, R.P., Dixon, J.S., Kuntz, I.D., and Venkataraghavan, R. 1986. Docking flexible ligands to macromolecular receptors by molecular shape. J. Med. Chem. 29: 21492153.[CrossRef][Medline]
Enshell-Seijffers, D., Smelyanski, L., and Gershoni, J.M. 2001. The rational design of a type 88 genetically stable peptide display rector in the filamentous bacteriophage fd. Nucleic Acids Res. 29: E50, 113.
Fariselli, P., Pazos, F., Valencia, A., and Casadio, R. 2002. Prediction of proteinprotein sites in heterocomplexes with neural networks. Eur. J. Biochem. 269: 13561361.[Medline]
Ferrer, M. and Harrison, S.C. 1999. Peptide ligands to human immunodeficiency virus type 1 gp120 identified from phage display libraries. J. Virol. 73: 57955802.