|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139 USA
2 Department of Biomedical Engineering, Boston University, Boston, Massachusetts 02215, USA
3 Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
(RECEIVED August 16, 2007; FINAL REVISION October 23, 2007; ACCEPTED October 23, 2007)
| Abstract |
|---|
|
|
|---|
and all-β proteins) of cytokines. In comparison with the state-of-the-art threader RAPTOR, LTHREADER generates on average 20% more accurate alignments of interacting residues. Furthermore, in cross-validation tests, LTHREADER correctly predicts experimentally confirmed interactions for a common binding mode within the 4-helical long-chain cytokine family with 75% sensitivity and 86% specificity with 40% gain in sensitivity compared to RAPTOR. For the TNF-like family our method achieves 70% sensitivity with 55% specificity with 70% gain in sensitivity. LTHREADER combines information from multiple complex templates when such data are available. When only one solved structure is available, a localized PSI-BLAST approach also outperforms standard threading methods with 25%–50% improvements in sensitivity. Keywords: threading; protein interactions; statistical scores; cytokines; 4-helical bundles; TNF-like
| Introduction |
|---|
|
|
|---|
Currently, methods exist for predicting general protein–protein interactions (PPIs) that have achieved some degree of success, relying mostly on data obtained from high-throughput (HTP) experiments such as yeast-two-hybrid (Y2H) screens. However, extracellular ligand–receptor interactions are much more difficult to predict than general PPIs, and methods that work well for PPIs often fail when applied to ligand–receptor binding pairs. In particular, the lack of HTP experimental data for these interactions makes it difficult to apply existing prediction methods that depend on this information (see Related work section below). The HTP experiments such as Y2H that can monitor interactions in the intracellular environment are not appropriate for investigation of extracellular proteins.
We consider the problem of predicting whether an extracellular ligand and receptor interact, given only their sequence information and several confirmed ligand–receptor PPIs among members of the same structural SCOP family (Murzin et al. 1995). Even when one or more complex structures are available within a ligand–receptor family, it is often a challenge to effectively use this information to predict interactions among other members of the family. One reason is the difficulty in identifying the interacting residues that are common among distant family members. The conformational differences that often occur at the interface of bound proteins make such identification nonobvious. In Figure 1 we compare the structural alignments for two families of cytokines. The global structural alignment methods do not generate accurate alignments at the interfaces (RMSD errors of 4.09 Å and 2.75 Å for the 4-helical and TNF-like families, respectively). The alignment of only one interacting domain (e.g., ligand or receptor) from the complex also leads to poor alignment at the interface. In comparison, when only the interaction region was considered, the alignment is much improved (RMSD errors of 1.96 Å and 1.73 Å, respectively). Clearly, the "localized alignment" based on just the interface residues is able to capture the structural variation that exists on the interface surface. By generating templates based on just the interface surface, we are able to better capture this variation.
|
Related work
Many computational approaches have been applied to prediction of PPIs, such as threading of structural complexes (Bowie et al. 1991; Jones et al. 1992; Lathrop and Smith 1996; Lathrop et al. 1998; Jones 1999; Panchenko et al. 2000; Xu et al. 2003a; Bienkowska and Lathrop 2005; Zhang et al. 2005); homology modeling and statistical potentials (Aloy and Russell 2002; Lu et al. 2002; Aloy and Russell 2003; Lu et al. 2003; Aytuna et al. 2005; Davis and Sali 2005; Davis et al. 2006; Pieper et al. 2006); correlated mutations (Lichtarge et al. 1996; Pazos et al. 1997; Olmea et al. 1999; Goh et al. 2000; Tan et al. 2004); and docking methods using physical force fields (Smith and Sternberg 2002; Mendez et al. 2003; Janin 2005; Summa et al. 2005). However, the performance of all of these methods is highly dependent on the accuracy of the alignment to the structural template and, for distantly related proteins, is more prone to errors. For example, the PPI predictor InterPrets (Aloy and Russell 2003) cannot find a confident match for any of the sequences from the cytokine families that we consider. Integrative machine learning methods also have been applied to prediction of PPIs and networks (Qi et al. 2005; Singh et al. 2006). Many of these approaches rely on HTP experimental PPI data itself as a predictor, and this information is scarce for ligand–receptor pairs.
Contributions
This paper proposes a novel threading algorithm, LTHREADER, which incorporates secondary structure (SS) and relative solvent accessibility (RSA) predictions with residue contact maps to guide and constrain alignments. While existing threading algorithms (e.g., RAPTOR) (Xu et al. 2003a) are not so successful at aligning interacting residues in sequences with low homology (Xu et al. 2003b), LTHREADER achieves much higher accuracy. The improvements were achieved by introducing a concept of localized threading that, which focuses on generating accurate alignment for the putative binding interface. When multiple structural complexes are available for a ligand–receptor family, our algorithm uses alignment of contact maps to generate accurate local templates for the interaction regions. Given interaction data from gold-standard low-throughput experiments, LTHREADER predicts ligand–receptor interactions using statistical scores: statistical potential, correlated mutations, and residue conservation. We demonstrate that just with the localized threading and a single complex structure the accuracy of prediction is improved. The addition of multiple complex data further increases the accuracy.
We apply our algorithm to the cytokines, performing significantly better than existing in silico methods. We investigate two structurally distinct cytokine families: 4-helical bundle cytokines and the TNF-like family belonging to the all-β structural class. Cytokine interactions with receptors are particularly difficult to predict because they display a high level of structural similarity but almost no sequence similarity, preventing the effective use of simple homology-based methods or general threading techniques. Those methods perform very well when there is significant sequence similarity among sequences that can be determined by PSI-BLAST (Altschul et al. 1997). Furthermore, experimental interaction data for cytokines is available only from low-throughput methods, and the structures for only a few cytokine–receptor complexes have been determined. Therefore, given the variability in sequence and structure, accurate prediction of cytokine interactions is a good indicator of the success we can achieve with our algorithm. Finally, our method predicts previously undocumented cytokine interactions which may have implications for disease. We evaluate the significance of our predictions by comparing them to those of randomized interaction surfaces.
| Results |
|---|
|
|
|---|
|
Alignment of interacting residues
LTHREADER employs contact maps between ligand and receptor to align interface regions in complexes of proteins belonging to distantly related families. The contact maps generated from the set of template complexes representing the 4-helical bundle and TNF-like families in our data set showed local similarity in the interface region (Fig. 3). Figure 3 illustrates that, in the interaction regions defined on ligand and receptor sequences, interacting residues have similar patterns of contacts in similar complexes. Despite the low similarity of the cytokine sequences, the similarity of the contact maps is apparent. The contact maps from multiple complexes were aligned using the algorithm described in Materials and Methods.
|
|
|
When threading the low-similarity cytokine sequences onto the templates, we achieved better results with LTHREADER than either RAPTOR or PSI-BLAST profiles despite the fact that all methods used the same structural templates and RAPTOR used the same secondary structure and relative solvent accessibility information. Table 2 shows how successful each algorithm was at identifying the locations of interacting residues. We see that even with low sequence similarity (between 15% and 25%), LTHREADER performed well at identifying interacting residues while RAPTOR struggles. This is not surprising as RAPTOR's accuracy, like most standard threaders, decreased as the sequence similarity to the template decreased (Xu et al. 2003a). We could not compare our threading results with the MULTIPROSPECTOR (Lu et al. 2002) threader since the program was not publicly available. The individual improvements in the accuracy of alignment by LTHREADER were substantial, ranging from 6% to 32% (Table 2). For 21 out of 24 cross-threaded complexes LTHREADER significantly improved the accuracy of the alignment at the interface. In the three cases when LTHREADER did not perform as well as RAPTOR, the accuracy is lower by 1% for the EPO–EPOR complex threaded onto the GH–GHR template and lower by 2% and 4% for threading TNFSF10–TNFRSF10B onto the TNFSF13–TNFRSF13B and TNFSF13B–TNFRSF13C templates. In the few cases where LTHREADER performed worse, the decrease in accuracy is minimal and is caused by wrongly identified core boundaries. Notably localized PSI-BLAST profiles also improve the alignments over RAPTOR by an average of 10%. It is important to note that in the case of PSI-BLAST only one complex template is required for alignment. Thus, localized threading with PSI-BLAST provides an adequate approach to address cases when only one complex template is available within a ligand–receptor family.
|
Prediction of ligand–receptor interactions
From the alignment of multiple complexes we have identified the core interaction regions in the sequences of both ligands and receptors. For each core region in a template complex we constructed a generalized sequence profile as described in Materials and Methods. We then aligned the query sequences to the template profiles; the query residues aligned to the interacting template residues define the putative interaction surface. This stage of LTHREADER uses the putative interaction surface to calculate the surface complementarity, scores for a pair of ligand–receptor sequences and learns using the available experimental data, what distinguishes interacting from noninteracting ligand–receptor pairs. Below we compare the performance of different scoring methods and effects of the score normalizations.
LTHREADER integrating multiple statistical scores outperforms any single-scoring method in predicting ligand–receptor interactions. First, we investigated contributions of single scores and combinations of four different surface complementarity scores: statistical potentials (SP), correlated mutations (CM), conserved residues (CR), and physical force fields (FF). Each of those scores is described in more detail in Materials and Methods. In Figure 5 we show the distributions of the calculated scores for interacting and noninteracting pairs in the principal component space for both families. From the distributions one can infer that there is some, but not an exceptional, degree of clustering within true positive (interacting) or true negative (noninteracting) pairs. Among different machine learning approaches (SVMs, decision trees, regression), we chose to use a decision tree classifier to combine our stand-alone scoring methods due to the small size of our data set. The inclusion of all scores resulted in higher prediction accuracy than the individual scoring methods, even when the latter are given the same alignments of the interaction surface. In order to measure the improvement of the integrated solution over the individual scoring methods, we compared the sensitivity and specificity of each one to that of the integrated solution for both families (Table 3). The performance was determined using leave-one-out cross validation using the data sets and structural complexes described in Materials and Methods. The initial examination of the raw scores of the interaction surface revealed that for some receptors the scores were consistently high across all putative ligands (e.g., the CM score highly depends on the variability of sequences in the multiple sequence alignment [MSA]). Normalizing scores for the interaction surface, using the method described in Materials and Methods Equation 8, greatly improved the performance of the method for both the individual and the combined scores (Table 4). In summary, using both integrated scores and normalization leads to the best performance of the classifier. While the integrated solution had comparable specificity to the single-score-based methods, it had higher sensitivity for the 4-helical bundle and TNF-like cytokines (75% and 70%, respectively).
|
|
|
|
|
Finally, we measured the impact each scoring method had on the final prediction by recomputing all of our predictions multiple times, but removing a scoring function during each iteration. The results are shown in Table 6. Removing FF scores from our classifier has the least effect on the overall prediction accuracy while removing any of the other scores significantly reduces specificity and/or sensitivity in at least one of the cytokine families. The lack of improvement in prediction accuracy with FF included is likely a consequence of the high level of sensitivity of the AMBER force-field function to the accuracy of the alignments and of side-chain packing following threading. This result demonstrates that FF contributions should be omitted from the combined method since they are unlikely to reflect the favorable complementarity of the ligand–receptor binding interfaces.
|
|
|
|
| Discussion |
|---|
|
|
|---|
We hope to extend LTHREADER to further enhance structure-based PPI predictors (Singh et al. 2006) by refining models of the interaction regions. We plan to apply localized threading to predict other extracellular ligand–receptor families as well as general PPIs on a genome scale of the interactome. It would seem that the success of our approach depends on the availability of structural templates and orthologous sequences. As with cytokines, other therapeutically interesting extracellular ligand–receptor families often have several complex structures available. Thus, our method helps fill a void in predicting traditionally challenging, important for human diseases, ligand–receptor interactions. In the case when multiple complex structures are not available we have shown that a localized PSI-BLAST approach can improve interaction prediction. We plan to investigate combinations of these approaches to scale to whole genomes. As an initial starting point we plan to use the SCOPPI (Winter et al. 2006) database and classification of the protein–protein interfaces observed in structural complexes.
We hope to further improve the prediction accuracy of our methods by enhancing existing and developing new scoring functions that utilize randomized surfaces to better separate signal from noise. With the current accuracy of alignments generated by LTHREADER (or localized PSI-BLAST) the important contributions to the predictions come from the statistical type scores (SP, CM, CR) while the FF contributions are clearly too noisy. It is possible that for perfectly correct alignments FFs could prove beneficial. Also, due to the high computational intensity of FF calculations, it is clear that there is no justification to apply FFs on a scale of an entire interactome. Alternatively, we will investigate smoother energy functions derived from side-chain rotamer distributions that are more tolerant to small alignment errors.
We will make the program available to the broader community.
| Materials and Methods |
|---|
|
|
|---|
In the TNF-like family we focused on the 90's loop binding site on the receptor common to known structural complexes (Hymowitz et al. 1999). The TNF-like cytokine data set included 13 ligands and 21 receptors (available at the supplemental material Web site). Our template complexes consisted of five PDB structures listed in Table 7. The gold-standard positive and negative interaction set was taken from the results of the flow-cytometry assays reported in Bossen et al. (2006). The training set consisted of 20 positive and 20 negative interactions determined experimentally, and is also available at the supplemental material Web site.
For both families, the set of noninteracting pairs were chosen from the same ligands and receptors as those in the set of known interacting pairs to ensure that the classifier distinguishes complementarity of the interfaces rather than their composition. For each sequence we identified a set of orthologs from the available genomic databases. Since cytokines belong to families that were greatly expanded and diversified in mammalian evolution, we included the sequences from the following genomes: Mus Musculus, Canis Familiaris, Bos Taurus, Rattus Norvegicus, Pan Troglodytes, and Sus Scrofa. We initially addressed the challenge of calculating correlated mutation scores by insisting that ligands and receptors from the same family have the same set of orthologs. We thus had to omit S. Scrofa and P. Trogodytes orthologs for the 4-helical and TNF-like families, respectively. For each protein an MSA of orthologs was created using CLUSTALW (Higgins and Sharp 1988).
Algorithm
Figure 2 shows an overview of the LTHREADER algorithm and each stage is described in detail below.
Stage 1: Generation of localized profiles for interaction cores
In this stage, we assume that, if a set of ligands and receptors have similar structures and binding orientation, then their corresponding interface surfaces will have good alignment. We first examine the ligand–receptor pairs that have solved structures for their bound complex and align the ligand and receptor structures separately using POSA (Ye and Godzik 2005). Then, clusters of interacting residues are identified within these complexes and mapped to their corresponding ligand and receptor sequences, thus delimiting core regions of interaction within each sequence. Given a set (minimum two) of complexes, the positions of the cores are then optimized to ensure that the locations of the interactions contained in the clusters overlap as much as possible between complexes. Finally, generalized profiles are computed for each residue in the core regions of all pairs of ligand–receptor sequences.
Clustering of residue interactions
For two interacting domains in a complex structure we define the interface residues as those in contact with residues from the other domain. We define two residues to be in contact if the distance between any two of their heavy atoms is <4.5 Å. This cutoff is the same as that used by Lu et al. (2003) to determine statistical potentials for contacting residues.
We define a contact map as a matrix C such that ci,j = 1 if the ith residue of the ligand and the jth residue of the receptor interact, and ci,j = 0 if they do not. Given a contact map C, we group together clusters of interacting pairs (non-zero entries of C) by using a simple index-based distance function to determine inclusion. The distance between two interacting pairs {i1,j1} and {i2,j2} in C, where i1 and j1 are the ligand and receptor indices, respectively, for the first interacting pair, and i2 and j2, for the second pair, is defined as follows:
|
|
which indicates infinite distance when any two residues do not interact. Using k-neighbor joining clustering we identify contact clusters in a contact map. We chose k = 3 where k is defined by the distance measure dist in Equation 1. This choice of k clusters residues that are spatial nearest neighbors on the same side of a β-strand or
-helix as these secondary structures are defined by periodicities of i, i + 2 and i, i + 3. Interacting residue pairs that are separated by a distance, dist, less than four are considered members of the same cluster. A cluster in contact map C implies a corresponding sub-matrix whose non-zero entries are members of that cluster. Note that cluster edges delimit a contiguous sequence stretch on both the ligand and receptor sequences, referred to as a "core" (Fig. 8). Thus we can define a notation for indexing a cluster by the index of its corresponding cores in the ligand and receptor.
|
Alignment of clusters for a pair of ligand–receptor complexes
The next step of our algorithm optimizes the length and location of cores within a pair of ligand–receptor complexes so that the similarity score of corresponding clusters is maximized. Let C be the contact map for the first complex, and D be the contact map for the second complex. Let m be the number of cores in the ligands for both complexes, and n the number of cores in the receptor for both complexes. Let C k,l refer to the k,l-th cluster in C, and D k,l to the corresponding k,l-th cluster in D. We set the height (ligand axis in Fig. 8) and width (receptor axis in Fig. 8) of both sub-matrices to the maximum of the height and width of each sub-matrix. (Note that this accounts for the rare case when two clusters in one complex map to a single larger cluster in another.)
The precise alignment of the interaction cores is the goal of the following optimization procedure. For the k,l-th cluster we fix the starting position of C k,l , but allow the starting position of D k,l to vary. Let D k.l p,q be equal to D k,l offset by p along the first dimension of D and offset by q along the second dimension. Our goal then is to maximize the objective function,
|
|
for 1
k
m and 1
l
n subject to the following constraints: –4
p 1,..., pm
4 and –4
q 1,..., qn
4.
sim(A, B) is the measure of similarity between matrices A and B (both of dimension m x n) and is defined by the sum of all entries in the Hadamard product of the two matrices: sim(A, B) =
ai,jbi,j . Since there are only a few cores in the ligand and receptor (less than five in most cases), we use a brute-force iteration over all possible values of the offset variables p,q in order to maximize f.
Multiple alignment of interaction cores
The above method allows us to find the location of cores in the ligand and receptor sequences that maximizes the overlap of interacting residues between a pair of complexes. For more than two complexes in the training set, we extend the pairwise-alignment method in a way that optimizes their multiple alignment using a variant of the neighbor-joining method of Saitou and Nei (1987). At each step of the neighbor-joining procedure, we create a new contact matrix from the union of the Hadamard products of the contact matrices from each complex. The final result is a contact matrix representing the interaction surface common to all complexes (referred to as the average map; Fig. 3). From the multiple alignment of core regions, we construct a generalized profile from relative solvent accessibility (RSA), secondary structure (SS), and sequence at each interaction core position. RSA and SS values are calculated using DSSP (Kabsch and Sander 1983).
IRACC (interacting residues accuracy)
Given a multiple alignment of N complexes IRACC is defined by:
|
|
where iraccij is the alignment accuracy for a pair of template complexes i,j and is defined as:
. nalign (i, j) is the number of aligned interacting residue positions between two complexes and n min(i, j) is the minimum number of interacting residue positions in complexes i and j, the maximal number of contacts that can be aligned.
Stage 2: Threading of query sequences onto the template
In this stage we determine which residues in the query sequence pair would be part of the putative interaction surface by threading their sequences onto a template complex. To do this, we devise a localized threading algorithm that aligns sequences to the generalized profile of the interaction cores. The interaction cores can be localized in the sequence relative to secondary structures such as β-strands,
-helices, or coil regions.
In order to reduce errors, we first limit the search space to the region in the query sequence most likely to contain the interaction cores by using predicted SS from SABLE (Adamczak et al. 2005). In the template structure the interaction cores are localized to specific regions on the sequence delimited by the secondary structure elements:
-helices (H), β-strands (B), and loops (L). Aligning the predicted SS elements to the template structure elements identifies the likely positions of interaction cores. Specifically the alignment of secondary structure tags, where tag = (HLHLBLB...) and a score for a match is 1 and a mismatch –1 with 0 gap penalty, between the template and the predicted SS determines the position of the interaction cores in the query sequence.
Second, we predict RSAs for residues in the query sequence pair, again using SABLE. Finally, the generalized profile of the core calculated in the previous stage is used to search the query sequences using the predicted RSAs and SSs (Przybylski and Rost 2004). The search is performed by sliding a window of length equal to that of the core along the query sequence. The position, p, at which the window best matches the profile defines the location of the putative core. We search for interaction cores (ICs) within five residues before and after a predicted SS element that contains the core to account for SS prediction errors. We define ps and pe to be the start and end position, respectively, of a predicted SS element within the query sequence. We compute p, the position of the predicted IC within the query sequence restricted to positions between ps – 5 and pe + 5 as follows:
|
|
where aai+p is the amino acid, ssi+p is the predicted SS, and sai+p is the RSA of the residue at position i +p in the query sequence µi and
i are the mean and standard deviation, respectively, of the RSA at position i within the IC multiple alignment, and ssci is the SS of the core position and aact i is the amino acid from the template complex structure t.
is an indicator function for equality. N is the length of the IC multiple alignment profile and T is the total number of complex structures used as templates. For the sequence similarity matrix, SEQ, we will use BLOSUM62 (Henikoff and Henikoff 1992). We have adopted the relative weights of different score contributions, sequence (SEQ) versus structure (SS and RSA), as previously determined by others (Fischer 2000; Przybylski and Rost 2004).
Profile–profile alignments
An alternative to the above method of threading the query sequence onto the template is to use PSI-BLAST to compute sequence profiles of the query sequences and the template sequences and then perform a profile–profile alignment. In our tests, we use the log-average scoring method of von Öhsen and Zimmer (2001) to score profile alignments:
|
|
where
and β are amino acid frequency vectors at two different profile positions, prel is the probability distribution of related amino acid pairs, and pi is the background amino acid probability distribution. The value of prel (i,j)/pipj can be derived from the BLOSUM matrix and is equivalent to 2^(BLOSUM(i,j)/2). Only the sequence profiles corresponding to core regions are aligned and the search space within the query sequence is limited by using predicted SS values from SABLE as described above.
Stage 3: Scoring the interaction surface
After the interaction surface is determined for the ligand–receptor complex, it is scored and normalized as follows. Each contact from the aligned contact map calculated in Stage 1 is characterized by wij , the residue–residue distance averaged over the set of T complexes
, where dt ij is the Euclidean distance between pair of residues (Davis et al. 2006) in a complex t. The contact pairs in each complex map are used to calculate the total surface complementarity score as defined by:
|
|
where C(t) is the contact map of complex t defined by the interaction cores, and Sij is the score of the pair. In our studies of the cytokine families we included the following measures of different properties of the putative binding interface between proteins: statistical potentials, correlated mutations, residue conservation, and force-field energies. Each is described in detail below. The putative binding interface is defined by the alignment of query sequences to complex templates generated by LTHREADER.
Statistical potentials (SP)
For each residue pair located in the interaction surface, we assign a pairwise potential energy. This energy is not calculated from the physical force fields but, instead, is statistically derived from a set of known pairwise interactions between residues in solved structures. In our case, we use the pairwise potentials determined by Lu et al. (2003). To compute the SP score, we calculated the weighted sum of the potentials corresponding to all interacting residue pairs as defined by Equation 6.
Correlated mutations (CM)
In order to calculate this score, we first obtain a multiple sequence alignment (MSA) for each ligand–receptor sequence SL , SR from a set of orthologous species common to both the ligand and receptor. Let X1 ,...,XN be the sequences in the MSA for SL , and Y1 ,...,YN be the sequences in the MSA for SR . We then compute the Pearson correlation between positions i and j in SL and SR , respectively, as in Pazos et al. (1997).
|
|
Here, Dikl is the similarity between the residues at position i in sequences Xk and Xl , and Djkl is the similarity between the residues at position j in sequences Yk and Yl .
is the average sequence similarity at position i.
and
i is its standard deviation. We use the BLOSUM62 similarity matrix to compute D. Since we are interested in evaluating the likelihood of interaction, we only sum the correlation scores CMij over all pairs (i,j) within SL and SR that are within the putative interaction surface.
Conserved residues (CR)
This is a sequence-based scoring method for determining whether the conservation across species of the interacting residues in the threaded complex plays a predictive role. It is based on the assumption that residues that are contained within an interaction region are less likely to mutate than those outside of the region (Caffrey et al. 2004). We compute the conservation score at each residue position within the ligand and receptor from an MSA. The conservation score at the position i in the alignment is defined by average sequence similarity
as above.
.
AMBER force fields (FF)
This score is equal to the calculated energy of the putative interface surface within the threaded complex. We use the SCWRL 3.0 side-chain packing program (Canutescu et al. 2003) to first determine the coordinates for all the side-chain atoms in the ligand and receptor. Second, we fix atom positions for all residues that do not belong to the interface. Third, allowing the flexibility of interacting residues we perform 20 steps of conjugated gradient minimization using the molecular dynamics package BALL (Kohlbacher and Lenhof 2000) and the AMBER force fields (Ponder and Case 2003). The energy values typically reach a stable minimum after few steps of minimization. As the last step we compute the energy, FF, of the interface surface by applying the AMBER force-field function using BALL. The force fields are calculated only among the residues within the putative interaction surface and are not weighted by the averaged distance as are other scores. These calculations produce detailed all-atom interface models.
Normalization of scores
In order to put scores across all receptors and ligands on the same scale, we introduced the following formula to determine new normalized values for the scores. For each pair of ligand L and receptor R from the family we have the raw score S(L,R) calculated by one of the above methods S = {CM,SP,FF,CR}. The normalized scores are then given by:
|
|
Classification
For classification purposes we associate with the pair L and R, a vector of scores S LR = (s1,...,s4 ) that are generated from each of the scoring methods described above (when applied to L and R). We then use experimentally determined positive and negative interactions, to train a decision tree DT. We have used the publicly available DT software OC1 (Murthy et al. 1994), which uses information gain as a cost function and the oblique mode, as opposed to axis-parallel, of partitioning the attribute space (the score space). DT is then used to classify each pair based on S LR . We used decision trees because they provide a very intuitive understanding of the contributions and relative strengths of the different scoring variables used for prediction.
Randomized interaction surfaces
In order to estimate the significance of the predicted interaction for any ligand–receptor pair we have implemented the following probabilistic procedure. From all ligands and receptors within a family we create pools of ligand,
, and receptor,
, residues where r l and r r belong to the putative binding interface. For each ligand–receptor pair we generate 100 randomized interaction surfaces by grafting onto the interaction cores amino acids randomly drawn from pools P L and P R . We then score and classify them to determine f, the frequency at which the randomized surfaces are predicted to interact. 1 – f is the significance of predicted interactions within the ligand–receptor family for the nonrandomized surfaces.
| Footnotes |
|---|
Article published online ahead of print. Article and publication date are at http://www.proteinscience.org/cgi/doi/10.1110/ps.073178108.
| Acknowledgments |
|---|
|
|
|---|
| References |
|---|
|
|
|---|
Aloy, P. and Russell, R.B. 2002. Interrogating protein interaction networks through structural biology. Proc. Natl. Acad. Sci. 99: 5896–5901.
Aloy, P. and Russell, R.B. 2003. InterPreTS: Protein interaction prediction through tertiary structure. Bioinformatics 19: 161–162.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389–3402. doi: 10.1093/nar/25.17.3389.
Aytuna, A.S., Gursoy, A., and Keskin, O. 2005. Prediction of protein–protein interactions by combining structure and sequence conservation in protein interfaces. Bioinformatics 21: 2850–2855.
Berman, H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K., Feng, Z., Gilliland, G.L., Iype, L., Jain, S., et al. 2002. The Protein Data Bank. Acta Crystallogr. D Biol. Crystallogr. 58: 899–907.[CrossRef][Medline]
Bienkowska, J. and Lathrop, R. 2005. Threading algorithms. In Encyclopedia of genetics, genomics, proteomics, and bioinformatics (eds. M. Dunn et al.), pp. 3555–3568. J. Wiley, Hoboken, NJ.
Bossen, C., Ingold, K., Tardivel, A., Bodmer, J.L., Gaide, O., Hertig, S., Ambrose, C., Tschopp, J., and Schneider, P. 2006. Interactions of tumor necrosis factor (TNF) and TNF receptor family members in the mouse and human. J. Biol. Chem. 281: 13964–13971.
Bowie, J.U., Luthy, R., and Eisenberg, D. 1991. A method to identify protein sequences that fold into a known three-dimensional structure. Science 253: 164–170.
Caffrey, D.R., Somaroo, S., Hughes, J.D., Mintseris, J., and Huang, E.S. 2004. Are protein–protein interfaces more conserved in sequence than the rest of the protein surface? Protein Sci. 13: 190–202.
Canutescu, A.A., Shelenkov, A.A., and Dunbrack Jr, R.L. 2003. A graph–theory algorithm for rapid protein side-chain prediction. Protein Sci. 12: 2001–2014.
Davis, F.P. and Sali, A. 2005. PIBASE: A comprehensive database of structurally defined protein interfaces. Bioinformatics 21: 1901–1907.
Davis, F.P., Braberg, H., Shen, M.Y., Pieper, U., Sali, A., and Madhusudhan, M.S. 2006. Protein complex compositions predicted by structural similarity. Nucleic Acids Res. 34: 2943–2952. doi: 10.1093/nar/gkl353.
Fischer, D. 2000. Hybrid fold recognition: Combining sequence derived properties with evolutionary information. Pac. Symp. Biocomput. 5: 119–130.
Goh, C.S., Bogan, A.A., Joachimiak, M., Walther, D., and Cohen, F.E. 2000. Co-evolution of proteins with their interaction partners. J. Mol. Biol. 299: 283–293.[CrossRef][Medline]
Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89: 10915–10919.
Higgins, D.G. and Sharp, P.M. 1988. CLUSTAL: A package for performing multiple sequence alignment on a microcomputer. Gene 73: 237–244.[CrossRef][Medline]
Hymowitz, S.G., Christinger, H.W., Fuh, G., Ultsch, M., O'Connell, M., Kelley, R.F., Ashkenazi, A., and de Vos, A.M. 1999. Triggering cell death: The crystal structure of Apo2L/TRAIL in a complex with death receptor 5. Mol. Cell 4: 563–571.[CrossRef][Medline]
Janin, J. 2005. Assessing predictions of protein–protein interaction: The CAPRI experiment. Protein Sci. 14: 278–283.
Jones, D.T. 1999. GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 287: 797–815.[CrossRef][Medline]
Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992. A new approach to protein fold recognition. Nature 358: 86–89.[CrossRef][Medline]
Kabsch, W. and Sander, C. 1983. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22: 2577–2637.[CrossRef][Medline]
Kohlbacher, O. and Lenhof, H.P. 2000. BALL—rapid software prototyping in computational molecular biology. Biochemicals algorithms library. Bioinformatics 16: 815–824.
Lathrop, R.H. and Smith, T.F. 1996. Global optimum protein threading with gapped alignment and empirical pair score functions. J. Mol. Biol. 255: 641–665.[CrossRef][Medline]
Lathrop, R.H., Rogers Jr, R.G., Smith, T.F., and White, J.V. 1998. A Bayes-optimal sequence-structure theory that unifies protein sequence-structure recognition and alignment. Bull. Math. Biol. 60: 1039–1071.[CrossRef][Medline]
Lichtarge, O., Bourne, H.R., and Cohen, F.E. 1996. An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Biol. 257: 342–358.[CrossRef][