|
|
||||||||
1 Department of Computer Science and 2 Department of Biological Sciences, Purdue University, West Lafayette, Indiana 47907, USA
3 International Institute of Molecular and Cell Biology, 02-109 Warsaw, Poland
Reprint requests to: Alan M. Friedman, Department of Biological Sciences, Lilly Hall, Purdue University, West Lafayette, IN 47907, USA; e-mail: afried{at}purdue.edu; fax: (765) 496-1189; or Chris Bailey-Kellogg, Department of Computer Science, 6211 Sudikoff Laboratory, Dartmouth College, Hanover, NH 03755, USA; e-mail: cbk{at}cs.dartmouth.edu; fax: (603) 646-1672.
(RECEIVED April 30, 2004; FINAL REVISION August 20, 2004; ACCEPTED August 20, 2004)
| Abstract |
|---|
|
|
|---|
Tfa chaperone protein in which we plan dicysteine mutants for discriminating threading models by disulfide formation. Preliminary results from a subset of the planned experiments are consistent and demonstrate the practicality of planning. Our methods provide the experimenter with a valuable tool (available from the authors) for understanding and optimizing cross-linking experiments. Keywords: protein structure prediction; proteinprotein complexes; experiment design; cross-linking mass spectrometry; disulfide trapping; structural genomics
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.04846604.
| Introduction |
|---|
|
|
|---|
|
Several independent experiments have demonstrated successful application of the method, providing models that correlate with prior or subsequent crystal or NMR structures. Using Edman sequencing and mass spectroscopy of the cross-links, Haniu et al. (1993) developed a model of human erythropoietin via lysine-specific cross-linking, while Young et al. (2000; Kruppa et al. 2003) pioneered the use of high-resolution mass spectroscopy alone to correctly discriminate threading models correctly. Cross-linking has also been used to determine quaternary arrangements of proteins (Hughes et al. 1993; Scaloni et al. 1998; Tellinghuisen and Kuhn 2000; Back et al. 2002; Trester-Zedlitz et al. 2003). These methods are particularly valuable for proteins, such as membrane proteins (Bass and Falke 1998; Kwaw et al. 2000), that are inherently resistant to traditional structure determination methods. Large sets of cross-links have also been treated as distance restraints in an alternative distance geometry structure determination protocol to determine the arrangement of transmembrane helices in lac permease (Sorgen et al. 2002), a case in which no models were available beforehand.
Experimentation with several proteins has thus demonstrated the effectiveness of cross-linking, whereas associated computational work has proposed techniques for both data interpretation (Cohen and Sternberg 1980; Bailey-Kellogg et al. 2001; Chen et al. 2001; Albrecht et al. 2002; Schilling et al. 2003) and analysis of model geometry (Young et al. 2000; Potluri et al. 2004). However, these efforts have not addressed the essential question of the information content available from a cross-linking experiment, a question required to determine and optimize the utility of conducting any particular experiment. Any realistic analysis must also include consideration of multiple sources of experimental error. Our contribution here addresses these requirements with a probabilistic analysis mechanism that explicitly accounts for the expected experimental limitations. We also develop associated algorithms for planning optimal experiments, subject to trade-offs in experimental design. Our mechanism enables the selection of the most suitable set of probes (e.g., different cross-linkers, possible mutations) to maximize experimental discrimination.
| Results and Discussion |
|---|
|
|
|---|
Once experimental data are collected, characterization of the set of observed (and potentially the unobserved) cross-links provides evidence regarding the consistency of the models with the data. An observed high-feasibility cross-link supports a model. A low-feasibility cross-link that is not observed can also support a model, once the likelihood of cross-link detection is explicitly considered. Conversely, unobserved high-feasibility and observed low-feasibility cross-links provide evidence against a model. To account for limitations in the experimental detection of cross-links and potential experimental errors, we include two parameters, capture rate
, indicating the rate of detection of feasible cross-links (i.e., 1 -
equals the rate of false negatives), and noise rate
, the detection rate of spurious infeasible ones (i.e., false positives). Support for models provides probabilities (equation 1), which are used in a ratio to compare two models (equation 3). When one model is sufficiently better than every other, model selection results.
It is advantageous to consider the possible outcomes of probabilistic cross-link analysis before an experiment is conducted, to optimize experimental parameters and obtain the most information from an experiment. Similarly, if interpretation of the results of an experiment proves to be ambiguous, a subsequent experiment can be optimized to reduce the ambiguity. Variable experimental parameters include the cross-linker (particularly specificity and length) and the sequence itself, altered by planned mutations that are unlikely to affect the parent structure. For example, we could make a conservative change to Lys to introduce additional possible cross-links for BS3, or make nondrastic substitutions in two residues to the widely accepted Cys to test disulfide bond formation. Selecting cross-linker and mutation can be repeated, generating a family of experiments, each potentially providing additional information for model selection.
The central idea of planning is for experiments to probe features for which the models most disagree. We evaluate an experiment plan in terms of key properties of practical importance: cost, capturing the number and types of experiments and their relative difficulty; discriminability, a minimum difference
by which we would like the score of the selected model to exceed that of any other model after data are collected; coverage, the number of model-pairs that are expected to achieve the desired discriminability; balance, the desire to equalize the positive evidence for one model over another, so that models are not discriminated solely on negative data. The cross-link map provides a natural metric for testing discriminability, a directed cross-link map difference, counting the number of cross-links significantly favoring one model over another. Although many methods of making comparisons are possible, we have chosen a conservative pairwise one that has as its goal ensuring that no matter which model wins, it will have been selected over every other by sufficient positive evidence. Our implementation then maximizes coverage and balance, for an experimenter-selected level of desired discriminability, while minimizing the number of experiments. Although full coverage of all directed pairs is a planning goal, it is not always possible (e.g., when two models are too similar) and often requires a large number of experiments. In practice, full coverage is not required for selecting a model, because, given experimental data, we must only find that one model is sufficiently better than the rest. Ranking all possible pairs is not required. The simulations below address the relationship between pairwise coverage and the likelihood of successful discrimination.
Our algorithm, XlinkPlan (see Materials and Methods), optimizes experiments by considering for each experiment which directed pairs of models the experiment could potentially discriminate by at least
. It selects a subset of experiments that will adequately cover the various pairs. This problem is NP-hard, implying that it is expected that no algorithm can be guaranteed to solve all problem instances both optimally and efficiently. To solve this problem in practice, our algorithm adopts a heuristic, greedy approach, selecting the experiment that looks best in the current context. In particular, it keeps track of a weight on each model-pair, indicating how much coverage is still required. To select an additional experiment, it then identifies the one that adds the most coverage, according to the weights. This weighted approach optimizes for coverage and balance, and minimizes the number of experiments. We demonstrate here that this approach is extremely efficient and produces high-quality designs.
Residue-specific cross-linking
We first studied probabilistic discriminability analysis and experiment planning using lysine-specific cross-linking and three different proteins. The primary test case is basic fibroblast growth factor (FGF-2, PDB IDs 4FGF
[PDB]
, crystal, and 1BLA
[PDB]
, NMR) because of its earlier use in model discrimination by cross-linking (Young et al. 2000). Alternative threading models for FGF-2, using 12 of the published template structures, were obtained via the protein-fold-recognition meta-server (Kurowski and Bujnicki 2003); several of the published templates could not be suitably matched to the FGF-2 sequence given current threading programs queried by the server. Two of the models are of the same fold (
-trefoil) as the current structure, and the correct NMR structure (PDB ID 1BLA
[PDB]
) is also included in the model set. The other test cases were chosen from CASP4 (Moult et al. 2001) targets with many high-quality models: deoxyribonucleoside kinase (PDB ID 1J90
[PDB]
) and
-catenin (PDB ID 1L7C
[PDB]
). Predicted models that are less complete than the correct one are ignored. In total we used 13 models for FGF-2, 85 models for deoxyribonucleoside kinase, and 50 models for
-catenin.
We consider five commercially available, water-soluble, and primary amine-reactive N-hydroxysuccinimide, sulfoN-hydroxysuccinimide, or imido ester cross-linkers with different lengths between the reactive groups: sulfo-DST cross-links Lys N
to Lys N
at a distance of 6.4 Å, DSG at 7.7 Å, DMP at 9.2 Å, BS3 at 11.4 Å, and sulfo-EGS at 16.1 Å. Previously, only information of geometric feasibility has been used for making structural inference from cross-linking. We follow that procedure here, while recognizing that accessibility and reactivity can be measured separately by reaction with monofunctional reagents (Novak et al. 2004). Because our probabilistic Bayesian approach allows ready incorporation of other information, future versions of our method will use such measurements to improve model discrimination.
Following earlier work (Young et al. 2000), for each model, we computed the Lys C
to Lys C
straight-line distance (the position of the reactive N
atom is generally both uncertain and mobile); this requires adding 12.4 (2 x 6.2) Å to the maximal cross-linker length to allow for the maximal C
N
side-chain length. Because distributions of cross-linker distances less than maximal are most highly populated in solution (Green et al. 2001), it is reasonable that potential cross-links with distances that are some value less than the maximum should be considered most feasible. At the same time, cross-links with distances exceeding the maximum are considered infeasible, whereas those in between are considered ambiguous. Our strategy of ignoring the ambiguous cross-links for model discrimination leads to a smaller number of utilized cross-links and thus a smaller probability of making a decision. However, it simultaneously reduces the possibility of using a spurious cross-link and thus increases the probability that, when a decision is made, it is a correct one.
Chemically, there are two components to the reduction in effective cross-linker length. One arises from the relative rarity of the maximally extended conformation of the cross-linker, and the other from lack of maximum extent and deviation from in-line orientation of the lysine side chains. For cross-linker BS3, the cross-linker conformation component is 2.5 Å (Green et al. 2001), and we estimate the same value for the side-chain component. This creates an ambiguous region 5 Å wide where cross-links are feasible but less probable. For BS3, this region extends from 19 Å to the maximum C
C
distance of 24 Å. We have checked this ambiguity region against FGF-2 cross-linking data (Young et al. 2000), and have found that, as expected, the capture rate for geometrically feasible cross-links (<19 Å) is greater than that for the ambiguous ones (19 Å24 Å), 31% versus 24%. For the rest of the paper, then, we use a capture rate
of
. In addition to the expected effect on
, the application of an ambiguity region is expected also to improve our ability to accurately classify potential cross-links as feasible or infeasible (increase the difference between the probabilities H and L). In all subsequent analyses, we define each cross-linkers ambiguity region ranging from its maximum extent to 5 Å less.
The discriminability
reflects the extent of confidence that we plan for in the selection of one model over another in a pair. It is the anticipated discriminability value for positive data if all possible cross-links in a planned set of experiments were detected without errors. Owing to the inevitability of errors, the expected level achievable on average is 
- 
(see Materials and Methods). Thus, the experimenter must plan for discriminability greater than the level that is satisfactory for discrimination after collecting experimental data. An extreme example arises if we expect a low capture rate, as from residue-specific cross-linking; then we must require a high
so that the expected contribution to discrimination is sufficient. Thus, when we plan for
= 6 but have a capture rate of
=
and noise rate of
= 0.05, we can expect actually to observe 1.7 discriminatory cross-links on average in favor of the winning model.
In Figure 2
, we plot the discriminable model-pair percent coverage at
of 3, 6, and 12, while varying the potential cross-linker length for the three test proteins. The optimal cross-linker length, summarized in Table 1
for our examples, depends on the models and the relative positions of the reactive sites. Theoretically, with the same number of reactive sites and a random distribution of them on the protein surface, the optimal cross-linker length would be a function of protein sizethe larger the protein, the longer the optimal cross-linker length. The three proteins have a similar number of lysines; hence, the larger proteins deoxyribonucleoside kinase and
-catenin are better discriminated with longer cross-linkers than the smaller FGF-2. Our planning method can be used for choosing suitable cross-linkers for a particular protein or as a guide for designing novel cross-linkers (Trester-Zedlitz et al. 2003). The strange right tail of the FGF-2 curve is due to the elongated model based on the D-UTPase (
-Clip) template, which requires longer cross-linkers for discrimination.
|
|
|
|
possible experiment plans for N experiments. For FGF-2 and five potential cross-linkers, there were
4.1 * 1011 possibilities for this eight-experiment plan; our algorithm provides a valuable tool for selecting the best ones.
|
has a significant influence on the planning result. Different levels of
affect the choice of experiments, as well as the coverage attainable. Figure 4A
values. Although good coverage can be achieved at low
values with a smaller number of experiments, the chance for error is higher.
|
becomes very close to 1 (some false-positive errors may still occur). Finally, the error rate
can be reduced because the approach eliminates the possibility of assignment error in MS, and both
and
can potentially be improved by the detailed examination of cross-linking kinetics that this method allows (see below).
In planning for disulfide trapping, XlinkPlan considers pairs of residues for cysteine mutation (excluding drastic mutations from Phe, Trp, Tyr, Pro, and Gly). As before, planning parameters include the desired discriminability level
and the ambiguity region A. In this case, we construct A around a model C
C
distance of 13 Å, the midpoint of a sigmoidal transition of a 3 log difference in rates of disulfide formation (Careaga and Falke 1992), and expand A in increments of 1 and +2 to account for the asymmetry in the distribution of C
C
distances relative to the transition midpoint value. Beyond estimating C
C
distances, we do not construct a full geometric analysis of disulfide geometry (Sowdhamini et al. 1989), because protein dynamics override these considerations for many proteins (Careaga and Falke 1992) and our method does not require picking those disulfides that impart the greatest stability.
Figure 4B
shows disulfide trapping experiment plans for FGF-2, produced by XlinkPlan. Although, as with residue-specific cross-linking, there are diminishing returns from doing more experiments, the enormous variety of possible disulfide experiments allows nearly full coverage to be achieved even at high
levels if enough experiments are conducted. Assuming, as above, that
in lysine-specific cross-linking is ~1/3, the
= 3, 6, 12 curves in Figure 4A
are analogous to the
= 1, 2, 4 curves in Figure 4B
.
Because Phe, Trp, Tyr, Pro, and Gly comprise ~21% of the residues in an average protein, the number of possible disulfide trapping experiments is about
0.31n2; for N planned experiments, the number of possible combinations is about
. In the FGF-2 case, there are in total 5565 possible dicysteine mutations. The number of all possible combinations of choosing five experiments from these is >1016, whereas choosing 50 is >10120. These numbers are clearly intractable to an exhaustive search for the optimal plan.
In disulfide trapping, different numbers of experiments generate a wide range of coverage. Depending on the planned
, 100% coverage is achieved only with a large number of experiments. However, coverage can be viewed as a conservative estimate of ability to discriminate, and practical experiment plans need not attain 100% coverage. To illuminate the relationship between coverage and experimental success, a simulation of a disulfide experiment plan at
= 3 was conducted, using different numbers of experiments and corresponding coverage levels (Fig. 5
). The result of each disulfide cross-linking experiment was simulated according to the geometric feasibility in the correct structure. Simulated errors were introduced according to the specified capture and noise rates. By planning for
= 3, confident discrimination by a ratio corresponding to
= 2 can be expected even in the presence of this noise. In each simulation, we determined, with respect to the
= 2 threshold, which models were eliminated by losing a pairwise comparison. The remaining "top group" of unelimi-nated models typically contains the correct structure and as few as one or two others, typically the other
-trefoil models. With 86% coverage, the top group contains only these models in >80% of the cases. With sufficiently many experiments (N = 42), even the two most similar models can be distinguished >75% of the time. Because of false positives and negatives (because
1 and
0), the correct structure might be eliminated. However, in this simulation, elimination of the correct model happens infrequently (<0.01%) because we require a sufficiently high ratio to make a decision.
Practical example: Disulfide trapping for Tfa model discrimination
We put our planning mechanism into practice on the Tfa protein of bacteriophage
. The Tfa protein and its homologs are chaperones required for the assembly of trimeric tail fibers in those phage
strains ("Ur-lambda") resembling the original wild-type isolate (Hendrix and Duda 1992), and in related phages such as T4 (Montag and Henning 1987; Hashemolhosseini et al. 1996). Genetic data suggest that the activity of Tfa and its homologs is an extreme example of chaperone activity, in which the structure of the final tail fibers (their ability to bind host membrane components) is partially determined by the structure of the chaperone (Hashemolhosseini et al. 1994).
Tfa is a small 194-amino-acid protein, but no structural information is available for it or any homolog. Crystallization trials of Tfa readily yield crystals, but they fail to diffract (Hashemolhosseini et al. 1996; M.J. van der Woerd and A.M. Friedman, unpubl.). We submitted the Tfa sequence to our fold-recognition meta-server (Kurowski and Bujnicki 2003). Three potential templates were identified by different fold-recognition programs (Table 3
). The functional relationship between Tfa and the hsp70 chaperone DnaK was suggestive, and we used the domain structure of DnaK to suggest sites where the intact molecule might be divided for easier experimentation. A Tfa fragment of residues 1108 was constructed and found to express a soluble protein that folds cooperatively. To investigate the relationship to DnaK and consider alternatives, complete models were built of the 1108 Tfa fragment using the three templates. Many decoy models were also developed with the ab initio folding program Rosetta (Simons et al. 1997); one representative model was selected from each of the 100 largest clusters obtained from 15,456 decoys.
|
|
. Such an analysis will be reported elsewhere.
|
, and coverage C (%). The 3D plot of these three variables in shown in Figure 7A
|
|
= 2) as the first step, conduct ~20 experiments (25% coverage with
= 2) and then plan additional experiments only if the result proves to be ambiguous. Because we seek one model that overrides others, the result of these experiments could be sufficient to select such a model unambiguously. If additional experiments are required, losing models need not be planned for, thereby pruning the planning problem. As an example, if the results anticipated for threading model 1 were found in the six experiments of the three-model plan (Table 4
= 2. This process could be repeated, ending in a final experiment that is explicitly balanced as in the three-model plan, to discriminate the last few, most similar models.
Algorithmic considerations
Before this experiment planning method was proposed, investigators might have conducted cross-linking experiments less systematically. We have compared the effects of non-systematic experimentation with our planning method (Figs. 9
, 10
). One method would be simply to select experiments without any planning. The expected results and variation of this "planning-free" approach are illustrated by the mean and standard deviation of 1000 random plans. A better alternative, once the problem has been formulated as here, would be randomly to generate sets of plans and select the best. This approach is illustrated by the best of 1000 random plans. Our planning algorithm bests both of these methods, especially with the enormous degrees of freedom and complex restraints of disulfide trapping planned at high
. At the same time, our algorithm also achieves balance.
|
|
r, s
and
s, r
, and using only positive evidence for planning. Although we use uniform initial weights and weight decrements for all pairs in our algorithm, differential weights and reductions are possible and would provide greater flexibility in trading off among desired criteria for experimental design, either to focus on models of interest or to avoid spending resources on barely distinguishable pairs. Although there is additional cost in explicitly considering each pair of models (rather than using a linear-cost metric such as entropy), we have found that the coverage is sparse, with each experiment covering many fewer than the possible quadratic number of model-pairs. Thus, in practice, thousands of models can be handled quite efficiently.
|
experiments). Therefore, the total number of structure pair discriminations must be at least C
P to reach C percentage coverage at discriminability
for P pairs. Also under the best possible scenario, each experiment will discriminate a disjoint set of model-pairs. This disjointness can be approximated by not considering which model-pairs an experiment covers, but only taking the number of expected discriminable pairs. The smallest number of experiments whose expected discriminable pairs sum to the C
P threshold (again, without considering which pairs are covered) defines a lower bound on the optimal experiment number. The plans in Figure 7Our algorithm balances speed and quality. It takes only seconds on a Pentium 4 computer to generate any of the plans in this paper, even with reasonably large sets of models and sizes of experiment plans. As previously discussed, the problem is NP-hard, and as we further illustrate, the combinatorics do not permit an exhaustive exploration even for the problem sizes studied here. Yet XlinkPlan results are well within a factor of 2 of optimal, and significantly better than a randomized algorithm as the number of degrees of freedom increase.
Our algorithm has been implemented in platform-independent Python scripts. The software can be freely obtained for academic use by request from the authors.
Summary
We have developed a probabilistic mechanism for analyzing cross-linking information with respect to a set of protein structure models, estimating the ability of experiments to discriminate among those models, and optimizing experiments accordingly. A probabilistic framework allows explicit characterization of errors that are present in all experimental data, enabling careful quantification of the extent of support for a particular model. The probabilistic approach allows explicit consideration of the experiment in classical statistical terms of sensitivity and power (type I and type II errors). Under our mechanism, an experimenter can establish and plan for a sufficient level of evidence required to support model selection, and thereby avoid false confidence in committing to an ambiguous decision. Similarly, the ability to select a posterior ratio as well as plan further discriminatory experiments provides control over type II errors.
We use a small set of readily interpretable parameters to characterize key factors underlying errors in data (
,
) and interpretation (H, L). Such parameters remain unstated in other approaches; for example, a violation-counting approach (Young et al. 2000) implicitly assumes
= 0 (no false positives), and H = 1 and L = 0 (no errors in interpretation of models). Although we adopt the simplest possible forms for these parameters (fixed constant values), we show that they can constitute a rational basis for interpretation and planning. Furthermore, as we found with Figure 5
, the results are fairly insensitive to the exact parameter values. Future work will focus on a more complete accounting of these error parameters either with a classical sensitivity analysis, or within a Bayesian formulation that incorporates distributions over the values of the parameters.
Our formulation of experiment planning makes explicit the key factors of discriminability, coverage, balance, ambiguity, and cost. Although the experiments we plan here contain from 1011 to 10120 combinatorial possibilities, our greedy algorithm is efficient and effective in identifying plans expected to achieve these specified criteria. When confident selection requires planning larger experiments, a proposed semisequential approach, conducting batches of experiments that focus on remaining ambiguities, allows researchers to balance the desire for conservative plans with the need for experimental efficiency. This approach can also potentially integrate residue-specific and disulfide cross-linking, once put on common probabilistic ground, using an initial residue-specific experiment to eliminate many models and subsequent disulfide experiments to discriminate remaining ones. Disulfide cross-linking could also readily be supplemented by the use of cysteine-specific cross-linkers operating on the same dicysteine mutants to obtain more distance information (Kwaw et al. 2000). These semisequential and hybrid mechanisms are very general, and we also plan to study incorporating different types of experimental data, for example, the combination of cross-linking and mutagenesis. Finally, our planner can be applied to additional discrimination problems, for example, selecting among models of proteinprotein complexes provided by docking procedures.
Our analysis raises some questions about the value of residue-specific cross-linking, especially when compared with disulfide trapping. If many residue-specific cross-links can be identified in a single experiment, then residue-specific cross-linking can be very powerful. However, whenever residue-specific cross-links are difficult to identify, then our analysis indicates that disulfide cross-linking is a more powerful alternative. We believe a major practical problem, then, is the low and variable capture rate of residue-specific experiments. Even extensive experimentation (Young et al. 2000) yielded a capture rate
of only ~
, whereas less extensive experiments (Haniu et al. 1993) gave far less (<10%). New cross-linkers and new detection methods would improve these results, but at present
is far less than can be achieved with disulfide cross-linking. As a result of the low capture rate, residue-specific experiments effectively provide less information. We estimate that under the current
and
, one disulfide cross-link is approximately equivalent to several expected residue-specific ones. In addition, the coverage of residue-specific experiments "saturates" early and dramatically, whereas disulfide trapping experiments provide enormous degrees of freedom for further, fine-grained model discrimination.
The optimal experiment plan (Table 4
) for discrimination of the threading models of the
Tfa protein is currently being conducted. Thus far the data consistently support one model. As a final note, the mutagenesis and disulfide oxidation approach is simple and amenable to robotic automation. The combination of robotics with experiment planning should prove very powerful in the rapid elucidation of protein structure.
| Materials and methods |
|---|
|
|
|---|
S has a prior probability, p(s), which can be uniform or can incorporate scoring information from the modeling process. The task is to identify the model in S that is best, in terms of the prior and agreement with experimental data regarding a set
of possible cross-links. We bridge the gap between model and data in two steps: (1) consistency of cross-links with models, and (2) evidence for cross-links from data.
Consistency of a cross-link li with a model s is modeled with a conditional probability p(li|s). By analogy to contact maps, which show pairs of residues that are "close," we call each set of conditional probabilities for a particular model and experiment a cross-link map. Figure 1
, Step 1, has examples for three models. In general, cross-link map values would be determined by the reactivity of the protein groups being linked, their accessibility to the cross-linking reagent and the geometric feasibility of the cross-linking reaction given the finite length of the cross-linking molecule. The reactivity of the protein groups cannot be easily extracted from the model, but can be corrected for by measurements of reactivity with monofunctional reagents (Novak et al. 2004). For the studies here, we assume constant reactivity. Similar considerations hold for accessibility (although some portion of the relative accessibility of sites may be extracted from the predicted model). Finally, geometric feasibility depends on whether or not the cross-linker can bridge the distance between cross-linked atoms in the model, potentially with consideration for protein dynamics. For example, the cross-linker bis-sulfo-succinimidyl suberate (BS3) reacts with amino groups, including the N terminus and the N
of Lys residues, and forms a bridge of up to 11 Å between such pairs. Similarly, in disulfide trapping (Careaga and Falke 1992), disulfide bonds are formed upon oxidation of cysteines whose C
approach within 4.6 Å, with proper geometry, during the experiment.
Support for a cross-link li from experimental data d is modeled with likelihood p(d | li). Because this paper concentrates on the information content available via cross-linking, we take as given an interpretation of the data. For example, in the case of cross-link identification by mass spectrometry, likelihoods could be computed by predicting expected mass peaks for a given cross-link and comparing with observed spectra, using a distribution to model measurement error, and a mixture model to handle experimental complexities (e.g., missed proteolytic cleavage). A key part of these likelihoods that we explicitly model is the sparsity (false negatives) and noise (false positives) of the data. Feasible cross-links are detected at some capture rate
, whereas infeasible cross-links show up, spuriously, at some noise rate
. These rates depend on the cross-linker and peptides involved, the detection methods, and the experimental effort, but we consider the simplest case of fixed rates.
Combining these terms then yields the support for each model from the data, by marginalizing over cross-link existence. In this paper, we treat cross-links as independent, although it is certainly possible to model dependence due to such effects as common reactivity arising from cross-links sharing an amino acid side chain. Similarly, a model is conditionally independent of the data given the cross-links (models are not, e.g., optimized with respect to the data). Thus we have
![]() | (1) |
In this approach, a model is supported by high-feasibility cross-links that are observed and low feasibility ones that arent. It is penalized by low-feasibility cross-links that are observed and high-feasibility ones that arent. Figure 1
, Step 4, has two simple examples for one observed and one unobserved cross-link. The p(d | li) terms are the noise and capture rates, and the p(li | s) terms arise from the cross-link map.
An interesting consequence of this realistic model is that, depending on the number of potential cross-links, and their cross-link feasibility (H, L), capture (
), and noise (
) values, we should expect to observe some cross-links that are considered low feasibility in the correct structure. The expected number of identified cross-links among B low feasibility ones is [
L +
(1 L)] B. If
=
= 0.05, L = 0.1, and B = 25, we expect to see about two infeasible cross-links wrongly identified. The potential identification of incorrect cross-links points out the need for multiple possible cross-links supporting a model selection (see discussion of selection threshold below). Further analysis of this effect and its implications for model discrimination will be reported elsewhere.
Using equation 1, we can reweight the prior distribution p(s) by the information provided by the data:
![]() | (2) |
and identify the maximum a posteriori model, or maximum likelihood model in the absence of informative priors.
A posterior ratio allows comparison of the consistency of two models (r, s
S) with the data.
![]() | (3) |
In the present context, we allow for the possibility of priors, although we treat them as uniform. When priors are ignored, this ratio becomes a so-called Bayes factor. A model can be confidently selected when the ratio with respect to every other model is sufficiently large.
Experiment planning
Planning metrics
The problem of characterizing the utility of an experiment has been well-studied in the statistical literature; for example, model entropy or relative entropy (Kullback-Leibler distance) between posterior and prior distributions is one natural approach that would capture the expected effects of reweighting the models given data. We use a complementary approach that uses pairwise differences in cross-link maps so that we can make explicit trade-offs among key properties of practical importance for our applicationdiscriminability, coverage, balance, ambiguity, and cost. In the present paper, we treat experimental cost as uniform within an experiment type, so that cost becomes simply the number of experiments.
Intuitively, a pair of models with very different cross-link maps (i.e., disagreeing about feasibility of many cross-links) has a higher probability of being discriminated than a pair with very similar cross-link maps. We separately consider the two directed discriminations in favor of one or the other model, which we characterize as cross-link map differences, d(r, s) and d(s, r). Using H and L feasibilities as in Figure 1, d
(r, s) would simply be the size of the set
r of cross-links that have H in r and L in s, and similarly for d(s, r) = |
s | (note that cross-links for which they agree cancel out in the discriminability ratio, equation 3). Now we consider whether the discriminability ratio
is sufficient to select r if r is indeed correct. Because this analysis is done before data are collected, we must take the expectation over all possible data sets:
![]() | (4) |
In general, this integral cannot be evaluated analytically. However, it can be simplified under the assumptions we have been discussing: independent feasibility of cross-links using fixed H and L, detected under fixed rates for capture
and noise
. The probability of capturing a high-feasibility cross-link is then
= H
+ (1 H)
, the sum of capturing it correctly and of it showing up incorrectly. The probability of capturing a low-feasibility cross-link is
= L
+ (1 L)
. If r is the correct model, then each cross-link from
r contributes
/
to the ratio if observed or (1
)/(1
) if not (both contribution ratios are reciprocated for the
s cross-links). Assuming independence of cross-links, the expected value is multiplicative, and we can separately analyze the expected contribution of each cross-link to the ratio. Each cross-link in
r contributes:
![]() | (5) |
Cross-links in
s have a similar formula with
and
switched, giving
. Examination of
and
demonstrates that
must be greater than
for effective discrimination, and the greater the difference, the greater the effectiveness of the experimental system.
The expected ratio in equation 4 then becomes
![]() | (6) |
We can rewrite
as
![]() | (7) |
to see that it is >1 (assuming
>
). Similarly,
> 1, thus the expectation of the ratio E{
rs |r} increases monotonically with |
r|and |
s|. Thus, we can use the cross-link map differences as an easily interpretable measurement of the potential for correctly making a selection.
Averaging (or simply summing) the expectation of ratios over all model-pairs yields a measure of the overall expected information provided by an experiment. In our cross-link map difference approach, we simply sum up the number of model-pairs with a cross-link map difference of at least some threshold
:
|
|