|
|
||||||||
1 Department of Molecular Genetics and the Crown Human Genome Center, The Weizmann Institute of Science, Rehovot 76100, Israel
Reprint requests to: Doron Lancet, Department of Molecular Genetics and the Crown Human Genome Center, The Weizmann Institute of Science, Rehovot 76100, Israel; e-mail: doron.lancet{at}weizmann.ac.il; fax: 972-8-9344487.
(RECEIVED July 6, 2003; FINAL REVISION September 29, 2003; ACCEPTED October 1, 2003)
Supplemental material: see www.proteinscience.org
Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.03296404.
| Abstract |
|---|
|
|
|---|
Keywords: orthologs; paralogs; G-protein coupled receptors; homology modeling
| Introduction |
|---|
|
|
|---|
A large majority of ORs are semiorphan receptors, meaning that although they are known to bind odorants, the specificity of each receptor for target ligands is not available in most cases. This is largely due to the relative difficulty in functional expression of these proteins in heterologous expression systems (Gimelbrant et al. 1999). Also, to date, no experimentally determined structure of an OR protein exists in the literature. Consequently, relatively little is known about protein structural attributes of ligand recognition in ORs.
The sequencing of the first OR proteins revealed that TM helices 3 to 6 were more variable between paralogs, relative to the rest of the protein (Buck et al. 1991). Based on the notion that in a large protein repertoire, geared to recognize thousands of ligands, contact positions would show pronounced variability between paralogs (Wu and Kabat 1970), these segments were hypothesized to participate in odorant binding (Buck et al. 1991). Later studies have attempted to predict odorant binding residues in olfactory receptors based upon sequence analysis, docking simulations using structural models, and predictions combining sequence analysis with structure information. Some of the earlier attempts included correlated mutation analysis used to identify eight contact positions (Singer et al. 1995a) and positive selection moments, which predicted three specificity-determining residues within TM6 (Singer et al. 1996).
Additional studies predicted ligand-contact residues by computer-based docking of odorants to structural models of the receptors (Afshar et al. 1998; Floriano et al. 2000; Singer 2000; Vaidehi et al. 2002). Together, these studies predicted 22 putative contact residues, located on TMs 3 to 7 in their models. In an elaboration of the original variability detection concept, analysis of the TM regions of ~200 OR paralog sequences combined with a low-resolution structural homology model allowed the prediction of 17 olfactory complementarity determining residues (CDRs; Pilpel and Lancet 1999). The predicted 17 positions were suggested to constitute a hypervariable odorant binding site, similar to that of immunoglobulins. This analysis was subsequently enhanced by introducing comparisons of ortholog pairs. The hypothesis in this case was that functional residues would tend to be conserved in orthologs, assuming that such pairs may recognize the same or similar odorant ligands. In a limited analysis (Lapidot et al. 2001), which included six humanmouse OR orthologous pairs, 16 of the 17 originally predicted CDRs (Pilpel and Lancet 1999) displayed low interortholog variability and high interparalog variability. A more recent study by Kondo et al. (2002) similarly predicted binding site residues by identifying positions variable between two different OR paralogs but fully conserved among five fish orthologs of each. They identified 14 potential contact residues dispersed on TMs 3, 5, 6, and 7.
The resolution of both the human and mouse complete OR subgenomes (Fuchs et al. 2001; Glusman et al. 2001; Young et al. 2002; Zhang and Firestein 2002) provided large sets of paralog and putative ortholog OR pairs. In this study we predict the binding site of ORs in an analysis that is unbiased by a priori assumptions as to the location of the binding site, using a large number of sequences from both humans and the mouse. This is done by identifying sequence positions with high conservation within ortholog pairs but with significantly lower sequence preservation in paralog pairs. A similar approach has recently been successful in the prediction of the binding sites of bacterial transcription factors and eukaryotic and prokaryotic protein kinases (Mirny and Gelfand 2002; Li et al. 2003). However, the exact methodology used in these studies could not be transferred to the case of ORs due to the availability of the complete set of OR sequences for only two species, and the paucity of functional data. We therefore developed an alternative methodology, which uses sequence pairs.
| Results |
|---|
|
|
|---|
As a first step towards the prediction of the odorant binding site we wanted to identify positions that are highly conserved within OR ortholog pairs. To this end we selected a set of 218 predicted OR ortholog pairs, using conservative cutoff criteria of bearing mutual best-hit relationship and having higher than 77% sequence identity. Figure 1
illustrates the phylogenetic relationships captured by the ortholog selection criteria. We then calculated the positional conservation, C, in the predicted OR ortholog set (Fig. 1A
), and compared it to the conservation expected solely due to the overall sequence identity among the ortholog pairs (0.838 ± 0.003). We found 146 positions to be significantly conserved within orthologous OR pairs with a false discovery rate (FDR) of 0.05, as assessed by a modified chi-square test (Fig. 1B
).
|
For the comparison between the positional conservation profiles of the ortholog and paralogs sets to be valid, the expected conservation for both groups has to be similar. We therefore chose only paralog pairs, which had mutual sequence identity between 77% and 95%, corresponding to the range of values found among the ortholog pairs. The expected positional conservation for paralog pairs using all OR paralog pairs with a mutual sequence identity within the specified range was lower than the expected value for the ortholog pairs set (0.834 ± 0.003 versus 0.838 ± 0.003, P = 0.018, assessed by a binomial proportions test). Using all 1374 pairs of paralogs specified by the range of sequence identities within the ortholog set would have resulted in spurious predictions. As an example, if we were to examine a position in which both sets had a C-value equal exactly to their respective mean expected positional conservation, we would conclude that at this position orthologs are more conserved than paralogs (P = 0.018, as assessed by a binomial proportions test). Therefore, we chose to work with a set of paralogs where each pair constituted an OR and its closest paralog with a mutual sequence identity within the desired range. The resultant set, which contained 518 pairs, had an expected positional conservation of 0.868 ± 0.002, and thus qualified as a conservative control set for our analysis. The phylogenetic relationships captured by this paralog set are illustrated in Figure 1E
.
We define D, as the difference in positional conservation between the set of orthologs and the control set of paralogs (Fig. 1C
). Twenty-three positions were found to display a significantly greater conservation among ortholog pairs than among paralog pairs with an FDR of 0.05, as assessed by a binomial proportions test (Fig. 1D
).
We singled out those positions that were found both to be significantly conserved among ortholog pairs (C criterion) and to be significantly more conserved amongst ortholog pairs than amongst paralog pairs (D criterion). Only one residue identified by the D criterion was below the C criterion threshold. In other words, high D-values tend to predict high ortholog C-values. Thus, a set of 22 positions was identified (Table 1
; Fig. 2
). These positions are disposed on the predicted TMs 2 to 7, and on the second extracellular loop. We propose that this set of positions may play a major role in constructing the odorant binding site of the OR protein superfamily.
|
|
|
|
Two of the 22 functional OR residues (alignment positions 193 and 196, Table 1
) were not in the TM barrel, but in the second extracellular loop. These residues were in close sequence proximity (relative positions -1 and +2) to a highly conserved cysteine within this loop, which in rhodopsin forms a disulfide bond with another highly conserved cysteine at the N terminus of the third helix (Fig. 4
). The high conservation of these two cysteines in ORs (both are 99.77% conserved in intact mouse ORs) leads us to believe that this disulfide bond is found also in ORs. In rhodopsin, the disulfide bond pulls the second extracellular loop towards the binding pocket, bringing the counterparts of the predicted OR contact residues near the putative binding site. They are the first and last residues of a ß-strand, which secludes the retinal from bulk solution on the extracellular surface (Menon et al. 2001). Ile189 in rhodopsin (alignment position 196) interacts with the methyl group bonded to C9 of the retinal ployene chain, while the other, Ser186 (alignment position 193), was shown to be within 4.5 Å of retinal. Thus, these loop residues are disposed favorably to interact with OR ligands.
Comparison of the predicted odorant binding site to experimental data
For other rhodopsin-like GPCRs, a wealth of data is available concerning ligand-contact residues. Using this information and the alignment of ORs against other rhodopsin-like GPCRs, we found that 21 out of 22 predicted binding site residues align against a ligand-contact residue in at least one other GPCR (Table 1
). This overlap set includes the two residues in the second extracellular loop. For comparison, Shi and Javitch (2002) listed 33 residue positions within the TM segments that have been implicated in ligand binding in aminergic receptors based on experiments. Eleven of these residue positions are within our set of predicted binding site residues (P
1.33 x 10-4)
A functional expression study of rat and mouse OR I7 (Krautwurst et al. 1998), whose human ortholog is OR6A1, indicated a ligand-contact residue at position 206 (position 216 in our global alignment). It was discovered, as it accounts for a difference in affinity towards n-heptanal between the rat I7 OR (valine at this position) and the mouse I7 OR (isoleucine at this position). The residue at this position in the amino acid sequence is not included in our predicted binding site set. This discrepancy is, however, alleviated by a more recent report, which did not find this difference in affinity (Bozza et al. 2002).
Conservation of the entire binding site among ortholog and paralog pairs
Although the method used ensures that each individual binding site position would be conserved within most of the ortholog pairs, it does not guarantee that in a given pair of orthologs all or most of the binding site residues would be conserved. We observed that 147 out 218 ortholog pairs (67%) conserve at least 21 of 22 of the binding site residues (P = 0.0087, as assessed by simulation). Thus, it appears that overall conservation of the entire proposed binding site amino acid set could be used as a criterion for OR functionality as well as for the functional significance of orthologous pair assignment.
As an example, in two cases (human OR8A1 and mouse MOR171-2 and MOR171-3; and human OR8D1 and mouse MOR171-9 and MOR171-22) we found that an OR had an identical putative binding site with its second best hit, instead of its predicted ortholog. In both cases the difference between the overall sequence percent identity with the first and second best hits was less than 2%. Thus, it is in the realm of possibility that the true functional ortholog does not coincide with the counterpart with highest overall sequence identity.
A study attempting to identify the dog OR subgenome (Olender et al. 2003), found 137 triplets, each containing a dog, human, and mouse OR, which were reciprocal best hits for all three interspecies sequence comparisons. No cutoff was imposed on the percent identities within the individual pairs. We calculated the number of differences within the putative binding site for every pair within every triplet. The binding sites were remarkably conserved with 26 triplets (19%) displaying an identical binding site, and 54 triplets (39%) displaying a conservation pattern where two of the ORs had an identical binding site and the binding site of the third differed from them by at most a single amino acid. The highest conservation was observed for the two macrosomatic species, dog and mouse, where 87 pairs (64%) had at most one difference within the binding site. Thus, although the analysis was performed only on ORs from human and mouse, the prediction holds for other species as well.
| Discussion |
|---|
|
|
|---|
Several theoretical studies have attempted to predict specific odorant-binding residues in the past. One of these studies (Kondo et al. 2002) based its prediction on the identification of positions that are fully conserved within groups of orthologs, but differ between paralogs. They examined the representative sequences of two OR paralogs in five strains of Japanese medaka fish, and predicted 14 specificity determining residues, five of which overlap with our prediction. However, an overwhelming majority (87%) of sequence positions, including 16 of the positions in our prediction, are fully conserved within the 10 homologous sequences examined. On the other hand, nine positions, which separate the paralogs in that study, do not display a significantly higher conservation within ortholog pairs, compared to paralog pairs in our analysis. Thus, both the small number of sequences examined and the relatively high similarity between them restricted the power of this previous study.
Another study (Pilpel and Lancet 1999) predicted the odorant binding site by detecting hypervariable positions in an alignment of ~200 paralogous ORs. To filter out nonspecific variability these authors imposed additional restriction, considering only residues located in the extracellular two-thirds of TMs 3 to 5, and facing the interior of the TM barrel in a low-resolution rhodopsin-based homology model. The resultant predicted binding site contained 17 residues, 10 of which appear also in our prediction. This previous study required strict hypervariability from the sequence positions of the odorant binding site, and thus overlooked residues that may be responsible for the fine tuning of specificity. Such residues may exhibit only slight variability, and would thus only be detected when contrasting their conservation among orthologs against that among paralogs. In addition, the a priori assumption of Pilpel and Lancet as to the location of the odorant binding site excluded the analysis of the loop regions of the receptor, and filtered out several hypervariable sequence positions found within the TM segments. Two such hypervariable positions, namely position 15 of TM6 and position 6 of TM 7, were indicated by the present analysis to be involved in receptor specificity. As for the seven residues missing from our prediction, two of them clearly face the exterior of the helix bundle in the present homology model, whereas the remaining five are all located in the cleft between TMs four and five, in a region not corresponding to the ligand-binding pocket of any other GPCR. The authors of the previous study hypothesized that this region might act as a binding site unique to ORs. None of the residues proposed by the present study is located in this region, indicating that the variability observed in this region may be nonspecific.
Several other studies used computer-based docking of odorants to structural models of ORs to predict residues that participate in the binding of odorants (Afshar et al. 1998; Floriano et al. 2000; Singer 2000; Vaidehi et al. 2002). The unified set of predicted residues from these studies constitutes 22 residues (Table 2
), 10 of which were predicted by the present study. Although all the contact residues predicted by these studies were located in TMs 3 to 7 in their respective models, four of the predicted residues lie in regions that are not membrane-embedded according to the homology model generated in the present study. This discrepancy may be due to the fact that most of these studies (Floriano et al. 2000; Singer 2000; Vaidehi et al. 2002) predicted the location of TMs, whereas we inferred the location of these segments by aligning ORs to rhodopsin, for which the bounds of the TMs have been determined experimentally. All odorant-binding residues predicted by the docking simulation studies, but excluded from the set of residues identified by our analysis, face the exterior of the TM bundle in our model. All these studies made use of the rhodopsin low-resolution (7.5 Å) two-dimensional map (Schertler et al. 1993) in which the kinks now known to be a prominent feature of the rhodopsin structure were not apparent. Interestingly, the greatest overlap between our prediction and those made by the docking simulation studies is in the third TM helix, the only helix that is not kinked in the rhodopsin structure. It is thus possible that the use of low resolution structural data in these studies compromised their ability to correctly predict residues that bind odorants in ORs.
|
Limitations of the prediction
One of the assumptions on which the present study is based is that the same residues determine odorant specificity in all ORs. If this assumption were not true, only the set of residues determining specificity shared by all (or most) ORs would be detected. Two additional assumptions are that paralogs have different odorant specificities and orthologs share identical odorant specificities. Noise in the form of false orthologs or orthologs with diverged specificity may thus cause the analysis to overlook some specificity-determining residues. Two recent studies, which predicted the binding sites of bacterial transcription factors (Mirny and Gelfand 2002) and the specificity-determining residues of protein kinases (Li et al. 2003), utilizing the same concept as our prediction, indeed used only unambiguous orthologs based on known function. However, for the OR superfamily this is not possible; thus, we resorted to predicting orthologs based solely on sequence similarity criteria. Despite this drawback we were able to predict a binding site that is corroborated both by its location on a structural model and by its high correspondence to ligand-contact residues in other GPCRs. This was aided by the large size of the sample, compared to previous studies, resulting in enhanced statistical power.
A high-resolution homology model for ORs
The binding site prediction process presented in this paper did not rely on any structural information. The homology model was generated solely for locating the implicated residues in the framework of the OR structure. For this, we used the crystal structure of bovine rhodopsin (Palczewski et al. 2000), the only structural template available for GPCRs. This is the first report of an OR homology model based on such high-resolution structural data. The rhodopsin structure contains many kinks and distortions in the TM domains, some of which were not seen in the low-resolution data of rhodopsin (Menon et al. 2001). Thus, our model is an improvement on previously published models (Floriano et al. 2000; Singer 2000; Vaidehi et al. 2002). However, it should be noted that this model still suffers from some of the disadvantages of its predecessors.
One such weakness is the loop regions. These are the most inaccurate feature of the model. On the one hand, these regions could not be modeled according to rhodopsin: some of their residues are absent from the crystal structure (Palczewski et al. 2000), they may be affected by packing forces within the crystal (Vaidehi et al. 2002), and they are the most divergent feature of GPCRs (Fig. 2A
). On the other hand, the second extracellular loop is of functional importance in ligand binding, as demonstrated experimentally for non-OR GPCRs (Silvente-Poirot and Wank 1996; Shi and Javitch 2002) and by our analysis for ORs. We, therefore, modeled the loop regions by using an ab initio method. The method used (Sali and Blundell 1993; Fiser et al. 2000) has been shown to perform well in the simultaneous prediction of short loops (up to 14 residues), with the accuracy of prediction dropping with length. The second extracellular loop of ORs is exceptionally long, and even when the disulfide bond-forming cysteine at its middle is used to divide it into two loops, each contains more than 14 residues. For our purposes the resultant limited quality model of this region was sufficient, because the location of the predicted binding residues in this region is quite well determined due to their proximity to the cysteine that participates in the disulfide bond. However, it is possible that this loop may have additional functional roles, as has been previously suggested (Singer et al. 1995b). To investigate this region, as well as other regions that are divergent between ORs and other rhodopsin-like GPCRs, it may be necessary to employ additional, complementary approaches to modeling the structure and function of ORs.
| Materials and methods |
|---|
|
|
|---|
A basic assumption made in the analysis is that the pseudogenes used are recent, and therefore, may be informative for the analysis. To test this assumption, we compared the conservation of the pseudogenes to that of the intact genes. We quantified the conservation of a gene by computing the percentage of a consensus it conserves. The consensus used was a group of 31 positions (Fig. 2A
) that are 90% conserved in both class I and class II intact ORs, which have been shown to display distinct conservation patterns (Zhang and Firestein 2002). We used only mouse intact ORs for the generation of the consensus, because these were previously shown to have higher conservation than human intact ORs (Young et al. 2002). Ninety percent of the OR pseudogenes were found to conserve more than 90% of the consensus, indicating that a similar proportion of the binding site residues may be conserved in these ORs. Thus, these sequences could provide substantial information for our analysis.
Non-OR GPCR sequences
To compare the predicted binding site to ligand contact residues in other GPCRs, we selected vertebrate sequences from the following rhodopsin-like GPCR families: opsins, acetylcholine (muscarinic), adrenergic, dopamine, serotonin, histamine, and angiotensin receptors. Sequences were obtained from the SWISS-PROT database (Bairoch and Apweiler 2000), and divided into sets according to the highest resolution division in the GPCRDB (Horn et al. 2001) classification.
Multiple sequence alignments
To date, experimentally determined structures have been published only for one GPCR, bovine rhodopsin (Palczewski et al. 2000). This precludes any possibility of a structure-based sequence alignment for GPCRs. We therefore created multiple sequence alignments based on sequence information and the knowledge of the location of the TM helices in rhodopsin. We employed a hierarchical approach in creating the alignments. In this approach, small sets of very close sequences were first aligned automatically. Alignments of increasing distance were then merged. In cases of gaps in the TM regions the alignments were edited manually, assuming that all aligned receptors share a similar seven-transmembrane bundle. Manual intervention was also necessary in cases where conserved residues or motifs (such as a N-glycosylation site common to most ORs) were misaligned. Automatic alignments of sequences were done using the Clustal X (Thompson et al. 1997) software with default parameters. The same software, in its profile alignment mode, was used for automatic merging of alignments. Manual editing and merging of alignments was done using the Seaview (Galtier et al. 1996) software.
To create the alignment of OR sequences we performed the following steps:
A subset of 112 ORs was selected for alignment against non-OR GPCRs. This set contained at least two representatives from each OR family. Where possible, we selected sequences that conserved all 31 positions of the OR consensus (Fig. 2
). From families 3 and 56, in which no sequence conserved all the consensus positions, we chose sequences conserving 30 and 29 of these positions, respectively.
The alignment of the non-OR GPCR sets and the merging of the resultant alignments were done automatically. Manual editing of the alignments was performed in the same cases detailed for the OR-only alignment. The OR subset was added to the alignment in the same way. We removed from the final alignment all non-OR GPCR sequences displaying more than 60% identity with another sequence in the set. The final alignment contained 205 sequences112 OR sequences and 93 non-OR GPCR sequences.
Both the alignment of ORs alone and of ORs with other rhodopsin like GPCRs are available online as Supplemental Material, together with a table of the positions of the predicted binding site residues within these alignments.
Location of TM segments and residue numbering
We used the annotation for bovine rhodopsin found in the SWISS-PROT database (Bairoch and Apweiler 2000). The location of the TM segments for the ORs and the other GPCRs was inferred from their alignment against this protein. Residues within the TM segments are numbered relative to the beginning of the TM segments. Residues within the second extracellular loop are numbered relative to the disulfide bond-forming cysteine, which is numbered zero.
Construction of the set of OR ortholog pairs
The construction process constituted the following steps:
(hOR,mOR), the overall sequence identity, using the alignment of all ORs in the data set.
values computed in step 1, select those pairs where the members are reciprocal best hits (Mushegian et al. 1998), that is, pairs (hOR, mOR) such that the overall sequence identity
(hOR,mOR) fulfills
(hOR,mOR) = maxmOR'
M
(hOR,mOR') and
(hOR,mOR) = maxhOR'
H
(hOR',mOR) where H and M are the sets of human and mouse receptors within the data set, respectively. This step identified 257 pairs.
Construction of the set of OR paralog pairs
The construction process constituted the following steps:
(ORA, ORB), the overall sequence identity, using the alignment of all ORs in the data set.
values computed in step 1, select all nonredundant paralogous pairs (ORA,ORB), such that
(ORA,ORB) fulfills,
(ORA,ORB) = maxOR'
S
(ORA,OR'); ORA,ORB
S, where S is either human or mouse.
(ORA,ORB) > 77%.
(ORA, ORB) = maxOR'
S
(ORA, OR') > 95%. Where possible, try to replace the pair (ORA, ORB) with a pair (ORA, ORB'), such that, 77%
(ORA, ORB)
95%, and for any receptor ORC
ORB' within species S
(ORA, ORC)
(ORA, ORB'). These steps resulted in a set of 518 paralogous OR pairs.
Phylogenetic analysis
The Clustal X (Thompson et al. 1997) software was used to generate a neighbor-joining tree (Saitou and Nei 1987) from an existing manually curated alignment, using default parameters. The program NJPLOT (Perriere and Gouy 1996) was used to visualize the resultant tree.
Calculation and assessment of positional conservation
In calculating the conservation of a position in an alignment one considers whether the substitutions seen at a specified position are conservative or not. One possibility for assessing whether a certain substitution may be classified as conservative or not is the examination of the score corresponding to the substitution in a scoring matrix, such as BLOSUM62 (Henikoff and Henikoff 1996). However, such substitution matrices were designed for database searching and pairwise alignment, and have not been tested for their ability to predict whether a substitution would alter a protein or not (Ng and Henikoff 2001). Therefore, we conservatively chose to use the strict measure of identity in the calculation of conservation. For each alignment position i we consider the subset of pairs in which at least one of the sequences has an amino acid at that alignment position, that is, excluding pairs in which both sequences have a gap. The number of pairs in this subset will be denoted by n(i) in the following calculations.
The conservation at position i was calculated as
![]() | (1) |
where nI(i) is the number of pairs in which both members have the same amino acid at position i.
In the equations that follow, all quantities refer to a specific alignment position i, which will be omitted for clarity.
For each position we expect to find a certain amount of conservation that is due only to the fact that each pair contains related sequences, exhibiting some degree of sequence identity. It is therefore necessary to assess the significance of the observed value of C, given the overall sequence identity in the pairs of the set examined. To determine the statistical significance of C we employed a modified one-sided chi-square test with one degree of freedom. The expected number of pairs in which both members have the same amino acid at position i was calculated as
![]() | (2) |
where
(j) is the overall sequence identity within the jth pair in the subset.
The expected number of pairs differing at position i was calculated as
![]() | (3) |
The
2 value for the statistical significance of C is then calculated by
![]() | (4) |
The statistical significance of Y was then extracted from the
2 distribution with one degree of freedom.
Comparison of the positional conservation between orthologs and paralogs
We wished to distinguish positions that are equally conserved among ortholog and paralog pairs from those that show differential conservation between these two sets. For this purpose we tested, for each position i, the null hypothesis that the probability of this position to be conserved within a pair of orthologs is equal to that within a pair of paralogs, using a two-sample binomial proportions test (Collet 1991). We denote by no and np the number of ortholog and paralog pairs, respectively, in which at least one sequence has an amino acid at position i; by nIo and nIo the number of ortholog and paralog pairs, respectively, in which both sequences have the same amino acid at alignment position i; and Co and Cp are the respective positional conservations of alignment position i in the ortholog and paralog sets, as calculated by equation 1
. Under the null hypothesis there is a common conservation probability, p, for the ortholog and paralog sets, which can be estimated by:
![]() | (5) |
(Collet 1991).
We may consider Co and Cp as the estimated conservation probabilities for the ortholog and paralog sets, respectively. If we assume the two sets represent independent samples then for large enough sample sizes the difference
![]() | (6) |
will have an approximate normal distribution and variance given by:
![]() | (7) |
and so
![]() | (8) |
(where s.e.(D) denotes the standard error of D) is approximately normally distributed with zero mean and unit variance. Positions for which the null hypothesis is rejected, and for which D > 0, were considered as having higher conservation within ortholog pairs than within paralog pairs.
Correction for multiple testing
Both the test for positional conservation and the test for comparison of positional conservations were performed for each alignment position. We used the FDR method (Benjamini and Hochberg 1995) to eliminate possible false positives due to multiple tests.
Statistical significance of the overlap between the predicted binding site set and results obtained by an alternative method
The following section deals with the calculation of the statistical significance of the overlap between the predicted binding site set and the results of SASA calculation for the rhodopsin structure, SCAM analysis of the human D2 dopamine receptor, and ligand contact residues obtained experimentally for aminergic receptors. Let T be the set of sequence positions analyzed in the particular method, R the set of sequence positions identified by the method, A the subset of the prediction contained within T, and O = A
R the overlap between the prediction and the results of the particular method. Then,
![]() | (9) |
In the case of the ligand contact residues obtained experimentally for aminergic receptors (Shi and Javitch 2002), we had no information as to the identity of the test set T. We therefore conservatively assumed that only residues within the transmembrane helices of receptors were tested.
Statistical significance of the conservation of the predicted binding site in ortholog pairs
A simulation was designed to test the hypothesis that the conservation of the binding site within ortholog pairs is purely due to the fact that its positions were selected for their high conservation within ortholog pairs. A binary matrix M of size no x b (no is the number of ortholog pairs; b is the number of residues within the predicted binding site) was generated, where each row corresponds to an ortholog pair, and each column corresponds to a binding site position. Mij = 1 if both members of the ith pair had an identical amino acid at the jth binding site position; otherwise, Mij = 0. In each of 10,000 iterations, we permuted each column independently, thus preserving the positional conservation values of the binding site positions. We then examined the rows of the modified matrix to find the number of pairs that had at most one difference within the binding site. We assessed the significance of the observed result by calculating the fraction of iterations where the simulated result was at least as good as the one observed.
Homology modeling
A homology model of OR5U1 (HORDE id 512) was constructed, using the high-resolution bovine rhodopsin crystal structure (PDB id 1F88
[PDB]
; Palczewski et al. 2000) as a template. The modeling process was made up of the following steps:
| Electronic supplemental material |
|---|
|
|
|---|
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28: 4548.
Baldwin, J.M. 1994. Structure and function of receptors coupled to G proteins. Curr. Opin. Cell Biol. 6: 180190.[CrossRef][Medline]
Ballesteros, J.A., Shi, L., and Javitch, J.A. 2001. Structural mimicry in G protein-coupled receptors: Implications of the high-resolution structure of rhodopsin for structurefunction analysis of rhodopsin-like receptors. Mol. Pharmacol. 60: 119.
Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rateA practical and powerful approach to multiple testing. J. R. Stat. Soc. B Met. 57: 289300.
Bozza, T., Feinstein, P., Zheng, C., Mombaerts, P., Horn, F., Vriend, G., and Cohen, F.E. 2002. Odorant receptor expression defines functional units in the mouse olfactory system. J. Neurosci. 22: 30333043.
Buck, L., Axel, R., Fiser, A., Do, R.K., and Sali, A. 1991. A novel multigene family may encode odorant receptors: A molecular basis for odor recognition. Cell 65: 175187.[CrossRef][Medline]
Campagne, F., Jestin, R., Reversat, J.L., Bernassau, J.M., and Maigret, B. 1999. Visualisation and integration of G protein-coupled receptor related information help the modelling: Description and applications of the Viseur program. J. Comput. Aided Mol. Des. 13: 625643.[CrossRef][Medline]
Collet, D. 1991. Modelling binary data, 1st ed., p. 369. Chapman & Hall, London.
Delano, W.L. 2002. The pymol molecular graphics system, 0.91 ed. Delano Scientific, San Carlos, CA.
Edvardsen, O., Reiersen, A.L., Beukers, M.W., and Kristiansen, K. 2002. tGRAP, the G-protein coupled receptors mutant database. Nucleic Acids Res. 30: 361363.
Fiser, A., Do, R.K., and Sali, A. 2000. Modeling of loops in protein structures. Protein Sci. 9: 17531773.[Abstract]
Floriano, W.B., Vaidehi, N., Goddard 3rd, W.A., Singer, M.S., and Shepherd, G.M. 2000. Molecular mechanisms underlying differential odor responses of a mouse olfactory receptor. Proc. Natl. Acad. Sci. 97: 1071210716.
Fuchs, T., Glusman, G., Horn-Saban, S., Lancet, D., and Pilpel, Y. 2001. The human olfactory subgenome: From sequence to structure and evolution. Hum. Genet. 108: 113.[CrossRef][Medline]
Galtier, N., Gouy, M., and Gautier, C. 1996. SEAVIEW and PHYLO_WIN: Two graphic tools for sequence alignment and molecular phylogeny. Comput. Appl. Biosci. 12: 543548.
Gilad, Y., Man, O., Paabo, S., and Lancet, D. 2003. Human specific loss of olfactory receptor genes. Proc. Natl. Acad. Sci. 100: 33243327.
Gimelbrant, A.A., Stoss, T.D., Landers, T.M., and McClintock, T.S. 1999. Truncation releases olfactory receptors from the endoplasmic reticulum of heterologous cells. J. Neurochem. 72: 23012311.[CrossRef][Medline]
Glusman, G., Yanai, I., Rubin, I., and Lancet, D. 2001. The complete human olfactory subgenome. Genome Res. 11: 685702.
Henikoff, J.G. and Henikoff, S. 1996. Blocks database and its applications. Methods Enzymol. 266: 88105.[Medline]
Horn, F., Vriend, G., and Cohen, F.E. 2001. Collecting and harvesting biological data: The GPCRDB and NucleaRDB information systems. Nucleic Acids Res. 29: 346349.
Ji, H., Zheng, W., Zhang, Y., Catt, K.J., and Sandberg, K. 1995. Genetic transfer of a nonpeptide antagonist binding site to a previously unresponsive angiotensin receptor. Proc. Natl. Acad. Sci. 92: 92409244.
Kondo, R., Kaneko, S., Sun, H., Sakaizumi, M., and Chigusa, S.I. 2002. Diversification of olfactory receptor genes in the Japanese medaka fish, Oryzias latipes. Gene 282: 113120.[CrossRef][Medline]
Krautwurst, D., Yau, K.W., Reed, R.R., Sali, A., and Blundell, T.L. 1998. Identification of ligands for olfactory receptors by functional expression of a receptor library. Cell 95: 917926.[CrossRef][Medline]
Lapidot, M., Pilpel, Y., Gilad, Y., Falcovitz, A., Sharon, D., Haaf, T., and Lancet, D. 2001. Mousehuman orthology relationships in an olfactory receptor gene cluster. Genomics 71: 296306.[CrossRef][Medline]
Li, L., Shakhnovich, E.I., and Mirny, L.A. 2003. Amino acids determining enzyme-substrate specificity in prokaryotic and eukaryotic protein kinases. Proc. Natl. Acad. Sci. 100: 44634468.<