Protein Science
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Protein Science (2004), 13:2170-2183. Published by Cold Spring Harbor Laboratory Press. Copyright © 2004 The Protein Society
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Arai, M.
Right arrow Articles by Shimizu, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Arai, M.
Right arrow Articles by Shimizu, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Proteome-wide functional classification and identification of prokaryotic transmembrane proteins by transmembrane topology similarity comparison

Masafumi Arai1,2, Kosuke Okumura1, Masanobu Satake2,3 and Toshio Shimizu1

1 Department of Electronic and Information System Engineering, Faculty of Science and Technology, Hirosaki University, Hirosaki 036-8561, Japan
2 Department of Developmental Biology and Neuroscience, Graduate School of Life Sciences and
3 Department of Molecular Immunology, Institute of Development, Aging and Cancer, Tohoku University, Sendai 980-8577, Japan

Reprint requests to: Toshio Shimizu, Department of Electronic and Information System Engineering, Faculty of Science and Technology, Hirosaki University, 3, Bunkyo-cho, Hirosaki 036-8561, Japan; e-mail: slsimi{at}si.hirosaki-u.ac.jp; fax: 81-172-39-3638.

(RECEIVED April 15, 2004; FINAL REVISION May 19, 2004; ACCEPTED May 19, 2004)


    Abstract
 TOP
 Abstract
 Introduction
 Results and Discussion
 Materials and methods
 References
 
We propose a new method for classifying and identifying transmembrane (TM) protein functions in proteome-scale by applying a single-linkage clustering method based on TM topology similarity, which is calculated simply from comparing the lengths of loop regions. In this study, we focused on 87 prokaryotic TM proteomes consisting of 31 proteobacteria, 22 gram-positive bacteria, 19 other bacteria, and 15 archaea. Prior to performing the clustering, we first categorized individual TM protein sequences as "known," "putative" (similar to "known" sequences), or "unknown" by using the homology search and the sequence similarity comparison against SWISS-PROT to assess the current status of the functional annotation of the TM proteomes based on sequence similarity only. More than three-quarters, that is, 75.7% of the TM protein sequences are functionally "unknown," with only 3.8% and 20.5% of them being classified as "known" and "putative," respectively. Using our clustering approach based on TM topology similarity, we succeeded in increasing the rate of TM protein sequences functionally classified and identified from 24.3% to 60.9%. Obtained clusters correspond well to functional superfamilies or families, and the functional classification and identification are successfully achieved by this approach. For example, in an obtained cluster of TM proteins with six TM segments, 109 sequences out of 119 sequences annotated as "ATP-binding cassette transporter" are properly included and 122 "unknown" sequences are also contained.

Keywords: transmembrane protein; transmembrane topology similarity; functional classification and identification; proteome-wide analysis; prokaryotic genome

Abbreviations: ABC, ATP-binding cassette • n-tms, with n transmembrane segment(s) • ORF, open reading frame • SP, signal peptide • TM, transmembrane • TMS, transmembrane segment

Supplemental material: see www.proteinscience.org

Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.04814404.


    Introduction
 TOP
 Abstract
 Introduction
 Results and Discussion
 Materials and methods
 References
 
Genome projects have provided an enormous number of potential protein sequences, of which functions tried to be identified by using computer-based methods. Many of these proteins, however, have not yet been annotated, with about half of all proteome sequences being classified as functionally "unknown" or "putative" at best (Serres et al. 2001). Such is the case, in particular, for transmembrane (TM) proteins, which account for as much as 20%~30% of the total number of proteins in individual proteomes (Boyd et al. 1998; Jones 1998; Wallin and von Heijne 1998; Mitaku et al. 1999; Pasquier and Hamodrakas 1999; Stevens and Arkin 2000; Krogh et al. 2001; Liu and Rost 2001; Arai et al. 2003). As will be described later, functionally "unknown" sequences make up more than three-quarters of all TM proteomes (see later). Furthermore, in 70,228 full-length protein sequences with a function annotated as "known" in SWISS-PROT release 41 (containing 122,564 sequences in total; Boeckmann et al. 2003), the number of TM protein sequences is only 10,796, compared with 59,432 soluble protein sequences (details described in the section of Materials and Methods). This shortage of "known" TM protein sequences in the SWISS-PROT database, as a matter of course, would cause a serious delay in the classification and identification of TM protein functions if sequence similarity is used as the only criteria.

On the other hand, recent studies have revealed that TM protein functions are closely related to TM topology (the number of TM segments [TMSs], positions of the TMSs and N-tail location), and can be classified and identified with high accuracies using TM topology information as the primary basis even without using sequence similarity itself directly (Sugiyama et al. 2003; Inoue et al. 2004). Individual functional groups have their own specific TM topologies, that is, characteristic combination patterns of loop lengths. The similarity of TM topologies between two TM protein sequences can be evaluated rather easily from comparing the lengths of corresponding loop regions between the two sequences, as described in detail in the Materials and Methods section. It is generally true that a pair of TM protein sequences with a higher sequence identity usually shows a higher TM topology similarity. In some cases, however, the TM topology similarity is kept at a high level between two sequences belonging to the same functional groups (at the superfamily level) even if the sequence similarity is below the twilight zone. For example, we have a pair of TM protein sequences, mouse GABA receptor {alpha} 6 (GAA6_MOUSE) and human neuronal acetylcholine receptor {alpha} 5 (ACH5_HUMAN), between which sequence identity is only 15.8%, while the TM topology similarity is as high as 96.9%. Thus, it is expected that the classification and identification of TM protein functions on proteome scale should be improved to a large degree by making good use of TM topology information in addition to sequence similarity.

One example of the approaches for obtaining reliable and more accurate TM topology prediction data is the ConPred program (Ikeda et al. 2002; Arai et al. 2004; Xia et al. 2004), which is based on a consensus strategy by combining several proposed prediction methods, and achieves an accuracy increase of as much as 10%, for example, predicting the entire TM topology of prokaryotic TM protein sequences, from 56.5% (by MEMSAT 1.8 [Jones et al. 1994] and HMMTOP 2.0 [Tusnády and Simon 1998] and HMMTOP 2.0 [Tusnády and Simon 2001]) to 68.1% (Arai et al. 2004).

In this study, we propose a new approach for classifying and identifying TM proteome functions by using a clustering method based on TM topology similarity. We focused on predicted TM proteins from 87 completed prokaryotic (72 bacterial and 15 archaean) genome sequences. In this approach, in the case when sequences of unknown function are segregated into a cluster together with sequences of known function, not only the functional classification but also the functional identification are achieved. Prior to carrying out the clustering, we first identified functions of the predicted TM protein sequences and classified them into three categories by using homology search and sequence comparison on the basis of sequence similarity, that is, "known," "putative" (similar to "known" sequences), and "unknown."


    Results and Discussion
 TOP
 Abstract
 Introduction
 Results and Discussion
 Materials and methods
 References
 
Table 1Go summarizes the 87 prokaryotic (31 proteobacterial, 22 gram-positive bacterial, 19 other bacterial, and 15 archaean) proteomes used in this study. Out of 239,359 protein sequences in the 87 proteomes, 53,053 TM protein sequences (22.2% of the 87 proteomes) were obtained together with their TM topologies following the procedure described in the Materials and Methods section. We focused on the TM proteins with between 1 and 12 TMSs (1~12-tms), because only 3.8% of all of the TM proteins in the proteomes have more than 12 TMSs. The number and the fraction of predicted 1~12-tms TM proteins in each proteome are also listed in Table 1Go. Most of the proteomes fall in a narrow range around 21% over the four categories of prokaryotic species, with a few extremes, for example, 13.2% for Buchnera aphidicola and 29.1% for Tropheryma whipplei. The average fraction of TM proteins per proteome was calculated as 21.3% over all 87 species. The distribution of the number of TMSs in the 51,044 TM protein sequences is given in the second column of Table 2Go.


View this table:
[in this window]
[in a new window]
 
Table 1. The number of ORFs and predicted 1–12-tms TM proteins for 87 prokaryotic genomes
 

View this table:
[in this window]
[in a new window]
 
Table 2. The current status of the functional identification of the 1–12-tms TM proteins from the 87 prokaryotic genomes based on a homology search and sequence similarity search
 
Current status of the proteome-wide functional identification of TM protein sequences based on sequence similarity only
The current level of functional identification of 1~12-tms TM proteins obtained by sequence homology searches (and similarity comparisons) is shown in Table 2Go. The fractions of TM protein sequences identified as "known" by our approach, which are defined as almost identical to or exactly the same as one of the sequences registered in the SWISS-PROT database with an unambiguous function, are extremely low: 5.2% for 12-tms TM proteins and 5.0% for 9-tms TM proteins at the highest, and only 3.8% as an average over all 1~12-tms TM proteins.

The fractions of "putative" sequences, the functions of which are inferable from the functionally known sequences in SWISS-PROT, range widely from the minimum, 11.3% for 2-tms, to the maximum, 34.9% for 9-tms TM proteins, with an overall average of 20.5%. The "known" and "putative" sequences added together amount to only 24.3%, that is, about one-quarter of the TM proteomes, indicating the majority (i.e., more than three-quarters) of TM proteomes are still classified as functionally unknown.

The results listed in Table 2Go are illustrated in detail separately for each species in Figure 1Go. As expected, Escherichia coli has the highest percentage of known sequences, with over 40% of its TM proteome sequences classified as "known." The high rate of "known" sequences for Shigella flexneri as much as for E. coli is due to the close phylogenic relationship between them that about three-quarters of TM protein sequences of S. flexneri are almost identical to those of E. coli. The "known" fractions for the archaean genomes are extremely small, that is, just 1.1% as an average over the 15 genomes.



View larger version (38K):
[in this window]
[in a new window]
 
Figure 1. Current status of the functional identification of the 1~12-tms TM proteins from the 87 prokaryotic genomes based on a homology search and sequence similarity comparison together with the results of the functional classification and identification based on TM topology similarity: black bar, "known" sequences; dark gray bar, "putative" sequences; gray bar, newly classified and identified "unknown" sequences; white bar, still "unknown" sequences. Abbreviations of all 87 species are the same as in Table 1Go.

 
As with the fraction of "known" and "putative" sequences put together, 10 species belonging to {gamma}proteobacteria in proteobacteria (from E. coli to P. multocida in the list) stand out among the other species. This is again the contribution from the large number of "known" E. coli sequences in SWISS-PROT. Overall, the proteobacteria genomes far exceed the other three species categories in the fractions of "known" and "putative" sequences. The archaean TM proteomes have the smallest fractions of "known" plus "putative" sequences, 8.4% as an average over the 15 species. Interestingly, 65.1% of the "putative" sequences over all the archaean genomes are annotated after the proteobacterial "known" sequences, while only 23.3% of them are directly after the archaean "known" sequences.

Threshold TM topology similarities and the minimum cluster size
We assumed the proteome-scale functional classification using the clustering approach was successful when more than 50% of all the sequences were included in the clusters of at least 10 sequences (the minimum cluster size). The threshold TM topology similarities as the criteria for clustering were determined based on this assumption. The conditions (the 50% coverage and the minimum cluster size of 10) adopted in our approach are not based on any scientific data, but rather are purely empirical ones. This assumption is, however, supported by the relationships between the threshold TM topology similarities versus the minimum cluster size, where with increasing minimum cluster size, the threshold TM topology similarities decrease rapidly at first and then reach saturated levels at a minimum cluster size of around 10 for most numbers of TMSs (see Supplemental Fig. 1Go). Hereafter, we refer to clusters whose size is larger than nine as "large clusters," and all others including orphan clusters as "small clusters."

Threshold TM topology similarities thus determined are, for example, 98%, 85%, and 82% for 1-tms, 6-tms, and 12-tms TM protein sequences, respectively, as shown in the third column of Table 3Go. As expected, stricter threshold similarity values are obtained for the smaller numbers of TMSs.


View this table:
[in this window]
[in a new window]
 
Table 3. The results of the functional classification and identification of the 1–12-tms TM proteins from the 87 prokaryotic genomes based on TM topology similarity
 
Comprehensive functional classification and identification of TM protein sequences based on TM topology similarity
The results of the functional classification and identification of TM proteomes using the single-linkage clustering method based on TM topology similarity are summarized in Table 3Go for 1~12-tms TM proteins. The numbers of large clusters generated range from 22~74, with more clusters generated for the smaller numbers of TMSs and less for larger, in general. In these large clusters, more than half of all of the TM proteome sequences are included, a large majority of which (69.8%) are "unknown" sequences together with "known" and "putative" sequences, indicating a large amount of "unknown" sequences have been functionally classified and identified by this approach. Taking into account the "known" plus "putative" sequences included in the small clusters all together, the number of functionally annotated TM protein sequences runs up to 60.9% of the TM proteome sequences, a significant improvement over the 24.3% obtained from the sequence homology search plus similarity comparison.

The percentages of newly classified and identified sequences using this approach are displayed in Figure 1Go (gray bars) individually for the respective species. Averaged over the 31 proteobacteria, 22 gram-positive bacteria, 19 other bacteria, and 15 archaea, 32.3%, 40.3%, 38.9%, and 36.7% of all the sequences are newly classified and identified, respectively. These correspond to more than half of all the "unknown" sequences in the four individual species categories. Several {gamma}proteobacteria belonging to proteobacteria, that is, E. coli, Salmonella typhi, Salmonella typhimurium, Yersinia pestis, S. flexneri, Buchnera sp., B. aphidicola, Wigglesworthia brevipalpis, Haemorphilus influenzae and Pasteurella multocida have smaller numbers of newly classified and identified TM proteins when compared with the other species, although the total levels of functional annotation achieved were remarkably high, as much as around 80%. It is also noted that the number of classified and identified archaean sequences that were originally "unknown" significantly increased from 8.4% to 45.9%.

The following describes the details of the functional classification and identification attained by this approach, exemplifying 6-tms TM proteins.

Table 4Go provides the list of the 27 large clusters generated by single-linkage clustering based on TM topology similarity (threshold similarity 85%) for 6-tms TM proteins enumerated in order of cluster size. The largest cluster, Cluster 1, includes 1085 sequences, nearly one-fourth of all of the 6-tms TM protein sequences, with the "known" plus "putative" sequences (679 in total) annotated as "transport system permease protein" except for one sequence (as photosystem II chlorophyll-binding protein). This implies the 406 "unknown" sequences (37.4% of the 1085 sequences) included in the cluster also could be annotated as transport system permease proteins. By further clustering based on sequence similarity (threshold sequence identity 30%) within Cluster 1, we obtained 46 subclusters that correspond to functional subgroups that are, for example, "dipeptide transport system permease dppB" (in total 228 sequences including "unknown" sequences), "maltose transport system permease malD" (212 sequences), "lactose transport system permeases lacF" (181 sequences), "sulfate transport system permease cysT" (118 sequences), etc., suggesting that the TM topology-based clustering may correspond to a superfamily-or family-level classification, whereas the sequence similarity-based clustering to a family- or subfamily-level one in this case.


View this table:
[in this window]
[in a new window]
 
Table 4. A list of 27 large clusters (with ≥10 sequences) based on a threshold TM topology similarity of 85% for 6-tms TM proteins
 
The top 13 clusters, except for Clusters 9 and 12, contain sequences that distribute over all the species categories, indicating the TM proteins of these functional groups are essential for the life of prokaryotic species. By comparison, Cluster 14 (phage infection protein) contains sequences from only gram-positive bacterial and other bacterial species, and the sequences in Cluster 20 (intracellular separation protein) exist only in proteobacterial and gram-positive bacterial genomes.

In Table 4Go, we have four clusters composed of only "unknown" sequences, Clusters 9, 16, 23, and 27. Of these, Clusters 23 and 27 comprise the sequences from only archaean and proteobacterial species, respectively. These "unknown" protein sequences must be not only novel but also biologically important functional groups. We expect further experimental studies would characterize these sequences and elucidate their functions in detail.

Cluster 3 (231 sequences, of which 109 are "known" or "putative" assigned as "ATP-binding cassette [ABC] transporters") clearly illustrates how well the TM topology-based clustering works in the functional classification and identification of TM proteins. Out of 119 6-tms sequences annotated as "ABC transporter," 109 sequences (91.6%) are captured properly in this cluster, and the remaining 10 sequences are spread across nine small clusters: one sequence in a cluster with the size of nine (including nine sequences in total, N-in topology), one in a size-four cluster (N-out), two in a size-two (N-in), one in a size-two (N-out, +SP), and five orphan sequences (all N-in). The other 122 sequences are all "unknown," and no sequences with other functions are included in this cluster at all.

TM topology models of the 231 sequences are illustrated in Figure 2Go, where TMS, cytoplasmic, and noncytoplasmic loop regions are represented with black, gray, and dark gray bars, respectively. The sequences in this cluster have the following characteristics with regard to TM topology: (1) No signal peptide (SP) is present; (2) the N-tail loop is located on the cytoplasmic side; (3) the cytoplasmic loops (including the N- and C-tails) are long, in particular, the C-tail loop is extremely long; and (4) the noncytoplasmic loops are short with connecting adjacent TMSs to form three typical "helical-hairpin" domains in the TM topology architecture (Gafvelin and von Heijne 1994; Gafvelin et al. 1997). The sequence-similarity based clustering within Cluster 3 segregates these 231 sequences into 23 subclusters (including 13 orphan clusters), which are shown in Figure 2Go divided by the space lines. In the largest subcluster (182 sequences), 105 sequences are "known" or "putative" as ABC transporters, and the remaining 77 are "unknown." The other 22 subclusters are composed of only "unknown" sequences, with the exception of a few subclusters that contain ABC transporter sequences.



View larger version (51K):
[in this window]
[in a new window]
 
Figure 2. TM topology models of the 231 sequences in 6-tms Cluster 3 ("ABC transporter"; see Table 4Go), with the 23 subclusters (including orphan clusters) separated by spaces and enumerated in order of subcluster size: black bar, TMS region; gray bar, cytoplasmic loop region; dark gray bar, noncytoplasmic loop region. These TM topologies have traits such as very long cytoplasmic and short noncytoplasmic loops, and seem to be formed from three typical "helical-hairpin" domains in the TM topology architecture.

 
We would like to show another typical example, that is, Cluster 10 (52 sequences are included) of which TM topology models are presented in Figure 3Go. Among the 30 "known" or "putative" sequences included in this cluster, 26 sequences are actually secD proteins and four sequences are of other functions. We have 33 6-tms sequences that were annotated as "secD" (both "known" and "putative") by the homology search plus sequence similarity comparison procedure. This means 79% (26 of 33) of 6-tms secD protein sequences were correctly classified into this cluster by this approach. The remaining seven secD protein sequences are dissipated into two small clusters: six sequences in a size-six cluster, and one orphan sequence, both of which have N-out topology predicted in error. Among the four sequences other than secD in Cluster 10, one sequence is "known" as "protein-export membrane protein secF," of which the sequence is similar to that of secD and mostly classified together with secD into a unified secD_secF family in the family databases such as Pfam (Bateman et al. 2004). The other three are "putative" sequences annotated as "inner membrane protein creD" with rather low sequence identities, that is, 42.9%, 36.8%, and 30.6% to a hit sequence in SWISS-PROT. There are two other small 6-tms clusters holding two "known" creD sequences each: a size-three cluster of which another member is "unknown" (N-out), and a size-two cluster (N-out, +SP). Because the exact TM topology model of creD protein has not been determined experimentally yet, it is not ascertained currently which predicted TM topology model is the true one: N-out or N-in. At the same time, however, our results indicate accidentally that both TM proteins have similar predicted TM topologies except for the N-tail location, that is, N-in for secD and N-out for creD. Although the function of creD seems different apparently from that of secD (suggested an enhancing effect on the transcription of phoA; Drury and Buxton 1988), both functions might be somewhat related possibly.



View larger version (25K):
[in this window]
[in a new window]
 
Figure 3. TM topology models of the 52 sequences in 6-tms Cluster 10 ("secD protein"; see Table 4Go), with the 16 subclusters separated by the space lines: black bar, TMS region; gray bar, cytoplasmic loop region; dark gray bar, noncytoplasmic loop region. The leftmost three-letter codes are abbreviations for the species name (underline, functionally "unknown" sequence) and are the same as in Table 1Go. The Arabic numerals next to the three-letter codes indicate species-categories (1, proteobacteria; 2, gram-positive bacteria; 3, other bacteria; 4, archaea).

 
The TM proteins contained in Cluster 10 have the following characteristics with the TM topology, as seen in Figure 3Go: (1) The N-tail loop is short; (2) the noncytoplasmic loop connecting the first and second TMSs is extremely long; (3) the second cytoplasmic loop is short; (4) the other two noncytoplasmic loops are short, with connecting adjacent TMSs to form two "helical-hairpin" domains; and (5) the remaining two cytoplasmic loops are relatively long. It should be pointed out that the secD and secF families (the TM topology models of "secF" cluster, that is, Cluster 11, are shown in Supplemental Fig. 2Go) are defined as one unified family (i.e., SecD_SecF) in the Pfam database as mentioned previously, while these two families are correctly split into two clusters by our approach. It is interesting that the differences in TM topology between secD and secF proteins are enough to bring about this result, but the rather high sequence similarities between these protein families resulted in being classified together into one family. Further sequence-similarity based clustering within the cluster (threshold identity 30%) yields 16 subclusters (including nine orphan subclusters). The largest subcluster consists of 22 "known" sequences distributed over three of the four species groups, the second largest consists of seven "unknown" sequences specific to only archaean species, and the third consists of four "known" sequences from only "other bacteria." This is a good example of how our approach is effective in classifying and identifying the functions of TM proteins even in species that are distantly related. This example also demonstrates that TM topologies are more conservative than amino acid sequences themselves for preserving the TM protein functions.


    Materials and methods
 TOP
 Abstract
 Introduction
 Results and Discussion
 Materials and methods
 References
 
Data source
We used 239,359 open reading frames (ORFs) from 87 sequenced prokaryotic genomes registered in GenBank (Benson et al. 2004) for this study, as listed in Table 1Go. The ORFs were downloaded from ftp://ncbi.nlm.nih.gov/genbank/genomes/ on March 6, 2003. The 87 genomes included 31 proteobacteria, 22 gram-positive bacteria, 19 other bacteria, and 15 archaea according to the classification in GenBank.

Prediction of TM protein sequences and their TM topologies from the proteomes
Out of the protein sequences translated from the ORFs, we segregated TM protein sequences and predicted their TM topologies according to the following procedure: (1) prediction of TM protein sequence candidates using SOSUI (≥98% accuracy; Hirokawa et al. 1998); (2) removal of predicted SP regions using DetecSig (88% accuracy; Lao and Shimizu 2001; Lao et al. 2002); and (3) prediction of TM topology by ConPred (68.1% accuracy; Arai et al. 2004). A more detailed description of this procedure is given in our previous article (Arai et al. 2003).

Functional identification of TM protein sequences based on sequence similarity
We first categorized the 114,965 full-length protein sequences in SWISS-PROT release 41 into "known," "putative," or "unknown" groups according to the level of functional annotation. For this categorization, we adopted the simple but rational criteria given in the GTOP database (http://spock.genes.nig.ac.jp/?genome/func.html; Kawabata et al. 2002). The criterion for discriminating sequences with a "known" function requires at least one of the following: (1) more than five letters with functional information in the DE line, (2) at least one informative word in the KW line, or (3) both "-!- FUNCTION" and "-!- CATALYTIC ACTIVITY" in the CC line. Sequence entries were classified as "putative" if the entry contains one of the following descriptions: (1) "HOMO-LOG," "HOMOLOGY," "HYPOTHETICAL," "POTENTIAL," "POSSIBLE," "PROBABLE," or "PUTATIVE" in the DE line; (2) "BY SIMILARITY," "HYPOTHETICAL," "POTENTIAL," "POSSIBLE," "PROBABLE," or "PUTATIVE" in the "CC -!FUNCTION" or "CC -!- CATALYTIC ACTIVITY" line; and (3) "HYPOTHETICAL PROTEIN" in the KW line. When only the "known" criterion is satisfied, the sequence is regarded as "known." In cases when both "known" and "putative" criteria are true, the sequence is classified as "putative." The sequences to which the "known" criterion does not fit are categorized as "unknown," even if the "putative" criterion fits. Through this procedure, we obtained 70,228 "known" (10,796 TM protein sequences), 39,296 "putative" (6643), and 5441 "unknown" (754) sequences from SWISS-PROT release 41.

Next, we classified the 51,044 predicted TM protein sequences from the 87 prokaryotic genomes into three categories in agreement with the functional description levels in SWISS-PROT using a BLAST homology search (Altschul et al. 1990, 1997) and an ALIGN (Myers and Miller 1988) sequence comparison, as illustrated in Figure 4Go. The BLAST search was carried out with the default settings (first gap penalty, -11; additional gap penalty, -1; substitution matrix, BLOSUM 62; Henikoff and Henikoff 1992) against the grouped full-length sequences from the SWISS-PROT database. If a query sequence matches one of the SWISS-PROT sequences of the "known" group with an E-value less than 10–5, it was treated as a candidate for the "known" or "putative" category; otherwise, it was classified as "unknown."



View larger version (20K):
[in this window]
[in a new window]
 
Figure 4. Illustration of the procedure to categorize TM protein sequences into "known", "putative", and "unknown" groups according to the level of functional annotation in SWISS-PROT as ascertained by a homology search and sequence similarity comparison using BLAST and ALIGN, respectively. BLAST is used only for detecting candidates for "known" and "putative" sequences against SWISS-PROT; the level of functional annotation is determined in compliance with the value of the sequence identity calculated using ALIGN.

 
Next, the "known" or "putative" candidate sequence from the BLAST search process was aligned with the matched SWISS-PROT sequences to calculate the global sequence identities between them using the ALIGN program with the default settings, except for the substitution matrix (BLOSUM 62 was used). The matched SWISS-PROT sequence with the highest identity was characterized as the most similar one to the candidate sequence, and the candidate sequence was finally classified into one of the three categories according to the value of the highest identity: "known" (with a highest identity of ≥95%), "putative" (30%~95%), or "unknown" (<30%). When a query sequence is categorized into "known" or "putative," it is considered to be a functionally identified TM protein and the function of the matched SWISS-PROT sequence is given to the query sequence as its function.

Functional classification and identification of TM protein sequences based on TM topology similarity
The procedure for classifying and identifying TM protein functions based on TM topology similarity is illustrated in Figure 5Go. The 51,044 TM protein sequences annotated using a BLAST homology search and ALIGN sequence comparison were divided into 36 data sets according to the number of TMSs, the presence or absence of a signal peptide, and N-tail location. The sequences are clustered within individual data sets by a single-linkage method based on TM topology similarity. In this single-linkage clustering, the TM topology similarity is used as the determining factor defined as:



View larger version (26K):
[in this window]
[in a new window]
 
Figure 5. Illustration of the procedure used to classify and identify TM protein functions by single-linkage clustering based on TM topology similarity and sequence similarity using ALIGN (threshold sequence identity 30%). TM topology similarities used in the clustering are determined in the subsection, "Threshold TM topology similarities and the minimum cluster size." Prior to clustering, the predicted 1~12-tms TM protein sequences were divided into 36 data sets according to the number of TMSs, the existence of a signal peptide and N-tail location.

 

(1)

where, n, l1, i and l2, i are the number of TMSs and the length of the i-th loop in sequences 1 and 2, respectively, and min (l1, i, l2, i) and max (l1, i, l2, i) are the lengths of the shorter and longer loops in l1, i and l2, i, respectively.

Within the individual TM-topology based clusters, the sequences are further clustered by a single-linkage method based on sequence similarity (threshold sequence identity 30%) using the ALIGN program with the default settings, except for the substitution matrix (BLOSUM 62 was used), to generate subclusters that must correspond to functional subgroups in the TM-topology based clusters, as illustrated in Figure 5Go.

Electronic supplementary material
Supplemental materials are (1) lists of the obtained large clusters based on TM topology similarity for 1~12-tms TM proteins (named "Supple_Table1.doc"), (2) Supplemental Figure legends ("Supple_Fig_legends.doc"), (3) Supplemental Figure 1Go ("Supple_ Fig1.xls"), (4) Supplemental Figure 2Go ("Supple_Fig2.ppt"), (5) data sets of 87 prokaryotic TM proteome sequences functionally annotated by homology search plus sequence similarity comparison (e.g., "eco.db" for E. coli), and (6) ID lists of the sequences in the obtained clusters (both the large and small clusters) for individual 1~12-tms TM proteins (e.g., "tms06.db" for 6-tms TM proteins). These files are also available at ftp://bioinfo.si.hirosaki-u.ac.jp/TopClust.


    Acknowledgments
 
This research was supported in part by a Grant-in-Aid for Scientific Research on Priority Areas (C) "Genome Information Science" (no. 15014203) and a Grant-in-Aid for Scientific Research (C) (no. 14580665) from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.


    References
 TOP
 Abstract
 Introduction
 Results and Discussion
 Materials and methods
 References
 
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403–410.[CrossRef][Medline]

Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389–3402.[Abstract/Free Full Text]

Arai, M., Ikeda, M., and Shimizu, T. 2003. Comprehensive analysis of trans-membrane topologies in prokaryotic genomes. Gene 304: 77–86.[CrossRef][Medline]

Arai, M., Mitsuke, H., Ikeda, M., Xia, J.-X., Kikuchi, T., Satake, M., and Shimizu, T. 2004. ConPred II: A consensus prediction method for obtaining transmembrane topology models with high reliability. Nucleic Acids Res. 32: W390–W393.[Abstract/Free Full Text]

Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. 2004. The Pfam protein families database. Nucleic Acids Res. 32: D138–D141.[Abstract/Free Full Text]

Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., and Wheeler, D.L. 2004. GenBank: Update. Nucleic Acids Res. 32: D23–D26.[Abstract/Free Full Text]

Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., et al. 2003. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31: 365–370.[Abstract/Free Full Text]

Boyd, D., Schierle, C., and Beckwith, J. 1998. How many membrane proteins are there? Protein Sci. 7: 201–205.[Abstract]

Drury, L.S. and Buxton, R.S. 1988. Identification and sequencing of the Escherichia coli cet gene which codes for an inner membrane protein, mutation of which causes tolerance to colicin E2. Mol. Microbiol. 2: 109–119.[CrossRef][Medline]

Gafvelin, G. and von Heijne, G. 1994. Topological "frustration" in multispanning E. coli inner membrane proteins. Cell 77: 401–412.[CrossRef][Medline]

Gafvelin, G., Sakaguchi, M., Andersson, H., and von Heijne, G. 1997. Topological rules for membrane protein assembly in eukaryotic cells. J. Biol. Chem. 272: 6119–6127.[Abstract/Free Full Text]

Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89: 10915–10919.[Abstract/Free Full Text]

Hirokawa, T., Boon-Chieng, S., and Mitaku, S. 1998. SOSUI: Classification and secondary structure prediction system for membrane proteins. Bioinformatics 14: 378–379.[Abstract/Free Full Text]

Ikeda, M., Arai, M., Lao, D.M., and Shimizu, T. 2002. Transmembrane topology prediction methods: A re-assessment and improvement by a consensus method using a data set of experimentally-characterized transmembrane topologies. In Silico Biol. 2: 19–33.[Medline]

Inoue, Y., Ikeda, M., and Shimizu, T. 2004. Proteome-wide classification and identification of mammalian-type GPCRs by binary topology pattern. Comput. Biol. Chem. 28: 39–49.[CrossRef][Medline]

Jones, D.T. 1998. Do transmembrane protein superfolds exist? FEBS Lett. 423: 281–285.[CrossRef][Medline]

Jones, D.T., Taylor, W.R., and Thornton, J.M. 1994. A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry 33: 3038–3049.[CrossRef][Medline]

Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., and Hattori, M. 2004. The KEGG resource for deciphering the genome. Nucleic Acids Res. 32: D277–D280.[Abstract/Free Full Text]

Kawabata, T., Fukuchi, S., Homma, K., Ota, M., Araki, J., Ito, T., Ichiyoshi, N., and Nishikawa, K. 2002. GTOP: A database of protein structures predicted from genome sequences. Nucleic Acids Res. 30: 294–298.[Abstract/Free Full Text]

Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E.L. 2001. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 305: 567–580.[CrossRef][Medline]

Lao, D.M. and Shimizu, T. 2001. A method for discriminating a signal peptide and a putative 1st transmembrane segment. In Proceedings of the 2001 International Conference on Mathematics and Engineering Techniques in Medicine and Biological SciencesMETMBS’01 (ed. F. Valafar), pp. 119–125. CSREA Press, Las Vegas.

Lao, D.M., Arai, M., Ikeda, M., and Shimizu, T. 2002. The presence of signal peptide significantly affects transmembrane topology prediction. Bioinformatics 18: 1562–1566.[Free Full Text]

Liu, J. and Rost, B. 2001. Comparing function and structure between entire proteomes. Protein Sci. 10: 1970–1979.[Abstract/Free Full Text]

Mitaku, S., Ono, M., Hirokawa, T., Boon-Chieng, S., and Sonoyama, M. 1999. Proportion of membrane proteins in proteomes of 15 single-cell organisms analyzed by the SOSUI prediction system. Biophys. Chem. 82: 165–171.[CrossRef][Medline]

Myers, E.W. and Miller, W. 1988. Optimal alignments in linear space. Comput. Appl. Biosci. 4: 11–17.[Abstract/Free Full Text]

Pasquier, C. and Hamodrakas, S.J. 1999. An hierarchical artificial neural network system for the classification of transmembrane proteins. Protein Eng. 12: 631–634.[Abstract/Free Full Text]

Serres, M.H., Gopal, S., Nahum, L.A., Liang, P., Gaasterland, T., and Riley, M. 2001. A functional update of the Escherichia coli K-12 genome. Genome Biol. 2: RESEARCH0035. 1–0035.7.[Medline]

Stevens, T.J. and Arkin, I.T. 2000. Do more complex organisms have a greater proportion of membrane proteins in their genomes? Proteins 39: 417–420.[CrossRef][Medline]

Sugiyama, Y., Polulyakh, N., and Shimizu, T. 2003. Identification of transmembrane protein functions by binary topology patterns. Protein Eng. 16: 479–488.[Abstract/Free Full Text]

Tusnády, G.E. and Simon, I. 1998. Principles governing amino acid composition of integral membrane proteins: Application to topology prediction. J. Mol. Biol. 283: 489–506.[CrossRef][Medline]

———. 2001. The HMMTOP transmembrane topology prediction server. Bioinformatics 17: 849–850.[Abstract/Free Full Text]

Wallin, E. and von Heijne, G. 1998. Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci. 7: 1029–1038.[Abstract]

Xia, J.-X., Ikeda, M., and Shimizu, T. 2004. ConPred_elite: A highly reliable approach to transmembrane topology prediction. Comput. Biol. Chem. 28: 51–60.[CrossRef][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Arai, M.
Right arrow Articles by Shimizu, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Arai, M.
Right arrow Articles by Shimizu, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS