Camps 2.0: Exploring the sequence and structure space of prokaryotic, eukaryotic, and viral membrane proteins
Sindy Neumann
Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenschaftszentrum Weihenstephan, 85354 Freising, Germany
Search for more papers by this authorHolger Hartmann
Gene Center and Center for Integrated Protein Science (CIPSM), Ludwig-Maximilians-Universität München, 81377 Munich, Germany
Search for more papers by this authorAntonio J. Martin-Galiano
Unidad de Genética Bacteriana, Centro Nacional de Microbiología, Instituto de Salud Carlos III, Majadahonda, 28220 Madrid, Spain
Search for more papers by this authorAngelika Fuchs
pRED, Pharma Research and Early Development, pRED Informatics, Roche Diagnostics GmbH, 82377 Penzberg, Germany
Search for more papers by this authorCorresponding Author
Dmitrij Frishman
Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenschaftszentrum Weihenstephan, 85354 Freising, Germany
Department of Genome Oriented Bioinformatics, Technische Universität München, Maximus-von-Imhof-Forum 3, 85354 Freising, Germany===Search for more papers by this authorSindy Neumann
Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenschaftszentrum Weihenstephan, 85354 Freising, Germany
Search for more papers by this authorHolger Hartmann
Gene Center and Center for Integrated Protein Science (CIPSM), Ludwig-Maximilians-Universität München, 81377 Munich, Germany
Search for more papers by this authorAntonio J. Martin-Galiano
Unidad de Genética Bacteriana, Centro Nacional de Microbiología, Instituto de Salud Carlos III, Majadahonda, 28220 Madrid, Spain
Search for more papers by this authorAngelika Fuchs
pRED, Pharma Research and Early Development, pRED Informatics, Roche Diagnostics GmbH, 82377 Penzberg, Germany
Search for more papers by this authorCorresponding Author
Dmitrij Frishman
Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenschaftszentrum Weihenstephan, 85354 Freising, Germany
Department of Genome Oriented Bioinformatics, Technische Universität München, Maximus-von-Imhof-Forum 3, 85354 Freising, Germany===Search for more papers by this authorAbstract
Structural bioinformatics of membrane proteins is still in its infancy, and the picture of their fold space is only beginning to emerge. Because only a handful of three-dimensional structures are available, sequence comparison and structure prediction remain the main tools for investigating sequence–structure relationships in membrane protein families. Here we present a comprehensive analysis of the structural families corresponding to α-helical membrane proteins with at least three transmembrane helices. The new version of our CAMPS database (CAMPS 2.0) covers nearly 1300 eukaryotic, prokaryotic, and viral genomes. Using an advanced classification procedure, which is based on high-order hidden Markov models and considers both sequence similarity as well as the number of transmembrane helices and loop lengths, we identified 1353 structurally homogeneous clusters roughly corresponding to membrane protein folds. Only 53 clusters are associated with experimentally determined three-dimensional structures, and for these clusters CAMPS is in reasonable agreement with structure-based classification approaches such as SCOP and CATH. We therefore estimate that ∼1300 structures would need to be determined to provide a sufficient structural coverage of polytopic membrane proteins. CAMPS 2.0 is available at http://webclu.bio.wzw.tum.de/CAMPS2.0/. Proteins 2011. © 2012 Wiley Periodicals, Inc.
Supporting Information
Additional Supporting Information may be found in the online version of this article.
Filename | Description |
---|---|
PROT_23242_sm_suppinfo.pdf336.2 KB | Supporting Information |
Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
REFERENCES
- 1 Andreeva A,Howorth D,Chandonia JM,Brenner SE,Hubbard TJ,Chothia C,Murzin AG. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 2008; 36: D419–D425.
- 2 Greene LH,Lewis TE,Addou S,Cuff A,Dallman T,Dibley M,Redfern O,Pearl F,Nambudiry R,Reid A,Sillitoe I,Yeats C,Thornton JM,Orengo CA. The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res 2007; 35: D291–D297.
- 3 Berman HM,Westbrook J,Feng Z,Gilliland G,Bhat TN,Weissig H,Shindyalov IN,Bourne PE. The Protein Data Bank. Nucleic Acids Res 2000; 28: 235–242.
- 4 Chothia C. Proteins: one thousand families for the molecular biologist. Nature 1992; 357: 543–544.
- 5 Orengo CA,Jones DT,Thornton JM. Protein superfamilies and domain superfolds. Nature 1994; 372: 631–634.
- 6 Wang ZX. A re-estimation for the total numbers of protein folds and superfamilies. Protein Eng 1998; 11: 621–626.
- 7 Zhang C,DeLisi C. Estimating the number of protein folds. J Mol Biol 1998; 284: 1301–1305.
- 8
Govindarajan S,Recabarren R,Goldstein RA.
Estimating the total number of protein folds.
Proteins
1999;
35:
408–414.
10.1002/(SICI)1097-0134(19990601)35:4<408::AID-PROT4>3.0.CO;2-A CAS PubMed Web of Science® Google Scholar
- 9 Wolf YI,Grishin NV,Koonin EV. Estimating the number of protein folds and families from complete genome data. J Mol Biol 2000; 299: 897–905.
- 10 Coulson AF,Moult J. A unifold, mesofold, and superfold model of protein fold use. Proteins 2002; 46: 61–71.
- 11 Leonov H,Mitchell JS,Arkin IT. Monte Carlo estimation of the number of possible protein folds: effects of sampling bias and folds distributions. Proteins 2003; 51: 352–359.
- 12 Grant A,Lee D,Orengo C. Progress towards mapping the universe of protein folds. Genome Biol 2004; 5: 107.
- 13 Liu X,Fan K,Wang W. The number of protein folds and their distribution over families in nature. Proteins 2004; 54: 491–499.
- 14 Orengo CA,Sillitoe I,Reeves G,Pearl FM. Review: what can structural classifications reveal about protein evolution? J Struct Biol 2001; 134: 145–165.
- 15 Cuff AL,Sillitoe I,Lewis T,Redfern OC,Garratt R,Thornton J,Orengo CA. The CATH classification revisited—architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res 2009; 37: D310–D314.
- 16 Wallin E,Tsukihara T,Yoshikawa S,von Heijne G,Elofsson A. Architecture of helix bundle membrane proteins: an analysis of cytochrome c oxidase from bovine mitochondria. Protein Sci 1997; 6: 808–815.
- 17 Wimley WC. The versatile beta-barrel membrane protein. Curr Opin Struct Biol 2003; 13: 404–411.
- 18 Neumann S,Fuchs A,Mulkidjanian A,Frishman D. Current status of membrane protein structure classification. Proteins 2010; 78: 1760–1773.
- 19 Fuchs A,Frishman D. Structural comparison and classification of alpha-helical transmembrane domains based on helix interaction patterns. Proteins 2010; 78: 2587–2599.
- 20 Vroling B,Sanders M,Baakman C,Borrmann A,Verhoeven S,Klomp J,Oliveira L,de Vlieg J,Vriend G. GPCRDB: information system for G protein-coupled receptors. Nucleic Acids Res 2011; 39: D309–D319.
- 21 Saier MH,Jr.,Yen MR,Noto K,Tamang DG,Elkan C. The transporter classification database: recent advances. Nucleic Acids Res 2009; 37: D274–D278.
- 22 Liu Y,Engelman DM,Gerstein M. Genomic analysis of membrane protein families: abundance and conserved motifs. Genome Biol 2002; 3:research0054.
- 23 Sadka T,Linial M. Families of membranous proteins can be characterized by the amino acid composition of their transmembrane domains. Bioinformatics 2005; 21( Suppl 1): i378–i386.
- 24 Oberai A,Ihm Y,Kim S,Bowie JU. A limited universe of membrane protein families and folds. Protein Sci 2006; 15: 1723–1734.
- 25 Kelly L,Pieper U,Eswar N,Hays FA,Li M,Roe-Zurz Z,Kroetz DL,Giacomini KM,Stroud RM,Sali A. A survey of integral alpha-helical membrane proteins. J Struct Funct Genomics 2009; 10: 269–280.
- 26 Martin-Galiano AJ,Frishman D. Defining the fold space of membrane proteins: the CAMPS database. Proteins 2006; 64: 906–922.
- 27 Finn RD,Mistry J,Tate J,Coggill P,Heger A,Pollington JE,Gavin OL,Gunasekaran P,Ceric G,Forslund K,Holm L,Sonnhammer EL,Eddy SR,Bateman A. The Pfam protein families database. Nucleic Acids Res 2010; 38: D211–D222.
- 28 Rattei T,Tischler P,Gotz S,Jehl MA,Hoser J,Arnold R,Conesa A,Mewes HW. SIMAP—a comprehensive database of pre-calculated protein sequence similarities, domains, annotations and clusters. Nucleic Acids Res 2010; 38: D223–D226.
- 29 Kall L,Krogh A,Sonnhammer EL. A combined transmembrane topology and signal peptide prediction method. J Mol Biol 2004; 338: 1027–1036.
- 30 Pruitt KD,Tatusova T,Klimke W,Maglott DR. NCBI reference sequences: current status, policy and new initiatives. Nucleic Acids Res 2009; 37: D32–D36.
- 31 Li W,Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006; 22: 1658–1659.
- 32 Enright AJ,Van Dongen S,Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002; 30: 1575–1584.
- 33
Pearson W.
Finding protein and nucleotide similarities with FASTA.
Curr Protoc Bioinformatics
2004; chapter 3: unit 3.9.
10.1002/0471250953.bi0309s04 Google Scholar
- 34 Chenna R,Sugawara H,Koike T,Lopez R,Gibson TJ,Higgins DG,Thompson JD. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res 2003; 31: 3497–3500.
- 35 Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004; 32: 1792–1797.
- 36 Wistrand M,Kall L,Sonnhammer EL. A general model of G protein-coupled receptor sequences and its application to detect remote homologs. Protein Sci 2006; 15: 509–521.
- 37 Wilson D,Pethica R,Zhou Y,Talbot C,Vogel C,Madera M,Chothia C,Gough J. SUPERFAMILY—sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids Res 2009; 37: D380–D386.
- 38 Rattei T,Tischler P,Arnold R,Hamberger F,Krebs J,Krumsiek J,Wachinger B,Stumpflen V,Mewes W. SIMAP—structuring the network of protein similarities. Nucleic Acids Res 2008; 36: D289–D292.
- 39 Lin K,Zhu L,Zhang DY. An initial strategy for comparing proteins at the domain architecture level. Bioinformatics 2006; 22: 2081–2086.
- 40 Hegyi H,Gerstein M. Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. Genome Res 2001; 11: 1632–1640.
- 41 UniProt C. Ongoing and future developments at the universal protein resource. Nucleic Acids Res 2011; 39: D214–D219.
- 42 Tusnady GE,Dosztanyi Z,Simon I. PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank. Nucleic Acids Res 2005; 33: D275–D278.
- 43 Camacho C,Coulouris G,Avagyan V,Ma N,Papadopoulos J,Bealer K,Madden TL. BLAST+: architecture and applications. BMC Bioinformatics 2009; 10: 421.
- 44 Holm L,Park J. DaliLite workbench for protein structure comparison. Bioinformatics 2000; 16: 566–567.
- 45 Bairoch A. The ENZYME database in 2000. Nucleic Acids Res 2000; 28: 304–305.
- 46 Webb EC. Enzyme nomenclature 1992: recommendations of the nomenclature committee of the international union of biochemistry and molecular biology on the nomenclature and classification of enzymes. Academic Press: San Diego, California; 1992. 862p.
- 47 Knox C,Law V,Jewison T,Liu P,Ly S,Frolkis A,Pon A,Banco K,Mak C,Neveu V,Djoumbou Y,Eisner R,Guo AC,Wishart DS. DrugBank 3.0: a comprehensive resource for ‘omics’ research on drugs. Nucleic Acids Res 2011; 39: D1035–D1041.
- 48 Jayasinghe S,Hristova K,White SH. MPtopo: a database of membrane protein topology. Protein Sci 2001; 10: 455–458.
- 49 Hamosh A,Scott AF,Amberger JS,Bocchini CA,McKusick VA. Online Mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005; 33: D514–D517.
- 50 Lomize MA,Lomize AL,Pogozheva ID,Mosberg HI. OPM: orientations of proteins in membranes database. Bioinformatics 2006; 22: 623–625.
- 51 Chen L,Oughtred R,Berman HM,Westbrook J. TargetDB: a target registration database for structural genomics projects. Bioinformatics 2004; 20: 2860–2862.
- 52 Tusnady GE,Kalmar L,Simon I. TOPDB: topology data bank of transmembrane proteins. Nucleic Acids Res 2008; 36: D234–D239.
- 53 Yang J,Chen L,Sun L,Yu J,Jin Q. VFDB 2008 release: an enhanced web-based resource for comparative pathogenomics. Nucleic Acids Res 2008; 36: D539–D542.
- 54 Sonnhammer EL,von Heijne G,Krogh A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol 1998; 6: 175–182.
- 55 Otaki JM,Firestein S. Length analyses of mammalian G-protein-coupled receptors. J Theor Biol 2001; 211: 77–100.
- 56 Sugiyama Y,Polulyakh N,Shimizu T. Identification of transmembrane protein functions by binary topology patterns. Protein Eng 2003; 16: 479–488.
- 57 Inoue Y,Ikeda M,Shimizu T. Proteome-wide classification and identification of mammalian-type GPCRs by binary topology pattern. Comput Biol Chem 2004; 28: 39–49.
- 58 Jeong J,Berman P,Przytycka T. Fold classification based on secondary structure—how much is gained by including loop topology? BMC Struct Biol 2006; 6: 3.
- 59 Tastan O,Klein-Seetharaman J,Meirovitch H. The effect of loops on the structural organization of alpha-helical membrane proteins. Biophys J 2009; 96: 2299–2312.
- 60 Moller S,Vilo J,Croning MD. Prediction of the coupling specificity of G protein coupled receptors to their G proteins. Bioinformatics 2001; 17( Suppl 1): S174–S181.
- 61 Tatusov RL,Fedorova ND,Jackson JD,Jacobs AR,Kiryutin B,Koonin EV,Krylov DM,Mazumder R,Mekhedov SL,Nikolskaya AN,Rao BS,Smirnov S,Sverdlov AV,Vasudevan S,Wolf YI,Yin JJ,Natale DA. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003; 4: 41.
- 62 Muller J,Szklarczyk D,Julien P,Letunic I,Roth A,Kuhn M,Powell S,von Mering C,Doerks T,Jensen LJ,Bork P. eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations. Nucleic Acids Res 2010; 38: D190–D195.
- 63 Chen TW,Wu TH,Ng WV,Lin WC. DODO: an efficient orthologous genes assignment tool based on domain architectures. Domain based ortholog detection. BMC Bioinformatics 2010; 11( Suppl 7): S6.
- 64 Benson DA,Karsch-Mizrachi I,Lipman DJ,Ostell J,Wheeler DL. GenBank. Nucleic Acids Res 2008; 36: D25–D30.
- 65 Ashburner M,Ball CA,Blake JA,Botstein D,Butler H,Cherry JM,Davis AP,Dolinski K,Dwight SS,Eppig JT,Harris MA,Hill DP,Issel-Tarver L,Kasarskis A,Lewis S,Matese JC,Richardson JE,Ringwald M,Rubin GM,Sherlock G. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet 2000; 25: 25–29.
- 66 Liu Y,Gerstein M,Engelman DM. Transmembrane protein domains rarely use covalent domain recombination as an evolutionary mechanism. Proc Natl Acad Sci USA 2004; 101: 3495–3497.
- 67 Ekman D,Bjorklund AK,Frey-Skott J,Elofsson A. Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J Mol Biol 2005; 348: 231–243.
- 68 Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool 1970; 19: 99–113.
- 69 Rost B. Twilight zone of protein sequence alignments. Protein Eng 1999; 12: 85–94.
- 70 Sadekar S,Raymond J,Blankenship RE. Conservation of distantly related membrane proteins: photosynthetic reaction centers share a common structural core. Mol Biol Evol 2006; 23: 2001–2007.
- 71 Geourjon C,Combet C,Blanchet C,Deleage G. Identification of related proteins with weak sequence identity using secondary structure information. Protein Sci 2001; 10: 788–797.
- 72 Nagano N,Orengo CA,Thornton JM. One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions. J Mol Biol 2002; 321: 741–765.
- 73 Kall L,Krogh A,Sonnhammer EL. An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics 2005; 21( Suppl 1): i251–i257.
- 74 Thompson JD,Linard B,Lecompte O,Poch O. A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS One 2011; 6: e18093.
- 75 Henikoff JG,Greene EA,Pietrokovski S,Henikoff S. Increased coverage of protein families with the blocks database servers. Nucleic Acids Res 2000; 28: 228–230.
- 76 Sjolander K,Karplus K,Brown M,Hughey R,Krogh A,Mian IS,Haussler D. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci 1996; 12: 327–345.
- 77
Sonnhammer EL,Eddy SR,Durbin R.
Pfam: a comprehensive database of protein domain families based on seed alignments.
Proteins
1997;
28:
405–420.
10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-L CAS PubMed Web of Science® Google Scholar
- 78 Finn RD,Mistry J,Schuster-Bockler B,Griffiths-Jones S,Hollich V,Lassmann T,Moxon S,Marshall M,Khanna A,Durbin R,Eddy SR,Sonnhammer EL,Bateman A. Pfam: clans, web tools and services. Nucleic Acids Res 2006; 34: D247–D251.
- 79 Elofsson A,Sonnhammer EL. A comparison of sequence and structure protein domain families as a basis for structural genomics. Bioinformatics 1999; 15: 480–500.
- 80 Pandit SB,Bhadra R,Gowri VS,Balaji S,Anand B,Srinivasan N. SUPFAM: a database of sequence superfamilies of protein domains. BMC Bioinformatics 2004; 5: 28.
- 81 Hedman M,Deloof H,Von Heijne G,Elofsson A. Improved detection of homologous membrane proteins by inclusion of information from topology predictions. Protein Sci 2002; 11: 652–658.
- 82 Wong WC,Maurer-Stroh S,Eisenhaber F. More than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology. PLoS Comput Biol 2010; 6: e1000867.