Proteins: Structure, Function, and Bioinformatics

Volume 79, Issue 4 pp. 1061-1078

Research Article

Learning generative models for protein fold families

Sivaraman Balakrishnan,

Corresponding Author

Sivaraman Balakrishnan

Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania

Sivaraman Balakrishnan and Hetunandan Kamisetty contributed equally to this work.

5000 Forbes Ave., Pittsburgh, PA 15213===Search for more papers by this author

Hetunandan Kamisetty,

Corresponding Author

Hetunandan Kamisetty

Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania

Sivaraman Balakrishnan and Hetunandan Kamisetty contributed equally to this work.

5000 Forbes Ave., Pittsburgh, PA 15213===Search for more papers by this author

Jaime G. Carbonell,

Jaime G. Carbonell

Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania

Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania

Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania

Search for more papers by this author

Su-In Lee,

Su-In Lee

Department of Computer Science & Engineering, University of Washington, Seattle, Washington

Department of Genome Sciences, University of Washington, Seattle, Washington

Search for more papers by this author

Christopher James Langmead,

Corresponding Author

Christopher James Langmead

[email protected]

Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania

Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania

5000 Forbes Ave., Pittsburgh, PA 15213===Search for more papers by this author

Sivaraman Balakrishnan,

Corresponding Author

Sivaraman Balakrishnan

Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania

Sivaraman Balakrishnan and Hetunandan Kamisetty contributed equally to this work.

5000 Forbes Ave., Pittsburgh, PA 15213===Search for more papers by this author

Hetunandan Kamisetty,

Corresponding Author

Hetunandan Kamisetty

Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania

Sivaraman Balakrishnan and Hetunandan Kamisetty contributed equally to this work.

5000 Forbes Ave., Pittsburgh, PA 15213===Search for more papers by this author

Jaime G. Carbonell,

Jaime G. Carbonell

Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania

Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania

Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania

Search for more papers by this author

Su-In Lee,

Su-In Lee

Department of Computer Science & Engineering, University of Washington, Seattle, Washington

Department of Genome Sciences, University of Washington, Seattle, Washington

Search for more papers by this author

Christopher James Langmead,

Corresponding Author

Christopher James Langmead

[email protected]

Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania

Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania

5000 Forbes Ave., Pittsburgh, PA 15213===Search for more papers by this author

First published: 11 November 2010

https://doi.org/10.1002/prot.22934

Citations: 231

Share a link

Email
Wechat
Bluesky

Abstract

We introduce a new approach to learning statistical models from multiple sequence alignments (MSA) of proteins. Our method, called GREMLIN (Generative REgularized ModeLs of proteINs), learns an undirected probabilistic graphical model of the amino acid composition within the MSA. The resulting model encodes both the position-specific conservation statistics and the correlated mutation statistics between sequential and long-range pairs of residues. Existing techniques for learning graphical models from MSA either make strong, and often inappropriate assumptions about the conditional independencies within the MSA (e.g., Hidden Markov Models), or else use suboptimal algorithms to learn the parameters of the model. In contrast, GREMLIN makes no a priori assumptions about the conditional independencies within the MSA. We formulate and solve a convex optimization problem, thus guaranteeing that we find a globally optimal model at convergence. The resulting model is also generative, allowing for the design of new protein sequences that have the same statistical properties as those in the MSA. We perform a detailed analysis of covariation statistics on the extensively studied WW and PDZ domains and show that our method out-performs an existing algorithm for learning undirected probabilistic graphical models from MSA. We then apply our approach to 71 additional families from the PFAM database and demonstrate that the resulting models significantly out-perform Hidden Markov Models in terms of predictive accuracy. Proteins 2011; © 2011 Wiley-Liss, Inc.

REFERENCES

1 Krogh A,Brown M,Saira Mian I,Sjlander K,Haussler D. Hidden markov models in computational biology: applications to protein modeling. J Mol Biol 1994; 235: 1501–1531.
10.1006/jmbi.1994.1104
CAS PubMed Web of Science® Google Scholar
2 Karplus K,Sjlander K,Barrett C,Cline M,Haussler D,Hughey R,Holm L,Sander C,England E,England E. Predicting protein structure using hidden markov models. Proteins: Struct Funct Genet 1997; 29: 134–139.
10.1002/(SICI)1097-0134(1997)1+<134::AID-PROT18>3.0.CO;2-P
Google Scholar
3 Karplus K,Barrett C,Hughey R. Hidden markov models for detecting remote protein homologies. Bioinformatics 1998; 14: 846–856.
10.1093/bioinformatics/14.10.846
CAS PubMed Web of Science® Google Scholar
4 Bateman A,Birney E,Cerruti L,Durbin R,Etwiller L,Eddy SR,Griffiths-Jones S,Howe KL,Marshall M,Sonnhammer ELL. The Pfam protein families database. Nucleic Acids Res 2002; 30: 276.
10.1093/nar/30.1.276
CAS PubMed Web of Science® Google Scholar
5 Liu Y,Carbonell JG,Weigele P,Gopalakrishnan V. Protein fold recognition using segmentation conditional random fields. J Comput Biol 2006; 13: 394–406.
10.1089/cmb.2006.13.394
CAS PubMed Web of Science® Google Scholar
6 Eddy SR. Profile hidden Markov models. Bioinformatics 1998; 14: 755–763.
10.1093/bioinformatics/14.9.755
CAS PubMed Web of Science® Google Scholar
7 Thomas J,Ramakrishnan N,Bailey-Kellogg C. Protein design by sampling an undirected graphical model of residue constraints. IEEE/ACM Trans Comput Biol Bioinformatics 2009; 6: 506–516.
10.1109/TCBB.2008.124
CAS PubMed Web of Science® Google Scholar
8 Lockless SW,Ranganathan R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science 1999; 286: 295–299.
10.1126/science.286.5438.295
CAS PubMed Web of Science® Google Scholar
9 Socolich M,Lockless SW,Russ WP,Lee H,Gardner KH,Ranganathan R. Evolutionary information for specifying a protein fold. Nature 2005; 437: 512–518.
10.1038/nature03991
CAS PubMed Web of Science® Google Scholar
10 Russ WP,Lowery DM,Mishra P,Yaffe MB,Ranganathan R. Natural-like function in artificial WW domains. Natures 2005; 437: 579–583.
10.1038/nature03990
CAS PubMed Web of Science® Google Scholar
11 Thomas J,Ramakrishnan N,Bailey-Kellogg C. Graphical models of residue coupling in protein families. In: BIOKDD '05: Proceedings of the 5th International Workshop on Bioinformatics, ACM, New York, NY, 2005, 12–20.
Google Scholar
12 Thomas J,Ramakrishnan N,Bailey-Kellogg C. Graphical models of residue coupling in protein families. IEEE/ACM Trans Comput Biol Bioinformatics (TCBB) 2008; 5: 183–197.
10.1109/TCBB.2007.70225
CAS PubMed Web of Science® Google Scholar
13 Thomas J,Ramakrishnan N,Bailey-Kellogg C. Graphical models of protein-protein interaction specificity from correlated mutations and interaction data. Proteins: Struct Funct Bioinformatics 2009; 76: 911–29.
10.1002/prot.22398
CAS PubMed Web of Science® Google Scholar
14 Thomas J,Ramakrishnan N,Bailey-Kellogg C. Protein Design by Sampling an Undirected Graphical Model of Residue Constraints. IEEE/ACM Trans Comput Biol Bioinformatics (TCBB) 2009; 6: 506–516.
10.1109/TCBB.2008.124
CAS PubMed Web of Science® Google Scholar
15 Lee SI,Ganapathi V,Koller D. Efficient structure learning of markov networks using l₁-regularization. In: B Schölkopf, J Platt, T Hoffman, editors, Advances in neural Information processing systems 19. Cambridge, MA: MIT Press; 2007. pp 817–824.
Google Scholar
16 Besag J. Efficiency of pseudolikelihood estimation for simple Gaussian fields. Biometrika 1977; 64: 616–618.
10.1093/biomet/64.3.616
PubMed Web of Science® Google Scholar
17 Wainwright MJ,Ravikumar P,Lafferty JD. High-dimensional graphical model selection using ℓ₁-regularized logistic regression. In: B Schölkopf, J Platt, T Hoffman, editors, Advances in neural information processing systems 19. Cambridge, MA: MIT Press; 2007. pp 1465–1472.
Google Scholar
18 Schmidt M,Murphy K,Fung G,Rosales R. Structure learning in random fields for heart motion abnormality detection. In: CVPR. IEEE Computer Society. Anchorage: Alaska; 2008.
10.1109/CVPR.2008.4587367
Google Scholar
19 Gidas B. Consistency of maximum likelihood and pseudo-likelihood estimators for Gibbs distributions. Inst Math Appl 1988; 10: 129–+.
Google Scholar
20 Hofling H,Tibshirani R. Estimation of sparse binary pairwise markov networks using pseudo-likelihoods. J Mach Learn Res 2009; 10: 883–906.
PubMed Web of Science® Google Scholar
21 Tropp JA. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans Info Theory 2006; 52: 1030–1051.
10.1109/TIT.2005.864420
Web of Science® Google Scholar
22 Boyd S,Vandenberghe L. Convex optimization. UK: Cambridge University Press; 2004.
10.1017/CBO9780511804441
Google Scholar
23 Liu DC,Nocedal J,Liu DC,Nocedal J. On the limited memory bfgs method for large scale optimization. Math Program 1989; 45: 503–528.
10.1007/BF01589116
Web of Science® Google Scholar
24 Listgarten J,Heckerman D. Determining the number of non-spurious arcs in a learned dag model: investigation of a bayesian and a frequentist approach. Presented at the 23rd Annual Conference on Uncertainity in Artificial Intelligence, July 19–22, 2007; University of British Columbia, Vancouver, BC Canada, 2007.
Google Scholar
25 Berman HM,Westbrook J,Feng Z,Gilliland G,Bhat TN,Weissig H,Shindyalov IN,Bourne PE. The protein data bank. Nucl Acids Res 2000; 28: 235–242.
10.1093/nar/28.1.235
CAS PubMed Web of Science® Google Scholar
26 Dhulesia A,Gsponer J,Vendruscolo M. Mapping of two networks of residues that exhibit structural and dynamical changes upon binding in a pdz domain protein. J Am Chem Soc 2008; 130: 8931–8939.
10.1021/ja0752080
CAS PubMed Web of Science® Google Scholar
27 Fuentes EJ,Der CJ,Lee AL. Ligand-dependent dynamics and intramolecular signaling in a PDZ domain. J Mol Biol 2004; 335: 1105–1115.
10.1016/j.jmb.2003.11.010
CAS PubMed Web of Science® Google Scholar
28 Altschuh D,Vernet T,Berti P,Moras D,Nagai K. Coordinated amino acid changes in homologous protein families. Protein Eng 1988; 2: 193–199.
10.1093/protein/2.3.193
CAS PubMed Web of Science® Google Scholar
29 Göbel U,Sander C,Schneider R,Valencia A. Correlated mutations and residue contacts in proteins. Proteins: Struct Funct Genet 1994; 18: 309–317.
10.1002/prot.340180402
CAS PubMed Web of Science® Google Scholar
30 Fatakia SN,Costanzi S,Chow CC. Computing highly correlated positions using mutual information and graph theory for g protein-coupled receptors. PLoS ONE 2009; 4: e4681.
10.1371/journal.pone.0004681
CAS PubMed Web of Science® Google Scholar
31 Fodor AA,Aldrich RW. On evolutionary conservation of thermodynamic coupling in proteins. J Biol Chem 2004; 279: 19046–19050.
10.1074/jbc.M402560200
CAS PubMed Web of Science® Google Scholar
32 Fuchs A,Martin-Galiano AJ,Kalman M,Fleishman S,Ben-Tal N,Frishman D. Co-evolving residues in membrane proteins. Bioinformatics 2007; 23: 3312–3319.
10.1093/bioinformatics/btm515
CAS PubMed Web of Science® Google Scholar
33 Weigt M,White RA,Szurmant H,Hoch JA,Hwa T. Identification of direct residue contacts in protein-protein interaction by message passing. Proc Natl Acad Sci USA 2009; 106: 67–72.
10.1073/pnas.0805923106
CAS PubMed Web of Science® Google Scholar
34 Yuan M,Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Series B 2006; 68: 49–67.
10.1111/j.1467-9868.2005.00532.x
Web of Science® Google Scholar
35 Argyriou A,Evgeniou T,Pontil M. Multi-task feature learning. In: Advances in neural information processing systems 19. MIT Press, 2007.
Google Scholar
36 Tibshirani R. Regression shrinkage and selection via the lasso. J R Stats Soc Series B 1994; 58: 267–288.
Web of Science® Google Scholar
37 Pollock DD,Taylor WR. Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. Protein Eng 1997; 10: 647–657.
10.1093/protein/10.6.647
CAS PubMed Web of Science® Google Scholar
38 Felsenstein J. Inferring phylogenies. Sinauer Associates; 2003.
Google Scholar
39 Kamisetty H,Ghosh B,Bailey-Kellogg C,Langmead CJ. Modeling and Inference of Sequence-Structure Specificity. In: Proceedings of the 8th International Conference on Computational Systems Bioinformatics (CSB), August 10–12, Palo Alto, California: Stanford University; 2009, pp 91–101.
Google Scholar
40 Schwarz G. Estimating the dimension of a model. Annals Stat 1978; 6: 461–464.
10.1214/aos/1176344136
PubMed Web of Science® Google Scholar
41 Akaike H. A new look at the statistical model identification, Vol. 19. IEEE Transactions on Automatic Control; 2003; 716–723.
Google Scholar
42 Yang Y. Can the strengths of aic and bic be shared? Biometrica 2003; 92: 2003.
Web of Science® Google Scholar
43 Csiszar I,Talata Z. Consistent estimation of the basic neighborhood of markov random fields. Ann Stat 2006; 34: 123–145.
10.1214/009053605000000912
Web of Science® Google Scholar

Citing Literature

Volume79, Issue4

April 2011

Pages 1061-1078

Learning generative models for protein fold families

Abstract

REFERENCES

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Learning generative models for protein fold families

Abstract

REFERENCES

Citing Literature

References

Related

Information