Learning generative models for protein fold families
Corresponding Author
Sivaraman Balakrishnan
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania
Sivaraman Balakrishnan and Hetunandan Kamisetty contributed equally to this work.
5000 Forbes Ave., Pittsburgh, PA 15213===Search for more papers by this authorCorresponding Author
Hetunandan Kamisetty
Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania
Sivaraman Balakrishnan and Hetunandan Kamisetty contributed equally to this work.
5000 Forbes Ave., Pittsburgh, PA 15213===Search for more papers by this authorJaime G. Carbonell
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania
Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania
Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania
Search for more papers by this authorSu-In Lee
Department of Computer Science & Engineering, University of Washington, Seattle, Washington
Department of Genome Sciences, University of Washington, Seattle, Washington
Search for more papers by this authorCorresponding Author
Christopher James Langmead
Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania
Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania
5000 Forbes Ave., Pittsburgh, PA 15213===Search for more papers by this authorCorresponding Author
Sivaraman Balakrishnan
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania
Sivaraman Balakrishnan and Hetunandan Kamisetty contributed equally to this work.
5000 Forbes Ave., Pittsburgh, PA 15213===Search for more papers by this authorCorresponding Author
Hetunandan Kamisetty
Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania
Sivaraman Balakrishnan and Hetunandan Kamisetty contributed equally to this work.
5000 Forbes Ave., Pittsburgh, PA 15213===Search for more papers by this authorJaime G. Carbonell
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania
Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania
Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania
Search for more papers by this authorSu-In Lee
Department of Computer Science & Engineering, University of Washington, Seattle, Washington
Department of Genome Sciences, University of Washington, Seattle, Washington
Search for more papers by this authorCorresponding Author
Christopher James Langmead
Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania
Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania
5000 Forbes Ave., Pittsburgh, PA 15213===Search for more papers by this authorAbstract
We introduce a new approach to learning statistical models from multiple sequence alignments (MSA) of proteins. Our method, called GREMLIN (Generative REgularized ModeLs of proteINs), learns an undirected probabilistic graphical model of the amino acid composition within the MSA. The resulting model encodes both the position-specific conservation statistics and the correlated mutation statistics between sequential and long-range pairs of residues. Existing techniques for learning graphical models from MSA either make strong, and often inappropriate assumptions about the conditional independencies within the MSA (e.g., Hidden Markov Models), or else use suboptimal algorithms to learn the parameters of the model. In contrast, GREMLIN makes no a priori assumptions about the conditional independencies within the MSA. We formulate and solve a convex optimization problem, thus guaranteeing that we find a globally optimal model at convergence. The resulting model is also generative, allowing for the design of new protein sequences that have the same statistical properties as those in the MSA. We perform a detailed analysis of covariation statistics on the extensively studied WW and PDZ domains and show that our method out-performs an existing algorithm for learning undirected probabilistic graphical models from MSA. We then apply our approach to 71 additional families from the PFAM database and demonstrate that the resulting models significantly out-perform Hidden Markov Models in terms of predictive accuracy. Proteins 2011; © 2011 Wiley-Liss, Inc.
REFERENCES
- 1 Krogh A,Brown M,Saira Mian I,Sjlander K,Haussler D. Hidden markov models in computational biology: applications to protein modeling. J Mol Biol 1994; 235: 1501–1531.
- 2
Karplus K,Sjlander K,Barrett C,Cline M,Haussler D,Hughey R,Holm L,Sander C,England E,England E.
Predicting protein structure using hidden markov models.
Proteins: Struct Funct Genet
1997;
29:
134–139.
10.1002/(SICI)1097-0134(1997)1+<134::AID-PROT18>3.0.CO;2-P Google Scholar
- 3 Karplus K,Barrett C,Hughey R. Hidden markov models for detecting remote protein homologies. Bioinformatics 1998; 14: 846–856.
- 4 Bateman A,Birney E,Cerruti L,Durbin R,Etwiller L,Eddy SR,Griffiths-Jones S,Howe KL,Marshall M,Sonnhammer ELL. The Pfam protein families database. Nucleic Acids Res 2002; 30: 276.
- 5 Liu Y,Carbonell JG,Weigele P,Gopalakrishnan V. Protein fold recognition using segmentation conditional random fields. J Comput Biol 2006; 13: 394–406.
- 6 Eddy SR. Profile hidden Markov models. Bioinformatics 1998; 14: 755–763.
- 7 Thomas J,Ramakrishnan N,Bailey-Kellogg C. Protein design by sampling an undirected graphical model of residue constraints. IEEE/ACM Trans Comput Biol Bioinformatics 2009; 6: 506–516.
- 8 Lockless SW,Ranganathan R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science 1999; 286: 295–299.
- 9 Socolich M,Lockless SW,Russ WP,Lee H,Gardner KH,Ranganathan R. Evolutionary information for specifying a protein fold. Nature 2005; 437: 512–518.
- 10 Russ WP,Lowery DM,Mishra P,Yaffe MB,Ranganathan R. Natural-like function in artificial WW domains. Natures 2005; 437: 579–583.
- 11 Thomas J,Ramakrishnan N,Bailey-Kellogg C. Graphical models of residue coupling in protein families. In: BIOKDD '05: Proceedings of the 5th International Workshop on Bioinformatics, ACM, New York, NY, 2005, 12–20.
- 12 Thomas J,Ramakrishnan N,Bailey-Kellogg C. Graphical models of residue coupling in protein families. IEEE/ACM Trans Comput Biol Bioinformatics (TCBB) 2008; 5: 183–197.
- 13 Thomas J,Ramakrishnan N,Bailey-Kellogg C. Graphical models of protein-protein interaction specificity from correlated mutations and interaction data. Proteins: Struct Funct Bioinformatics 2009; 76: 911–29.
- 14 Thomas J,Ramakrishnan N,Bailey-Kellogg C. Protein Design by Sampling an Undirected Graphical Model of Residue Constraints. IEEE/ACM Trans Comput Biol Bioinformatics (TCBB) 2009; 6: 506–516.
- 15 Lee SI,Ganapathi V,Koller D. Efficient structure learning of markov networks using l1-regularization. In: B Schölkopf, J Platt, T Hoffman, editors, Advances in neural Information processing systems 19. Cambridge, MA: MIT Press; 2007. pp 817–824.
- 16 Besag J. Efficiency of pseudolikelihood estimation for simple Gaussian fields. Biometrika 1977; 64: 616–618.
- 17 Wainwright MJ,Ravikumar P,Lafferty JD. High-dimensional graphical model selection using ℓ1-regularized logistic regression. In: B Schölkopf, J Platt, T Hoffman, editors, Advances in neural information processing systems 19. Cambridge, MA: MIT Press; 2007. pp 1465–1472.
- 18
Schmidt M,Murphy K,Fung G,Rosales R.
Structure learning in random fields for heart motion abnormality detection. In:
CVPR. IEEE Computer Society. Anchorage: Alaska;
2008.
10.1109/CVPR.2008.4587367 Google Scholar
- 19 Gidas B. Consistency of maximum likelihood and pseudo-likelihood estimators for Gibbs distributions. Inst Math Appl 1988; 10: 129–+.
- 20 Hofling H,Tibshirani R. Estimation of sparse binary pairwise markov networks using pseudo-likelihoods. J Mach Learn Res 2009; 10: 883–906.
- 21 Tropp JA. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans Info Theory 2006; 52: 1030–1051.
- 22
Boyd S,Vandenberghe L.
Convex optimization.
UK: Cambridge University Press;
2004.
10.1017/CBO9780511804441 Google Scholar
- 23 Liu DC,Nocedal J,Liu DC,Nocedal J. On the limited memory bfgs method for large scale optimization. Math Program 1989; 45: 503–528.
- 24 Listgarten J,Heckerman D. Determining the number of non-spurious arcs in a learned dag model: investigation of a bayesian and a frequentist approach. Presented at the 23rd Annual Conference on Uncertainity in Artificial Intelligence, July 19–22, 2007; University of British Columbia, Vancouver, BC Canada, 2007.
- 25 Berman HM,Westbrook J,Feng Z,Gilliland G,Bhat TN,Weissig H,Shindyalov IN,Bourne PE. The protein data bank. Nucl Acids Res 2000; 28: 235–242.
- 26 Dhulesia A,Gsponer J,Vendruscolo M. Mapping of two networks of residues that exhibit structural and dynamical changes upon binding in a pdz domain protein. J Am Chem Soc 2008; 130: 8931–8939.
- 27 Fuentes EJ,Der CJ,Lee AL. Ligand-dependent dynamics and intramolecular signaling in a PDZ domain. J Mol Biol 2004; 335: 1105–1115.
- 28 Altschuh D,Vernet T,Berti P,Moras D,Nagai K. Coordinated amino acid changes in homologous protein families. Protein Eng 1988; 2: 193–199.
- 29 Göbel U,Sander C,Schneider R,Valencia A. Correlated mutations and residue contacts in proteins. Proteins: Struct Funct Genet 1994; 18: 309–317.
- 30 Fatakia SN,Costanzi S,Chow CC. Computing highly correlated positions using mutual information and graph theory for g protein-coupled receptors. PLoS ONE 2009; 4: e4681.
- 31 Fodor AA,Aldrich RW. On evolutionary conservation of thermodynamic coupling in proteins. J Biol Chem 2004; 279: 19046–19050.
- 32 Fuchs A,Martin-Galiano AJ,Kalman M,Fleishman S,Ben-Tal N,Frishman D. Co-evolving residues in membrane proteins. Bioinformatics 2007; 23: 3312–3319.
- 33 Weigt M,White RA,Szurmant H,Hoch JA,Hwa T. Identification of direct residue contacts in protein-protein interaction by message passing. Proc Natl Acad Sci USA 2009; 106: 67–72.
- 34 Yuan M,Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Series B 2006; 68: 49–67.
- 35 Argyriou A,Evgeniou T,Pontil M. Multi-task feature learning. In: Advances in neural information processing systems 19. MIT Press, 2007.
- 36 Tibshirani R. Regression shrinkage and selection via the lasso. J R Stats Soc Series B 1994; 58: 267–288.
- 37 Pollock DD,Taylor WR. Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. Protein Eng 1997; 10: 647–657.
- 38 Felsenstein J. Inferring phylogenies. Sinauer Associates; 2003.
- 39 Kamisetty H,Ghosh B,Bailey-Kellogg C,Langmead CJ. Modeling and Inference of Sequence-Structure Specificity. In: Proceedings of the 8th International Conference on Computational Systems Bioinformatics (CSB), August 10–12, Palo Alto, California: Stanford University; 2009, pp 91–101.
- 40 Schwarz G. Estimating the dimension of a model. Annals Stat 1978; 6: 461–464.
- 41 Akaike H. A new look at the statistical model identification, Vol. 19. IEEE Transactions on Automatic Control; 2003; 716–723.
- 42 Yang Y. Can the strengths of aic and bic be shared? Biometrica 2003; 92: 2003.
- 43 Csiszar I,Talata Z. Consistent estimation of the basic neighborhood of markov random fields. Ann Stat 2006; 34: 123–145.