Volume 79, Issue 4 pp. 1061-1078
Research Article

Learning generative models for protein fold families

Sivaraman Balakrishnan

Corresponding Author

Sivaraman Balakrishnan

Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania

Sivaraman Balakrishnan and Hetunandan Kamisetty contributed equally to this work.

5000 Forbes Ave., Pittsburgh, PA 15213===Search for more papers by this author
Hetunandan Kamisetty

Corresponding Author

Hetunandan Kamisetty

Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania

Sivaraman Balakrishnan and Hetunandan Kamisetty contributed equally to this work.

5000 Forbes Ave., Pittsburgh, PA 15213===Search for more papers by this author
Jaime G. Carbonell

Jaime G. Carbonell

Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania

Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania

Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania

Search for more papers by this author
Su-In Lee

Su-In Lee

Department of Computer Science & Engineering, University of Washington, Seattle, Washington

Department of Genome Sciences, University of Washington, Seattle, Washington

Search for more papers by this author
Christopher James Langmead

Corresponding Author

Christopher James Langmead

Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania

Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania

5000 Forbes Ave., Pittsburgh, PA 15213===Search for more papers by this author
First published: 11 November 2010
Citations: 231

Abstract

We introduce a new approach to learning statistical models from multiple sequence alignments (MSA) of proteins. Our method, called GREMLIN (Generative REgularized ModeLs of proteINs), learns an undirected probabilistic graphical model of the amino acid composition within the MSA. The resulting model encodes both the position-specific conservation statistics and the correlated mutation statistics between sequential and long-range pairs of residues. Existing techniques for learning graphical models from MSA either make strong, and often inappropriate assumptions about the conditional independencies within the MSA (e.g., Hidden Markov Models), or else use suboptimal algorithms to learn the parameters of the model. In contrast, GREMLIN makes no a priori assumptions about the conditional independencies within the MSA. We formulate and solve a convex optimization problem, thus guaranteeing that we find a globally optimal model at convergence. The resulting model is also generative, allowing for the design of new protein sequences that have the same statistical properties as those in the MSA. We perform a detailed analysis of covariation statistics on the extensively studied WW and PDZ domains and show that our method out-performs an existing algorithm for learning undirected probabilistic graphical models from MSA. We then apply our approach to 71 additional families from the PFAM database and demonstrate that the resulting models significantly out-perform Hidden Markov Models in terms of predictive accuracy. Proteins 2011; © 2011 Wiley-Liss, Inc.

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.