Volume 90, Issue 1 pp. 45-57
RESEARCH ARTICLE
Free to Read

Predicting mutant outcome by combining deep mutational scanning and machine learning

Hagit Sarfati

Hagit Sarfati

Department of Computer Science, Ben-Gurion University of the Negev, Be'er Sheva, Israel

Search for more papers by this author
Si Naftaly

Si Naftaly

Avram and Stella Goldstein-Goren Department of Biotechnology Engineering and the National Institute of Biotechnology in the Negev, Ben-Gurion University of the Negev, Be'er Sheva, Israel

Search for more papers by this author
Niv Papo

Niv Papo

Avram and Stella Goldstein-Goren Department of Biotechnology Engineering and the National Institute of Biotechnology in the Negev, Ben-Gurion University of the Negev, Be'er Sheva, Israel

Search for more papers by this author
Chen Keasar

Corresponding Author

Chen Keasar

Department of Computer Science, Ben-Gurion University of the Negev, Be'er Sheva, Israel

Correspondence

Chen Keasar, Department of Computer Science, Ben-Gurion University of the Negev, Be'er Sheva, Israel.

Email: [email protected]

Search for more papers by this author
First published: 22 July 2021
Citations: 3

Funding information: H2020 European Research Council, Grant/Award Number: 336041; European Research Council, Grant/Award Number: 336041; Israel Science Foundation, Grant/Award Numbers: 1122/14, 615/14

Abstract

Deep mutational scanning provides unprecedented wealth of quantitative data regarding the functional outcome of mutations in proteins. A single experiment may measure properties (eg, structural stability) of numerous protein variants. Leveraging the experimental data to gain insights about unexplored regions of the mutational landscape is a major computational challenge. Such insights may facilitate further experimental work and accelerate the development of novel protein variants with beneficial therapeutic or industrially relevant properties. Here we present a novel, machine learning approach for the prediction of functional mutation outcome in the context of deep mutational screens. Using sequence (one-hot) features of variants with known properties, as well as structural features derived from models thereof, we train predictive statistical models to estimate the unknown properties of other variants. The utility of the new computational scheme is demonstrated using five sets of mutational scanning data, denoted “targets”: (a) protease specificity of APPI (amyloid precursor protein inhibitor) variants; (b-d) three stability related properties of IGBPG (immunoglobulin G-binding β1 domain of streptococcal protein G) variants; and (e) fluorescence of GFP (green fluorescent protein) variants. Performance is measured by the overall correlation of the predicted and observed properties, and enrichment—the ability to predict the most potent variants and presumably guide further experiments. Despite the diversity of the targets the statistical models can generalize variant examples thereof and predict the properties of test variants with both single and multiple mutations.

DATA AVAILABILITY STATEMENT

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.