Volume 32, Issue 11 e4780

RESEARCH NOTE

Open Access

Zero-shot mutation effect prediction on protein stability and function using RoseTTAFold

Sanaa Mansoor,

Sanaa Mansoor

orcid.org/0009-0004-5992-060X

Department of Biochemistry, University of Washington, Seattle, Washington, WA, USA

Institute for Protein Design, University of Washington, Seattle, Washington, WA, USA

Molecular Engineering Graduate Program, University of Washington, Seattle, Washington, WA, USA

Search for more papers by this author

Minkyung Baek,

Minkyung Baek

Department of Biochemistry, University of Washington, Seattle, Washington, WA, USA

Institute for Protein Design, University of Washington, Seattle, Washington, WA, USA

School of Biological Sciences, Seoul National University, Seoul, Republic of Korea

Search for more papers by this author

David Juergens,

David Juergens

Department of Biochemistry, University of Washington, Seattle, Washington, WA, USA

Institute for Protein Design, University of Washington, Seattle, Washington, WA, USA

Molecular Engineering Graduate Program, University of Washington, Seattle, Washington, WA, USA

Search for more papers by this author

Joseph L. Watson,

Joseph L. Watson

Department of Biochemistry, University of Washington, Seattle, Washington, WA, USA

Institute for Protein Design, University of Washington, Seattle, Washington, WA, USA

Search for more papers by this author

David Baker,

Corresponding Author

David Baker

[email protected]

orcid.org/0000-0001-7896-6217

Department of Biochemistry, University of Washington, Seattle, Washington, WA, USA

Institute for Protein Design, University of Washington, Seattle, Washington, WA, USA

Howard Hughes Medical Institute, University of Washington, Seattle, Washington, WA, USA

Correspondence

David Baker, Department of Biochemistry, University of Washington, 3963 Stevens Way NE, Seattle, WA 98105 USA.

Email: [email protected]

Search for more papers by this author

Sanaa Mansoor,

Sanaa Mansoor

orcid.org/0009-0004-5992-060X

Department of Biochemistry, University of Washington, Seattle, Washington, WA, USA

Institute for Protein Design, University of Washington, Seattle, Washington, WA, USA

Molecular Engineering Graduate Program, University of Washington, Seattle, Washington, WA, USA

Search for more papers by this author

Minkyung Baek,

Minkyung Baek

Department of Biochemistry, University of Washington, Seattle, Washington, WA, USA

Institute for Protein Design, University of Washington, Seattle, Washington, WA, USA

School of Biological Sciences, Seoul National University, Seoul, Republic of Korea

Search for more papers by this author

David Juergens,

David Juergens

Department of Biochemistry, University of Washington, Seattle, Washington, WA, USA

Institute for Protein Design, University of Washington, Seattle, Washington, WA, USA

Molecular Engineering Graduate Program, University of Washington, Seattle, Washington, WA, USA

Search for more papers by this author

Joseph L. Watson,

Joseph L. Watson

Department of Biochemistry, University of Washington, Seattle, Washington, WA, USA

Institute for Protein Design, University of Washington, Seattle, Washington, WA, USA

Search for more papers by this author

David Baker,

Corresponding Author

David Baker

[email protected]

orcid.org/0000-0001-7896-6217

Department of Biochemistry, University of Washington, Seattle, Washington, WA, USA

Institute for Protein Design, University of Washington, Seattle, Washington, WA, USA

Howard Hughes Medical Institute, University of Washington, Seattle, Washington, WA, USA

Correspondence

David Baker, Department of Biochemistry, University of Washington, 3963 Stevens Way NE, Seattle, WA 98105 USA.

Email: [email protected]

Search for more papers by this author

First published: 11 September 2023

https://doi.org/10.1002/pro.4780

Citations: 5

Review Editor: Nir Ben-Tal

Share a link

Email
Wechat
Bluesky

Abstract

Predicting the effects of mutations on protein function and stability is an outstanding challenge. Here, we assess the performance of a variant of RoseTTAFold jointly trained for sequence and structure recovery, RF_joint, for mutation effect prediction. Without any further training, we achieve comparable accuracy in predicting mutation effects for a diverse set of protein families using RF_joint to both another zero-shot model (MSA Transformer) and a model that requires specific training on a particular protein family for mutation effect prediction (DeepSequence). Thus, although the architecture of RF_joint was developed to address the protein design problem of scaffolding functional motifs, RF_joint acquired an understanding of the mutational landscapes of proteins during model training that is equivalent to that of recently developed large protein language models. The ability to simultaneously reason over protein structure and sequence could enable even more precise mutation effect predictions following supervised training on the task. These results suggest that RF_joint has a quite broad understanding of protein sequence-structure landscapes, and can be viewed as a joint model for protein sequence and structure which could be broadly useful for protein modeling.

1 INTRODUCTION

Accurate prediction of single-point mutation effects using sequence information alone would help relate observed sequence polymorphisms to human disease (Hopf et al., 2017; Shin et al., 2021) and contribute to the design of proteins with higher functional activities. Deep learning methods have recently shown considerable promise for mutation effect prediction. DeepSequence (Riesselman et al., 2018), a probabilistic model for sequence families, obtained high accuracy in mutation effect prediction using latent variables for capturing higher-order interactions between residues in proteins through training on multiple sequence alignments (MSAs) for the target protein of interest. Large protein language models trained on MSAs (MSA Transformer) (Rao et al., 2021) or single sequences (Meier et al., 2021) also perform well at mutation effect prediction using an unsupervised or zero-shot approach. These language models have the advantage over DeepSequence of not requiring specific training on the protein family of interest.

RoseTTAFold was originally developed for protein structure prediction (Baek et al., 2021) and more recently RoseTTAFold Joint (RF_joint) was further trained to solve protein “inpainting” problems (Wang et al., 2022). During the inpainting process using the specifically trained RoseTTAFold network, a pass through the network starts from the functional site and fills in missing sequence and structure, resulting in the creation of a complete and viable protein scaffold. Included in RF_joint training was a masked MSA token recovery task for sequence prediction: predicting the correct amino acid sequence at specific masked positions within the alignment.

To assess RF_joint's understanding of protein mutational landscapes, we set out to investigate whether it could predict experimental mutational data from published deep mutational scanning (DMS) sets (Starita & Fields, 2015) with no further training (i.e., using a “zero-shot” approach). We compared the performance of RoseTTAFold Joint on this task to that of MSA Transformer and DeepSequence. All three are MSA-based methods, RF_Joint and MSA Transformer require no further training, while DeepSequence is trained on data from the family of interest. While not developed specifically for this task, we found that the performance in predicting the effects of single mutations on a set of diverse proteins was slightly better for RF_joint than MSA Transformer and comparable to the specifically trained DeepSequence.

2 RESULTS

RF_joint was evaluated on a set of 38 deep mutational scans curated by Riesselman et al. (2018). (The original dataset consisted of 42, we excluded the tRNA (TRNA_YEAST), the toxin–antitoxin complex (PARE_PARD), HIS7_YEAST_Kondrashov2017 and the PABP-doubles datasets to focus on single mutations made to monomeric proteins.) Each of the mutational scans recorded a different protein function with varying measurements. Given that only 2 out of the 38 DMS datasets pertain specifically to stability, the evidence for the stability change prediction is weaker compared to that for the functional effect prediction. Each dataset was treated as a separate prediction task, and each variant was scored individually. For each target protein, we generated MSAs using iterative sequence search against the UniClust30 database as described in Baek et al. (2021) and used it for both RF_joint and MSA Transformer predictions. For RF_joint, the variants were scored by masking out the mutation site in the query sequence in the MSA, and the MSA token recovery head was used to predict the distribution over the masked position. The predicted effect of the mutation was calculated as the log odds ratio of the mutant amino acid and the wild-type amino acid (Figure 1). The performance on each dataset was assessed based on the spearman correlation of the predictions to the observed experimental values. For DeepSequence, we compared the results of MSA Transformer and RF_joint to the published spearman rho values (Riesselman et al., 2018), which are from an ensemble of models trained on a different set of MSAs than those used for MSA Transformer or RF_joint for each target protein.

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

Zero-shot prediction of mutation effect using RoseTTAFold Joint. The only required input is an MSA which is then masked at the mutation position in the query sequence and fed into RF_joint. Protein structure models may also be optionally put in (this was done for the calculations in Figure 2). _.

We found that RF_joint predicts mutational effects considerably better than a baseline calculated as the log odds ratio of the frequency of the mutant amino acid and of the wild-type amino acid in the MSA (Figure 2). RF_joint also slightly outperformed MSA Transformer and is comparable to the protein family-specific DeepSequence (Figure 2). RF_joint has the advantage in principle over the purely sequence-based models of also being able to utilize structural template information, but we did not observe a significant improvement with incorporation of template structure information (Supplementary Figure S1; this may be in part because RoseTTAFold generates 3D models from MSA with reasonable accuracy). We also found little dependency of prediction accuracy on MSA depth (Supplementary Figure S2).

3 DISCUSSION

We find that the RoseTTAFold network developed originally for structure prediction and then extended to protein design, is also able to predict the effect of single mutations with quite a high accuracy. DeepSequence has a slightly higher average spearman rho correlation than RF_joint but requires training for each protein family individually. Just as large protein language models, like MSA Transformer, provide general models of protein sequence, RoseTTAFold Joint may be viewed as a general joint model of protein sequence and structure. With further directed training, it should be possible to further improve mutation effect prediction performance by better-utilizing protein structural information, which can be readily input into RoseTTAFold Joint but not into pure sequence-based models, and by fine-tuning specifically on the mutant prediction task. The ability of RoseTTAFold to function as a joint model of protein sequence and structure, incorporating any available protein sequence and structure information could prove useful in applications beyond protein design and mutation effect prediction.

4 MATERIALS AND METHODS

4.1 Deep mutational scanning datasets

RoseTTAFold was evaluated on a subset of 38 deep mutational scans collected by Riesselman et al. (2018). The proteins evaluated perform a wide range of functions and the experimental measures performed are different for each protein. We treat each DMS dataset as a separate prediction task. Performance on each task is evaluated by spearman rho correlations of the calculated (baseline), published (DeepSequence), or predicted (RF_joint and MSA Transformer) scores to the experimental values.

4.2 MSA generation

The same MSA inputs are used for both RoseTTAFold Joint and MSA Transformer at inference time. The protocol for generating MSAs is adopted from RoseTTAFold (Baek et al., 2021), where for each protein, sequences are found by iterative search against UniRef30 (Mirdita et al., 2017) and BFD (Steinegger, Mirdita, & Söding, 2019) using HHblits (Steinegger, Meier, et al., 2019). Sequences are then filtered at 90% sequence identity cutoff. The E-value cutoff for sequence search is gradually relaxed (from 1e-10 to 1e-3) until the generated MSA has at least 2000 sequences with 75% coverage or 5000 sequences with 50% coverage. For the proteins that failed to get 5000 sequences (with E-value of 1e-3 and 50% sequence coverage cutoff), as many sequences as the protocol can find are used as an input MSA.

4.3 Non-ML baseline setup

For establishing the non-ML baseline, we used the input MSA for each protein and calculated the log odds ratio of the frequency of the wild-type amino acid and mutant amino acid for each position (Equation (1)). All sequences of the input MSA were used in this calculation

y_{baseline, i} = \log ({freq}_{wt, i}) - \log ({freq}_{mt, i})

(1)

4.4 RF_joint inference setup

We used the published RF_joint model (Wang et al., 2022) in inference mode for the task of single mutation effect prediction. All weights of the model were frozen and no further training was done. As described in the RF_joint paper (Wang et al., 2022), we split the input MSA into two groups, a small seed MSA and an extra MSA, to reduce the memory cost for all sequence-to-all sequence attention map calculation in the original RoseTTAFold. Up to 256 sequences were considered as a seed MSA (the input for RF_joint's main three-track blocks) from the input MSA of a target protein with an additional 1024 extra sequences (the input for RF_joint's ExtraMSAStack) passed into the model. All default parameters from RF_joint were used and the number of recycles was set to 1. RoseTTAFold (Baek et al., 2021) predicted structures for a target protein were used as structural templates for mutation effect prediction. The mutation site of interest was masked in the query sequence of the input MSA and the masked MSA token recovery head was used to predict the probability of all 20 amino acids over that masked position. The predicted effect of a mutation at position i was calculated as the log odds ratio of the probability of the wild-type amino acid to the mutant amino acid (Equation (2)). This scoring is zero-shot, that is, the model requires no further training

y_{RFjoint, i} = \log (p_{wt, i}) - \log (p_{mt, i})

(2)

4.5 MSA transformer inference setup

We used the published MSA Transformer (Meier et al., 2021; Rao et al., 2021) loaded with pre-trained weights (annotated as esm_msa1b_t12_100M_UR50S on the public ESM github). The default arguments were used, where 400 sequences were randomly sampled from the MSA for inference. We used the masked marginals scoring strategy for scoring mutants from MSA Transformer, which is done by introducing masks at the mutated positions and computing the score for a mutation by considering its probability relative to the wildtype amino acid (Meier et al., 2021). This is similar to the setup that we used for predicting the effect of a mutation through RF_joint (Equation (2)).

AUTHOR CONTRIBUTIONS

Sanaa Mansoor: Conceptualization; Investigation; Methodology; Validation; Visualization; Data curation; Writing—original draft; Writing—review & editing; Formal analysis. Minkyung Baek: Conceptualization; Methodology; Writing—review & editing; Writing—original draft; Formal analysis. David Juergens: Methodology; Writing—original draft. Joseph L. Watson: Methodology; Writing—original draft. David Baker: Conceptualization; Methodology; Validation; Supervision; Writing—original draft.

ACKNOWLEDGMENTS

We would like to thank Justas Dauparas, Ivan Anishchanka, Doug Tischer, Hahnbeom Park, Sergey Ovchinnikov, and Eric Horvitz for helpful comments and suggestions.

CONFLICT OF INTEREST STATEMENT

The authors declare no conflict of interest.

Open Research

DATA AVAILABILITY STATEMENT

Inference code for predicting the effect of single mutations on protein function or stability through this pipeline is available here: https://github.com/RosettaCommons/RFDesign/tree/main/inpainting. All input data (target MSAs, structural templates), and experimental and predicted values of all methods compared are available on Zenodo at this link: https://doi.org/10.5281/zenodo.8106250.

Supporting Information

REFERENCES

Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021; 373(6557): 871–876. https://doi.org/10.1126/science.abj8754
10.1126/science.abj8754
CAS PubMed Web of Science® Google Scholar
Hopf TA, Ingraham JB, Poelwijk FJ, Schärfe CPI, Springer M, Sander C, et al. Mutation effects predicted from sequence co-variation. Nat Biotechnol. 2017; 35(2): 128–135.
10.1038/nbt.3769
CAS PubMed Web of Science® Google Scholar
Meier J, Rao R, Verkuil R, Liu J, Sercu T, Rives A. Language models enable zero-shot prediction of the effects of mutations on protein function. BioRxiv: 2021.07.09.450648 2021 https://doi.org/10.1101/2021.07.09.450648
10.1101/2021.07.09.450648
Google Scholar
Mirdita M, von den Driesch L, Galiez C, Martin MJ, Soding J, Steinegger M. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 2017; 45(D1): D170–D176.
10.1093/nar/gkw1081
CAS PubMed Web of Science® Google Scholar
Rao R, Liu J, Verkuil R, Meier J, Canny JF, Abbeel P, et al. MSA transformer. BioRxiv: 2021.02.12.430858 2021 https://doi.org/10.1101/2021.02.12.430858
10.1101/2021.02.12.430858
Google Scholar
Riesselman AJ, Ingraham JB, Marks DS. Deep generative models of genetic variation capture the effects of mutations. Nat Methods. 2018; 15(10): 816–822. https://doi.org/10.1038/s41592-018-0138-4
10.1038/s41592-018-0138-4
CAS PubMed Web of Science® Google Scholar
Shin JE, Riesselman AJ, Kollasch AW, McMahon C, Simon E, Sander C, et al. Protein design and variant prediction using autoregressive generative models. Nat Commun. 2021; 12(1): 2403.
10.1038/s41467-021-22732-w
CAS PubMed Web of Science® Google Scholar
Starita LM, Fields S. Deep mutational scanning: a highly parallel method to measure the effects of mutation on protein function. Cold Spring Harb Protoc. 2015; 2015(8): 711–714.
10.1101/pdb.top077503
PubMed Google Scholar
Steinegger M, Meier M, Mirdita M, Vöhringer H, Haunsberger SJ, Söding J. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics. 2019; 20(1): 473.
10.1186/s12859-019-3019-7
PubMed Web of Science® Google Scholar
Steinegger M, Mirdita M, Söding J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat Methods. 2019; 16(7): 603–606.
10.1038/s41592-019-0437-4
CAS PubMed Web of Science® Google Scholar
Wang J, Lisanza S, Juergens D, Tischer D, Watson JL, Castro KM, et al. Scaffolding protein functional sites using deep learning. Science. 2022; 377(6604): 387–394. https://doi.org/10.1126/science.abn2100
10.1126/science.abn2100
CAS PubMed Web of Science® Google Scholar

Citing Literature

Volume32, Issue11

November 2023

e4780

This article also appears in:

David Baker in Protein Science

Zero-shot mutation effect prediction on protein stability and function using RoseTTAFold

Abstract

1 INTRODUCTION

2 RESULTS

3 DISCUSSION

4 MATERIALS AND METHODS

4.1 Deep mutational scanning datasets

4.2 MSA generation

4.3 Non-ML baseline setup

4.4 RF_joint inference setup

4.5 MSA transformer inference setup

AUTHOR CONTRIBUTIONS

ACKNOWLEDGMENTS

CONFLICT OF INTEREST STATEMENT

Open Research

DATA AVAILABILITY STATEMENT

Supporting Information

REFERENCES

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Zero-shot mutation effect prediction on protein stability and function using RoseTTAFold

Abstract

1 INTRODUCTION

2 RESULTS

3 DISCUSSION

4 MATERIALS AND METHODS

4.1 Deep mutational scanning datasets

4.2 MSA generation

4.3 Non-ML baseline setup

4.4 RFjoint inference setup

4.5 MSA transformer inference setup

AUTHOR CONTRIBUTIONS

ACKNOWLEDGMENTS

CONFLICT OF INTEREST STATEMENT

Open Research

DATA AVAILABILITY STATEMENT

Supporting Information

REFERENCES

Citing Literature

Figures

References

Related

Information

4.4 RF_joint inference setup