Volume 32, Issue 11 e4780
RESEARCH NOTE
Open Access

Zero-shot mutation effect prediction on protein stability and function using RoseTTAFold

Sanaa Mansoor

Sanaa Mansoor

Department of Biochemistry, University of Washington, Seattle, Washington, WA, USA

Institute for Protein Design, University of Washington, Seattle, Washington, WA, USA

Molecular Engineering Graduate Program, University of Washington, Seattle, Washington, WA, USA

Search for more papers by this author
Minkyung Baek

Minkyung Baek

Department of Biochemistry, University of Washington, Seattle, Washington, WA, USA

Institute for Protein Design, University of Washington, Seattle, Washington, WA, USA

School of Biological Sciences, Seoul National University, Seoul, Republic of Korea

Search for more papers by this author
David Juergens

David Juergens

Department of Biochemistry, University of Washington, Seattle, Washington, WA, USA

Institute for Protein Design, University of Washington, Seattle, Washington, WA, USA

Molecular Engineering Graduate Program, University of Washington, Seattle, Washington, WA, USA

Search for more papers by this author
Joseph L. Watson

Joseph L. Watson

Department of Biochemistry, University of Washington, Seattle, Washington, WA, USA

Institute for Protein Design, University of Washington, Seattle, Washington, WA, USA

Search for more papers by this author
David Baker

Corresponding Author

David Baker

Department of Biochemistry, University of Washington, Seattle, Washington, WA, USA

Institute for Protein Design, University of Washington, Seattle, Washington, WA, USA

Howard Hughes Medical Institute, University of Washington, Seattle, Washington, WA, USA

Correspondence

David Baker, Department of Biochemistry, University of Washington, 3963 Stevens Way NE, Seattle, WA 98105 USA.

Email: [email protected]

Search for more papers by this author
First published: 11 September 2023
Citations: 5

Review Editor: Nir Ben-Tal

Abstract

Predicting the effects of mutations on protein function and stability is an outstanding challenge. Here, we assess the performance of a variant of RoseTTAFold jointly trained for sequence and structure recovery, RFjoint, for mutation effect prediction. Without any further training, we achieve comparable accuracy in predicting mutation effects for a diverse set of protein families using RFjoint to both another zero-shot model (MSA Transformer) and a model that requires specific training on a particular protein family for mutation effect prediction (DeepSequence). Thus, although the architecture of RFjoint was developed to address the protein design problem of scaffolding functional motifs, RFjoint acquired an understanding of the mutational landscapes of proteins during model training that is equivalent to that of recently developed large protein language models. The ability to simultaneously reason over protein structure and sequence could enable even more precise mutation effect predictions following supervised training on the task. These results suggest that RFjoint has a quite broad understanding of protein sequence-structure landscapes, and can be viewed as a joint model for protein sequence and structure which could be broadly useful for protein modeling.

1 INTRODUCTION

Accurate prediction of single-point mutation effects using sequence information alone would help relate observed sequence polymorphisms to human disease (Hopf et al., 2017; Shin et al., 2021) and contribute to the design of proteins with higher functional activities. Deep learning methods have recently shown considerable promise for mutation effect prediction. DeepSequence (Riesselman et al., 2018), a probabilistic model for sequence families, obtained high accuracy in mutation effect prediction using latent variables for capturing higher-order interactions between residues in proteins through training on multiple sequence alignments (MSAs) for the target protein of interest. Large protein language models trained on MSAs (MSA Transformer) (Rao et al., 2021) or single sequences (Meier et al., 2021) also perform well at mutation effect prediction using an unsupervised or zero-shot approach. These language models have the advantage over DeepSequence of not requiring specific training on the protein family of interest.

RoseTTAFold was originally developed for protein structure prediction (Baek et al., 2021) and more recently RoseTTAFold Joint (RFjoint) was further trained to solve protein “inpainting” problems (Wang et al., 2022). During the inpainting process using the specifically trained RoseTTAFold network, a pass through the network starts from the functional site and fills in missing sequence and structure, resulting in the creation of a complete and viable protein scaffold. Included in RFjoint training was a masked MSA token recovery task for sequence prediction: predicting the correct amino acid sequence at specific masked positions within the alignment.

To assess RFjoint's understanding of protein mutational landscapes, we set out to investigate whether it could predict experimental mutational data from published deep mutational scanning (DMS) sets (Starita & Fields, 2015) with no further training (i.e., using a “zero-shot” approach). We compared the performance of RoseTTAFold Joint on this task to that of MSA Transformer and DeepSequence. All three are MSA-based methods, RFJoint and MSA Transformer require no further training, while DeepSequence is trained on data from the family of interest. While not developed specifically for this task, we found that the performance in predicting the effects of single mutations on a set of diverse proteins was slightly better for RFjoint than MSA Transformer and comparable to the specifically trained DeepSequence.

2 RESULTS

RFjoint was evaluated on a set of 38 deep mutational scans curated by Riesselman et al. (2018). (The original dataset consisted of 42, we excluded the tRNA (TRNA_YEAST), the toxin–antitoxin complex (PARE_PARD), HIS7_YEAST_Kondrashov2017 and the PABP-doubles datasets to focus on single mutations made to monomeric proteins.) Each of the mutational scans recorded a different protein function with varying measurements. Given that only 2 out of the 38 DMS datasets pertain specifically to stability, the evidence for the stability change prediction is weaker compared to that for the functional effect prediction. Each dataset was treated as a separate prediction task, and each variant was scored individually. For each target protein, we generated MSAs using iterative sequence search against the UniClust30 database as described in Baek et al. (2021) and used it for both RFjoint and MSA Transformer predictions. For RFjoint, the variants were scored by masking out the mutation site in the query sequence in the MSA, and the MSA token recovery head was used to predict the distribution over the masked position. The predicted effect of the mutation was calculated as the log odds ratio of the mutant amino acid and the wild-type amino acid (Figure 1). The performance on each dataset was assessed based on the spearman correlation of the predictions to the observed experimental values. For DeepSequence, we compared the results of MSA Transformer and RFjoint to the published spearman rho values (Riesselman et al., 2018), which are from an ensemble of models trained on a different set of MSAs than those used for MSA Transformer or RFjoint for each target protein.

Details are in the caption following the image
Zero-shot prediction of mutation effect using RoseTTAFold Joint. The only required input is an MSA which is then masked at the mutation position in the query sequence and fed into RFjoint. Protein structure models may also be optionally put in (this was done for the calculations in Figure 2). .

We found that RFjoint predicts mutational effects considerably better than a baseline calculated as the log odds ratio of the frequency of the mutant amino acid and of the wild-type amino acid in the MSA (Figure 2). RFjoint also slightly outperformed MSA Transformer and is comparable to the protein family-specific DeepSequence (Figure 2). RFjoint has the advantage in principle over the purely sequence-based models of also being able to utilize structural template information, but we did not observe a significant improvement with incorporation of template structure information (Supplementary Figure S1; this may be in part because RoseTTAFold generates 3D models from MSA with reasonable accuracy). We also found little dependency of prediction accuracy on MSA depth (Supplementary Figure S2).

Details are in the caption following the image
Boxplots of spearman rho correlations on deep mutation scanning datasets. Baseline refers to the non-ML MSA baseline. RFjoint refers to the model trained on a joint sequence and structure recovery task (Wang et al., 2022). Box plots show the median (center line), interquartile range (hinges), and 1.5 times the interquartile range (whiskers); outliers are plotted as individual points. An asterisk above bars indicates significant differences: Baseline-DeepSequence (p < 0.05) and Baseline-RFjoint (p < 0.05), signifying p-values below the threshold. The average spearman rho correlation is 0.426 for the baseline, 0.502 for DeepSequence, 0.430 for MSA Transformer, and 0.497 for RFjoint.

3 DISCUSSION

We find that the RoseTTAFold network developed originally for structure prediction and then extended to protein design, is also able to predict the effect of single mutations with quite a high accuracy. DeepSequence has a slightly higher average spearman rho correlation than RFjoint but requires training for each protein family individually. Just as large protein language models, like MSA Transformer, provide general models of protein sequence, RoseTTAFold Joint may be viewed as a general joint model of protein sequence and structure. With further directed training, it should be possible to further improve mutation effect prediction performance by better-utilizing protein structural information, which can be readily input into RoseTTAFold Joint but not into pure sequence-based models, and by fine-tuning specifically on the mutant prediction task. The ability of RoseTTAFold to function as a joint model of protein sequence and structure, incorporating any available protein sequence and structure information could prove useful in applications beyond protein design and mutation effect prediction.

4 MATERIALS AND METHODS

4.1 Deep mutational scanning datasets

RoseTTAFold was evaluated on a subset of 38 deep mutational scans collected by Riesselman et al. (2018). The proteins evaluated perform a wide range of functions and the experimental measures performed are different for each protein. We treat each DMS dataset as a separate prediction task. Performance on each task is evaluated by spearman rho correlations of the calculated (baseline), published (DeepSequence), or predicted (RFjoint and MSA Transformer) scores to the experimental values.

4.2 MSA generation

The same MSA inputs are used for both RoseTTAFold Joint and MSA Transformer at inference time. The protocol for generating MSAs is adopted from RoseTTAFold (Baek et al., 2021), where for each protein, sequences are found by iterative search against UniRef30 (Mirdita et al., 2017) and BFD (Steinegger, Mirdita, & Söding, 2019) using HHblits (Steinegger, Meier, et al., 2019). Sequences are then filtered at 90% sequence identity cutoff. The E-value cutoff for sequence search is gradually relaxed (from 1e-10 to 1e-3) until the generated MSA has at least 2000 sequences with 75% coverage or 5000 sequences with 50% coverage. For the proteins that failed to get 5000 sequences (with E-value of 1e-3 and 50% sequence coverage cutoff), as many sequences as the protocol can find are used as an input MSA.

4.3 Non-ML baseline setup

For establishing the non-ML baseline, we used the input MSA for each protein and calculated the log odds ratio of the frequency of the wild-type amino acid and mutant amino acid for each position (Equation (1)). All sequences of the input MSA were used in this calculation
y baseline , i = log freq wt , i log freq mt , i (1)

4.4 RFjoint inference setup

We used the published RFjoint model (Wang et al., 2022) in inference mode for the task of single mutation effect prediction. All weights of the model were frozen and no further training was done. As described in the RFjoint paper (Wang et al., 2022), we split the input MSA into two groups, a small seed MSA and an extra MSA, to reduce the memory cost for all sequence-to-all sequence attention map calculation in the original RoseTTAFold. Up to 256 sequences were considered as a seed MSA (the input for RFjoint's main three-track blocks) from the input MSA of a target protein with an additional 1024 extra sequences (the input for RFjoint's ExtraMSAStack) passed into the model. All default parameters from RFjoint were used and the number of recycles was set to 1. RoseTTAFold (Baek et al., 2021) predicted structures for a target protein were used as structural templates for mutation effect prediction. The mutation site of interest was masked in the query sequence of the input MSA and the masked MSA token recovery head was used to predict the probability of all 20 amino acids over that masked position. The predicted effect of a mutation at position i was calculated as the log odds ratio of the probability of the wild-type amino acid to the mutant amino acid (Equation (2)). This scoring is zero-shot, that is, the model requires no further training
y RFjoint , i = log p wt , i log p mt , i (2)

4.5 MSA transformer inference setup

We used the published MSA Transformer (Meier et al., 2021; Rao et al., 2021) loaded with pre-trained weights (annotated as esm_msa1b_t12_100M_UR50S on the public ESM github). The default arguments were used, where 400 sequences were randomly sampled from the MSA for inference. We used the masked marginals scoring strategy for scoring mutants from MSA Transformer, which is done by introducing masks at the mutated positions and computing the score for a mutation by considering its probability relative to the wildtype amino acid (Meier et al., 2021). This is similar to the setup that we used for predicting the effect of a mutation through RFjoint (Equation (2)).

AUTHOR CONTRIBUTIONS

Sanaa Mansoor: Conceptualization; Investigation; Methodology; Validation; Visualization; Data curation; Writing—original draft; Writing—review & editing; Formal analysis. Minkyung Baek: Conceptualization; Methodology; Writing—review & editing; Writing—original draft; Formal analysis. David Juergens: Methodology; Writing—original draft. Joseph L. Watson: Methodology; Writing—original draft. David Baker: Conceptualization; Methodology; Validation; Supervision; Writing—original draft.

ACKNOWLEDGMENTS

We would like to thank Justas Dauparas, Ivan Anishchanka, Doug Tischer, Hahnbeom Park, Sergey Ovchinnikov, and Eric Horvitz for helpful comments and suggestions.

    CONFLICT OF INTEREST STATEMENT

    The authors declare no conflict of interest.

    DATA AVAILABILITY STATEMENT

    Inference code for predicting the effect of single mutations on protein function or stability through this pipeline is available here: https://github.com/RosettaCommons/RFDesign/tree/main/inpainting. All input data (target MSAs, structural templates), and experimental and predicted values of all methods compared are available on Zenodo at this link: https://doi.org/10.5281/zenodo.8106250.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.