Volume 33, Issue 3 e4898
RESEARCH ARTICLE
Free Access

Systematic enhancement of protein crystallization efficiency by bulk lysine-to-arginine (KR) substitution

Nooriel E. Banayan

Nooriel E. Banayan

Department of Biological Sciences, 702A Sherman Fairchild Center, MC2434, Columbia University, New York, New York, USA

Contribution: Software, ​Investigation, Formal analysis, Writing - original draft, Visualization, Writing - review & editing, Validation, Data curation, Methodology

Search for more papers by this author
Blaine J. Loughlin

Blaine J. Loughlin

Department of Biological Sciences, 702A Sherman Fairchild Center, MC2434, Columbia University, New York, New York, USA

Contribution: ​Investigation

Search for more papers by this author
Shikha Singh

Shikha Singh

Department of Biological Sciences, 702A Sherman Fairchild Center, MC2434, Columbia University, New York, New York, USA

Contribution: Methodology, ​Investigation

Search for more papers by this author
Farhad Forouhar

Farhad Forouhar

Department of Biological Sciences, 702A Sherman Fairchild Center, MC2434, Columbia University, New York, New York, USA

Contribution: ​Investigation

Search for more papers by this author
Guanqi Lu

Guanqi Lu

Department of Biological Sciences, 702A Sherman Fairchild Center, MC2434, Columbia University, New York, New York, USA

Contribution: Software

Search for more papers by this author
Kam-Ho Wong

Kam-Ho Wong

Department of Biological Sciences, 702A Sherman Fairchild Center, MC2434, Columbia University, New York, New York, USA

Contribution: ​Investigation

Search for more papers by this author
Matthew Neky

Matthew Neky

Department of Biological Sciences, 702A Sherman Fairchild Center, MC2434, Columbia University, New York, New York, USA

Contribution: ​Investigation

Search for more papers by this author
Henry S. Hunt

Henry S. Hunt

Department of Physics, Stanford University, Stanford, California, USA

Contribution: Software

Search for more papers by this author
Larry B. Bateman Jr

Larry B. Bateman Jr

Accendero Software, Idaho Falls, Idaho, USA

Contribution: Software

Search for more papers by this author
Angel Tamez

Angel Tamez

Accendero Software, Idaho Falls, Idaho, USA

Contribution: Software

Search for more papers by this author
Samuel K. Handelman

Samuel K. Handelman

Department of Biological Sciences, 702A Sherman Fairchild Center, MC2434, Columbia University, New York, New York, USA

Contribution: Methodology

Search for more papers by this author
W. Nicholson Price

W. Nicholson Price

Department of Biological Sciences, 702A Sherman Fairchild Center, MC2434, Columbia University, New York, New York, USA

Contribution: Methodology

Search for more papers by this author
John F. Hunt

Corresponding Author

John F. Hunt

Department of Biological Sciences, 702A Sherman Fairchild Center, MC2434, Columbia University, New York, New York, USA

Correspondence

John F. Hunt, Department of Biological Sciences, 702A Sherman Fairchild Center, MC2434, Columbia University, New York, NY 10027, USA.

Email: [email protected]

Contribution: Conceptualization, Software, Formal analysis, Funding acquisition, Project administration, Writing - original draft, Methodology, Data curation, Validation, Supervision, Visualization, Resources, Writing - review & editing

Search for more papers by this author
First published: 15 February 2024
Citations: 2

Nooriel E. Banayan and Blaine J. Loughlin contributed equally to the work reported in this paper.

Review Editor: Jeanine Amacher

Abstract

Structural genomics consortia established that protein crystallization is the primary obstacle to structure determination using x-ray crystallography. We previously demonstrated that crystallization propensity is systematically related to primary sequence, and we subsequently performed computational analyses showing that arginine is the most overrepresented amino acid in crystal-packing interfaces in the Protein Data Bank. Given the similar physicochemical characteristics of arginine and lysine, we hypothesized that multiple lysine-to-arginine (KR) substitutions should improve crystallization. To test this hypothesis, we developed software that ranks lysine sites in a target protein based on the redundancy-corrected KR substitution frequency in homologs. This software can be run interactively on the worldwide web at https://www.pxengineering.org/. We demonstrate that three unrelated single-domain proteins can tolerate 5–11 KR substitutions with at most minor destabilization, and, for two of these three proteins, the construct with the largest number of KR substitutions exhibits significantly enhanced crystallization propensity. This approach rapidly produced a 1.9 Å crystal structure of a human protein domain refractory to crystallization with its native sequence. Structures from Bulk KR-substituted domains show the engineered arginine residues frequently make hydrogen-bonds across crystal-packing interfaces. We thus demonstrate that Bulk KR substitution represents a rational and efficient method for probabilistic engineering of protein surface properties to improve crystallization.

1 INTRODUCTION

More than 50 years after the solution of the first protein crystal structure (Kendrew, 1959; Kendrew et al., 1958; Kendrew & Perutz, 1948), protein crystallization remains a hit-or-miss proposition. Synergistic developments in crystallographic methods (Hendrickson et al., 1990; Liebschner et al., 2019; Liu & Hendrickson, 2017; Otwinowski & Minor, 1997; Sheldrick, 2010; Terwilliger, 2001), synchrotron beamlines (Grimes et al., 2018; Hendrickson, 2000; Sanishvili & Fischetti, 2017; Wilson, 2022), and high-speed computing have made structure solution and refinement routine, even for massive complexes, but only if high-quality crystals are available. However, there has been comparatively little progress in improving methods for protein crystallization. Structural genomics consortia systematically confirmed that most naturally occurring proteins do not readily yield high-quality crystals suitable for x-ray structure determination and that crystallization is the major obstacle to the determination of protein structures using diffraction methods (Canaves et al., 2004; Price 2nd et al., 2009; Slabinski et al., 2007). While numerous methods have been developed that have some efficacy in improving protein crystallization properties (Anstrom et al., 2005; Cieslik & Derewenda, 2009; Cooper et al., 2007; Czepas et al., 2004; Derewenda, 2004a; Derewenda, 2004b; Derewenda & Godzik, 2017; Derewenda & Vekilov, 2006; Janda et al., 2004; Longenecker et al., 2001; Mateja et al., 2002; Qiu & Janson, 2004), none work with sufficiently high efficiency to have been applied with significant frequency by practicing crystallographers. While the development of AlphaFold has provided a de facto solution to the protein-folding problem for many sequence families (Jumper et al., 2021; Jumper & Hassabis, 2022; Senior et al., 2020), its occasional failure, limited stereochemical accuracy, and inability to date to model ligand complexes means protein crystallography remains widely practiced (Chowdhury et al., 2022; Hendrickson, 2023; Oeffner et al., 2022; Terwilliger et al., 2022; Terwilliger et al., 2023), especially for structure-based drug discovery projects (Bijak et al., 2023). We therefore set out to develop more efficient methods for rational engineering of protein surface properties to improve crystallization propensity.

The first phase of our research identified a large number of local primary sequence patterns, which we called crystallization epitopes, that are strongly overrepresented in crystal-packing interfaces (Naumov et al., 2019). We demonstrated that introducing these epitopes individually into proteins generally increases their crystallization propensity and that introducing multiple such epitopes progressively increases crystallization propensity. The cumulative nature of the observed improvements suggested that multiple simultaneous mutations could potentially produce definitive improvements in crystallization propensity in a single protein construct based on large-scale probabilistic engineering of protein surface properties. We herein present an efficient method to achieve this goal while preserving protein stability and solubility.

Our efforts to develop rational methods to improve protein crystallization properties are grounded in sequence and structural analyses of historical crystallization results and associated thermodynamic studies. Our published analyses of large-scale experimental studies showed that several surface properties of proteins are a significant determinant of protein crystallization propensity (Price 2nd et al., 2009). These studies demonstrated that overall thermodynamic stability is not a major determinant of protein crystallization propensity. They identified a number of primary sequence properties that correlate with successful crystal structure determination, including significant anticorrelations with predicted backbone disorder and the sidechain entropy of predicted solvent-exposed residues as well as significant positive correlations with the fractional content of several individual amino acids (Price 2nd et al., 2009). In follow-up studies, we analyzed 87,683 crystal structures from the Protein Data Bank (PDB) and identified contiguous amino acid patterns strongly overrepresented in crystal packing interfaces (Naumov et al., 2019). This analysis also generated data on the relative overrepresentation of individual amino acids in crystal-packing interfaces segregated by protein secondary structure (Figure 1), and these data suggested the streamlined approach reported in this paper that enhances protein crystallization propensity based on multiple simultaneous surface mutations.

Details are in the caption following the image
Overrepresentation ratios (Naumov et al., 2019) of amino acids in crystal-packing interfaces normalized to overall surface composition in 87,684 crystal structures deposited in the Protein Data Bank (PDB). The thick dotted line schematizes the gain in packing probability produced by K-to-R mutations, while the thin dotted lines schematize the changes from N-to-Q and D-to-E mutations. The overrepresentation ratios are segregated by protein 2° structure as assessed by DSSP (Kabsch & Sander, 1983), and the amino acids are ordered in decreasing order of overrepresentation ratio in α-helical secondary structure.

Our computational analysis of crystal-packing interactions in the PDB showed a substantially higher probability for arginine to mediate inter-molecular packing contacts than lysine (Figure 1), consistent with our expectations based on earlier analyses of correlations between primary sequence features and protein crystallization propensity (Price 2nd et al., 2009). The observation that arginine mediates crystal-packing contacts more frequently than lysine is particularly notable because the entropy of the arginine sidechain is estimated to be somewhat higher than that of lysine (Bhowmick & Head-Gordon, 2015; DuBay & Geissler, 2009; Srinivasan & Rose, 1999; Sternberg & Chickos, 1994), which implies its immobilization in an intermolecular interface should tend to incur a higher entropic penalty that reduces its probability of making crystal-packing contacts (Cooper et al., 2007; Czepas et al., 2004; Derewenda, 2004b; Janda et al., 2004; Price 2nd et al., 2009). Therefore, the more frequent occurrence of arginine compared to lysine in crystal-packing contacts suggests that the guanidino group on arginine is substantially “stickier,” in terms of intermolecular interaction free energy than the primary amine on lysine.

This inference concerning the comparative interaction potential of arginine vs. lysine is supported by research in physical biochemistry (Auton et al., 2007; Auton et al., 2011; Auton & Bolen, 2007; Beck et al., 2007; Bennion & Daggett, 2003; Bolen & Rose, 2008; Ferreon & Bolen, 2004; Holthauzen et al., 2010; Hu et al., 2010; Scott et al., 2008; Vener et al., 2015; Wetlaufer & Lovrien, 1964; Yang et al., 2000). One straightforward contributor is likely to be the greater hydrogen-bonding (H-bonding) potential of the guanidino group in arginine compared to the primary amine in lysine (i.e., the presence of five H-bond donor protons compared to three). The significance of this factor is supported by molecular dynamics simulations showing multiple stable interaction geometries for salt bridges containing arginine (Liu et al., 2022; Vener et al., 2015). Arginine is much more effective than lysine in inhibiting protein aggregation, which is believed to reflect strong solvation interactions between arginine and protein surfaces (Arakawa et al., 2007; Arakawa & Tsumoto, 2003; Tischer et al., 2010; Tischer et al., 2014; Tsumoto et al., 2005), including at apolar sites (Das et al., 2007; Mitchell et al., 1994). The guanidinium ion, a close analog of the guanidino group in arginine, shows thermodynamically significant interactions with some proteins in their native conformations (Courtenay et al., 2000; Courtenay et al., 2001; Ferreon & Bolen, 2004; Makhatadze & Privalov, 1992; Zarrine-Afsar et al., 2006), and it also enhances the solvation (Venkatesu et al., 2007) and solubility (Wetlaufer et al., 1964) of apolar groups. The greater interaction potential of arginine is further supported by the well-known properties of the guanidinium ion as a potent protein denaturant (Auton et al., 2011; Bolen & Rose, 2008; Pace et al., 2004; Schellman, 1987), a property that is not shared by primary amines. The thermodynamics of the denaturation process involve an interplay between enthalpically favorable solvation interactions with protein group (Courtenay et al., 2000; Courtenay et al., 2001; Ferreon & Bolen, 2004; Liu & Bolen, 1995; Nandi & Robinson, 1984; Robinson & Jencks, 1963; Robinson & Jencks, 1965; Schrier & Schrier, 1976; Scott et al., 2008; Venkatesu et al., 2007; Zheng et al., 2016), perturbations in water structure (Scott et al., 2008) that weaken the hydrophobic effect (Scott et al., 2008; Wetlaufer et al., 1964), and competition with polar groups on the protein for H-bonding to water. Experimental measurements of the free energy of transfer (Nozaki & Tanford, 1970) of model peptides (Courtenay et al., 2000; Courtenay et al., 2001; Liu & Bolen, 1995; Nandi & Robinson, 1984; Robinson & Jencks, 1963; Robinson & Jencks, 1965; Venkatesu et al., 2007) support strong solvation of the backbone being a dominant contributor to guanidinium-induced protein denaturation (Auton et al., 2011; Ferreon & Bolen, 2004; Liu & Bolen, 1995; Schrier & Schrier, 1976). NMR measurements of backbone amide exchange rates show that guanidinium does not H-bond to the backbone (Lim et al., 2009), while quantum chemical calculations support its favorable solvation of the backbone deriving from strong enthalpic interactions with the Cβ atom of the amino acids (Scott et al., 2008). Finally, molecular dynamics simulations show stabilizing interactions increasing the duration of non-covalent associations of guanidinium with both the backbone and many sidechains (Zheng et al., 2016).

Based on the multifaceted evidence that arginine mediates stronger intermolecular interactions with protein groups than lysine, we hypothesized that introducing multiple arginine-to-lysine (KR) substitutions in a protein would enhance crystallization propensity. We furthermore hypothesized that, given the very similar physicochemical properties of arginine and lysine in terms of size and polarity, multiple simultaneous substitutions would be tolerated without significantly impairing thermodynamic stability.

We herein report the results of biophysical studies that support the validity of this hypothesis. We developed a computer program that automates the selection of sites for KR mutagenesis based on the frequency of such substitutions in naturally occurring homologs, which should tend to avoid sites where lysine is critical for function or structural stability. We furthermore characterized the effects of introducing multiple simultaneous KR mutations on the thermodynamic stability, solubility, and crystallization propensity of three unrelated test proteins, one of which crystallizes readily and two of which are recalcitrant to crystallization with their native sequences. These studies demonstrate that introducing multiple KR mutations into a protein, which we call Bulk KR substitution, is a simple and effective method to improve crystallization propensity. Physicochemical analyses have thus guided the development of an efficient method for large-scale probabilistic engineering of protein surface properties to improve crystallization, which was historically considered a stochastic phenomenon refractory to rational experimental manipulation.

2 RESULTS

2.1 KR mutation site-selection algorithm and software

Sites for Bulk KR substitution are ranked and selected based on the frequency of these substitutions in naturally evolved sequences in a phylogenetic alignment (Figure 2). The procedure is fully automated in Python code that is available for download and that can also be run interactively via the worldwide web using our protein crystallization engineering webserver (see Section 4 for details.) The algorithm implemented by the program ranks sites based on a redundancy-compensated estimate (explained below) of the frequency of KR substitutions in homologous sequences, which are divided into mutually exclusive bins with progressively lower levels of overall percent identity relative to the target sequence. The first bin includes sequences with less than 99% identity (to avoid mutant variants of the target sequence) and greater than or equal to 90% identity. The next bin includes sequences with less than 90% identity and greater than or equal to 80% identity, while subsequent mutually exclusive bins reduce the range of identity levels in 10% steps down to a minimum of 30%. The algorithm steps through these bins in order from highest to lowest percent identity. In each bin, it selects first the site with the highest estimated number of independent arginine substitutions among up to the seven most remotely related homologs in that bin (Figure 2d). It then selects additional sites that show KR substitutions in the same percent identity bin in decreasing order of their estimated number of independent arginine substitutions among the same set of most remote homologs, stopping when it hits a user-adjustable minimum count.

Details are in the caption following the image
Representative output from the Bulk-R webserver analyzing lysine-to-arginine substitution patterns in homologous proteins. Results are shown for the hPDIa domain. All sequence analyses are conducted on sets of proteins spanning 10% ranges (bins) of sequence identity relative to the target protein, with the exception of the highest identity bin which spans 90%–99% identity to avoid including mutant protein sequences. (a) The top graph shows the Shannon entropy for each lysine site in the target protein in the indicated mutually exclusive sequence % ID bin. The middle graph shows the fraction of residues other than lysine, while the bottom graph shows the ratio of arginine/lysine residues at those sites in the same bins. (b) The total number of sequences in each mutually exclusive % ID bin in which arginine is observed at a lysine site in the target protein. (c) A heuristically redundancy-reduced count of the number of arginine substitutions in each of those bins based on the algorithm described in Section 4.3. (d) The expected number of independent substitutions of arginine for lysine in the course of evolution between up to the seven most diverged sequences in each of the indicated % ID bins. The graph in panel (c) displays the results of the first redundancy-correction calculation described in the text and in Section 4.3, while the graph in panel (d) displays the results of the second, which are used to rank lysine sites in the target protein for mutagenesis. (e) Table showing those rankings to illustrate the site-selection algorithm.

This threshold count is imposed to avoid selecting a site based on an arginine substitution in a single sequence that could potentially be inaccurate or present only in a small number of very closely related sequences, in which case they could potentially share a function-impairing or stability-impairing mutation. The threshold count defaults to a value of 1.1, which ensures observation of a KR substitution in at least two sequences with no more than ~93% identity to one another. Given the details of our heuristic sequence-divergence metric described in Section 4.3, every 0.1 increase in the threshold count is equivalent to either requiring ~7% lower identity in two sequences having the same substitution or having an additional pair of homologs at the same divergence level with the same substitution.

After hitting the specified threshold or selecting all sites showing KR substitutions above the threshold count in the bin being evaluated, the algorithm progresses to the next lower percent identity bin and implements the same site-selection protocol in that bin. This selection algorithm continues until the same site-selection protocol is executed in the final mutually exclusive bin, which includes sequences with less than 40% identity and greater than or equal to 30% identity to the target protein. The algorithm thus provides a rank-order for mutation of all lysine sites in the target protein at which mutations to arginine are observed above the threshold count in any of the evaluated percent identity bins. The program outputs the complete list of sites selected by the algorithm ranked according to their order of selection (e.g., as shown in Figure 2e).

The software provides graphical displays of summary parameters characterizing the amino acid distribution in the homologs in each of the percent-identity bins at every lysine site in the target sequence (Figure 2), as well as a graphical display of the overall sequence diversity in each of the bins (Figure S1). The displayed summary parameters are the Shannon entropy of the amino-acid frequency distribution, the frequency of all residues other than lysine, the KR ratio, the total count of sequences with an arginine residue at the site, and two different estimates of that count after compensation for redundancy between those sequences. Both redundancy-compensation calculations use the same heuristic estimate of the degree of mutational resampling between pairs of sequences, as described in Section 4.3, which provides explanations of the details of the two algorithms. In brief, the first redundancy-reduced count evaluates all sequences using a calculation that has rigorously correct behavior in the cases of full redundancy and full independence between the sequences but is otherwise approximate. The second count provides a rigorous probabilistic estimate of the number of independent KR substitutions between the seven most remotely related sequence pairs that have arginine at that site in the percent identity bin being evaluated. Extending this calculation to more sequences is computationally prohibitive, but the estimate based on a limited set of the most diverged homologs provides a highly effective method to ensure that multiple independently determined protein sequences have an arginine residue at the lysine site in the target protein, which is the essential goal of the redundancy-compensation calculations. This second calculation is used for the automated site-ranking algorithm described above.

The program additionally provides a ranking of sites for introducing aspartate-to-glutamate (DE) and asparagine-to-glutamine (NQ) mutations together with a record of which of those sites have potential salt-bridging or H-bonding partners in the target sequence, which would tend to reduce the entropy of the longer sidechains in beta-sheet (i ± 2) or α-helical (i ± 3, i ± 4) secondary structures (Donald et al., 2011; Olson et al., 2001; Vener et al., 2015). (The rationale behind this approach is described in Section 3 below.) Lysine, arginine, and histidine are considered potential salt-bridging (ionic interaction) partners for glutamate and H-bonding partners for glutamine. Asparagine, glutamine, serine, and threonine are considered potential H-bonding partners for both glutamate and glutamine, while aspartate and glutamate are also considered potential H-bonding partners for glutamine.

Our site-selection strategy, which is based on making amino acid substitutions observed in multiple non-redundant homologs, will tend to preserve activity because evolutionary selection for organismic fitness tends to preserve protein function. Consistently with the observed plasticity in homologous protein sequences in the course of biological evolution, large-scale saturation mutagenesis studies using next-generation sequencing methods support most amino acid substitutions at most positions preserving qualitative protein function (Gupta & Varadarajan, 2018; Hopf et al., 2017; Kelsic et al., 2016; Nisthal et al., 2019). Based on these data, a physicochemically conservative substitution observed in strongly homologous proteins is unlikely to abrogate activity. Nonetheless, multiple conservative point mutations could impair protein function, so direct experimental evaluation of protein activity would be required to verify the functional competency of proteins harboring Bulk KR mutations. Lysine residues known or suspected to be important for activity or involved in oligomeric interactions should generally be avoided when implementing the method.

2.2 Test protein selection and expression

We chose to test the Bulk KR substitution approach using three proteins with different crystallization properties. The hPDIa domain is a human drug target (Hoffstrom et al., 2010; Khan et al., 2011) that represents the first of four domains in the endoplasmic-reticulum-resident human Protein Disulfide Isomerase (hPDI) protein. The hPDIa domain had never successfully been crystallized on its own, but its structure was known from a relatively low-resolution crystal structure of a much longer multi-domain construct containing hPDIa (Wang et al., 2013). This crystal structure enables evaluation of the impact of Bulk KR substitutions on the conformation of the hPDIa domain, as reported below. Escherichia coli RNaseH is difficult to crystallize in the absence of ligands stabilizing active site structure but has had its crystal structure determined by groups studying its enzymological mechanism and folding (Goedken & Marqusee, 2001; Katayanagi et al., 1992; Katayanagi, Ishikawa, et al., 1993; Katayanagi, Okumura, et al., 1993; Liao et al., 2022; Yang et al., 1990). MA_2137, an S-adenosyl-methionine-dependent RNA methyltransferase from Methanosarcina acetivorans, crystallizes well in the presence of S-adenosyl-homocysteine (SAH), the product of the methyltransferase reaction that it catalyzes. We included this last protein because we previously demonstrated that increasing hit count in high-throughput crystallization screening is strongly correlated with the probability of successful crystal-structure determination (Price 2nd et al., 2009), which implies quantification of hit count for a protein that crystallizes well is an effective assay for crystallization propensity. KR mutations were introduced into the D65R mutant of MA_2137 because we had previously demonstrated that this single mutation improves the crystallization of this protein, and we wanted to determine whether Bulk KR substitutions could improve it even further.

We introduced 2–13 KR mutations into these proteins (Table 1), and we first examined the expression and solubility levels of the full set of mutant constructs when expressed from a pET plasmid using T7 RNA polymerase in E. coli, which yields high-level expression of the three parental proteins in the form of efficiently purified monodisperse monomers. The largest number of KR mutations that we tested preserved high-yield protein production in a monodisperse state for both hPDIa and MA_2137-D65R (i.e., the hPDIa-9KR and MA_2137-D65R-11KR constructs). Only two native lysine residues remain in each of these constructs. The RNaseH-2KR and RnaseH-5KR constructs similarly preserved high-yield protein production in a monodisperse state. However, the RnaseH-7KR construct yielded polydisperse protein that co-purified with the Hsp33 molecular chaperone protein (Graf et al., 2004; Moayed et al., 2020), while the RnaseH-11KR was completely insoluble even though it expressed at a high level (data not shown). The stability studies presented in the next section confirm earlier research (Goedken et al., 2000; Ishikawa et al., 1993) showing that RnaseH has a low thermal melting temperature of ~45°C, making it marginally stable, which likely explains its tolerance for fewer KR mutations than the other target proteins.

TABLE 1. Summary of expression, stability, and crystallization results for Bulk KR mutant proteins.
Target protein Construct KR mutation sites Expression Solubility Apparent Tm (°C) Apparent ΔHvH (kcal/mol) # Hits at 4 weeks Crystal resolution (Å)

hPDIa

120 amino acids with 11 native lysines

WT +++ +++ 59.9, 67.9 204 ± 36, 131 ± 9 0 n/p
2KR 42, 114 +++ +++ 58.9, 68.6 201 ± 40, 117 ± 5 n/a n/a
5KR 2KR + 130, 69, 71 +++ +++ 48.2, 65.7 47.7 ± 2, 125 ± 3 n/a n/a
7KR 5KR + 31, 131 +++ +++ 49.5, 64.1 89.8 ± 5, 156 ± 5 n/a n/a
9KR 7KR + 57, 65 +++ +++ 60.4 161 ± 3 9 1.89 Å

MA_2137

194 amino acids with 13 native lysines

WT +++ +++ 69.0 283 ± 13.5 75 n/a
D65R +++ +++ 68.7 250 ± 4.60 126 1.60 Å
D65R-3KR 126, 129, 194 +++ +++ 67.1 194 ± 3.30 n/a n/a
D65R-5KR 3KR + 52, 71 +++ +++ 67.4 153 ± 7.10 n/a n/a
D65R-7KR 5KR + 172, 133 +++ +++ 65.5 193 ± 3.90 n/a n/a
D65R-11KR 7KR + 8, 64, 142, 155 +++ +++ 63.4 150 ± 2.60 238 1.91 Å

RNaseH

166 amino acid with 11 native lysines

WT +++ +++ 45.2 135 ± 2.20 0 n/p
2KR 31, 90 +++ +++ n/a n/a n/a n/a
5KR 2KR + 66, 89, 35 +++ +++ 45.5 128 ± 2.60 0 n/p
7KR 5KR + 107, 124 +++ + n/p n/p n/p n/p
11KR 7KR + 111, 62, 88, 101 +++ n/p n/p n/p n/p
  • a The constructs harboring increasing numbers of KR mutations also include all of those in the constructs of the same protein harboring fewer KR mutations. The order of addition of KR mutations for hPDIa is slightly different from the rankings produced by the automated site-selection algorithm (Figure 2e) because this experiment was initiated before development of that software was completed. The code “n/p” stands for not possible, while the code “n/a” stands for not applicable or not attempted. The Tm and ΔHvH values are labeled as apparent because the reversibility of the unfolding transitions was not assessed in the thermal denaturation experiments.

2.3 KR mutations are generally only minimally destabilizing

The thermal stabilities of all the successfully purified Bulk KR constructs were characterized using circular dichroism (CD) spectroscopy. These assays show a variable but generally very small degree of destabilization by KR mutations (Figure 3 and Table 1). RNase-5KR shows an approximately unaltered apparent Tm compared to the wild-type (WT) protein, demonstrating that KR mutations can have a completely neutral effect on stability. PDIa-9KR shows an ~8° reduction compared to the 68°C apparent Tm of the WT domain, while MA_2137-D65R-11KR shows an ~6° reduction compared to the 69°C apparent Tm of the parental protein. Considering the entire set of mutant proteins in our study that could be purified, which includes 25 different KR mutations (Table 1), there is on average a 0.54 ± 0.30° reduction in apparent Tm per KR mutation. Therefore, KR mutations are generally very well tolerated, although large sets of mutations tend to produce modest reductions in protein stability (Sokalingam et al., 2012) that can reduce soluble protein yield in vivo when the stability of the WT protein is relatively low.

Details are in the caption following the image
Thermodynamic stability of Bulk KR mutants characterized using thermal denaturation experiments monitored by circular dichroism spectroscopy. Experiments were conducted using protein samples at 2 mg/mL scanned at a rate of 3°/min in a buffer containing 100 mM NaCl, 10 mM Tris-Cl, and pH 7.5, with the addition of 1 mM SAH for the MA_2137 constructs. The suppression of the low-temperature transition in the hPDIa-9KR construct could be attributable to an intra-helical salt bridge formation between the sidechains of residues E62 and R65. The latter residue is one of the KR mutations in this construct not shared by the hPDIa-7KR construct, and the sidechain of residue K65 does not make any H-bonds at all in the crystal structure of the multidomain construct (PDB id 4EKZ). Therefore, the sidechain salt-bridge produced by the K65R mutation could potentially stabilize the local structure in the hPDIa domain. KR, lysine-to-arginine.

2.4 Bulk KR mutations enhance crystallization propensity and yield strongly diffracting crystals

The purified protein constructs harboring the largest number of KR mutations (i.e., PDIa-9KR and MA_2137-D65R-11KR) along with matched controls were screened for crystallization at the National Crystallization Center at the Hauptman-Woodward Institute (HWI) using their automated, high-throughput 1536-condition screen. This well-documented (Budziszewski et al., 2023; Luft et al., 2001; Luft et al., 2003; Luft, Snell, et al., 2011; Luft, Wolfley, et al., 2011; Lynch et al., 2023) microbatch under-oil screen was employed for initial crystallization screening by the Northeast Structural Genomics Consortium (Acton et al., 2011; Boel et al., 2016; Everett et al., 2016; Xiao et al., 2010) (www.nesg.org), which used it to generate 664 crystal structures deposited in the PDB. Neither the WT nor 5KR construct of RNaseH yielded any crystallization hits in a screen intentionally conducted without any ligands stabilizing active site structure in order to provide the most exacting test of protein crystallization propensity; the lack of success for this protein was potentially influenced by the high 15 mg/mL protein concentration used for screening, which produced pervasive amorphous precipitation in the screen at the earliest observation times (data not shown). However, the hPDIa-9KR and MA_2137-D65R-11KR constructs both yielded significantly more crystallization hits than the control proteins. MA_2137-D65R-11KR yielded hits under twice as many conditions as the MA_2137-D65R control protein, while hPDIa-9KR yielded nine high-quality hits compared to no hits at all for the WT construct (Figure 4 and Table 1). A small number of hit conditions for each protein were chosen for optimization, which very rapidly yielded 1.9 Å structures for both Bulk KR constructs based on a single session of remote synchrotron diffraction screening and data collection (Figures 5 and S1 and Table S1). Therefore, for both target proteins, crystallization screening only had to be conducted on the soluble construct harboring the largest number of Bulk KR mutations to rapidly obtain high-quality crystal structures.

Details are in the caption following the image
Crystallization hit counts from the 1536-condition high-throughput automated microbatch-under-oil crystallization screen at the National Crystallization Center at the Hauptman-Woodward Institute (HWI). These screens were conducted during the summer of 2021 using generation 19 of the HWI crystallization cocktail collection. The proteins were at ~15 mL/mL in the stock solutions used for screening, which contained 100 NaCl, 10 mM DTT, 10 mM Tris-Cl, and pH 7.5. The screening plates were maintained at 4°C for the first week and then moved to 25°C for the remainder of the screening period (Luft et al., 2003; Luft, Snell, et al., 2011; Luft, Wolfley, et al., 2011; Snell et al., 2008).
Details are in the caption following the image
Crystal-packing interactions in structures from protein construct containing bulk-lysine-to-arginine (KR) mutations and the parental protein constructs. The protein backbone is shown in ribbon representation, colored gray for symmetry mates, and shades of green, blue, cyan, and teal for the subunits modeled in the asymmetric unit of each crystal structure. The sidechains of the native arginines (magenta) and residues mutated from KR (red) are shown in space-filling representation on the left and in stick representation in the zoomed-in views of local packing interactions in the boxes to the right. Therefore, red residues in the hPDI(abb‘xa’) and MA_2137-D65R structures are the native lysine residues, while the red residues in the hPDIa-9KR and MA_2137-D65R-11KR structures are the mutated arginine residues. Asp, asn, glu, and gln residues making H-bonds to the illustrated lysine and arginine residues in the boxes are shown in blue stick representation, while the backbone atoms making H-bonds to those residues are shown in ball-and-stick representation. Figure S2 presents magnified stereo views of the crystal-packing interactions in all four structures.

2.5 Bulk KR mutations do not perturb protein structure and frequently make H-bonds in crystal-packing interfaces

The 1.9 Å crystal structures of our Bulk-KR-substituted constructs (Figures 5 and S2 and Table S1) show 0.32–0.33 Å root-mean-square deviations for their backbone Cα atoms compared to the reference structures (Figure S3) (i.e., the much larger multidomain hPDI(abb‘xa’) construct for hPDIa because the isolated domain has never successfully been crystallized before and the parental MA_2137-D65R construct for MA_2137-D65R-11KR). The observed deviations are close to the expected coordinate error in well-refined crystal structures in the operative resolution range (Cruickshank, 1960), indicating our Bulk KR substitution method does not significantly perturb protein conformation for either of our targets. Moreover, the detailed analyses of protein conformation presented in Figure S3 demonstrate that the distribution of the local backbone deviations at the KR mutation sites is equivalent to that observed for all residues in each structure.

Detailed analyses of the intermolecular interactions in our crystal structures demonstrate that the engineered arginine sidechains make extensive crystal-packing contacts. Their contact counts consistently exceed the number of van der Waals contacts and especially H-bonds made by the native arginine sidechains in the same constructs, and they greatly exceed the number of both kinds of contacts made by lysine sidechains in the parental constructs (Table 2 and Figure 5). The larger number of crystal-packing contacts made by the engineered vs. native arginine residues could potentially reflect greater sequestration of the native residues in local surface interactions reducing the probability of reaching across a packing interface to make an energetically stabilizing interaction with a neighboring molecule in the crystal lattice. More extensive experimentation will be required to evaluate this possibility and also to establish the statistical robustness of the trends documented in Table 2. The observed trends nonetheless support the premise underlying our Bulk KR substitution strategy, which was based on the substantially stronger overrepresentation of arginine versus lysine in crystal-packing interfaces in our large-scale analysis of crystal structures previously deposited in the PDB (Figure 1).

TABLE 2. Crystal-packing contacts in reference and Bulk KR protein structures.
Construct WT 9KR D65R D65R-11KR Totals
PDB ID (chain) 4EKZ (A) 8GDY (A) 8GDY (B) 6MRO (A) 8GDU (A)
Resolution (Å) 2.51 1.93 1.93 1.6 1.95
Crystal Solvent Content 44.2% 37.4% 37.4% 35.2% 51.2%
Domain hPDIa hPDIa hPDIa MA_2137 MA_2137
# Residues in domain 120 118 120 194 194 746
# Ordered surface residues 75 74 74 118 118 459
# Ordered surface residues not K or R 58 57 57 94 97 363
# Disordered K 0 0 0 0 0 0
# Ordered K (all surface-exposed) 11 2 2 13 2 30
# Disordered native R 0 0 0 0 2 2
# Ordered native R (all surface-exposed) 6 6 6 11 11 40
# Disordered engineered R 0 0 0 0 1 1
# Ordered engineered R (all surface-exposed) n/a 9 9 n/a 8 26
BB vdW contacts per residue in domain 0.72 0.37 0.43 0.45 0.27 0.43
BB vdW per ordered surface residue 1.15 0.59 0.70 0.75 0.44 0.70
BB vdW contacts per ordered surface residue not K or R 1.41 0.68 0.84 0.84 0.48 0.81
BB vdW contacts per ordered surface K 0.27 0 0 0.69 0 0
BB vdW contacts per ordered surface native R 0.17 0 0 0 0.09 0.05
BB vdW contacts per ordered surface engineered R n/a 0.56 0.44 n/a 0.50 0.50
BB H-bond per residue in domain 0.08 0 0.03 0.04 0.04 0.04
BB H-bond per ordered surface residue 0.12 0 0.05 0.07 0.06 0.06
BB H-bonds per ordered surface residue not K or R 0.16 0 0.05 0.06 0.06 0.07
BB H-bonds per ordered surface K 0 0 0 0 0 0
BB H-bonds per ordered surface native R 0 0 0 0 0 0
BB H-bonds per ordered surface engineered R n/a 0 0.11 n/a 0.13 0.08
SC vdW per residue in domain 0.99 1.42 1.24 1.16 0.99 1.14
SC vdW per ordered surface residue 1.59 2.26 2.01 1.92 1.63 1.86
SC vdW contacts per ordered surface residue not K or R 1.83 1.75 1.56 1.80 1.29 1.62
SC vdW contacts per ordered surface K 0.09 0 0 0.46 0 0
SC vdW contacts per ordered surface native R 2.00 3.33 3.33 4.64 3.09 3.43
SC vdW contacts per ordered surface engineered R n/a 5.22 4.44 n/a 4.13 4.62
SC H-bonds per residue 0.09 0.25 0.26 0.22 0.05 0.16
SC H-bond per ordered surface residue 0.15 0.39 0.42 0.36 0.08 0.27
SC H-bonds per ordered surface residue not K or R 0.12 0.28 0.32 0.32 0 0.20
SC H-bonds per ordered surface K 0.09 0 0 0.15 0 0.10
SC H-bonds per ordered surface native R 0.50 0.67 0.50 0.91 0.09 0.53
SC H-bonds per ordered surface engineered R n/a 1.00 1.11 n/a 1.13 1.08
# Ordered K making 2, 1, 0 SC H-bonds 0, 1, 10 0, 0, 2 0, 0, 2 1, 0, 12 0, 0, 2 1, 1, 28
# Ordered native R making 5, 4, 3, 2, 1, 0 SC H-bonds 0, 0, 0, 1, 1, 4 0, 1, 0, 0, 0, 5 0, 0, 1, 0, 0, 5 1, 0, 0, 1, 1, 7 0, 0, 0, 0, 1, 10 1, 1, 1, 1, 5, 31
# Ordered engineered R making 5, 4, 3, 2, 1, 0 SC H-bonds n/a 1, 0, 0, 1, 2, 5 0, 1, 1, 1, 1, 5 n/a 0, 0, 1, 2, 2, 3 1, 1, 2, 4, 5, 13
  • a Statistics are tabulated separately for backbone (BB) and sidechain (SC) atoms in the indicated amino acids. The abbreviation n/a stands for not applicable. Note: Rows with a white background give counts relative to all residues in the protein construct. Rows highlighted in dark colors with entries in plain text give counts relaitve to all ordered surface-exposed residues, while rows highlighted in the corrresponding light colors with italicized entries give counts segregated according to amino acid type as indicated by the title on each line. Rows highlighted in shades of blue provide overall counts of surface-exposed residues, while rows highlighted in shades of orange provide counts of van der Waals contacts, and rows hihglighted in shades of yeloow provide counts of H-bonds. Ratios per engineered arginine reflect exclusively ordered residues and exclude the three disordered arginine residues in the MA_2137-D65R-11KR structure. Ordered residues are defined as having sufficient electron density to be included in the refined coordinate model deposited in the PDB. Candidate H-bonds were initially identified based on the participating heteroatoms having an internuclear separation ≤3.5 Å, but they were included in the count only if visually confirmed to have reasonable interaction geometry. Among the five crystal structures analyzed here, only two potential H-bonding interactions in crystal-packing interfaces fulfilled the distance criterion but failed the geometric evaluation. Individual atoms fulfilling the basic distance and geometric criteria with two different potential H-bonding partner atoms were counted as contributing two H-bonds (Vener et al., 2015), consistent with Coulomb's law being additive. Atom pairs with internuclear separation ≤4.0 Å not meeting the H-bonding criteria were counted as van der Waals contacts.

On average, the well-ordered engineered arginine sidechains in our Bulk KR structures make 1.08 H-bonds each to a neighboring protein molecule in the crystal lattice, compared to 0.48 each for the native arginine residues (Table 2). In comparison, the well-ordered lysine sidechains make an average of 0.10 H-bonds each to a neighboring protein molecule in the native structures and none in our Bulk KR structures. These results support our hypothesis for the physicochemical basis of the greater overrepresentation of arginine compared to lysine in crystal-packing interfaces in the PDB, which is that the guanidino group in arginine is substantially more efficacious than the primary amine group in lysine in mediating energetically stabilizing H-bonds in the relevant stereochemical contexts (Figure 5).

The number of van der Waals contacts per ordered sidechain follows a similar trend. Engineered and native arginine residues each make on average 4.62 and 3.43 contacts, respectively. In contrast, lysine residues in the native structures each make on average 0.28 contacts, while the lysine residues in the engineered structures make none (Table 2). The greater number of intermolecular van der Waals contacts made by the arginine sidechains could potentially be influenced by their greater H-bonding propensity leading to more frequent occurrence in crystal-packing interfaces, but additional research will be required to determine the relative energetic contributions of their van der Waals versus H-bonding interactions to lattice stabilization. Notably, the number of backbone H-bonding and van der Waals interactions made by arginine versus lysine residues in our reference and engineered structures do not show any clear trends (Table 2).

2.6 Influence of Bulk KR mutations on protein solubility in PEG3350 solutions

Thermodynamic solubility assays using polyethylene glycol 3350 (PEG3350) to induce protein precipitation (Arakawa & Timasheff, 1985; Bhat & Timasheff, 1992; Kita et al., 1994) assess the relative free energy of the hydrated state of individual protein molecules compared to the most favorable self-associated state under conditions of constant ionic strength but reduced water activity (effective concentration). In practice, these assays monitor optical density at 280 nm in the supernatant of solutions containing different concentrations of protein in the presence of increasing concentrations of PEG3350 after centrifugation to remove large particulate molecular assemblies. Therefore, they effectively measure the equilibrium concentration of protein that remains soluble as water activity is reduced. The observed results depend intrinsically on the free energy of the self-associated state, which varies significantly for different proteins in different solvent environments and can include crystalline phases and liquid–liquid phase separated (LLPS) phases in addition to heterogeneous amorphously precipitated phases. This factor can complicate the interpretation of thermodynamic solubility assays, but they nonetheless provide insight into physicochemical properties that ultimately control protein crystallization behavior.

WT hPDIa and the 9KR mutant show no significant difference in their behavior in PEG3350 precipitation assays (Figure S4a), indicating that the Bulk KR substitutions in this protein domain do not alter its thermodynamic solubility under these conditions even though they enable crystallization and high-resolution structure determination of a domain that does not crystallize at all with its native sequence. Even when harboring the 9KR mutations, hPDIa crystallizes only under a very small fraction (0.6%) of the solution conditions explored in high-throughput crystallization screening (Figure 4) while showing amorphous precipitation in many of them (data not shown). Therefore, our thermodynamic solubility assays on the hPDIa constructs are likely measuring the free energy of the hydrated state of individual protein molecules compared to amorphously precipitated phases, and they demonstrate that the physicochemical properties controlling the formation of such phases are likely different from those controlling protein crystallization behavior.

PEG3350 precipitation assays on our MA_2137 constructs demonstrate more complex phase behavior (Figure S4b,c) likely reflecting different physical forms of self-association under assay conditions. Notably, the WT and D65R mutant could not be precipitated by the highest 35% (v/v) concentration of PEG3350 that was assayed (Figure S4b). These protein constructs instead showed some tendency to exhibit a small increase in optical density at low PEG3350 concentration, likely reflecting light scattering due to some form of protein self-association in a low-density state that does not sediment during low-speed centrifugation. During crystallization screening, these constructs showed clear evidence of LLPS without any apparent amorphous precipitation in many reaction conditions (Figure S5). Therefore, the inability to precipitate these constructs at high PEG3350 concentration likely reflects LLPS being energetically more favorable for this protein under conditions of low water activity than amorphous precipitation. Our crystal structures of MA_2137 constructs show clear and well-ordered electron density for every residue in this 202-residue protein except for the C-terminal hexahistidine tag that was added to enable purification using NiNTA affinity chromatography and a 12-residue internal loop that is disordered in the MA_2137-D65R-11KR structure, although well ordered by Ca++ ions from the mother liquor in the structure of the parental MA_2137-D65R construct (Figure S3b). Furthermore, our CD thermal melting data demonstrate that the protein is very stably folded (Figure 3). Therefore, our solubility data (Figures S4c and S5) combined with our crystallization screening data (Figure 4b) suggest that MA_2137 undergoes LLPS in an essentially fully folded conformational state.

In contrast to the behavior of the WT and the D65R constructs, the 5KR and 11KR constructs of MA_2137 show precipitation at the highest PEG3350 concentrations used in our solubility assays, with the 11KR construct showing stronger precipitation than the 5KR construct (Figure S4c). These results indicate the free energy of these MA_2137 constructs is lower in the precipitated state than in the LLPS state under conditions of very low water activity, reflecting a reduction in thermodynamic solubility. However, these constructs both crystallize extremely promiscuously, with the 5KR and 11KR constructs yielding crystallization hits in ~8% and ~ 15% of screened conditions, respectively (Figure 4). These results raise the possibility that the precipitate formed by the 5KR and 11KR constructs at very high PEG3350 concentration could be in a microcrystalline state rather than amorphously precipitated state due to the high efficacy of the Bulk KR mutations in promoting crystallization. Further research will be needed to determine whether the reduced solubility of the 5KR and 11KR constructs reflects the stabilization of crystalline states or amorphously precipitated states of MA_2137-D65R.

3 DISCUSSION

The results presented in this paper demonstrate the efficacy of a new method for probabilistic engineering of protein surface properties to enhance crystallization propensity based on the substitution of multiple lysine (K) residues with arginine (R). The rationale behind this “Bulk KR” substitution method is that lysine and arginine have very similar physicochemical properties, but arginine shows substantially higher overrepresentation than lysine in a large-scale computational analysis we performed of crystal structures deposited in the PDB (Figure 1) (Naumov et al., 2019). We have developed software to rank lysine sites for substitution based on the redundancy-corrected count of KR substitutions observed in homologous proteins with the highest level of sequence identity (Figure 2), based on the rationale that biological evolution selects against destabilizing and function-impairing mutations. We demonstrate that mutations selected this way are only minimally destabilizing (Figure 3 and Table 1) and significantly enhance crystallization propensity for two of three test proteins (Figure 4). The crystals yielded by our Bulk KR method diffract strongly and enabled efficient determination of a 1.9 Å crystal structure (Table S1) for the hPDIa protein domain that does not crystallize at all with its native sequence (Figure 4 and Table 1). Our crystal structures of Bulk KR substituted proteins show no significant conformational or stereochemical differences versus reference proteins (Figure S3). Furthermore, the engineered arginine residues, like the native ones, make both van der Waals contacts and H-bonds in crystal-packing interfaces at substantially higher frequencies than either lysine residues or other residues (Table 2). These crystal structures were produced by the Bulk KR constructs harboring the highest number of substitutions, which were the only constructs for which any diffraction data were measured. These results support the efficacy of a streamlined pipeline for crystal structure determination in which solubility is tested for a set of constructs with an increasing number of KR mutations, but purification and crystallization screening are only performed on the construct harboring the largest number of mutations. In summary, the biophysical results presented in this paper support bulk KR substitution being a rational and effective probabilistic strategy to engineer protein surface properties to enhance protein crystallization propensity.

Our Bulk KR method focuses on large-scale modification of protein surface properties using mutations between amino acids that conserve qualitative physicochemical properties. Many previous studies have demonstrated that mutation of individual surface residues generally changes crystallization behavior and can significantly improve it, but the strategies evaluated in the past have led to mixed results (Anstrom et al., 2005; Cieslik & Derewenda, 2009; Cooper et al., 2007; Czepas et al., 2004; Derewenda, 2004a; Derewenda, 2004b; Derewenda & Godzik, 2017; Derewenda & Vekilov, 2006; Janda et al., 2004; Longenecker et al., 2001; Mateja et al., 2002; Qiu & Janson, 2004). We therefore focused on the simultaneous introduction of multiple putative crystallization-enhancing mutations based on the hypothesis that stronger and more consistent improvements in crystallization behavior are likely to be promoted by more extensive changes in surface properties, assuming the individual changes tend to increase crystallization probability. This probabilistic surface-engineering strategy requires reliable information on the relative influence of different amino acids on crystallization propensity, and it also requires that the individual mutations do not significantly reduce protein stability so that large-scale mutagenesis does not impair protein folding and prevent effective purification. We used our computational analyses summarized in Figure 1 (Banayan et al., 2023; Naumov et al., 2019) to guide the selection of crystallization-enhancing amino acid substitutions, leading us to focus initially on KR substitutions because of the equivalent charge and similar volume and entropy of lysine and arginine sidechains.

KR mutations have been explored in the past both for their ability to modulate protein stability (Sokalingam et al., 2012) and crystallization propensity (Czepas et al., 2004). Bulk KR substitution in GFP was shown to greatly reduce the amount of soluble protein expressed in vivo in E. coli and also reduce the fluorescence level of the protein that could be purified, although the mutations slowed the rate of unfolding by chemical denaturants (Sokalingam et al., 2012). However, this study only examined 14KR and 19KR mutations at sites selected based on diffuse criteria. Our studies show that KR mutations at 25 sites selected based on the frequency of substitution observed in homologs show on average a 0.54°C reduction in apparent Tm per KR mutation (Figure 3 and Table 1).

A series of previous experimental studies examined the effects of mutating surface-exposed lysine residues on crystallization propensity (Anstrom et al., 2005; Czepas et al., 2004). These studies were guided by different conceptual premises that prioritize fundamentally different kinds of amino acid substitutions. The surface entropy reduction (SER) method, which represented pioneering research on the use of rational surface mutagenesis to improve protein crystallization propensity, focused on mutations that replace high entropy sidechains with low entropy sidechains, especially lysine-to-alanine (KA) mutations, in order to reduce the free energy penalty incurred upon immobilizing flexible surface residues in crystal-packing interfaces (Cieslik & Derewenda, 2009; Cooper et al., 2007; Czepas et al., 2004; Derewenda, 2004a; Derewenda, 2004b; Derewenda & Godzik, 2017; Derewenda & Vekilov, 2006; Janda et al., 2004; Longenecker et al., 2001; Mateja et al., 2002). The alternative approach was based on a computational analysis of the amino acids making crystal-packing interactions in a set of 233 protein crystal structures (Dasgupta et al., 1997). This groundbreaking analysis, which used different computational methods, produced different conclusions compared to our computational analysis of a much larger set of 87,684 crystal structures (Banayan et al., 2023; Naumov et al., 2019) (Figure 1). When comparing results for all 20 amino acids, there is not a statistically significant correlation between the single amino-acid crystal-packing propensities deduced from the two analyses (Figure S6). The most salient difference is that the earlier study concluded that lysine, glutamate, and tryptophan are all disfavored in crystal-packing contacts (Dasgupta et al., 1997), which contradicts our results reported in Figure 1 (Banayan et al., 2023; Naumov et al., 2019). Supporting the validity of our analysis, we have successfully used introduction of both lysine and glutamate residues to improve protein crystallization (Naumov et al., 2019). The earlier computational analysis did conclude that arginine and glutamine are favored in crystal-packing contacts (Dasgupta et al., 1997), consistent with our results (Banayan et al., 2023; Naumov et al., 2019) (Figure 1), leading the authors of the earlier analysis to suggest that KR and KQ substitutions could improve protein crystallization propensity.

Two different groups subsequently tested these proposals (Anstrom et al., 2005; Czepas et al., 2004). One group tested nine single, double, or triple KR mutations, five of which produced diffracting crystals, but only one of which produced a crystal that diffracted to higher resolution than the WT protein (Czepas et al., 2004). Their overall conclusion was that KR mutations show lower efficacy in improving protein crystallization than KA mutations (Czepas et al., 2004). The other group conducted a more extensive study that effectively led to the opposite conclusion regarding the efficacy of KA mutations (Anstrom et al., 2005). They systematically examined the substitution of 15 surface-exposed lysine residues with either glutamine or alanine, which did not yield any well-diffracting crystals. This rigorous study demonstrated that the surface mutations consistently changed the crystallization conditions that yielded hits, and weakly diffracting crystals were obtained from one KQ and one KA mutant protein, while none were obtained from the WT protein using the same crystallization screens. However, the KQ mutations yielded hits in those screens at the same rate at the WT protein within experimental error, while the KA mutations yielded hits at a significantly reduced rate that was ~2/3 that of the WT protein (Anstrom et al., 2005). These experimental results are consistent with our computational results reported in Figure 1 (Banayan et al., 2023; Naumov et al., 2019), which show glutamine has a somewhat higher probability of participating in crystal-packing contacts than lysine, while alanine has a significantly lower probability of participating in crystal-packing contacts. The experimental studies summarized above highlight the probabilistic nature of the influence of single amino acid substitutions on crystallization propensity, which contributed to the development of our strategy focused on engineering more consistent improvements in crystallization behavior via simultaneous introduction of multiple putative crystallization-enhancing surface mutations.

This large-scale surface-mutagenesis strategy requires that individual mutations preserve or at most minimally perturb protein stability, which has led us to focus on physicochemically conservative substitutions and initially KR mutations. One complication with KA mutations is that the grossly different physicochemical properties of alanine compared to lysine are more likely to produce significant protein destabilization. This effect impedes the introduction of multiple KA mutations, while our results (Tables 1 and 2) and previously reported results (Czepas et al., 2004) show that the introduction of multiple mutations is an effective strategy to improve crystallization propensity. Our computational analyses raise additional questions about the KA method. Alanine is significantly underrepresented in crystal-packing interfaces, while lysine is significantly overrepresented (Figure 1). These observations are consistent with the results of computational analyses demonstrating that, for all amino acids, increasing solvent exposure correlates strongly with increasing the probability of making a crystal-packing interaction (Banayan, 2023). The low solvent exposure of alanine residues at most sites in proteins (Banayan, 2023; Rost & Sander, 1994; Tien et al., 2013) is therefore consistent with the strong underrepresentation of alanine in crystal-packing interfaces (Banayan et al., 2023; Naumov et al., 2019) (Figure 1). Additional computational analyses show that some alanine-containing sequences located in α-helix-capping motifs and short surface loops are significantly overrepresented in crystal-packing interfaces (Naumov et al., 2019; Price 2nd et al., 2009). These results suggest that the influence of KA mutations on crystallization propensity is likely to depend strongly on local structural context (Cieslik & Derewenda, 2009; Cooper et al., 2007; Czepas et al., 2004; Derewenda, 2004a; Derewenda, 2004b; Derewenda & Godzik, 2017; Derewenda & Vekilov, 2006; Janda et al., 2004; Longenecker et al., 2001; Mateja et al., 2002) and may involve more complex physicochemical effects than SER.

The conceptual foundation of the crystallization engineering method reported in this paper is fundamentally different from that of earlier studies. As indicated above, it focuses on probabilistic reengineering of protein surface properties via simultaneous substitution of multiple amino acids with similar physicochemical properties but different propensity to make crystal-packing interactions based on large-scale computational analyses of previously determined crystal structures (Banayan et al., 2023; Naumov et al., 2019) (Figure 1). We demonstrated significant improvement in the crystallization properties of two out of three target proteins when the method was applied using a streamlined experimental design in which a single mutant construct of each target was purified and subjected to crystallization screening (Figures 4 and 5, Tables 1 and 2, and Table S1). For each target, the expression and solubility properties were evaluated for a series of constructs containing an increasing number of physicochemically conservative crystallization-enhancing mutations, which is readily done in parallel for at least a dozen constructs, but exclusively the soluble construct with the greatest number of crystallization-enhancing mutations was purified, which is substantially more labor-intensive. This streamlined experimental protocol led to solution of a high-resolution crystal structure (Figure 5b and Table S1) for the hPDIa domain, a human drug target that does not yield any crystal hits with its native sequence (Figure 4b).

Given the success of the Bulk KR method in improving protein crystallization behavior, our computational analysis of crystal-packing interactions in the PDB (Figure 1) suggests several related strategies with promise to improve crystallization behavior based on the same conceptual approach. Aspartate and glutamate frequently substitute for one another in the course of evolution (Henikoff & Henikoff, 1993; Morcos et al., 2011; Thomas et al., 2008) due to their very similar physicochemical properties, but glutamate shows over twofold higher overrepresentation in crystal-packing interfaces (Figure 1). A similar trend relative to crystal packing interactions is observed for asparagine and glutamine, which also have very similar physicochemical properties. These observations suggest bulk DE and NQ substitutions are also likely to improve crystallization propensity.

In the case of these substitutions, the higher entropy of the sidechain with greater crystal-packing propensity will tend to thermodynamically oppose immobilization in a crystal-packing interface, while this factor does not apply to bulk KR substitution due to the very similar entropy of these sidechains. However, our computational analyses of crystal-packing interactions in the PDB (Banayan, 2023; Naumov et al., 2019) shows that high-entropy sidechains mediating crystal-packing interactions tend to participate in salt-bridging and H-bonding interactions (Donald et al., 2011; Olson et al., 2001; Vener et al., 2015) with nearby residues in the primary sequence, especially at ±3 and ±4 positions in α-helices and ±2 positions in β-strands. These interactions likely reduce the entropy of the sidechains in the isolated protein molecules, which will reduce or eliminate entropy loss due to immobilization in a crystal-packing interface. Therefore, bulk DE and NQ substitution seems likely to be most effective when the residues prioritized for mutation have potential salt-bridging or H-bonding partners at ±3 and ±4 positions in α-helices or ±2 positions in β-strands.

Future research by the structural biology community will be required to rigorously assess the efficacy of the Bulk KR method and the related DE and NQ bulk substitution methods proposed above. Given the relatively small sizes of the proteins evaluated in our experiments, one important question to be addressed in future studies is how the number of KR mutations tolerated by a protein and the number needed to achieve substantial improvement in crystallization properties scales with protein size. Future studies will also be needed to establish the most efficient paradigm for combing KR, DE, and NQ mutations to maximize crystallization hit rate and crystal quality while minimizing the number of constructs needed to obtain a diversity of different crystal forms and a high-resolution crystal structure. Nonetheless, our bulk substitution method focused on large-scale probabilistic remodeling of protein surface properties to enhance crystallization propensity already shows significant efficacy for rational engineering of proteins to improve their crystallization properties.

4 MATERIALS AND METHODS

4.1 Site-selection software and input sequence alignment format

Our Python program that automatically generates a suggested order for mutation of lysine residues to arginine incorporates routines from the BioPython package (Cock et al., 2009). The program code is available for download in our Github archive (https://github.com/huntmolecularbiophysicslab/pxengineering), and it can also be run interactively using our protein crystallization engineering webserver (http://www.pxengineering.org). The program requires the input of an alignment of homologous sequences in Clustal (Thompson et al., 1994) format. In brief, the first line of files in this ASCII format must start with the words “CLUSTAL W” or “CLUSTALW.” Blank lines then separate sequential blocks of text containing aligned sequence segments with one sequence per line. Each of those lines starts with a string containing the sequence name, which is followed by blank spaces and then up to 60 single-letter amino acid codes. Each line can contain additional blank spaces followed by a residue number. Each block optionally ends with a line containing characters encoding the degree of sequence conservation at each position in the alignment.

4.2 GPU acceleration of sequence identity calculation

Our Python program accelerates the calculation of the absolute percent identity between two sequences by parallelizing site-by-site comparison on a Graphical Processing Unit (GPU) chip using a custom reduction kernel (Kirk & Hwu, 2023) written using the CuPy library (Okuta et al., 2017). In brief, each pair of aligned amino acids in two sequences is sent to a separate GPU core. If the one-character amino acid codes in both sequences match each other and neither is empty due to a gap in the sequence alignment, that core stores a value of 1 in its register, which is otherwise set to 0. The values in the registers for all sites are then summed using parallel reduction to count the number of identical amino acids in the sequence. Details can be found in the code at https://github.com/huntmolecularbiophysicslab/pxengineering.

4.3 Prioritization of mutation sites based on redundancy-corrected counts of KR mutations observed in homologous proteins

To compensate for outright redundancy as well as inhomogeneous phylogenetic sampling in sequence databases, our software performs two redundancy-compensation calculations on the set of sequences in each percent-identity bin having an arginine substitution at a specific lysine site in the target protein. Both calculations use the same heuristic estimate for the probability of evolutionary resampling at an aligned site between two sequences i and j with an overall fraction fid of identical residues at all aligned sites:
P site resampling f id ij = 1.0 for f id f min 1 f id 1 f min for f id > f min . $$ {P}_{site- resampling}{\left({f}_{id}\right)}_{ij}=\left\{\ \begin{array}{c}1.0\kern0.75em for\ {f}_{id}\le {f}_{min}\ \\ {}\left(\frac{1-{f}_{id}}{1-{f}_{min}}\right)\kern0.75em for\ {f}_{id}>{f}_{min}\ \end{array}\right\}. $$
We assume fmin = 0.3. One calculation gives a redundancy-reduced estimate CR of arginine counts using the following formula in which the summation is performed over all unique pairs of the N sequences in the bin having an arginine substitution at one lysine site in the target protein:
C R = 2 N i < j P site resampling f id ij + 1 . $$ {C}_R=\left(\frac{2}{N}\sum \limits_{i<j}{P}_{site- resampling}{\left({f}_{id}\right)}_{ij}\right)+1. $$

This calculation is extremely rapid and includes all homologs but is only rigorously accurate in the limiting cases of full redundancy or full independence. The second calculation gives a rigorous estimate of the expectation value for the number of independent observations of arginine based on combinations of the heuristic probabilities of being resampled or not being resampled for all unique pairs among the seven most diverged sequences in each bin that have arginine at a specific lysine site. These sequences are identified by taking the single sequence with the lowest percent identity to the target sequence and then progressively adding the sequence with the lowest average percent identity to those already selected. The details of the implementation can be found in the code at https://github.com/huntmolecularbiophysicslab/pxengineering.

4.4 Protein expression and purification

Protein coding sequences harboring Bulk KR substitutions were synthesized (Twist Bioscience, South San Francisco, CA) with a short C-terminal hexahistidine tag (LEHHHHHH), cloned under the control of the T7 RNA polymerase promoter in the pET21_NESG expression plasmid (https://dnasu.org/DNASU/GetCloneDetail.do?cloneid=336944), and then transformed into BL21(DE3) Rosetta E. coli cells (MilliporeSigma, Burlington, MA, USA). Protein expression was induced with 1 mM IPTG for 4 h at 30°C in Terrific Broth (Terrific Broth, 2015). Cells were pelleted by centrifugation at 4000 rpm for 25 min at 4°C and resuspended on ice in 10 mM imidazole, 300 mM NaCl, 1 mM TCEP, 5% (w/v) glycerol, 50 mM NaH2PO4, and pH 7.5, before cell lysis by probe sonication. The supernatant following 15,000 rpm centrifugation at 4°C was mixed with Ni-NTA resin and incubated at 4°C for 1 h. The mixture was then transferred into a column and washed with the same buffer containing a higher 100 mM imidazole concentration before elution of the protein in 6 mL of the same buffer containing 250 mM imidazole. A 1-mL aliquot of eluted protein was concentrated to 500 μL using an Amicon 10 kDa centrifugal filter (MilliporeSigma, Burlington, MA, USA) and loaded via a 1-mL loop onto a Superdex 200 Increase 10/300 gL gel filtration column equilibrated in 100 mM NaCl, 10 mM DTT, 10 mM Tris-Cl, pH 7.5. Protein-containing fractions were concentrated to ~15 mg/mL based on a priori sequence-based extinction coefficients (Gill & von Hippel, 1989), OD280nm values were measured using a Nanodrop spectrophotometer (ThermoFisher, Waltham, MA, USA), and the concentrated protein was immediately flash-frozen in aliquots in liquid nitrogen before storage at −80°C pending use.

4.5 Thermal stability assays using CD spectroscopy

An Applied Photophysics (Leatherhead, UK) Chirascan V100 spectropolarimeter with a Peltier-jacketed cell holder was used to collect serial CD spectra spanning 200–250 nm continuously during a 3°C/min thermal ramp nominally running from 10°C to 84°C. Data were measured from protein samples in a 0.5-mm quartz cuvette in 1 nm increments using a 0.25 integration time per point and a 1-nm bandwidth, corresponding to 35 s per spectrum. Direct measurements of the cell temperature during the experiment, which were used for data display and analysis, indicated the actual range of the temperature ramp was from ~11°C to 78°C for all samples. Protein samples were diluted to 2 mg/mL using gel filtration buffer lacking DTT (i.e., 100 mM NaCl, 10 mM Tris-Cl, and pH 7.5) with the addition of 1 mM SAH for the MA_2137 constructs. Global curve fitting of spectral data during the thermal ramp was performed from 215 to 230 nm using the program GLOBAL3 (Applied Photophysics) using double linear baseline correction (i.e., before and after the observed transitions) to extract the thermodynamic parameters and melting temperatures (Table 1 and Figure 3). Each dataset was analyzed using the smallest number of transitions showing approximately random directions for the residuals for adjacent points in the CD versus measured-temperature surface. Because the reversibility of the unfolding transitions was not evaluated, the inferred Tm and ΔHvH values listed in Table 1 and cited in the text are labeled as apparent.

4.6 Solubility assays

Protein stock solutions were diluted to working concentration in the same buffer used for gel-filtration chromatography containing different weight/volume concentrations of PEG3350. Following a 60-min incubation at room temperature, the samples were spun for 10 min at 14,000 RPM in a microfuge to pellet particulates, and the concentration of protein in the supernatant was measured using the optical density at 280 nm measured in a Nanodrop spectrophotometer (ThermoFisher) based on the a priori extinction coefficient (Gill & von Hippel, 1989). Centrifugation and measurement of concentration in the supernatant were repeated 24 h later to ensure equilibrium had been reached.

4.7 Protein crystallization screening

For each target protein, the parental construct and the construct containing the largest number of KR mutations that could be purified, but not the constructs containing fewer KR mutations, were submitted for crystallization screening using the standard protocol (Budziszewski et al., 2023; Luft et al., 2001; Luft et al., 2003; Luft, Snell, et al., 2011; Luft, Wolfley, et al., 2011; Lynch et al., 2023) at the High-Throughput Crystallization Screening Center at the Hauptman-Woodward Medical Research Institute (https://hwi.buffalo.edu/high-throughput-crystallization-center/). Protein stocks were adjusted to ~15 mg/mL concentration before mixing 1:1 with the well solution for microbatch under-oil crystallization screening. Crystallization reactions were imaged in a Rock Imager before protein addition and at 1 day, 1 week, 2, 3, 4, and 6 weeks after reaction setup using parallel brightfield microscopy, ultraviolet two-photon-exited fluorescence microscopy, and SONICC (second-order harmonic imaging of chiral crystals) microscopy. Hits were identified based on visual observation of refractive supramolecular aggregates in brightfield microscopy images that had corresponding features in either fluorescence or SONICC microscopy images or both. Hits were characterized as “High Quality” based on the judgement of an expert protein crystallographer that it would likely be straightforward to reproduce the crystals and optimize them to sufficient size to mount for measurement of x-ray diffraction data.

4.8 Protein crystal optimization

A small subset of the crystal hits for hPDIa-9KR and MA_2137-D65R-11KR was reproduced using the microbatch under-oil method at 4°C and 18°C, and they were subsequently optimized by seeding. The crystal used for determination of the hPDIa-9KR structure was grown by mixing the 15 mg/mL stock solution at a 2:1 volume ratio with a crystallization reagent comprising 24% (w/v) PEG 20k, 0.1 M potassium thiocyanate, 0.1 M MES, and pH 6. The crystal used for determination of the MA_2137-D65R-11KR structure was grown by mixing the 15 mg/mL stock solution at a 1:1 volume ratio with a crystallization reagent comprising 30% (w/v) PEG 1k, 0.1 M HEPES, and pH 7.5. All crystals were transferred into a similar crystallization solution supplemented with 20% (v/v) ethylene glycol before mounting and flash-freezing in liquid nitrogen.

4.9 Crystal structure determination and refinement

X-ray diffraction data were collected from single crystals of hPDIa-9 M and MA_2137-D65R-11 using, respectively, the NE-CAT 24-ID-E and 24-ID-C beam lines at the Advanced Photon Source (Table S1). The images were processed and scaled using XDS (Kabsch, 1988a; Kabsch, 1988b; Kabsch, 2010a; Kabsch, 2010b). The structure of hPDIa-9KR was solved by molecular replacement using the program MOLREP (Vagin & Teplyakov, 2010) employing a search model comprising the first domain in the crystal structure of full-length hPDI (PDB id 4EKZ). The structure of MA_2137-D65R-11KR was solved using the same methods employing the structure of MA_2137-D65R (PDB id 6MRO) as the search model. Both structures were refined (Table S1) using PHENIX (Liebschner et al., 2019) in conjunction with manual rebuilding in XtalView (McRee, 1999) and COOT (Casanal et al., 2020).

AUTHOR CONTRIBUTIONS

Nooriel E. Banayan: Software; investigation; formal analysis; writing – original draft; visualization; writing – review and editing; validation; data curation; methodology. Blaine J. Loughlin: Investigation. Shikha Singh: Methodology; investigation. Farhad Forouhar: Investigation. Guanqi Lu: Software. Kam-Ho Wong: Investigation. Matthew Neky: Investigation. Henry S. Hunt: Software. Larry B. Bateman: Software. Angel Tamez: Software. Samuel K. Handelman: Methodology. W. Nicholson Price: Methodology. John F. Hunt: Conceptualization; software; formal analysis; funding acquisition; project administration; writing – original draft; methodology; data curation; validation; supervision; visualization; resources; writing – review and editing.

ACKNOWLEDGMENTS

This work was supported by a grant from the US NIH-NIGMS to JFH (GM127883). We thank Accendro Inc. for assistance with web programming and G.T. Montelione, R. Xiao, and the other members of the Northeast Structural Genomics Consortium for long-term collaboration and advice.

    CONFLICT OF INTEREST STATEMENT

    JFH is a member of the scientific advisory board of Nexomics Biosciences.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.