Tools for Protein Science

Free Access

CHARMING: Harmonizing synonymous codon usage to replicate a desired codon usage pattern

Gabriel Wright

Department of Computer Science & Engineering, University of Notre Dame, Notre Dame, Indiana, USA

Contribution: Conceptualization (lead), Formal analysis (equal), Investigation (equal), Methodology (lead), Visualization (supporting), Writing - original draft (supporting), Writing - review & editing (supporting)

Search for more papers by this author

Anabel Rodriguez,

Anabel Rodriguez

Department of Chemistry & Biochemistry, University of Notre Dame, Notre Dame, Indiana, USA

Contribution: Conceptualization (supporting), Formal analysis (equal), Investigation (equal), Methodology (supporting), Visualization (lead), Writing - original draft (supporting), Writing - review & editing (supporting)

Search for more papers by this author

Jun Li,

Jun Li

Department of Applied and Computational Mathematics & Statistics, University of Notre Dame, Notre Dame, Indiana, USA

Contribution: Funding acquisition (equal), Methodology (supporting), Supervision (supporting)

Search for more papers by this author

Tijana Milenkovic,

Tijana Milenkovic

Department of Computer Science & Engineering, University of Notre Dame, Notre Dame, Indiana, USA

Contribution: Funding acquisition (equal), Methodology (supporting), Supervision (lead)

Search for more papers by this author

Scott J. Emrich,

Corresponding Author

Scott J. Emrich

[email protected]

Department of Electrical Engineering & Computer Science, University of Tennessee, Knoxville, Tennessee, USA

Correspondence

Patricia L. Clark, Department of Chemistry & Biochemistry, University of Notre Dame, Notre Dame, IN 46556, USA.

Email: [email protected]

Scott J. Emrich, Department of Electrical Engineering & Computer Science, University of Tennessee, Knoxville, TN 37996, USA.

Email: [email protected]

Search for more papers by this author

Patricia L. Clark,

Corresponding Author

Patricia L. Clark

[email protected]

orcid.org/0000-0001-5462-8248

Department of Chemistry & Biochemistry, University of Notre Dame, Notre Dame, Indiana, USA

Correspondence

Patricia L. Clark, Department of Chemistry & Biochemistry, University of Notre Dame, Notre Dame, IN 46556, USA.

Email: [email protected]

Scott J. Emrich, Department of Electrical Engineering & Computer Science, University of Tennessee, Knoxville, TN 37996, USA.

Email: [email protected]

Contribution: Funding acquisition (equal), Supervision (lead), Writing - original draft (lead), Writing - review & editing (lead)

Search for more papers by this author

Gabriel Wright,

Gabriel Wright

Department of Computer Science & Engineering, University of Notre Dame, Notre Dame, Indiana, USA

Search for more papers by this author

Anabel Rodriguez,

Anabel Rodriguez

Department of Chemistry & Biochemistry, University of Notre Dame, Notre Dame, Indiana, USA

Search for more papers by this author

Jun Li,

Jun Li

Department of Applied and Computational Mathematics & Statistics, University of Notre Dame, Notre Dame, Indiana, USA

Contribution: Funding acquisition (equal), Methodology (supporting), Supervision (supporting)

Search for more papers by this author

Tijana Milenkovic,

Tijana Milenkovic

Department of Computer Science & Engineering, University of Notre Dame, Notre Dame, Indiana, USA

Contribution: Funding acquisition (equal), Methodology (supporting), Supervision (lead)

Search for more papers by this author

Scott J. Emrich,

Corresponding Author

Scott J. Emrich

[email protected]

Department of Electrical Engineering & Computer Science, University of Tennessee, Knoxville, Tennessee, USA

Correspondence

Patricia L. Clark, Department of Chemistry & Biochemistry, University of Notre Dame, Notre Dame, IN 46556, USA.

Email: [email protected]

Scott J. Emrich, Department of Electrical Engineering & Computer Science, University of Tennessee, Knoxville, TN 37996, USA.

Email: [email protected]

Search for more papers by this author

Patricia L. Clark,

Corresponding Author

Patricia L. Clark

[email protected]

orcid.org/0000-0001-5462-8248

Department of Chemistry & Biochemistry, University of Notre Dame, Notre Dame, Indiana, USA

Correspondence

Patricia L. Clark, Department of Chemistry & Biochemistry, University of Notre Dame, Notre Dame, IN 46556, USA.

Email: [email protected]

Scott J. Emrich, Department of Electrical Engineering & Computer Science, University of Tennessee, Knoxville, TN 37996, USA.

Email: [email protected]

Contribution: Funding acquisition (equal), Supervision (lead), Writing - original draft (lead), Writing - review & editing (lead)

Search for more papers by this author

First published: 04 November 2021

https://doi.org/10.1002/pro.4223

Citations: 3

Funding information: National Institute of General Medical Sciences, Grant/Award Number: DP1 GM146256; National Institutes of Health, Grant/Award Number: R01 GM120733; W. M. Keck Foundation

Share a link

Email
Wechat
Bluesky

Abstract

There is a growing appreciation that synonymous codon usage, although historically regarded as phenotypically silent, can instead alter a wide range of mechanisms related to functional protein production, a term we use here to describe the net effect of transcription (mRNA synthesis), mRNA half-life, translation (protein synthesis) and the probability of a protein folding correctly to its active, functional structure. In particular, recent discoveries have highlighted the important role that sub-optimal codons can play in modifying co-translational protein folding. These results have drawn increased attention to the patterns of synonymous codon usage within coding sequences, particularly in light of the discovery that these patterns can be conserved across evolution for homologous proteins. Because synonymous codon usage differs between organisms, for heterologous gene expression it can be desirable to make synonymous codon substitutions to match the codon usage pattern from the original organism in the heterologous expression host. Here we present CHARMING (for Codon HARMonizING), a robust and versatile algorithm to design mRNA sequences for heterologous gene expression and other related codon harmonization tasks. CHARMING can be run as a downloadable Python script or via a web portal at http://www.codons.org.

1 INTRODUCTION

Synonymous codon substitutions alter an mRNA sequence but not its encoded amino acid sequence. For this reason, synonymous substitutions were historically considered to be phenotypically silent. In recent years, however, it has become clear that substitutions between synonymous codons can lead to a variety of effects on functional protein production, a term we use here to describe the net effect of transcription (mRNA synthesis), mRNA half-life, translation (protein synthesis) and the probability of a protein folding correctly to its active, functional structure, versus alternative outcomes like aggregation or degradation. Although a comprehensive understanding of the precise mechanisms and their interplay is still emerging, synonymous codon substitutions have been shown to affect each step of functional protein production.

A well-studied example of the effects of codon usage on functional protein production is the impact of synonymous substitutions amongst codons with different codon usage frequencies. Although the genetic code is (essentially) universal, synonymous codons tend to be used with different frequencies in different organisms. A particularly striking difference in codon usage is the arginine codon AGA, which is rarely used in Escherichia coli but the most common of all six arginine codons in human coding sequences. The tRNA that decodes AGA is present at very low abundance in E. coli, which leads to slow translation of AGA by the ribosome, which can lead to mistranslation.¹ Recombinant expression of human coding sequences in E. coli can therefore be increased by substituting AGA with a synonymous arginine codon used more frequently in E. coli. These and other related observations led to the widespread practice of “codon optimization,” which involves systematically substituting all or most codons in a coding sequence with the synonymous codon most commonly used in the heterologous expression host.^2-6 However, codon optimization is not always successful: although these negative results are rarely reported, it is widely acknowledged that codon “optimization” can instead be detrimental to functional protein production.⁷ Codon optimization can fail for a variety of reasons, including changing the rate of rate of synthesis of the nascent peptide chain. Changes to protein synthesis rate can alter co-translational folding mechanisms, which can lead to misfolding, aggregation and/or degradation of the encoded protein (Figure 1).^8-12

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

Synonymous substitutions to more common codons (green) or codons otherwise perceived as “optimal” can have a negative impact on functional protein production. In this example, synonymous mutations increase the amount of protein synthesized but the disruption of the codon usage pattern in the wild type gene (red) impairs the formation of the functional protein structure, due to changes to co-translational folding.
*Source*: This figure was modified from Reference ⁸

The limitations of codon optimization as a universal approach for producing high levels of functional protein have sparked an appreciation of the importance of patterns of synonymous codon usage in coding sequences. For example, even many highly expressed genes include significant clusters of rare or otherwise sub-optimal codons,¹³ the locations of which are often conserved between homologous coding sequences.^{14, 15} Disrupting local patterns of synonymous codon usage can lead to altered folding of the encoded protein,^{9, 10, 16, 17} even when gene-wide average codon usage metrics are preserved.^{11, 12, 18} These results have led to an alternative approach for maximizing recombinant gene expression called “codon harmonization.” Harmonization involves substituting synonymous codons with the goal of closely replicating the codon usage pattern (e.g., rare vs. common) in the original gene/organism (Figure 2a, b).^{8, 11, 19-21}

Harmonizing the codon usage values of a sequence of interest from one organism to another requires an algorithm that can take an mRNA sequence and the codon usage values from both the origin organism and a proposed destination (heterologous host) organism, analyze the values in the sequence using a measure of interest, and then harmonize the synonymous codon usage pattern in the heterologous host to match the pattern in the origin organism by systematically altering the mRNA coding sequence (Figure 2). Here we present such an algorithm, called CHARMING, for “Codon HARMonizING”, a major expansion and refinement of a simple codon harmonizing procedure we previously developed.⁸ CHARMING can be run as a Python script or via a web-based portal at http://www.codons.org. Included with the script and the web-based portal are options to analyze coding sequences using two commonly used codon measures: (a) ORFeome-wide codon usage frequency (calculated using %MinMax)^{8, 13} for E. coli, yeast, human and several other common organisms, or (b) CAI values²² for E. coli and yeast codons, which are calculated as a geometric mean. Both can be analyzed using a sliding window of adjustable size. Users can also use either of these two calculations to harmonize to values other than those provided, by uploading their own tables of codon usage values. Below we describe the structure and operations of CHARMING and its applications.

2 GENERAL CONSIDERATIONS

The codon usage patterns described in the introduction to this article arise due to differences in global codon usage frequencies across all coding sequences. However, it is important to note that codon usage can instead be evaluated using a variety of other measures, including those based on tRNA abundance (tAI),²³ tRNA supply versus codon demand (nTE)²⁴ and the codon usage in only highly expressed genes (Codon Adaptive Index, CAI).^{22, 25} Although all of these codon usage measures are somewhat correlated with one another, the underlying differences, as well as mathematical differences between how the most common codon calculators compare values to one another (see Box 1), can lead to distinct codon usage patterns within a coding sequence (Figure 2). Currently, there is no broad consensus regarding which codon usage measure is most predictive of translation rate or functional protein production.^{26, 27} Hence it can be useful to compare codon patterns generated by different codon usage measures. Here, we focus primarily on CAI, as it is one of the earliest and most widely used codon usage measures, and %MinMax, which has been shown to be predictive of the effect of local ribosome elongation rate on co-translational protein folding in E. coli.¹⁶ Of note, calculating %MinMax for a gene of interest requires knowing only genome-wide codon usage frequencies,¹³ which are readily available for many organisms.^{28, 29}

BOX 1. Common codon calculators measure different things and scale values differently

When evaluating codon usage, it is important to appreciate that distinctions between codon usage measures and their underlying scaling calculations mean that, depending on the specific harmonization application, one calculation may be more well-suited than another. Key differences include: (a) the biological consideration that defines codon optimality and subsequent codon value assignment, (b) the strategy used to normalize codon values relative to each other, (c) the strategy used to scale codon usage values relative to each other, and (d) whether usage bias is summarized using a geometric mean or average, across either the entire gene or as a sliding window (gene profile).

Assigning codon values

As outlined in the main text, different codon usage measures evaluate distinct codon features when defining a synonymous codon to be optimal or non-optimal for translational efficiency, which is defined here as the amount of protein synthesized from an mRNA transcript per unit time. For example, CAI defines as optimal the synonymous codon with the highest usage frequency in only highly expressed genes,^{22, 25} whereas tAI defines as optimal the synonymous codon with the highest cognate tRNA concentration (typically estimated from tRNA gene copy number).²³ Similarly, nTE defines as optimal the codon with the highest ratio of tRNA supply (estimated concentration) to the cognate codon abundance in the transcriptome,²⁴ whereas %MinMax considers the most optimal synonymous codon to be the one used most frequently in the ORFeome.^{8, 13} Each codon measure assigns a value to each codon, based on its optimality considerations.
Normalizing codon values: Absolute versus relative comparisons

After individual codon values are assigned, there is an additional distinction between how values are normalized. This normalization step has important implications for interpreting synonymous codon patterns in gene sequences. Broadly speaking, codon usage values can be normalized either (a) in an absolute sense, as compared to all 61 sense (non-stop) codons (this approach is used by tAI and nTE), or (b) in a relative sense, versus only the synonymous codons that encode the same amino acid (used by CAI) or amino acid sequence window (%MinMax). This distinction means that codon usage measures that use absolute normalization include in their codon values contributions from amino acid usage bias. For example, the amino acid cysteine is rarely used in proteins, whereas the amino acid leucine is abundant. Hence in an absolute comparison, both cysteine codons will have a lower normalized value than most (if not all) of the six codons that encode leucine. In contrast, when normalized relative to only the codons within the same synonymous codon group, the more commonly used cysteine codon will be ranked as more optimal than the rarest leucine codon.
Scaling relative-normalized codon values

There is an additional distinction between how the two most commonly-used relative codon calculators, CAI and %MinMax, scale normalized codon values. CAI scales the highest codon value within each synonymous codon group to 1, whereas %MinMax scales the highest value as +100 and the lowest as −100, relative to an average value of zero in a codon window. The consequence of this difference is that, by virtue of the differences in scaling alone, %MM gives equal weight to differences between usage frequencies that are below average as those that are above average. In contrast, scaling from 1 to 0 (as CAI does) compresses the percentage difference between values that are below average.

As a simple example, imagine there are four codons in a codon usage group (A, B, C, and D) with usage frequencies of 50, 25, 10, and 5 codons per 1,000 total codons, respectively. If we assume the usage frequencies of A, B, C, and D are identical in the ORFeome and highly expressed genes and a window size of one, usage frequency values for CAI and %MinMax are equal:

Codon name	Usage frequency value	CAI	%MinMax
		Scaled value
A	50	1	+100
B	25	0.5	+9
C	10	0.2	−71
D	5	0.1	−100

The sum of these usage frequencies is 90 and the average is 22.5 (90/4). Yet due to the different scaling calculations for CAI and %MinMax, codon C, which appears twice as frequently as D, is scaled by CAI only 10% higher than D. In contrast, using %MinMax there is a ~15% difference in the total scale (from +100 to −100) between the scaled values for C and D. In other words, %MinMax scaling returns the percentage of the deviation from the average, regardless of whether that deviation is above or below average.

4.
Summarizing usage bias: geometric mean versus average, and single value versus sliding window

Historically, some codon measures (most notably CAI and tAI) calculated a geometric mean of individual codon values to generate for each coding sequence a single value, with a higher value describing a sequence that is more optimal (higher translational efficiency, defined above):

Geometric mean = \sqrt[n]{C_{1} * C_{2} * \dots * C_{n - 1} * C_{n}}

()

where C_j is the corresponding codon usage value of the jth codon in the sequence, and n is the total number of codons in the sequence. More recently, as interest in intra-gene codon usage patterns has increased, these codon usage measures have been adapted to generate for each coding sequence a codon usage profile, by calculating the geometric mean of a sliding window of codons along the coding sequence. %MinMax does not use a geometric mean calculation; instead, %MinMax calculates the average value of a sliding window of adjustable length. Describing synonymous codon usage with a single value can be useful for making correlations in large datasets. Describing codon usage bias with a sliding window profile is useful for identifying regions within a sequence with the largest contributions to translational efficiency or for comparing codon usage across orthologs or a library of synonymous mutants.

We previously developed a simple three-step algorithm for harmonizing the codon usage of a coding sequence from one organism to another.⁸ This simple algorithm, which we refer to here as “Rodriguez initialization,” operates by first ranking each codon in order of usage frequency relative to its synonyms in each of the two organisms, then mapping the ranked codons from one organism to the other, and finally, where necessary, substituting the codons in a coding sequence with the usage frequency-ranked match in the destination organism. This extremely simple harmonization procedure alone can do a remarkably good job of matching the general trends of codon usage frequency patterns,⁸ as can other, related codon-ranking procedures.²⁰ However, for many applications a more robust and versatile algorithm is necessary, to outperform frequency-ranked matching alone and/or to explore the vastness of synonymous codon space. CHARMING addresses this need, providing the first (to our knowledge) open-source algorithm of its kind for versatile and robust codon harmonization. Together, Rodriguez initialization and CHARMING provide a powerful tool to design well-harmonized synonymous coding sequences.

3 EXPLORING THE VASTNESS OF SYNONYMOUS CODON SPACE

Codon harmonization is a computationally difficult problem. Because the length of the average coding sequence in E. coli is ~300 codons and the average number of codons used to encode each amino acid is 3.05 (61/20), the resulting potential solution space (roughly 3.05³⁰⁰, or 2 × 10¹⁴⁵, synonymous sequences per gene) is simply too large for a brute-force search through all options to find the globally optimum (most-closely harmonized) sequence. Moreover, results of a codon calculation are typically averaged over a sliding window of codons, which vastly increases the number of synonymous sequence solutions that are equally well-harmonized to a desired codon usage pattern. CHARMING provides a computationally inexpensive solution to the intractable brute-force problem of identifying synonymous sequences with closely harmonized codon patterns.

4 THE CHARMING WORKFLOW

At its core, the CHARMING algorithm is a nested series of iterative steps that systematically substitute codons in a wild type (WT) coding sequence to recapitulate in a destination organism (typically, a heterologous expression host) the general pattern of high and low codon values experienced when the WT coding sequence is expressed in its origin organism (Figure 3). In doing so, CHARMING recapitulates a codon usage pattern for a desired codon usage measure from one organism (with one set of values) in a different organism with different values, as in the example shown in Figure 4. CHARMING therefore fulfills a key task of codon harmonization for heterologous gene expression.

CHARMING is a deterministic algorithm, meaning that for a given input coding sequence it will always return the same harmonized coding sequence. However, because synonymous sequence space for even a single sequence is astronomically vast (see previous section), and because most codon usage patterns are analyzed as a sliding window average across multiple codons, there are potentially many synonymous sequences that will harmonize equally well to the desired codon usage pattern. For users interested in sampling the vastness of possible synonymous sequence-space solutions for a specific harmonization task, CHARMING includes an option to generate a set of N (where N is defined by the user) harmonized synonymous sequences. This set is produced by first generating a set of random synonymous starting sequences to use as input into the algorithm, which in turn generates distinct sequences harmonized to the same sequence feature pattern. See the Section 5.1 for more details.

CHARMING can be used either online (at codons.org) or as a Python script (downloadable from github.com/wrightgs/CHARMIING). In both versions, the user must specify parameters before CHARMING is run. These parameters include the sequence to be harmonized, which codon usage measure to use, the window size for the codon usage measure, how many harmonized mutants to generate as output, and the codon usage tables for the origin and destination species (if applicable—see Section 5.1).

After the user specifies their desired parameters, CHARMING then is run. There are seven steps to CHARMING (shown schematically in Figure 3), including the initial Rodriguez initialization step:

4.1 Step 0: Initialization (optional)

We found that the performance of CHARMING improves when it begins with Rodriguez initialization, the simplified harmonization procedure we previously developed⁸ (see General considerations, above). Step 1 of CHARMING (below) then uses this initial harmonized sequence as the “current synonymous mutant.” We found that including Rodriguez initialization improves CHARMING harmonization by an average of 10.5%, as detailed in the harmonization case study (below) (see also Figures 4 and 5). Despite this improvement, there are some codon harmonization applications that are incompatible with Rodriguez initialization, hence it is only used by the algorithm when the user specifies parameters that benefit from initialization. Two examples of applications not suitable for initialization are described in a later section (Section 5.1).

4.2 Step 1: Calculate the codon usage pattern for the synonymous mutant sequence

For the desired codon usage measure, CHARMING analyzes the current synonymous mutant sequence using codon usage values from the destination species (e.g., a heterologous expression host like E. coli). Each codon usage measure analyzes per-codon usage values over a sliding window of M codons, where M is set by the user. In the examples shown here (Figures 1 and 4), window sizes of 17 and 9 codons were used; note that a window size of 10 codons was found to correlate most closely with ribosome footprint counts.^{27, 30} CHARMING then assigns the resulting value for a particular sliding window to the middle codon in the sliding window (or, in the case of even-sized windows, the right-most center codon), creating per-codon output values. These per-codon values are then compared to the target values (i.e., the per-codon values of the WT sequence when analyzed with the same measure, using codon usage values from the origin species). If this is the first iteration of the algorithm and the optional initialization step was not used, the “current synonymous mutant” is the WT coding sequence or a randomly generated synonymous mutant (as described before).

4.3 Step 2: Choose candidate codon positions for alteration

CHARMING next marks each sequence region of ≥10 consecutive codon positions where the output values for the current synonymous mutant are consistently above or below the target values as a consecutive deviation region (CDR). A minimum CDR length of 10 codons was chosen because larger values generally resulted in less well-correlated harmonized outputs, while smaller values led to only incremental reductions in the harmonization deviation at the cost of increased runtimes. The example sequence shown in Figure 3 has one CDR of length = 15 (colored orange). Next, CHARMING selects candidate sections for codon alteration within each CDR. CHARMING calculates the number of candidate sections for each CDR as (CDR length)/10, rounded down to the nearest integer. Each candidate section consists of five consecutive codon positions, spaced evenly within each CDR. In Figure 3 example, the CDR of length 15 generates one candidate section (colored red), placed at the middle five codon positions of the CDR.

4.4 Step 3: Choose a specific codon for alteration

Based on the algorithm's current iteration, CHARMING then selects one of the five positions within each candidate section for possible alteration. Specifically, the codon selected is the remainder of the iteration number divided by the number of candidate positions (i.e., the iteration number modulo 5). In Figure 3 example, the first codon is chosen because the algorithm is in its first iteration.

4.5 Step 4: Alter the selected codon, if possible

For CDRs above the corresponding target values, CHARMING changes the codon at the selected position to the synonymous codon with the next lowest rank for the codon measure of interest. Likewise, for CDRs below the corresponding target values CHARMING selects the synonymous codon with the next highest value to replace the selected codon. This step “pulls” sequence regions with consistent deviations closer to the target values. If the codon position cannot be altered in the desired direction (i.e., if the codon occupying the selected position is already the extreme value in the desired direction), then CHARMING makes no change and the algorithm advances to Step 5. In Figure 3 example, the selected codon is lowered from the highest ranked synonymous codon under the given measure to the second highest ranked synonymous codon (which is also the lowest ranked synonymous codon for the corresponding amino acid). If instead the codon at the selected position was already the lowest ranked of its synonyms, no change would be made.

4.6 Step 5: If the codon alteration reduces net deviation, keep it

CHARMING next recalculates the values for the altered sequence under the given codon usage measure, using codon usage values from the destination species. As detailed in Step 1, these values represent the average over a sliding window of M codons, which means that codon alterations can affect values at neighboring, unaltered positions. CHARMING keeps the codon alteration if the net deviation from the target values decreases because of the codon alteration. Net deviation is defined in Equation (2):

\sum_{i = 0}^{Sequence length} ∣ {Target value}_{i} - {Harmonized value}_{i} ∣

()

If net deviation does not decrease, then CHARMING discards the codon change. In Figure 3 example, the change reduced net deviation from 479 to 411, hence this change will be kept in the synonymous mutant.

If a CDR is long enough to contain multiple candidate sections or there are multiple CDRs in the current synonymous mutant sequence (or both), the sequence will have multiple selected codons during each iteration. In this case, CHARMING repeats Steps 3–5 independently for each set of candidate positions.

4.7 Step 6: Iterate or terminate (output the final harmonized sequence)

After each set of candidate positions has been processed through Steps 3–5, CHARMING increments the iteration number by one and a new iteration of the algorithm begins at Step 1. Alternatively, the algorithm terminates (i.e., the synonymous sequence is output as the final harmonized sequence) after five consecutive iterations where no beneficial codon change is found. This stopping condition implies that all five candidate positions of each CDR have been chosen for alteration, and no beneficial alteration could be found.

5 CASE STUDY: HARMONIZING RELATIVE CODON USAGE FREQUENCIES FROM YEAST TO E. COLI

To computationally evaluate the performance of CHARMING, we created synonymous mutants for 3,358 E. coli coding sequences, harmonized to closely match Saccharomyces cerevisiae ORFeome-wide relative codon usage frequencies, as calculated by %MinMax^{8, 13} using codon usage frequency tables from KazUSA²⁸ (tables “Escherichia coli O157:H7 EDL933 [gbbct]” and “Saccharomyces cerevisiae [gbpln]”, respectively). Harmonizing from E. coli to S. cerevisiae is a good test of CHARMING as these two organisms have similar but distinct codon usage frequencies, exemplified by their difference in %GC base composition: 51.5% for E. coli versus 39.8% for S. cerevisiae.

For each coding sequence, we created three harmonized mutants, designed using:

only Rodriguez initialization⁸
only CHARMING (no initialization)
Rodriguez initialization prior to CHARMING harmonization

The net deviation between the codon usage measure values for each of these designed sequences and the target values were calculated using Equation (2). On average, Rodriguez initialization reduced the deviation from the wild-type sequence by 68.5%, CHARMING with no initialization reduced the deviation from the wild-type sequence by 76.6%, and CHARMING used with Rodriguez initialization reduced the deviation by 87.1% (Figure 5). Figure 4 shows a representative example of the progression of CHARMING, with or without initialization, when harmonizing the Plasmodium falciparum sequence encoding the ubiquitin conjugating enzyme E2 for expression in E. coli.

5.1 Beyond heterologous expression

In addition to heterologous gene expression, there are other applications where codon harmonization can be a useful tool. However, it is important to note that some alternative applications, including the two described below, are incompatible with Rodriguez initialization (Step 0, above). In these cases, CHARMING skips the initialization step.

First, a user may desire to sample—albeit very sparsely—the vastness of synonymous codon space by designing not just one well-harmonized codon sequence but a small set of distinct synonymous sequences that harmonize equally well (or closely so) to the target sequence. Such a set could be useful, for example, to ensure that a harmonized sequence is not cross-correlated with another measure under consideration, and/or to ensure the generation of a harmonized sequence that avoids undesirable features, such as nuclease recognition sequences. To accomplish this, CHARMING includes an option to generate a set of synonymous sequences harmonized to the same sequence feature pattern. However, because CHARMING is deterministic, it will always output the same harmonized sequence for a given input sequence (assuming all other parameters remain unchanged). Therefore, when a user requests multiple harmonized synonymous mutant sequences, Rodriguez initialization is used as input to CHARMING for one sequence and the WT sequence is used as input to CHARMING for another. Any additional remaining output harmonized sequences are created by CHARMING first generating random synonymous sequences [technically, unweighted random reverse translations (RRTs)^{13, 14}] to serve as the input for Step 1 of CHARMING.

In other applications, a user may wish to design an alternative, equally-well harmonized synonymous sequence for use in the origin species. Because Rodriguez initialization chooses codons of equal rank based on usage rankings in the native and the host species, initialization will always return the WT sequence when the origin species is also the destination. To overcome this limitation, a random synonymous sequence is generated to first artificially create areas of “deviation” for harmonization, akin to the procedure described above for generating multiple harmonized mutants.

ACKNOWLEDGEMENTS

We thank the members of our laboratories for helpful input on the development of the CHARMING algorithm. We are grateful to Kristina Davis for her assistance with Figure 3. This project was supported by grants from the National Institutes of Health (R01 GM120733 to T.M., P.L.C., S.J.E. and J.L. and DP1 GM146256 to P.L.C.) and the W. M. Keck Foundation (to P.L.C.).

AUTHOR CONTRIBUTIONS

Gabriel Wright: Conceptualization (lead); formal analysis (equal); investigation (equal); methodology (lead); software (lead); visualization (supporting); writing – original draft (supporting); writing – review and editing (supporting). Anabel Rodriguez: Conceptualization (supporting); formal analysis (equal); investigation (equal); methodology (supporting); software (supporting); visualization (lead); writing – original draft (supporting); writing – review and editing (supporting). Jun Li: Funding acquisition (equal); methodology (supporting); supervision (supporting). Tijana Milenkovic: Funding acquisition (equal); methodology (supporting); supervision (lead). Patricia L. Clark: Funding acquisition (equal); project administration (equal); supervision (lead); writing – original draft (lead); writing – review and editing (lead).

CONFLICT OF INTEREST

The authors declare no conflicts of interest.

REFERENCES

1Kane JF. Effects of rare codon clusters on high-level expression of heterologous proteins in Escherichia coli. Curr Opin Biotechnol. 1995; 6: 494–500.
10.1016/0958-1669(95)80082-4
CAS PubMed Web of Science® Google Scholar
2Grote A, Hiller K, Scheer M, et al. JCat: A novel tool to adapt codon usage of a target gene to its potential expression host. Nucleic Acids Res. 2005; 33: W526–W531.
10.1093/nar/gki376
CAS PubMed Web of Science® Google Scholar
3Lanza AM, Curran KA, Rey LG, Alper HS. A condition-specific codon optimization approach for improved heterologous gene expression in Saccharomyces cerevisiae. BMC Syst Biol. 2014; 8: 33.
10.1186/1752-0509-8-33
CAS PubMed Web of Science® Google Scholar
4Puigbo P, Guzman E, Romeu A, Garcia-Vallve S. OPTIMIZER: A web server for optimizing the codon usage of DNA sequences. Nucleic Acids Res. 2007; 35: W126–W131.
10.1093/nar/gkm219
PubMed Web of Science® Google Scholar
5Villalobos A, Ness JE, Gustafsson C, Minshull J, Govindarajan S. Gene designer: A synthetic biology tool for constructing artificial DNA segments. BMC Bioinformatics. 2006; 7: 285.
10.1186/1471-2105-7-285
CAS PubMed Web of Science® Google Scholar
6Withers-Martinez C, Carpenter EP, Hackett F, et al. PCR-based gene synthesis as an efficient approach for expression of the A+T-rich malaria genome. Protein Eng. 1999; 12: 1113–1120.
10.1093/protein/12.12.1113
CAS PubMed Web of Science® Google Scholar
7Welch M, Villalobos A, Gustafsson C, Minshull J. Designing genes for successful protein expression. Meth Enzymol. 2011; 498: 43–66.
10.1016/B978-0-12-385120-8.00003-6
CAS PubMed Web of Science® Google Scholar
8Rodriguez A, Wright G, Emrich S, Clark PL. %MinMax: A versatile tool for calculating and comparing synonymous codon usage and its impact on protein folding. Protein Sci. 2018; 27: 356–362.
10.1002/pro.3336
CAS PubMed Web of Science® Google Scholar
9Buhr F, Jha S, Thommen M, et al. Synonymous codons direct cotranslational folding toward different protein conformations. Mol Cell. 2016; 61: 341–351.
10.1016/j.molcel.2016.01.008
CAS PubMed Web of Science® Google Scholar
10Komar AA, Lesnik T, Reiss C. Synonymous codon substitutions affect ribosome traffic and protein folding during in vitro translation. FEBS Lett. 1999; 462: 387–391.
10.1016/S0014-5793(99)01566-5
CAS PubMed Web of Science® Google Scholar
11Spencer PS, Siller E, Anderson JF, Barral JM. Silent substitutions predictably alter translation elongation rates and protein folding efficiencies. J Mol Biol. 2012; 422: 328–335.
10.1016/j.jmb.2012.06.010
CAS PubMed Web of Science® Google Scholar
12Walsh IM, Bowman MA, Soto Santarriaga IF, Rodriguez A, Clark PL. Synonymous codon substitutions perturb cotranslational protein folding in vivo and impair cell fitness. Proc Natl Acad Sci U S A. 2020; 117: 3528–3534.
10.1073/pnas.1907126117
CAS PubMed Web of Science® Google Scholar
13Clarke TF, Clark PL. Rare codons cluster. PLoS ONE. 2008; 3:e3412.
10.1371/journal.pone.0003412
CAS PubMed Web of Science® Google Scholar
14Chaney JL, Steele A, Carmichael R, et al. Widespread position-specific conservation of synonymous rare codons within coding sequences. PLoS Comput Biol. 2017; 13:e1005531.
10.1371/journal.pcbi.1005531
PubMed Web of Science® Google Scholar
15Jacobs WM, Shakhnovich EI. Evidence of evolutionary selection for cotranslational folding. Proc Natl Acad Sci U S A. 2017; 114: 11434–11439.
10.1073/pnas.1705772114
CAS PubMed Web of Science® Google Scholar
16Sander IM, Chaney JL, Clark PL. Expanding Anfinsen's principle: Contributions of synonymous codon selection to rational protein design. J Am Chem Soc. 2014; 136: 858–861.
10.1021/ja411302m
CAS PubMed Web of Science® Google Scholar
17Zhou M, Guo J, Cha J, et al. Non-optimal codon usage affects expression, structure and function of clock protein FRQ. Nature. 2013; 495: 111–115.
10.1038/nature11833
CAS PubMed Web of Science® Google Scholar
18Zhang G, Hubalewska M, Ignatova Z. Transient ribosomal attenuation coordinates protein synthesis and co-translational folding. Nat Struct Mol Biol. 2009; 16: 274–280.
10.1038/nsmb.1554
CAS PubMed Web of Science® Google Scholar
19Angov E, Hillier CJ, Kincaid RL, Lyon JA. Heterologous protein expression is enhanced by harmonizing the codon usage frequencies of the target gene with those of the expression host. PLoS ONE. 2008; 3:e2189.
10.1371/journal.pone.0002189
CAS PubMed Web of Science® Google Scholar
20Claassens NJ, Siliakus MF, Spaans SK, et al. Improving heterologous membrane protein production in Escherichia coli by combining transcriptional tuning and codon usage algorithms. PLoS ONE. 2017; 12:e0184355.
10.1371/journal.pone.0184355
PubMed Web of Science® Google Scholar
21Rehbein P, Berz J, Kreisel P, Schwalbe H. "CodonWizard" - an intuitive software tool with graphical user interface for customizable codon optimization in protein expression efforts. Protein Expr Purif. 2019; 160: 84–93.
10.1016/j.pep.2019.03.018
CAS PubMed Web of Science® Google Scholar
22Sharp PM, Li WH. The codon adaptation index: A measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987; 15: 1281–1295.
10.1093/nar/15.3.1281
CAS PubMed Web of Science® Google Scholar
23Waldman YY, Tuller T, Shlomi T, Sharan R, Ruppin E. Translation efficiency in humans: Tissue specificity, global optimization and differences between developmental stages. Nucleic Acids Res. 2010; 38: 2964–2974.
10.1093/nar/gkq009
CAS PubMed Web of Science® Google Scholar
24Pechmann S, Frydman J. Evolutionary conservation of codon optimality reveals hidden signatures of cotranslational folding. Nat Struct Mol Biol. 2013; 20: 237–243.
10.1038/nsmb.2466
CAS PubMed Web of Science® Google Scholar
25Shah P, Gilchrist MA. Explaining complex codon usage patterns with selection for translational efficiency, mutation bias, and genetic drift. Proc Natl Acad Sci U S A. 2011; 108: 10231–10236.
10.1073/pnas.1016719108
CAS PubMed Web of Science® Google Scholar
26Ranaghan MJ, Li JJ, Laprise DM, Garvie CW. Assessing optimal: Inequalities in codon optimization algorithms. BMC Biol. 2021; 19: 36.
10.1186/s12915-021-00968-8
PubMed Web of Science® Google Scholar
27Wright G, Rodriguez A, Li J, Clark PL, Milenkovic T, Emrich SJ. Analysis of computational codon usage models and their association with translationally slow codons. PLOS ONE. 2020; 15:e0232003.
10.1371/journal.pone.0232003
CAS PubMed Web of Science® Google Scholar
28Nakamura Y, Gojobori T, Ikemura T. Codon usage tabulated from international DNA sequence databases: Status for the year 2000. Nucleic Acids Res. 2000; 28: 292.
10.1093/nar/28.1.292
CAS PubMed Web of Science® Google Scholar
29Alexaki A, Kames J, Holcomb DD, et al. Codon and codon-pair usage tables (CoCoPUTs): Facilitating genetic variation analyses and recombinant gene design. J Mol Biol. 2019; 431: 2434–2441.
10.1016/j.jmb.2019.04.021
CAS PubMed Web of Science® Google Scholar
30Tunney R, McGlincy NJ, Graham ME, Naddaf N, Pachter L, Lareau LF. Accurate design of translational output by a neural network model of ribosome distribution. Nat Struct Mol Biol. 2018; 25: 577–582.
10.1038/s41594-018-0080-2
CAS PubMed Web of Science® Google Scholar

Citing Literature

Volume31, Issue1

Special Issue:Tools 2022

January 2022

Pages 221-231

This article also appears in:

Experimental Biology 2022

CHARMING: Harmonizing synonymous codon usage to replicate a desired codon usage pattern

Abstract

1 INTRODUCTION

2 GENERAL CONSIDERATIONS

BOX 1. Common codon calculators measure different things and scale values differently

3 EXPLORING THE VASTNESS OF SYNONYMOUS CODON SPACE

4 THE CHARMING WORKFLOW

4.1 Step 0: Initialization (optional)

4.2 Step 1: Calculate the codon usage pattern for the synonymous mutant sequence

4.3 Step 2: Choose candidate codon positions for alteration

4.4 Step 3: Choose a specific codon for alteration

4.5 Step 4: Alter the selected codon, if possible

4.6 Step 5: If the codon alteration reduces net deviation, keep it

4.7 Step 6: Iterate or terminate (output the final harmonized sequence)

5 CASE STUDY: HARMONIZING RELATIVE CODON USAGE FREQUENCIES FROM YEAST TO E. COLI

5.1 Beyond heterologous expression

ACKNOWLEDGEMENTS

AUTHOR CONTRIBUTIONS

CONFLICT OF INTEREST

REFERENCES

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

CHARMING: Harmonizing synonymous codon usage to replicate a desired codon usage pattern

Abstract

1 INTRODUCTION

2 GENERAL CONSIDERATIONS

BOX 1. Common codon calculators measure different things and scale values differently

3 EXPLORING THE VASTNESS OF SYNONYMOUS CODON SPACE

4 THE CHARMING WORKFLOW

4.1 Step 0: Initialization (optional)

4.2 Step 1: Calculate the codon usage pattern for the synonymous mutant sequence

4.3 Step 2: Choose candidate codon positions for alteration

4.4 Step 3: Choose a specific codon for alteration

4.5 Step 4: Alter the selected codon, if possible

4.6 Step 5: If the codon alteration reduces net deviation, keep it

4.7 Step 6: Iterate or terminate (output the final harmonized sequence)

5 CASE STUDY: HARMONIZING RELATIVE CODON USAGE FREQUENCIES FROM YEAST TO E. COLI

5.1 Beyond heterologous expression

ACKNOWLEDGEMENTS

AUTHOR CONTRIBUTIONS

CONFLICT OF INTEREST

REFERENCES

Citing Literature

Figures

References

Related

Information