Volume 34, Issue 6 e70157
RESEARCH ARTICLE
Open Access

Cis-nonProline peptides: Genuine occurrences and their functional roles

Jane S. Richardson

Corresponding Author

Jane S. Richardson

Department of Biochemistry, Duke University, Durham, North Carolina, USA

Correspondence

Jane S. Richardson, 132 Nanaline Duke Building, Duke University, Durham NC 27710, USA.

Email: [email protected]

Contribution: Conceptualization, Data curation, Formal analysis, Validation, Supervision, Funding acquisition, Visualization, Writing - original draft, Writing - review & editing

Search for more papers by this author
Lizbeth L. Videau

Lizbeth L. Videau

Department of Biochemistry, Duke University, Durham, North Carolina, USA

Contribution: Data curation, ​Investigation, Validation, Visualization, Resources, Writing - review & editing

Search for more papers by this author
Christopher J. Williams

Christopher J. Williams

Department of Biochemistry, Duke University, Durham, North Carolina, USA

Contribution: Conceptualization, Software, ​Investigation, Formal analysis, Validation, Writing - original draft, Writing - review & editing

Search for more papers by this author
Bradley J. Hintze

Bradley J. Hintze

Department of Biochemistry, Duke University, Durham, North Carolina, USA

Duke Institute of Health Innovation, Duke University Medical Center, Durham, North Carolina, USA

Contribution: Formal analysis, ​Investigation, Software, Writing - review & editing

Search for more papers by this author
Steven M. Lewis

Steven M. Lewis

Department of Biochemistry, Duke University, Durham, North Carolina, USA

Computational Discovery, Cyrus Biotechnology, Seattle, Washington, DC, USA

Contribution: Conceptualization, Resources, Writing - review & editing

Search for more papers by this author
David C. Richardson

David C. Richardson

Department of Biochemistry, Duke University, Durham, North Carolina, USA

Contribution: Conceptualization, Funding acquisition, Project administration, Software, Supervision, Writing - review & editing

Search for more papers by this author
First published: 24 May 2025

Review Editor: Nir Ben-Tal

Abstract

While cis peptides preceding proline can occur about 5% of the time, cis peptides preceding any other residue (“cis-nonPro” peptides) are an extremely rare feature in protein structures, of considerable importance for two opposite reasons. On one hand, their genuine occurrences are mostly found at sites critical to biological function, from the active sites of carbohydrate enzymes to rare adjacent-residue disulfide bonds. On the other hand, a cis-nonPro can easily be misfit into weak or ambiguous electron density, which led to a high incidence of unjustified cis-nonPro over the 2006–2015 decade. This paper uses high-resolution crystallographic data and especially stringent quality-filtering at the residue level to identify genuine occurrences of cis-nonPro and to survey both individual examples and broad patterns of their functionality. We explain the procedure developed to identify genuine cis-nonPro examples with almost no false positives. We then survey a large sample of the varied functional roles and structural contexts of cis-nonPro, including the uses of specific amino acids for particular purposes. We emphasize aspects not previously covered: that cis-nonPro essentially always (except for vicinal disulfides) occurs in well-ordered structure, and especially the great concentration of occurrence in proteins that process or bind carbohydrates (identified by occurrence on the CAZy website).

1 INTRODUCTION

Peptide bonds in protein structures have a partial double-bond character, which keeps them fairly close to planar. The majority adopt the trans backbone conformation, with the ω dihedral angle near 180°, while some are cis with ω near 0°. Peptides preceding prolines have somewhat less unfavorable energy difference and a lower barrier to cis/trans rotation (reviewed in Pal & Chakrabarti, 1999), and the reported occurrence frequency of cis-Pro has converged over the years to a bit over 5% (Jabs et al., 1999; MacArthur & Thornton, 1991; Stewart et al., 1990), more common in β-sheet proteins than in helical ones. Cis-Pro play structural roles, as exposed turns or to accommodate tightly packed regions in the interior. They also play static functional roles such as positioning active-site residues in carbonic anhydrase and in pectate lyase (Videau et al., 2004). Such roles give them rather high conservation within a protein family. On the dynamic side, a need to achieve this rarer isomer produces a slow step in the protein folding process (Brandts et al., 1975; Schmid, 1992), and there are well-studied proline isomerase enzymes that aid in that step (Wawra & Fischer, 2006). The slow cis/trans conversion has been exploited by evolution for its suitability in biological timing circuits (Lu et al., 2007).

In contrast, genuine cis peptides preceding non-proline residues are extremely scarce. Historically, the first examples were three cis-nonPro near the active site of carboxypeptidase A (Rees et al., 1981) and the Gly-cis-Gly peptide that helps bind both NADPH (nicotinamide adenine dinucleotide phosphate) and inhibitors in dihydrofolate reductase (Bolin et al., 1982). Figure 1 shows that site at 1.09 Å resolution in the later 1kms human DHFR (Klon et al., 2002).

Details are in the caption following the image
Walleye stereo of one of the first cis-nonPro identified, in dihydrofolate reductase, as shown in the later 1.09 Å resolution 1kms. The cis peptide is marked by a lime trapezoid, while the all-atom contacts are shown as pairs of dots for van der Waals contacts and darker green pillows for H-bonds.

There is a pair of reviews of cis-nonPro (as opposed to just cis-Pro) in 1999. One (Pal & Chakrabarti, 1999) mainly considered cis-Pro, so they limited the resolution considered to <2 Å and did not give very much detail about their 29 examples of cis-nonPro. The other (Jabs et al., 1999) concentrated on treating cis-nonPro, and it included cases up to 3.5 Å in order to get 43 examples to work with. It was the first to comment on the possible involvement with carbohydrate enzymes. It emphasized that there were many fewer cis-nonPro at low resolution than would match the simple energetics available then, which was true but did not allow for considerations such as selection against slow folding, so it balanced out and they got the correct 0.03% occurrence value by favorable accident. Neither of them quality-filtered by residue. Although there have been many careful studies of cis-nonPro peptides in specific structures, as far as we can tell there has been no other systematic review of cis-nonPro since then.

In 2006, cis-nonPro examples were included in the fragment libraries, first for COOT (Emsley et al., 2010) and later for other graphics programs, with no consideration of their very low prior probability. This began the decade of extreme overuse (often as much as 30 times too many) discovered by Croll (2015), which was stopped by explicitly flagging them in nearly all graphics programs. More recently, specific subcategories of genuine cis-nonPro in vicinal disulfides (Richardson et al., 2017) and of systematically incorrect cis-nonPro for 2-in-a-row and at chain termini (Williams et al., 2022) have been described.

The current work is the first systematic description of the cis-nonPro functional roles. It is also the first to ensure that essentially all occurrences are genuine, even in a good set of overall structures, by using strict quality filters at the residue level: by all-atom contacts, geometry, B-factor, map value, and real-space correlation. It includes a list of 439 cis-nonPro residues with comments in supplementary Tables as a pdf for human readability and a csv for findable, accessible, interoperable, and reusable machine access.

2 RESULTS

2.1 Reliability filtering for genuine occurrences

Any search for an unusual conformation needs quality filtering, at the residue level as well as the file level (or structural model level), in order to achieve a robust structural-bioinformatics treatment (Williams et al., 2022). Even more rigorous filtering is required for cis-nonPro peptides, given the epidemic of over-use in experimental structures, in poor density at all resolutions (Croll, 2015; Williams & Richardson, 2015). Therefore, as well as steric, geometrical, and B-factor criteria, for this analysis we also use criteria based on electron density quality (see “Methods” section), modified from those first used in our latest sidechain rotamer analysis (Hintze et al., 2016). This added filter will reject some genuine cases, but it accepts extremely few incorrect cases, and so is effective for choosing a list of reliably correct cis-nonPro examples for evaluating factors that affect occurrence preferences and for studying the real functional roles of this conformation.

The resulting reliability-filtered list contains 439 cis-nonPro peptides in 378 different protein chains, compiled from the Top8000 reference dataset of 6765 protein chains at <2.0 Å resolution, < 50% sequence identity, with deposited diffraction data, and additional residue-level filters. This cis-nonPro list, with annotations, is given in Tables S1 and S2 of the supplementary material, where the pdf version is human-readable and the csv version is machine-readable. It forms the primary basis of the analyses in this paper, with a few exceptions that investigate relationships or conservation across a wider range.

2.2 Preferential occurrence of cis-nonPro in carbohydrate-active proteins

It has been known for quite some time that a number of carbohydrate enzymes make use of cis-nonPro conformations (Jabs et al., 1999), and we are now in a position to judge whether this apparent connection is significant. Our source for automated identification of carbohydrate-processing or binding proteins is inclusion on the CAZy web site (Carbohydrate Active enZymes; http://www.cazy.org; Lombard et al., 2014). A total of 43 different CAZy families are represented on the CAZy list, and 13 distinct folds.

99 of our 378 distinct protein chains with a genuine cis-nonPro (26%) are in CAZy, while only 6% of the 6765 total chains in our reference dataset (the Top8000_50%_SF) are in CAZy. Carbohydrate-active protein chains are thus over-represented by a factor of >4 in this list. Of the 439 reliably genuine cis-nonPro peptides, 144 examples (33%) occur in proteins that are in CAZy. Another way of expressing this contrast is that 76% (33/43) of the protein chains with more than one authenticated cis-nonPro peptide are in CAZy. Those 33 chains are only 0.49% of the reference dataset, but they contain 77 (17.5%) of the cis-nonPro peptides, an over-representation of 36-fold. Thus carbohydrate-active enzymes are very much more likely than other proteins to have more than a single genuine cis-nonPro per chain, as well as much more likely to have any at all.

Of the 43 protein chains with >1 authenticated cis-nonPro, 15 (34%) are chitinases or chitinase-like: 9 of them have three cis-nonPros and 6 have two (see Table S1). We found a later-deposited protein chain with 4 genuine cis-nonPro, also a chitinase: 3wd0. Chitinases are remarkably widespread and sequence-diverse throughout phyla, occurring in bacteria, fungi, insects, plants, animals, and archaea (Adrangi and Faramarzi 2013; Nakamura et al., 2007), providing either nutritional use, protection against, or other uses of chitin-containing arthropod and especially fungal organisms. Their diversity explains how so many of them can occur in a dataset with <50% sequence identity. (There are nearly as many at <30% sequence identity, so that additional step would significantly reduce our general dataset, without affecting the chitinases much at all.)

The positions of cis-nonPro peptides in a sample of chitinases are shown in a cross-sectional slice in Figure 2. All family 18 chitinases have a Trp-cis-X cis-nonPro at the C-terminal end of β-strand 8 in their TIM-barrel fold, where the Trp is well packed and seems to act by positioning the backbone and residue X, which is usually the catalytic acid. In the triple cis-nonPro chitinases, the other two well-conserved cis-nonPros are at or near the ends of β-strands 2 and 4, and are involved in substrate binding (Terwissa van Scheltinga et al., 1996).

Details are in the caption following the image
Six chitinase TIM barrels superimposed, showing the completely conserved cis-nonPro at the end of β-strand 8 (all Trp-cis-X) and β-strand 2, and the less conserved ones at the end of β-strand 4. They are 1goi Serratia marcescens (Kolstad et al., 2002), 1w9p Aspergillus fumigatus (Hurtado-Guerrero & van Aalten, 2007), 3fy1 acidic mammalian (Olland et al., 2009), 3alf Nicotiana tabacum (Ohnuma et al., 2011), 3n11 Bacillus cereus, missing the cis-nonPro at strand β4 (Hsieh et al., 2010), and 2uy2 Saccharomyces cerevisiae, missing the cis-nonPro at strand β4 (Rao et al., 2005), at 1.2–1.7 Å resolution.

Besides the chitinases, chains with 3 cis-nonPro peptides include 2 β-galactosidases (1yq2 and 3cmg), in CAZy, and 3 carboxypeptidases (2piy, 3d4u, and 3i1u), not in CAZy. The chains with 2 cis-nonPro peptides include a wider variety of functions, 14 in and 6 out of CAZy.

The cis-nonPro-containing carbohydrate enzymes are unevenly distributed across the many structure/function families defined in CAZy. There are 17 cis-nonPro protein chains in family GH18 (glycoside hydrolase 18): the 15 chitinases and 2 xylanase inhibitors. There are 13 in family GH5, mostly mannanases, 11 in family GH1, mostly β-glucosidases, 8 in family GH10, all xylanases, 5 in family GH2, and 5 in different PL families (polysaccharide lyases). In contrast, the remaining 41 CAZy cis-nonPro chains are spread across 34 different families (see Table S1). The xylanases present an interesting combination of conservation versus divergence. The 8 in family GH18 all use a His-cis-Thr cis peptide at the end of β-strand 3 of their TIM barrels. The 3 xylanases in other GH families each have a different cis-peptide sequence, located at the end of β-strand 1 (2ddx), β-strand 4 (2y8k), or β-strand 5 (1nof) of their TIM barrels.

Lectins, or carbohydrate-binding modules (CBMs), are listed currently on CAZy only when they are part of a protein that also includes a carbohydrate enzyme, either as a separate domain or a separate chain. Therefore, we did not have an automated way of identifying all of them. However, anecdotally, lectins such as the prototypical concanavalin A are fairly well represented on the cis-nonPro list (14 of them), but not as over-represented as the carbohydrate enzymes. The CBMs mostly have all-β antiparallel folds. If the 14 lectins in our list, plus the 47 whose names indicate action on a carbohydrate, are added to the 99 enzymes that are in CAZy, then 160 of the 378 cis-nonPro-containing chains, or 42%, are carbohydrate related. Since we made these additional assignments mainly from the titles and abstracts on the PDB site, they are not necessarily complete or correct, but even as an approximation this means that the predominance of carbohydrate relatedness is even stronger than the pure CAZy survey indicates.

The non-CAZy cis-nonPro proteins include 17 phosphoribosyl transferases, another highly sequence-diverse group (the PRT family) with only a 13-residue conserved sequence signature recognized. They share an essential cis-nonPro conserved in conformation but not in sequence, on the “PPi loop”. As shown in Figure 3 for the 1.05 Å 1fsg PRT structure (Heroux et al., 2000) the backbone of the cis-nonPro and its adjacent peptides are used to bind a phosphate oxygen of PRPP (phosphoribosyl pyrophosphate) and, through waters, a Mg ion. In at least some PRTs, a trans to cis change accompanies substrate binding (Shi et al., 2002).

Details are in the caption following the image
Walleye stereo of 1fsg, a 1.05 Å phosphoribosyl transferase (PRT) enzyme, showing the many H-bonds between the cis-nonPro and the substrate (in pink). Gray balls are Mg ions.

2.3 Preferential occurrence of cis-nonPro peptides in TIM-barrel proteins

As well as their enzymatic activity on carbohydrates, another characteristic shared by many of the cis-nonPro containing protein chains is a (β/α)8 TIM barrel fold. Over half the relevant domains of the 99 CAZy proteins are TIM barrels, plus five (β/α)8 domains in phosphodiesterases. However, in the non-CAZy chains, TIM barrels are not especially over-represented, and indeed most TIM barrels (including the eponymous triose phosphate isomerase itself) do not include any cis-nonPro peptides. It seems, then, that the strong preference for (β/α)8 folds is probably not an independent factor, but is in some way correlated with presence in the carbohydrate-processing system.

The more generic preference for β structure, and for location at or near the C-terminal end of a β-strand, does hold for non-CAZy as well as CAZy proteins.

2.4 Sequence preferences for first and second positions, on CAZy and on non-CAZy examples

Table 1 shows the numbers of examples for each amino acid for all four categories (first and second, CAZy and non-CAZy), listed in the same order they occur for the UniProtKB generic examples, shown in the final column. Each position has several amino acids that have quite significantly different frequencies than normal, showing that the two positions are different and that CAZy and non-CAZy have different requirements for these highly unusual conformations. Note that the second residue position is forbidden for Pro, since these are defined as cis-nonPro conformations.

TABLE 1. Occurrences of each amino acid in the first and second positions for CAZy and non-CAZy proteins, compared to general amino acid frequencies, listed in the order of general frequencies. Significant up or down departures are flagged as described in the legend. Scroll down to see entire table.
Amino acid CAZy CAZy Non-CAZy Non-CAZy UniProt (%)
First res Second res First res Second res
Leu 1 2 16 13 9.85
Ala 12 11 23 14 9.03
Gly 22 6 77 30 7.23
Val 1 6 12 18 6.86
Ser 9 26 12 29 6.73
Glu 13 12 15 15 6.16
Ile 0 0 5 13 5.72
Arg 2 3 6 7 5.69
Thr 5 13 6 32 5.58
Asp 2 8 28 35 5.44
Lys 3 2 5 9 5.03
Pro 3 0 26 0 4.85
Phe 6 20 12 7 3.93
Asn 2 5 7 17 3.91
Gln 3 4 7 3 3.80
Tyr 4 7 10 26 2.94
Met 2 2 2 2 2.39
His 9 11 11 11 2.20
Trp 44 5 6 8 1.29
Cys 1 2 7 4 1.22
Total 144 144 295 295 100.00
  • Note: The last column is the general occurrence value for UniProtKB (http://www.ebi.ac.uk/uniprot/TrEMBLstats). Values 3σ below UniProtKB are shown in italic font, values 3σ–6σ above UniProtKB are shown in boldface, and values 12σ and more above UniProtKB are shown in bold and italics.

2.5 Five-dimensional ϕ, ψ, ω, ϕ, ψ local conformational and sequence preferences

A useful study tool is to plot datapoints for the genuine cis-nonPro examples in the expanded Ramachandran five-dimensional space of ϕ1, ψ1, ω, ϕ2, ψ2 selectable by their first and second amino-acid identities. These plots can be viewed three dimensions at a time in the Mage interactive graphics program (Richardson & Richardson, 2001). ϕ1, ψ1 is almost entirely β, which reflects the great tendency for cis-nonPro to occur on or at the end of β-strands. Not surprisingly, there is little spread from 0° in ω, but somewhat more than in trans conformation because of the long tails of the distribution, especially on the negative side (see Supplementary Table 1). ψ1, ϕ2 provides the most diagnostic projection, shown in Figure 4a for all cis-nonPro and in Figure 4b for selected clusters.

Details are in the caption following the image
ψ1, ϕ2 plots, taken as the most diagnostic two-dimensional projections from five-dimensional plots. (a) Includes all cis-nonPro and shows the very strong preference for extended structure, as well as the X-cis-G, G-cis-X, and first-residue alpha that are the only exceptions. (b) Shows the 3 TIM barrel clusters, the G-cis-G DHFR and arginase clusters, and the first-residue alpha clusters discussed in the text.

2.5.1 Trp-X cis-nonPro

Trp and Cys are the least frequent amino acids in globular proteins (~1.3%). However, there are 49 Trp-cis-X cis-nonPro in our dataset of 439, or 11%, an enrichment by a factor of 8, the highest ratio for any amino acid type. The preference is asymmetrical, with only 13 X-cis-Trp and no Trp-cis-Trp. This seems to be a functional rather than an energetic preference since nearly all are on β-strand 8 of a TIM-barrel carbohydrate-cleaving CAZy enzyme (14 chitinases, 7 glucosidases, 6 mannanases, 5 endoglucanase cellulases, and 6 other types in our dataset). Those ϕ, ψ values are nearly all closely clustered in all 5 dimensions near the TIM β8 points of Figure 4b, and the Trp sidechain adopts a t-90 rotamer that lets it usually make some contact with the X sidechain (usually the active-site residue) and often with the adjacent β-strand. Trp-cis-Ser is the most common sequence, followed by Trp-cis-Glu.

Five more Trp-cis-X cis-nonPros are Trp-cis-Thr in non-CAZy glycerol-phosphodiester phosphodiesterases, at the end of β-strand 8 of 8-stranded TIM-like barrels. Only one Trp-cis-X is in a protein with no carbohydrate relationship at all: 1qgu nitrogenase.

2.5.2 First-residue helical ϕ, ψ

A cis peptide cannot occur inside a regular helix, since it would break the helix. Even single-residue helical ϕ, ψ values are very unusual for cis-nonPro, especially in the first amino acid position. This loose cluster of 8 examples, however, turned out to have nothing else in common. Most examples are some sort of turn, loop, or corner, while 3pb6 Asp-cis-Ser makes a bulge in a helix and ligands the Zn. Two are especially unusual and interesting. 2eab A His760-cis-Ala-cis-Pro762 (Nagae et al., 2007) is an unprecedented type of tight turn formed by successive cis-nonPro and cis-Pro peptides, confirmed by definitive 1.12 Å electron density. It has no evident functionality, and we suspect it might not be conserved, but could not tell at the time of the Top8000's construction because its crystal structure defined a new family (GH95) of fucosidases. Since then, there have been 4 further structures done in the GH95 family for other species (7kmq, 2rdy, 4ufc, and 7znz), all of which have the structure His-cis-Pro-cis-Pro, with the much commoner cis-Pro in both positions.

The one α-first case that has a definite biological function is 3hhs chain A Glu-cis-Ala354 (Figure 5) and chain B Glu-cis-Ser352 (Li et al., 2009). They are conserved in related prophenyl-oxidase enzymes. The cis-nonPro forms the C-cap of one helix and makes two backbone H-bonds to the first turn of the next helix, separated by a short, meandering loop. The protruding cis peptide puts a bend of about 30° between the two helices. Glu 352 of the cis-nonPro stabilizes one of the 6 His ligands to the bi-nuclear Cu site, and the second helix contributes two more His ligands.

Details are in the caption following the image
The biologically relevant first-residue-alpha cis-nonPro in 3hhs 1.97 Å prophenyl-oxidase. The Glu is positioned to make an H-bond through water (small red sphere) to a ligand of the bi-nuclear Cu site (large copper-colored spheres), while 2 other ligands are provided by the second helix. H-bonds are shown as green pillows of dots.

2.5.3 DHFRs and other Gly-cis-Gly

Gly is enriched generally by about three-fold in cis-nonPro, at least partly because a Gly-containing peptide is easier to fit incorrectly as cis and thus their numbers are inflated in unfiltered data. There is also a functional preference for Gly in some contexts, especially for Gly-cis-Gly. Expectation for Gly-cis-Gly would be 3.4 examples in our dataset, and there are actually 19. 10 of them are in dihydrofolate reductases (DHFRs) as a conserved and essential feature of the active site. Figure 1 illustrates the local conformation and interactions in the classic 1.09 Å 1kms Lactobacillus casei DHFR Gly-cis-Gly, showing how it is a major factor in binding both the NADPH cofactor and the methotrexate inhibitor, using H-bonds from the cis and surrounding peptides and van der Waals contacts from the essential Gly H “sidechains.” These features are all conserved in DHFRs, as shown even more cleanly in the later structures of our dataset such as the 1.09 Å 1kms human DHFR.

The other Gly-cis-Gly cis-nonPros are quite varied, including 3 in arginase-superfamily enzymes and 3 in carbohydrate enzymes with β-helix folds. Their conformational clusters are in orange in Figure 4b, quite distinct from the DHFR cluster.

There are 59 Gly-cis-X and only 15 X-cis-Gly. We do not know of a specific reason for this, but certainly the relationship of the glycines to the rest of the structure is quite different in the two positions.

2.5.4 Asp-cis-asp

Most of the nine Asp-cis-Asp cis-nonPro peptides arrange their two sidechain carboxyl groups to form one side of a divalent-cation binding site, where the other side is bound by substrate. Figure 6 shows an example from 2xjp at 0.85 Å resolution, with bound Ca++ and mannose. This flocculin enzyme in Candida albicans promotes the self-aggregation called flocculation, one of the steps in brewing beer.

Details are in the caption following the image
The 2xjp structure of flocculin at 0.85 Å, where both Asps bind the Ca++, which binds the mannose (Veelders et al., 2010). Electron density contours are at 1.2 and 3.0σ.

Four other Asp-cis-Asp examples bind the Zn++ (3ife, 3iib, 3pfe) or Mn++ (2pok) at the active site of a metallopeptidase. In the 2jdi F1-ATPase α-chain, just Asp 269 binds the Mg++ through a water.

2rb7 is a metallopeptidase from Desulfovibrio desulfuricans G2 with no metal binding in the crystal and no publication (Joint Center for Structural Genomics, 2007), but the Asp-cis-Asp is part of a suggestive cluster of 3 Asp, 2 Glu, and 2 His deep in a cleft. At 30%–40% homology, there is a new pdb file 7m6u at 2.59 Å resolution of what is now called carboxypeptidase G2 (Yachnin et al., 2022), and it has an Asp-cis-Asp in the equivalent position, where the first Asp binds both Zn++. With a closely related structure but no detectable overall sequence identity, the 8vkt 1.4 Å structure of DapE (Terrazas-Lopez et al., 2024) has an Asp-cis-Met in the equivalent local position, where the Asp binds both Zn++.

2.5.5 Cys-cis-Cys: Vicinal disulfides

The cis-nonPro list includes two examples of Cys-cis-Cys that are SS-bonded as sequence-adjacent, or “vicinal,” disulfides. Figure 7 shows the cis vicinal SS in 1wd4, where the disulfide makes van der Waals contact with bound arabinofuranose. The trans conformation is also possible, and more frequent, for vicinal disulfides. Therefore, we have analyzed the occurrence patterns, the possible conformations (2 cis and 2 trans), and the varied functional roles of vicinal SS in a separate paper (Richardson et al., 2017). They can bind ligands (usually the undecorated side of a ring, as in Figure 7), stabilize structure, or provide the switch for a large conformational change.

Details are in the caption following the image
The 1wd4 arabinofuranosidase at 2.04 Å resolution (Miyanaga et al., 2004). The vicinal SS bond makes extensive van der Waals contact with undecorated side of the sugar ring, preventing binding of a decorated ring. The atoms of the SS bond are gold spheres, and the N and O of the cis bond are blue and red spheres, both pointed in the same direction. Green dots are close contacts and orange spikes are small overlaps. Electron density contours are at 1.0 and 3.0σ.

3 DISCUSSION

The most novel and striking conclusion of this study is also very puzzling. Why are cis-nonPro peptides selected by evolution 4 times more often in carbohydrate-active enzymes than in other proteins, and multiple cis-nonPro 36 times as often? As far as we have seen, the cis peptide never itself performs the catalysis, and it seems to position a catalytic sidechain or to bind the carbohydrate by many quite distinct strategies. Only a subset of CAZy enzymes, and very few other enzymes, take advantage of these cis-nonPro capabilities. Our best guess is that there could be different transition states in carbohydrate chemistry (which is distinct from protein and nucleic acid chemistry) that can make use of the change to a trans state to promote more favorable binding (Du et al., 2024). This could begin to be investigated by seeing whether there are any structures of a transition state for these enzymes. There must also be non-proline peptide isomerases that have been coupled into the systems that produce and control carbohydrate-related enzymes. A great many prolyl-peptide isomerases are known, but apparently a generic peptide cis-trans isomerase activity has so far been described only for DnaK (Schiene-Fischer et al., 2002), and not with a carbohydrate connection. A search for such enzymes might well be productive and informative.

To support their biological functionality, some cis-nonPro position both sidechains (see Figures 6 and 7), while some position just one sidechain (Figures 2b and 5a). Other examples use interactions with the NH and/or CO of the cis peptide itself (Figures 1 and 3), and those interactions can be either direct or through a water molecule. The two sidechains and the functional groups of the peptide are approximately on opposite sides of the motif. Cis-nonPro almost always occurs in well-ordered regions of the structure, usually with one side open for business and the other side held by tight contacts, so a cis-nonPro in a partly disordered loop is highly suspect. One exception is vicinal disulfides, which can occur and function on very mobile loops, since the disulfide makes a cis-trans transition extremely difficult (Richardson et al., 2017). Two situations that have always been found to be wrong are those at chain ends and those that are two-in-a-row (Williams et al., 2022).

As usually true for rare motifs, nearly all genuine cis-nonPro are functionally important. They are rare because they are energetically quite unfavorable relative to trans, so they are not conserved by evolution unless they are useful and needed. Of many dozens of examples examined, only a few had no evident connection with function, such as the fourth one in the all-β domain of chitinase 3wd0 and His-cis-Ala 761 in 2eab. A major motivation for avoiding or fixing incorrect cis-nonPro peptides is to improve the signal-to-noise ratio for recognizing the genuine ones and investigating their functional roles.

4 METHODS

The starting dataset of cis-nonPro examples is from a version of the “Top8000” that includes only the best chain in each RCSB PDB 50% homology cluster, defined as the best average of resolution and MolProbity score (Chen et al., 2010) with both <2.0 Å and requiring deposited diffraction data. Importantly, the dataset is further quality-filtered at the residue level, aiming to cull out incorrect or unjustified cases and leave only genuine cis-nonPro examples. It removes all residues with any backbone atom (including Cβ) with B-factor >40, real-space correlation coefficient <0.7, 2mFo-DFc map value at atom position <1.2σ, a covalent geometry outlier >4σ, or an all-atom clash ≥0.5 Å. B-factor, RSCC, and map value used phenix.real-space-correlation detail = atom, and geometry outliers and all-atom clashes used the MolProbity website (Williams et al., 2018).

We could not provide the huge residue-filtered dataset at the time when we did the initial work. However, a later and larger dataset with residue-level filtering is now freely available on Zenodo at https://doi.org/10.5281/zenodo.4626149 (Williams et al., 2022).

For the cis-nonPro work, we first looked at the unfiltered but high-resolution data for cases that had more than one cis-nonPro, to check out how well the residue filtering worked. 42 examples were clearly correct, in good density for the real backbone position and a gap along the Cα–Cα line, and no outliers. 44 were clearly unjustified, in very fragmented density and with many outliers in bond geometry and torsions. Just two were ambiguous, in reasonable but not excellent density and with slight clashes. The residue-level filters were doing a good job, and there was a clear separation between disordered structure—which is where most of the incorrect cis-nonPros occurred during the epidemic of overuse—and well-ordered structure. To check this out further, 127 filtered cases were examined in 3D along with their electron density, validation markup, and literature references, which identified only two incorrect examples (the same ones as before; this time we corrected and refined them to show that trans was indeed preferable). Both contained a Gly, which in hindsight was reasonable because the lack of a Cβ to provide extra density for a cross-check increases the likelihood of a trans-to-cis misfit. After finding this, all 84 additional cases were examined that contained a Gly, finding 6 cases to be inadequately supported, which were also removed. All but one of the unsupported ones were deposited during the period of epidemic overuse of cis-nonPro, from 2006 to 2015. Examination of individual examples was done in the KiNG display and modeling program (Chen et al., 2009), using 2mFo-DFc electron density maps and difference-density maps, and their occurrence on partially disordered loops or where identical chains are not cis was also rejected.

In addition to the enormous number of unjustified cis-nonPro added during the epidemic of cis-nonPro overuse, we were curious how many genuine cis-nonPro had been overlooked in our high-resolution dataset. Touw had produced a thorough and detailed tool for finding the cis-trans and peptide flips for the case of overlooked cis-Pro (Touw et al., 2015), so we asked them to run the tool on our quality-filtered dataset for the case of cis-nonPro. They kindly agreed, finding only 13 possible examples, listed in Supplementary Table-S3.overlooked-cnP_in-Top2018.pdf. That is a reassuringly small number and will not change our conclusions. However, we have not added them to our analysis because we want all our examples to be accessible directly from the PDB.

All figures except Figure 4 were produced in KiNG. The Mage graphics program (Richardson & Richardson, 2001) was used for the five-dimensional dihedral-angle analysis, because of its cluster-defining functionality and because it can support the 52 lists per subgroup needed to select first- and second-residue sequence and conformation combinations for study.

Occurrence frequencies for the amino acids in cis-nonPro peptides were normalized relative to expectation by comparing them with frequencies in the general protein sequence population, as given in the UniProt knowledge base at http://www.ebi.ac.uk/uniprot/TrEMBLstats, version as of November 2017. As described in the text, carbohydrate-active enzymes were identified by their inclusion in the CAZy database at http://www.cazy.org, accessed over mid to late 2016.

In the text and the Supplemental table, a convention is used that is legible in any font (regardless of 0-O, I-1-l confusions): all lower case except for L.

AUTHOR CONTRIBUTIONS

Jane S. Richardson: Conceptualization; data curation; formal analysis; validation; supervision; funding acquisition; visualization; writing – original draft; writing – review and editing. Lizbeth L. Videau: Data curation; investigation; validation; visualization; resources; writing – review and editing. Christopher J. Williams: Conceptualization; software; investigation; formal analysis; validation; writing – original draft; writing – review and editing. Bradley J. Hintze: Formal analysis; investigation; software; writing – review and editing. Steven M. Lewis: Conceptualization; resources; writing – review and editing. David C. Richardson: Conceptualization; funding acquisition; project administration; software; supervision; writing – review and editing.

ACKNOWLEDGMENTS

This work was supported by National Institutes of Health grants P01-GM063210 Project IV to JSR and R01-GM073919 to DCR, and the Phenix Industrial Royalties.

    CONFLICT OF INTEREST STATEMENT

    The authors declare no conflict of interest.

    DATA AVAILABILITY STATEMENT

    Our data is shared in the supplementary tables.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.