α-Helical coiled coils are common tertiary and quaternary elements of protein structure. In coiled coils, two or more α helices wrap around each other to form bundles. This apparently simple structural motif can generate many architectures and topologies. Coiled coil-forming sequences can be predicted from heptad repeats of hydrophobic and polar residues, hpphppp, although this is not always reliable. Alternatively, coiled-coil structures can be identified using the program SOCKET, which finds knobs-into-holes (KIH) packing between side chains of neighboring helices. SOCKET also classifies coiled-coil architecture and topology, thus allowing sequence-to-structure relationships to be garnered. In 2009, we used SOCKET to create a relational database of coiled-coil structures, CC⁺, from the RCSB Protein Data Bank (PDB). Here, we report an update of CC⁺ following an update of SOCKET (to Socket2) and the recent explosion of structural data and the success of AlphaFold2 in predicting protein structures from genome sequences. With the most-stringent SOCKET parameters, CC⁺ contains ≈12,000 coiled-coil assemblies from experimentally determined structures, and ≈120,000 potential coiled-coil structures within single-chain models predicted by AlphaFold2 across 48 proteomes. CC⁺ allows these and other less-stringently defined coiled coils to be searched at various levels of structure, sequence, and side-chain interactions. The identified coiled coils can be viewed directly from CC⁺ using the Socket2 application, and their associated data can be downloaded for further analyses. CC⁺ is available freely at http://coiledcoils.chm.bris.ac.uk/CCPlus/Home.html. It will be updated automatically. We envisage that CC+ could be used to understand coiled-coil assemblies and their sequence-to-structure relationships, and to aid protein design and engineering.

1 INTRODUCTION

Protein–protein interactions (PPIs) are critical for all biological processes (Kuriyan et al., 2012). There are various classes of PPI involving the common protein secondary structure elements—α helices and β strands—less-well-defined turns, loops, and intrinsically disordered regions, and many types of protein tertiary structure. Gathering sequence and structural data on PPIs (McDowall et al., 2009; Pagel et al., 2005; Szklarczyk et al., 2023; Xenarios et al., 2000) is important and necessary if we are to understand protein networks, target them for biomedical applications, and exploit them in synthetic biology and biotechnology.

One class of PPI that is well defined both at the sequence and structural levels is the α-helical coiled coil (CC). As a result, CCs can be readily identified and examined. In turn, this provides insight into protein structure and function, and a solid basis for protein design and engineering. Indeed, structure, design, and engineering studies of CCs are relatively mature and have been reviewed extensively (Hartmann, 2017; Korendovych & DeGrado, 2020; Lapenta et al., 2018; Lupas et al., 2017; Lupas & Bassler, 2017; Woolfson, 2017, 2023). Therefore, this introduction focusses on the most-relevant points that underpin the work presented in this paper. One thing lagging behind these advances is an understanding of the biological diversity and functions of the many natural CCs (Woolfson, 2023). One of our aims for the revised database of CC structures and associated data presented is that it will help others reach a full understanding of this particularly widespread and diverse class of PPI.

CCs are assemblies of two or more α helices that wrap around each other to form rope-like, or supercoiled, helical bundles. At the most basic level, these assemblies are encoded by seven-residue (heptad) sequence repeats in which hydrophobic (h) residues are spaced alternately three and four residues apart. The intervening residues are often polar (p) resulting in hpphppp repeats usually labeled abcdefg. Combined with the 3.6 residues per turn of the α helix, these patterns encode amphipathic helices with distinct hydrophobic (a + d) and polar faces. In aqueous media, multiple copies of such helices come together to bury their hydrophobic faces and form hydrophobic cores that stabilize the assemblies. The helices supercoil around each other because the average hydrophobic spacing of 3.5 residues falls short of the fixed 3.6 residues per turn of the α helix. Moreover, and because of this combination of sequence repeat and α-helical geometry, residues at a and d form knobs that can insert into holes formed by four residues of a neighboring helix (e.g., d-g-a-d or a-d-e-a, respectively, in parallel CCs). This so-called knobs-into-holes (KIH) packing is the true signature of CC structures and assemblies.

However, this apparent simplicity masks underlying complexities (Lupas et al., 2017; Lupas & Gruber, 2005). For instance, although heptad sequence repeats are the most common, other repeats based on 3,4-spacings of hydrophobic residues are possible, and these lead to different CC supercoils. Furthermore, expanding the helix–helix interfaces from the canonical a + d sites to include the g and e positions accesses oligomer states beyond the dimers, trimers, and tetramers that are most common in nature. As well as these different CC oligomers from dimers upwards, the component helices of CC assemblies can be in all-parallel, antiparallel, or mixed arrangements. And, though many CCs are homo-oligomers, hetero-oligomers are also common. Finally, as well as these oligomeric CCs, many CCs are formed within single chains, that is, they form parts of tertiary structures. Access to complete and robust databases of CCs would help assess the breadth of CC structural space, and the interrogation of CCs en masse would give a deeper understanding of sequence-to-structure/function relationships. In turn, this would aid structural molecular biology, chemical and synthetic biology, protein design, and other fields and applications.

Some time ago, we wrote the program SOCKET to identify KIH packing between α helices in 3D coordinates of protein structures, that is, RCSB PDB files (Bittrich et al., 2022; Walshaw & Woolfson, 2001). SOCKET has allowed us to identify CCs in the whole of the PDB and create a relational database of CC structures, CC⁺ (Testa et al., 2009). Using these resources, we have been able to classify CCs in a Periodic Table of Coiled Coils (Moutevelis & Woolfson, 2009) and a graph-based Atlas of Coiled Coils (Heal et al., 2018). Others have used SOCKET to generate similar databases and useful resources for collating and analyzing CC structures; notably, SamCC-Turbo (Szczepaniak et al., 2020). Web-based resources for predicting, analyzing, and categorizing CCs more widely have been reviewed elsewhere (Woolfson, 2023).

Recently, we have updated SOCKET to Socket2 (Kumar & Woolfson, 2021). This includes improvements in the SOCKET algorithm itself to capture CC structures that SOCKET could not find, and a new visualizer to allow users to view and analyze identified CCs directly in real time. Here, we describe an update and overhaul of the CC⁺ database using Socket2. Moreover, we include CCs identified by Socket2 in both the experimentally validated structures of the RCSB PDB and those found in tertiary structures predicted by AlphaFold2 (Jumper et al., 2021) collated in the EMBL-EBI AlphaFold Protein Structure Database (https://alphafold.ebi.ac.uk/), which at the time of updating CC⁺ had models for 48 proteomes (Table S1) (Varadi et al., 2022). We envision that the new CC⁺ database and its associated tools will be of use to expert and non-expert users interested in all aspects of CC biology, structure, and design.

2 DESIGN, ARCHITECTURE, AND POPULATION OF THE CC⁺ DATABASE

The new CC⁺ database has three components: First, an automatically updated backend houses the core data on the identified CC structures/models and their associated sequences and structural parameters. Second, as described in the next section, an accessible frontend allows users to access these data using a wide range of search parameters and criteria. Third, following user-defined searches, CC⁺ can generate search-specific data on the fly, including: position-specific scoring matrices (PSSMs) from the selected sequences; and models of the CC regions in context of the whole protein structure, which can be visualized and interrogated using the interactive GUI of Socket2 (Kumar & Woolfson, 2021).

The design and architecture of the backend borrow from the original database reported in 2009 (Testa et al., 2009): that is, it is an updated rather than a completely rewritten resource. The process of populating this is outlined in Figure 1 and described below.

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

Flow chart detailing the process of compiling the CC⁺ database and website. The CC⁺ database is compiled from the structures from RCSB PDB Bittrich et al., 2022 and models from the AlphaFold Protein Structure Database (https://alphafold.ebi.ac.uk/) (Varadi et al., 2022). These files are scanned using DSSP (Joosten et al., 2011; Kabsch & Sander, 1983) and then Socket2 (Kumar & Woolfson, 2021) to identify KIH interactions and assign CCs. The organized output is stored in a MySQL database comprising tables of CCs and associated data. MultAlin (Corpet, 1988) and CD-HIT (Fu et al., 2012; Li & Godzik, 2006) are used to align sequences and calculate redundancy, respectively. A user-friendly website has been developed to provide easy access to and searching of the stored data at http://coiledcoils.chm.bris.ac.uk/CCPlus/Home.html.

PDB-formatted files from the January 2023 release of RCSB PDB (PDB) (Bittrich et al., 2022), and the AlphaFold Protein Structure Database (Varadi et al., 2022) of predicted models for 48 genomes were downloaded from https://alphafold.ebi.ac.uk/. For the PDB structures, the corresponding asymmetric and biological units were also considered in the following process. DSSP (Joosten et al., 2011; Kabsch & Sander, 1983) and Socket2 (Kumar & Woolfson, 2021) were used to identify α-helical regions and to assign CCs, respectively. MultAlin (Corpet, 1988) was then used to determine if partnering helices had the same or different sequences. CD-Hit (Fu et al., 2012; Li & Godzik, 2006) was used to categorize sequences into four groups based on redundancy: ≤50% identity, ≤70% identity, non-redundant (<100% identity), and redundant (i.e., all examples are included, including those with identical sequences). As described below, these categories can be used to focus user-defined searches. Biopython (Cock et al., 2009) was employed to generate PDB files for just the CC regions. CC assignments and associated data were stored in a MySQL database for easy access and searchability. The full set of these assignments, or “default structures,” are summarized in Figure 2, where they are broken down according to the number and orientation of α helices in each CC.

The user-accessible frontend of CC⁺ was developed using HTML, JavaScript, and CSS, and the backend using CGI/Perl, HTML, JavaScript, and CSS. The MySQL database was used to create the associated sequence and parameter tables. For on-the-fly sequence analyses, the Numpy and Pandas packages of Python were used to generate PSSMs. The SciPy Stats module of Python was used for the χ² tests. Matplotlib (Hunter, 2007) and PyMol (Schrödinger, 2021) were used to generate plots and figures for protein structures, respectively. The CC⁺ database is available at http://coiledcoils.chm.bris.ac.uk/CCPlus/Home.html.

3 USING THE CC⁺ DATABASE

3.1 Overview

The CC⁺ Homepage provides links to two form-based “Dynamic Interface” tabs, one each for customizable searches of CCs in PDB structures and AlphaFold2 models, CCPlus-PDB (Figure 3), and CCPlus-AlphaFold, respectively. These are described in detail below. The “Statistics” tab provides users with quick-glance summaries of the content of the current version of the database in terms of the number of CCs categorized as intra- (same chain) and inter-molecular (multiple chains), the number of helices/oligomeric state in the CC assemblies, and the arrangement of helices within these (parallel or antiparallel). These summaries can be tailored by users selecting the Socket2 cut-off value (7, 7.5, 8, 8.5, or 9 Å) and the sequence redundancy or similarity (50%, 70%, non-identical, redundant) used to find the CCs. Finally, the “Documentation” tab provides users with comprehensive descriptions of Socket2 (Kumar & Woolfson, 2021), and advice on using the Dynamic Interface, what can be done when searches fail or return no hits, and how data can be downloaded and used.

3.2 Searching the dynamic interfaces

A major advance in protein science since the launch of the CC⁺ database in 2009 has been the success of AlphaFold2 in predicting protein structures (Jumper et al., 2021) and its application to 48 complete genomes (https://alphafold.ebi.ac.uk/) (Varadi et al., 2022). Therefore, along with CC structures found by Socket2 (Kumar & Woolfson, 2021) in the RCSB PDB (Bittrich et al., 2022), for the updated CC⁺, we have included those found in AlphaFold2 models predicted from these 48 proteomes. However, to avoid confusion between experimental structures and predicted models, and to give users control over how they use (separate or combine) these data, we have separated the backend data and frontend searches of the two groups via the CCPlus-PDB and CCPlus-AlphaFold “Dynamic Interface” tabs. As detailed below, some of the subtabs used to search these are common to both parts of the CC⁺ database, but others are unique to each arm of the database. In all cases, searches are initiated by clicking the Search CC⁺ button, and default values for each of the parameters can be regained with the Reset button. Searches using default values include any number of canonical helices that are over 11 residues in length, in any orientation, with any type of partnering, from any number of chains, and with 50% sequence identity or less.

3.2.1 Specify PDB IDs

This subtab only applies to CCPlus-PDB. It can be used to find any CCs in a chosen PDB file. This can be used in conjunction with other search parameters. However, to avoid missing any CCs, we advise setting the Redundancy parameter to “redundant” and leaving the other subtabs at their default settings in the “Specify Structures” subtab.

3.2.2 Specify structures

This subtab has undergone major updates. For both the CCPlus-PDB (Figure 4a) and CCPlus-AlphaFold (Figure 4a) tabs, a slider allows users to choose the Socket2 cut-off value for identifying CCs. The default value is 7 Å, which we recommend using, as the other values (7.5, 8, 8.5, and 9 Å) are increasingly less stringent and may pull in non-coiled-coil regions.

The “CCPlus-PDB” tab (Figure 4a) offers several search parameters for CCs, including: Redundancy, as introduced above; the number of α Helices and their relative Orientation; whether the Partnering helices have the same (homo-mers) or different (hetero-mers) sequences; whether the helices are from the same or different polypeptide Chains; if the underlying CC sequence Repeats are heptad or non-heptad based; and the minimum Length of the CC helices. In addition to these options adopted from the original CC⁺, users can now specify the Protein Type (membrane or globular); the Experiment Type used to solve the parent structure; and request specific Modified Residues, that is, non-proteinogenic residues. When Experiment Type is defined, a Resolution range can be added by the user.

As AlphaFold2 predicted models are for single chains and contain only the 20 standard proteogenic residues, the “CCPlus-AlphaFold” tab (Figure 4b) offers similar search parameters to the above but without options for the number of Chains, Protein Type, Experiment Type, and Modified Residues. These searches can also be filtered by a min—max pLDDT Score for the predicted CC regions.

3.2.3 Specify sequences

Searches can be defined further using this subtab (Figure 5a) to find CCs that contain specified sequences or sequence patterns. Users can enter plain text for a query sequence or use PROSITE (Sigrist et al., 2010) notation to find sequence patterns. Furthermore, these sequences and patterns can be requested to fall at specified positions of the heptad repeats using the a–g heptad notation.

The Register field of this subtab can be used to search for non-canonical repeats. For instance, 11-residue, hendecad repeats can be found by entering “abcdefgdefg.” This is because Socket2 locates KIH interactions and then assigns them as a–g sites only, as heptad repeats predominate in CC sequences and structures. In the case of hendecad repeats, the hijk positions are analogous to an addition of defg to a heptad repeat.

3.2.4 Specify interactions

Finally, searches can be defined even further using this subtab (Figure 5b), which allows users to identify residue–residue interactions within a specified distance in a CC search. This has the option to specify the register positions of the interacting residues. Thus, this feature allows the compilation of CC datasets with potential residue–residue interactions that underpin sequence-to-structure relationships in CCs.

3.3 Displaying and using results from CC⁺ searches

On completion of a search, the results are displayed in a new page, Figure 6. Here, users can choose several options to display and analyze the data. Each “Results page” has four main features: (i) “Selected Values for the Current Search,” which simply tabulates the current selection criteria and their values; (ii) the “Results” section itself, which is explained below; (iii) an associated “Gallery” of cartoons for the CCs returned, which allows quick access to their structures and sequences; and (iv) a “Search Again” tab, which allows the user to search the database with a new set of parameters.

In more detail, the “Results” section has two main features: (a) a list of the returned CCs; and (b) options to download associated data from the search. The subtabs enable displaying the list of CCs in “Summary,” “Gallery,” or “Compact” forms, with “Gallery” being the default view. It provides clickable images summarizing the architecture of each CC. Users have the option to select the number of CCs displayed, with 25 as the default. In these three displays, the images or PDB codes are also clickable links to a Socket2-based GUI for visualizing the CC in context of the whole structure. This is described in more detail below.

Above these display items, users are given options for downloading data from the search. For instance, the “Profiles” button links to PSSMs for the selected CCs; that is, 20 x 7 tables for the occurrence of each amino acid at the seven positions of the heptad repeats, which can be normalized internally or using amino-acid frequencies from Uniprot/Swissprot (UniProt, 2023). In addition, the adjacent drop-down menu and “Save” button allow users to download a flat csv files for the “Summary and CC Sequences,” or a zipped file giving the “3D-Coordinates” of the CC regions only in PDB file format. For the latter, the CCs are extended by one residue at each terminus of the constituent helices. These summary, sequence, and coordinate files are provided to facilitate further user-specified analyses and applications of the CC datasets.

Finally, if the database does not provide any hits for a set of search parameters, the “Results” section will display the message “No Results Satisfying the Input Criteria.” In such cases, users are encouraged to redefine their search criteria or to visit the “Documentation” tab.

3.4 Improvements on the original 2009 webserver

The database and webserver have undergone significant changes as detailed below.

3.4.1 New user interface

The user interface has been updated to ease searches and visualization of results. The “Dynamic Interface” allows users to search for CCs in the PDB or the AlphaFold Protein Structure Database. The “Documentation” tab has been updated accordingly.

3.4.2 New search parameters

The updated CC⁺ uses Socket2 to provide broader searches, for example, CCs that contain glycine or modified residues. As noted above, users can now specify Protein Type as listed in the RCSB PDB, and Resolution (in the range 0 – 100 Å) for appropriate Experiment Types.

3.4.3 Sequences with modified residues

The MODRES record of the PDB file format provides information about the modified residue and the corresponding proteogenic amino acid. Socket2 uses this record allowing CCs that contain non-proteinogenic residues to be identified. CC⁺ can be searched for examples via the “Specify Structures” tab by providing the three-letter code for the non-proteinogenic residue. Typing the first letter of the code activates a drop-down box of non-proteinogenic residues for the user to choose from.

As of January 2023, in the redundant set of CCs found at a cut-off value of 7 Å, 2347 CCs were found with modified residues. Of these, modified methionine (MSE) was the most common accounting for ≈90% of the examples. Others included modified phenylalanine and lysine residues. The highest number of non-proteogenic residues that we found in a CC helix was 7 in some α/β-peptide foldamers (PDB IDs: 2oxj, 2oxk, and 3c3g; Horne et al., 2007, 2008).

3.4.4 Ability to query the AlphaFold Protein Structure Database

The new CC⁺ contains CCs found in the AlphaFold2 predicted models from 48 proteomes downloaded from https://alphafold.ebi.ac.uk/ (Varadi et al., 2022). Users can search for CCs from individual proteomes or across all 48 proteomes. However, as the predicted models are for single chains and include only proteogenic residues, some search options of the CCPlus-PDB tab are not available.

When searching CCPlus-AlphaFold, users are given the option to specify ranges of the predicted local distance difference test (pLDDT Score). This is a per-residue confidence score for the AlphaFold2 prediction scaled 0 – 100. Scores >90 indicate high confidence, while scores <50 indicate low-confidence predictions. When specified, this option will search for all CCs in CC⁺ with average pLDDT value of the input range. This average is calculated for the residues of the participating helices rather over the whole protein structure.

Currently, CCPlus-AlphaFold only contains CCs that are within the same polypeptide chain; that is, only intramolecular CCs. This is a necessary consequence of the publicly available AlphaFold2 models being limited to predictions of tertiary structures. As discussed below, we are working to remedy this by including AlphaFold2-based predictions of homomeric quaternary structures at least.

3.4.5 Downloadable resources

As introduced above, users have the option to download the results of CC⁺ as a flat csv file. Furthermore, using Biopython (Cock et al., 2009), PDB files containing only the selected CC regions can be downloaded; in this case, the helices are extended by a single residue at each terminus. By clicking on the “Profiles” button of the “Results” page, users can download the raw counts for the occurrence of the 20 standard residues at each heptad position and the Swissprot or internally normalized propensity tables (PSSMs).

3.4.6 Visualization of the results

On the “Results” page, each search is presented as “Summary,” “Gallery,” and “Compact” views. Here, the figure and the PDB ID act as a clickable links to an interactive GUI running Socket2 (Kumar & Woolfson, 2021) to visualize and download images of that CC. As detailed in the Socket2 publication (Kumar & Woolfson, 2021), the GUI gives multiple options for visualizing the component CCs of a structure/model and their associated sequences. Via subtabs, it also provides some analysis of the structures such as CC “Register,” “Angle Between Helices,” and “Core-packing Angles” for the KIH packing.

3.4.7 Regular updates

The webserver is designed to update automatically on the first Thursday of each month. However, updates for the CCPlus-AlphaFold part of the website require manual intervention. We aim to perform these updates regularly to keep the database current with any changes in the AlphaFold Protein Structure Database (https://alphafold.ebi.ac.uk/; Varadi et al., 2022).

4 RESULTS

As described above, the new CC⁺ database allows searches of CC structures and predicted models from the RSCB PDB and AlphaFold Protein Structure Database, respectively. The new features that we have introduced allow users to search for CCs that are either broadly defined or highly specified in terms of structure, sequence, protein type, experimental methods for structure determination, organism, and so on. We have implemented many of these features to give flexibility to users, and because we could not possibly anticipate all searches that users might choose to run. Therefore, here we do not give an exhaustive list or overview of what searches can be done and the data that might be retrieved. Instead, we show a few examples of the shape of the data that CC⁺ contains and how this can be accessed by users.

4.1 Distribution of different CCs across dataset

As summarized in the “Statistics” tab and shown in Figure 2, two-helix CCs make up almost 80% of the total CCs from the PDB, and 87% in the AlphaFold Protein Structure Database. Three-helix CCs constitute 14% and 11% of the total in PDB and AlphaFold2 datasets, respectively, whereas four-helix CCs represent 4% and 1%, respectively. For the CCs from the PDB, intramolecular antiparallel interactions dominate the CCs, although the proportions of parallel and antiparallel interactions even out with increasing numbers of helices in the assemblies. Indeed, the higher-order CC structures in the PDB with >5 helices are mostly parallel, intermolecular assemblies. Because of the nature of the current models in AlphaFold Protein Structure Database—that is, they are for tertiary structures only--the predicted CCs are within the same chain and predominantly have antiparallel helices. Although, again in the higher-order CC predictions, at least those captured by the most-stringent Socket2 cut-off of 7 Å, there are six predicted parallel five-helix CCs.

Interestingly, there are some very large CC assemblies with >6 helices, which Socket2 can now identify (Kumar & Woolfson, 2021). Using the default parameters and ≤ 70% sequence identity, a total of 24 such CCs were found in the CCPlus-PDB Database. Of these, the true largest assembly has 15 parallel helices, which is in the biological unit of the rotor ring (c subunit) of the proton-dependent ATP synthase (PDB ID, 2wie; Pogoryelov et al., 2009). In this structure, 15 helical hairpins are arranged side-by-side to form two concentric rings. The inner helices contributed by each hairpin have KIH interactions extending for just eight residues and form a 15-helix coiled-coil barrel. This class of protein presents an interesting case. For instance, the ATP synthase from Bacillus pseudofirmus OF4 has a rotor ring with 13 parallel helices and a slightly longer 11-residue stretch of KIH interactions (PDB ID, 4cbj; Preiss et al., 2014). There is a note of caution for Socket2 users too. The program is generally accurate in identifying CCs. However, it can misassign oligomeric state/number of helices. For instance, for the same part from the Escherichia coli ATP synthase (PDB ID, 5t4o; Sobti et al., 2016), Socket2 indicates a 20-helix CC. However, upon inspection, the structure comprises two concentric 10-helix rings. Similarly, the rotor ring of the mycobacterial ATP synthase (PDB ID, 4v1g; Preiss et al., 2015) is a nonomer, but Socket2 identifies it as an 18-mer. The source of the misclassification is that the rings have KIH packing within them and between them; effectively, there are rings of three-helix bundles, which Socket2 interprets as a single, contiguous, larger ring. This would be difficult to correct for such a small, though interesting, class of structures. Therefore, at this stage, we advise using the visualizer in Socket2 or an external viewer to inspect and verify unusual CC classification from Socket2 and CC⁺.

4.2 Comparison of CCs in PDB structures and predicted AlphaFold2 models

We note that for the current version of CC⁺, direct comparisons between the PDB- and AlphaFold2-derived datasets are limited because CCPlus-PDB contains both intra- and inter-molecular CCs, whereas, at present, CCPlus-AlphaFold necessarily only has the former. We discuss how this is being addressed below. Nonetheless, the currently available structures and models do allow some comparisons to be made.

Using search parameters of 7 Å Socket2 cut-off and ≤ 70% sequence identity, CCPlus-PDB and CCPlus-AlphaFold currently house ≈9000 structures and ≈91,000 models, respectively. Both datasets predominantly have CCs with canonical (heptad-based) sequence repeats. As CC⁺ allows searches of CCs with non-canonical repeats, we asked if the proportions of these differed between the two datasets. Out of 8838 and 90,669 CCs, we found 755 (8.54%) and 6116 (6.75%) non-canonical CCs in CCPlus-PDB and CCPlus-AlphaFold datasets (Figure 2), respectively.

Next, we compared the experimental and modeled CCs as a function of minimum length binned into five categories: (i) <11, (ii) 11–14, (iii) 14–21, (iv) 21–28, and (v) >28 residues (Figure 7). For the two-helix structures, the length distributions of the CCs from the two datasets were similar. However, above this, the predicted CCs from the AlphaFold2 dataset tended to be shorter than the corresponding classes from the PDB dataset. We suspect that this is related to the CCs predicted by AlphaFold2 being intra-molecular. Therefore, it will be interesting to see if and how these distributions change when predicted quaternary structures become available.

Both datasets have examples of CCs with helices >100 residues. For instance, in the PDB dataset, a dimeric CC in a cryo-EM structure of the motor-protein dynein tail-dynactin-BICD2N complex (PDB ID, 5afu) (Urnavicius et al., 2015) has a helix spanning 165 residues. And in the AlphaFold2 dataset, a tropomyosin-like protein (Uniprot ID: A0A077ZIM1) from Trichuris trichiura is predicted to have a two-helix antiparallel CC with a helix of 176 residues, although the average pLDDT score is in the confidence range of 70–90.

5 CONCLUSION

We have described the structure, main features, and a small number of many possible uses of an updated database of coiled-coil (CC) structures and predicted models, the CC⁺ database. The CCs are found using the program Socket2 (Kumar & Woolfson, 2021), which identifies the signature KIH packing between neighboring α helices of CCs. Therefore, it does not rely on, and is not biased by sequence-based CC predictions, which are not always consistent or reliable (Simm et al., 2021). The new CC⁺ database includes both experimentally derived CCs from the RCSB PDB (Bittrich et al., 2022) and predicted AlphaFold2 models for 48 proteomes (https://alphafold.ebi.ac.uk/; Varadi et al., 2022). This represents a significant expansion of CC⁺ since its inception in 2009.

We recognize that the PDB and AlphaFold2 parts of the new CC⁺ database are not directly comparable because currently available AlphaFold2 predictions are only for protomers, that is, single-chain, tertiary structures. We are working to remedy this, and we aim to link the two parts of the database fully in the future. For instance, recent work has predicted homo-oligomeric quaternary-structure models based on AlphaFold2 tertiary-structure predictions for four proteomes (Schweke et al., 2023). Interestingly, application of Socket2 to these models indicates that CCs are major enablers of PPIs, particularly in eukaryotes. However, because this study considers PPIs and quaternary structures more widely, and is distinct from the update of the CC⁺ database presented here, it will be presented elsewhere (Schweke et al., 2023).

The new version of CC⁺ is available at http://coiledcoils.chm.bris.ac.uk/CCPlus/Home.html. It can be searched in a wide variety of ways at the sequence and structural levels to generate user-defined datasets. In turn, the identified CCs can be visualized and analyzed in a user-friendly GUI as part of Socket2. There are also options for analyzing datasets within CC⁺. Alternatively, the datasets (sequences, coordinates, and metadata) can be downloaded in bulk for analysis off-line. Thus, we hope that CC⁺ will be useful to many users interested in an array of CC chemistry, structure, and biology. For instance, from gathering examples of related CC structures for basic biological research to garnering sequence-to-structure/function relationships for underpinning protein design and engineering projects.

AUTHOR CONTRIBUTIONS

Project conceptualization: Prasun Kumar and Derek N. Woolfson. Database and front-end development: Prasun Kumar and Derek N. Woolfson. Data analyses: All authors. Writing of the manuscript: Prasun Kumar, Rokas Petrenas, Emmanuel D. Levy, and Derek N. Woolfson. All authors have read and commented on the manuscript, including the final version.

ACKNOWLEDGMENTS

Prasun Kumar was supported by the Biotechnology and Biological Sciences Research Council (BBSRC) grant to Derek N. Woolfson (BB/R00661X/1). Rokas Petrenas was supported by a BBSRC-funded PhD studentship (SWBio DTP). William M. Dawson and Derek N. Woolfson were funded by an European Research Council Advanced Grant (340764) and a subsequent European Research Council Proof of Concept Grant (787173). Derek N. Woolfson was also supported by the BrisSynBio, a BBSRC/Engineering and Physical Sciences Research Council-funded Synthetic Biology Research Centre (BB/L01386X/1). Emmanuel D. Levy and Hugo Schweke acknowledge support from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreement No. 819318), the Israel Science Foundation (grant no. 1452/18), and by the Abisch-Frenkel Foundation.

Supporting Information

REFERENCES

Bittrich S, Rose Y, Segura J, Lowe R, Westbrook JD, Duarte JM, et al. RCSB protein data Bank: improved annotation, search and visualization of membrane protein structures archived in the PDB. Bioinformatics. 2022; 38(5): 1452–1454.
10.1093/bioinformatics/btab813
CAS PubMed Web of Science® Google Scholar
Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009; 25(11): 1422–1423.
10.1093/bioinformatics/btp163
CAS PubMed Web of Science® Google Scholar
Corpet F. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. 1988; 16(22): 10881–10890.
10.1093/nar/16.22.10881
CAS PubMed Web of Science® Google Scholar
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012; 28(23): 3150–3152.
10.1093/bioinformatics/bts565
CAS PubMed Web of Science® Google Scholar
Hartmann MD. Functional and structural roles of coiled coils. Subcell Biochem. 2017; 82: 63–93.
10.1007/978-3-319-49674-0_3
CAS PubMed Google Scholar
Heal JW, Bartlett GJ, Wood CW, Thomson AR, Woolfson DN. Applying graph theory to protein structures: an atlas of coiled coils. Bioinformatics. 2018; 34(19): 3316–3323.
10.1093/bioinformatics/bty347
CAS PubMed Web of Science® Google Scholar
Horne WS, Price JL, Gellman SH. Interplay among side chain sequence, backbone composition, and residue rigidification in polypeptide folding and assembly. Proc Natl Acad Sci U S A. 2008; 105(27): 9151–9156.
10.1073/pnas.0801135105
CAS PubMed Web of Science® Google Scholar
Horne WS, Price JL, Keck JL, Gellman SH. Helix bundle quaternary structure from alpha/beta-peptide foldamers. J Am Chem Soc. 2007; 129(14): 4178–4180.
10.1021/ja070396f
CAS PubMed Web of Science® Google Scholar
Hunter JD. Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007; 9(3): 90–95.
10.1109/MCSE.2007.55
Web of Science® Google Scholar
Joosten RP, te Beek TA, Krieger E, Hekkelman ML, Hooft RWW, Schneider R, et al. A series of PDB related databases for everyday needs. Nucleic Acids Res. 2011; 39(Database issue): D411–D419.
10.1093/nar/gkq1105
CAS PubMed Web of Science® Google Scholar
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021; 596(7873): 583–589.
10.1038/s41586-021-03819-2
CAS PubMed Web of Science® Google Scholar
Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983; 22(12): 2577–2637.
10.1002/bip.360221211
CAS PubMed Web of Science® Google Scholar
Korendovych IV, DeGrado WF. De novo protein design, a retrospective. Q Rev Biophys. 2020; 53:e3.
10.1017/S0033583519000131
CAS PubMed Web of Science® Google Scholar
Kumar P, Woolfson DN. Socket2: a program for locating, visualising, and analysing coiled-coil interfaces in protein structures. Bioinformatics. 2021; 37(23): 4575–4577.
10.1093/bioinformatics/btab631
CAS PubMed Web of Science® Google Scholar
Kuriyan J, Konforti B, Wemmer D. The molecules of life: physical and chemical principles. Garland Science, New York; 2012.
10.1201/9780429258787
Google Scholar
Lapenta F, Aupic J, Strmsek Z, Jerala R. Coiled coil protein origami: from modular design principles towards biotechnological applications. Chem Soc Rev. 2018; 47(10): 3530–3542.
10.1039/C7CS00822H
CAS PubMed Web of Science® Google Scholar
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006; 22(13): 1658–1659.
10.1093/bioinformatics/btl158
CAS PubMed Web of Science® Google Scholar
Lupas AN, Bassler J. Coiled coils - a model system for the 21st century. Trends Biochem Sci. 2017; 42(2): 130–140.
10.1016/j.tibs.2016.10.007
CAS PubMed Web of Science® Google Scholar
Lupas AN, Bassler J, Dunin-Horkawicz S. The structure and topology of alpha-helical coiled coils. Subcell Biochem. 2017; 82: 95–129.
10.1007/978-3-319-49674-0_4
CAS PubMed Google Scholar
Lupas AN, Gruber M. The structure of alpha-helical coiled coils. Adv Protein Chem. 2005; 70: 37–78.
10.1016/S0065-3233(05)70003-6
CAS PubMed Web of Science® Google Scholar
McDowall MD, Scott MS, Barton GJ. PIPs: human protein–protein interaction prediction database. Nucleic Acids Res. 2009; 37(Database issue): D651–D656.
10.1093/nar/gkn870
CAS PubMed Web of Science® Google Scholar
Moutevelis E, Woolfson DN. A periodic table of coiled-coil protein structures. J Mol Biol. 2009; 385(3): 726–732.
10.1016/j.jmb.2008.11.028
CAS PubMed Web of Science® Google Scholar
Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, et al. The MIPS mammalian protein–protein interaction database. Bioinformatics. 2005; 21(6): 832–834.
10.1093/bioinformatics/bti115
CAS PubMed Web of Science® Google Scholar
Pogoryelov D, Yildiz O, Faraldo-Gomez JD, Meier T. High-resolution structure of the rotor ring of a proton-dependent ATP synthase. Nat Struct Mol Biol. 2009; 16(10): 1068–1073.
10.1038/nsmb.1678
CAS PubMed Web of Science® Google Scholar
Preiss L, Langer JD, Hicks DB, Liu J, Yildiz Ö, Krulwich TA, et al. The c-ring ion binding site of the ATP synthase from bacillus pseudofirmus OF4 is adapted to alkaliphilic lifestyle. Mol Microbiol. 2014; 92(5): 973–984.
10.1111/mmi.12605
CAS PubMed Web of Science® Google Scholar
Preiss L, Langer JD, Yildiz O, Eckhardt-Strelau L, Guillemont JEG, Koul A, et al. Structure of the mycobacterial ATP synthase Fo rotor ring in complex with the anti-TB drug bedaquiline. Sci Adv. 2015; 1(4):e1500106.
10.1126/sciadv.1500106
PubMed Web of Science® Google Scholar
Schrödinger LLC. The PyMOL molecular graphics system open-source, Version 2.4.0. 2021.
Google Scholar
Schweke H, Levin T, Pacesa M, Goverde CA, Kumar P, Duhoo Y, et al. An atlas of protein homo-oligomerization across domains of life. bioRxiv. 2023. https://doi.org/10.1101/2023.06.09.544317
10.1101/2023.06.09.544317
Google Scholar
Sigrist CJ, Cerutti L, de Castro E, Langendijk-Genevaux PS, Bulliard V, Bairoch A, et al. PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res. 2010; 38(Database issue): D161–D166.
10.1093/nar/gkp885
CAS PubMed Web of Science® Google Scholar
Simm D, Hatje K, Waack S, Kollmar M. Critical assessment of coiled-coil predictions based on protein structure data. Sci Rep. 2021; 11(1):12439.
10.1038/s41598-021-91886-w
CAS PubMed Web of Science® Google Scholar
Sobti M, Smits C, Wong AS, Ishmukhametov R, Stock D, Sandin S, et al. Cryo-EM structures of the autoinhibited E. Coli ATP synthase in three rotational states. Elife. 2016; 5:e21598.
10.7554/eLife.21598
PubMed Web of Science® Google Scholar
Szczepaniak K, Bukala A, da Silva Neto AM, Ludwiczak J, Dunin-Horkawicz S. A library of coiled-coil domains: from regular bundles to peculiar twists. Bioinformatics. 2020; 36(22–23): 5368–5376.
10.1093/bioinformatics/btaa1041
CAS Web of Science® Google Scholar
Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, et al. The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023; 51(D1): D638–D646.
10.1093/nar/gkac1000
CAS PubMed Web of Science® Google Scholar
Testa OD, Moutevelis E, Woolfson DN. CC+: a relational database of coiled-coil structures. Nucleic Acids Res. 2009; 37(Database issue): D315–D322.
10.1093/nar/gkn675
CAS PubMed Web of Science® Google Scholar
UniProt C. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 2023; 51(D1): D523–D531.
10.1093/nar/gkac1052
PubMed Web of Science® Google Scholar
Urnavicius L, Zhang K, Diamant AG, Motz C, Schlager MA, Yu M, et al. The structure of the dynactin complex and its interaction with dynein. Science. 2015; 347(6229): 1441–1446.
10.1126/science.aaa4080
CAS PubMed Web of Science® Google Scholar
Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, et al. AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022; 50(D1): D439–D444.
10.1093/nar/gkab1061
CAS PubMed Web of Science® Google Scholar
Walshaw J, Woolfson DN. Socket: a program for identifying and analysing coiled-coil motifs within protein structures. J Mol Biol. 2001; 307(5): 1427–1450.
10.1006/jmbi.2001.4545
CAS PubMed Web of Science® Google Scholar
Woolfson DN. Coiled-coil design: updated and upgraded. Subcell Biochem. 2017; 82: 35–61.
10.1007/978-3-319-49674-0_2
CAS PubMed Google Scholar
Woolfson DN. Understanding a protein fold: the physics, chemistry, and biology of alpha-helical coiled coils. J Biol Chem. 2023; 299(4):104579.
10.1016/j.jbc.2023.104579
CAS PubMed Web of Science® Google Scholar
Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM, Eisenberg D. DIP: the database of interacting proteins. Nucleic Acids Res. 2000; 28(1): 289–291.
10.1093/nar/28.1.289
CAS PubMed Web of Science® Google Scholar

Volume32, Issue11

November 2023

e4789

This article also appears in:

Tools for Protein Science 2024

CC⁺: A searchable database of validated coiled coils in PDB structures and AlphaFold2 models

Abstract

1 INTRODUCTION

2 DESIGN, ARCHITECTURE, AND POPULATION OF THE CC⁺ DATABASE