Volume 33, Issue 9 e5136
TOOLS FOR PROTEIN SCIENCE
Open Access

ARCIMBOLDO at low resolution: Verification for coiled coils and globular proteins

Iracema Caballero

Iracema Caballero

Instituto de Biología Molecular de Barcelona (IBMB-CSIC), Barcelona Science Park, Barcelona, Spain

Contribution: ​Investigation, Writing - original draft, Software, Methodology, Data curation, Validation

Search for more papers by this author
Albert Castellví

Albert Castellví

Instituto de Biología Molecular de Barcelona (IBMB-CSIC), Barcelona Science Park, Barcelona, Spain

Contribution: ​Investigation, Methodology

Search for more papers by this author
Josep Triviño

Josep Triviño

Instituto de Biología Molecular de Barcelona (IBMB-CSIC), Barcelona Science Park, Barcelona, Spain

Contribution: Software, ​Investigation, Methodology

Search for more papers by this author
Elisabet Jiménez

Elisabet Jiménez

Instituto de Biología Molecular de Barcelona (IBMB-CSIC), Barcelona Science Park, Barcelona, Spain

Contribution: ​Investigation, Methodology, Software

Search for more papers by this author
Nicolas Soler

Nicolas Soler

Instituto de Biología Molecular de Barcelona (IBMB-CSIC), Barcelona Science Park, Barcelona, Spain

Contribution: ​Investigation, Methodology

Search for more papers by this author
Rafael Junqueira Borges

Rafael Junqueira Borges

Instituto de Biología Molecular de Barcelona (IBMB-CSIC), Barcelona Science Park, Barcelona, Spain

Contribution: Writing - original draft, ​Investigation, Methodology, Software

Search for more papers by this author
Isabel Usón

Corresponding Author

Isabel Usón

Instituto de Biología Molecular de Barcelona (IBMB-CSIC), Barcelona Science Park, Barcelona, Spain

ICREA: Institució Catalana de Recerca i Estudis Avançats, Barcelona, Spain

Correspondence

Isabel Usón, Instituto de Biología Molecular de Barcelona (IBMB-CSIC), Barcelona Science Park, Baldiri Reixach 15, Barcelona 08028, Spain.

Email: [email protected]

Contribution: Software, ​Investigation, Writing - original draft, Funding acquisition, Supervision, Methodology

Search for more papers by this author
First published: 16 August 2024

Review Editor: Nir Ben-Tal

Abstract

Crystallography at low resolution must determine the atomic model from less experimental observations, which is challenging in the absence of a model. In addition, model bias is more severe when independent experimental data are scarce. Our methods solve the phase problem by combining the location of accurate model fragments using Phaser with density modification and interpretation of the resulting maps using SHELXE. From a partial, correct structure, the density modification process and the stereochemical constraints draw the rest of the structure, validating the result. This same principle is now exploited at low resolution. Coiled coils are important, ubiquitous structures but notoriously difficult to phase and to predict. Both correct solutions and incorrect ones are poorly discriminated by the crystallographic figures of merit as long as helices are correctly oriented. We incorporate coiled-coil verification, designed to set up competing, incompatible structural hypotheses to probe both the results and establish the power of the data to discriminate them. Efficiency of coiled-coil phasing and validation in test cases from 3 to 4 Å is demonstrated in ARCIMBOLDO_LITE, placing single helices, and in ARCIMBOLDO_SHREDDER, with fragments derived from AlphaFold models. SHELXE tracing at low resolution has been enhanced, maintaining its local character but extending the environment assessment. For non-helical structures, verification is demonstrated in the fragment location process. Its use is exemplified with the solution of the VSR1 structure at 3.5 Å, depending on LLG optimization and the emergence of new features in the electron density. Relying on verification, we have extended the use of the ARCIMBOLDO software to low resolution.

1 INTRODUCTION

Crystallography provides an accurate experimental atomic structure, but on account of the phase problem, errors in the model may bias the determination (Ramachandran & Srinivasan, 1961; Terwilliger, 2004). Only the intensities of the diffracted beams are recorded in the diffraction experiment. The phases needed to compute the electron density map wherefrom the model is built are usually provided by a starting model, used in molecular replacement (MR) phasing (Read, 2001; Rossmann, 1972). The use of predicted models (Baek et al., 2021; Jumper et al., 2021) has been extensively incorporated in crystallographic determination methods (Medina et al., 2022; Simpkin et al., 2023). Low resolution reduces the number of independent experimental data, and the crystallographic determination relies heavily on stereochemical prior knowledge (Urzhumtsev & Lunin, 2019). Experimental data are outnumbered by parameters expressing the atomic positions and their average displacement, typically around 3 Å, taking solvent into account (Wlodawer et al., 2008). This raises the question at low resolution of whether the final model is truly experimental, adding further information, or whether the initial, virtually complete model could not be disproved by the data.

The absence of a model precludes model bias, and thus ab initio phasing, from the recorded intensities alone was long a quest, achieved by enforcement of atomicity as a constraint (Usón & Sheldrick, 1999). The use in ARCIMBOLDO of ubiquitous fragments, such as main chain alpha helices or libraries of small beta sheets, to substitute the atomicity constraint in direct methods extended ab initio phasing to medium resolution around 2.5 Å (Millán et al., 2015). A partial, accurate structure, correctly located in the asymmetric unit with Phaser (Read & McCoy, 2016) can be extended into a full structure from the density-modified map (Usón & Sheldrick, 2018), further enhanced through model building (Usón & Sheldrick, 2024). Structure completion and expansion, marked by a high correlation coefficient (Fujinaga & Read, 1987) of the SHELXE trace, is used as an indication of a successful solution. This principle builds the core of the ARCIMBOLDO methods (Rodríguez et al., 2009), which have been embraced and diversified by software tools like Ample (Bibby et al., 2012) and Fragon (Jenkins, 2018). A minimal starting hypothesis reveals previously unknown structural features and thus validates the solution. Conversely, a wrong solution will not evolve correct inferences.

The same principle applies at low resolution, between 3 and 6 Å, but unsurprisingly, partial models need to provide more scattering, and atomic model extension into the new density becomes uncertain (Borges et al., 2020). Proof of principle where SHELXE density modification was instrumental in drawing a solution from a partial model at low resolution was attained by phasing FtsH (Vostrukhina et al., 2015) or FasR (Lara et al., 2020).

The trade-off between starting information and available resolution is well understood for the assembly of a starting model (McCoy et al., 2017) and can be referred to as eLLG (expected Log-Likelihood Gain versus an uninformative model that a particular model could reach for particular experimental data) (Oeffner et al., 2018). For the expansion of the partial structure, atomic interpretation of a coarser map sequentially building amino acids becomes unreliable and model bias a prime concern.

Agreement with the experimental data and good stereochemistry (Chen et al., 2010) are the main validation criteria to judge the correctness of the model established in a crystallographic determination. High resolution is desirable but is inherently limited by the crystals; at low resolution, the lack of experimental data has always compromised the determination of the atomic model (Jorda et al., 2016; Urzhumtsev et al., 2000). The recent advent of predicted atomic models has eased structure solution but a practically complete model, globally judged by limited experimental data, and a priori conforming to proper stereochemistry may mask errors.

Crystallography builds a model upon consistency with experimental data and prior knowledge. Frequently, a popular figure of merit is relied on, such as correlation coefficient (CC) between model and data above 30%. Furthermore, emergence of new correct features in the electron density supports a partial model. When figures of merit become unreliable or the hypothesis leaves little to infer, a more active method to challenge the determination is needed.

We have expanded the resolution scope for reliable structure solution, implementing dedicated verification strategies in ARCIMBOLDO. In particular, we have extended the resolution limit to 4 Å for phasing coiled coils in ARCIMBOLDO_LITE with ideal helices or using AlphaFold multimeric predictions as models in ARCIMBOLDO_SHREDDER. In globular proteins, LLG-based fragment verification is demonstrated in the solution process from density modified maps.

2 RESULTS AND DISCUSSION

2.1 Verification in ARCIMBOLDO

Verification can be defined as a systematic attempt to disprove the model produced. Ways will vary depending on the scenario and on the more common errors to be expected, but the core idea is that while there are many ways in which a structure can be wrong, the correct solution must be unique. Hence, competing solutions are set up and scored in comparison. The process may correct errors, leading perturbed starts to develop into correct solutions. Equivalently, correct solutions should score higher than incorrect ones. When incompatible solutions reach comparable scores, only one might be right, so possibly none is. In any case, disambiguation will need to come from a different use of the data or from prior knowledge. If data are not able to add information, it is questionable whether we have an experimental determination or an initial model our data could not disprove (Read et al., 2020).

The general mode of ARCIMBOLDO (Rodríguez et al., 2012) imposed a resolution limit of 2.5 Å, intended to prevent false solutions that might go unidentified. To overcome this concern, we introduce verification in our programs, which is here extended to low resolution.

In the case of predicted models at medium resolution, we introduced a procedure in ARCIMBOLDO_SHREDDER to systematically eliminate the starting model in favor of its inferences (Medina et al., 2022). Surely, verification is most needed in cases where data are limited by resolution or by errors. Phasing peptides with microED data where resolution, accuracy, and completeness are compromised, we set up ARCIMBOLDO_BORGES (Sammito et al., 2013) to use heterogeneous fragment libraries rather than geometric variations on a common fold (Richards et al., 2023). Consistency in model selection by phasing success is used as verification.

In the case of coiled coils, ubiquitous structures involved in many cellular processes and typically limited by data quality and lack of models—they remain difficult to predict—fragment based phasing would appear suitable from their high helical content (Thomas et al., 2020). Not only in ARCIMBOLDO, but alternative implementations can be found in CCsolve (Rämisch et al., 2015), and AMPLE (Sánchez Rodríguez et al., 2020; Thomas et al., 2015) pipelines, the later reporting cases up to 3.3 Å resolution (Thomas et al., 2020). Even though phasing may succeed at lower resolution, a severe obstacle is that wrong solutions, featuring mistranslated or reversed helices, typically render high figures of merit. Hence, the need for a verification procedure was recognized and tentatively introduced for the structure solution of coiled coils up to 3 Å (Caballero et al., 2018). Our verification procedure generates perturbations on the substructure leading to the best solution and compares their scores after submitting them to the same density modification and autotracing procedure.

Even at resolutions close to 2 Å, we have observed that single ideal helices were occasionally placed in the correct position but in a reversed direction, as helical periodicity accounts for the main low resolution diffraction features set in either direction. ARCIMBOLDO_LITE addresses this issue by phasing with substructures with reversed helices. Overcoming the lock, in effect in model building, where the biased map would be traced with wrongly reversed helices again and again, was the primary reason to generate the perturbed substructures. Setting up all possible directions and letting them compete showed improved convergence. Then, these same helices are used for generating realistic perturbations for verification. However, when using multimeric predictions, generating a model traced backwards would be contrived, and we have not seen the case where tracing was reverting parts of the model either. Thus, random translations, which constitute realistic perturbations occurring in practice in wrong solutions, are used as a method for verification in ARCIMBOLDO_SHREDDER. They also occur in single helix searches and are used as well in ARCIMBOLDO_LITE, as a baseline for a wrong solution.

In ARCIMBOLDO_LITE (Sammito et al., 2015), perturbations are generated in two ways: random translation and reversing the direction of helices (Figure 1), whereas in ARCIMBOLDO_SHREDDER (Millán et al., 2018; Sammito et al., 2014), only a randomly translated solution is used. The whole solution is shifted, except for the space group P1, where the origin choice is unconstrained and half of the helices are translated with respect to the others. The resulting phase differences are assessed to validate randomness. From the Phaser substructure that led to the best CC after SHELXE, a sparse but systematic reversal of helices is performed. Subsequently, rigid-body refinement and rescoring in Phaser and CC assessment in SHELXE are used to select a subset along with the randomly translated solution and the best solution. Results are compared after the expansion procedure. If the discrimination between the best solution and the random solution persists or the final solutions are equivalent, confidence in this solution will be justified. Thus, the best solution is validated if it can be clearly discriminated from the random solution or if different perturbations develop into a group of equivalent solutions. Conversely, it is not validated if it cannot be discriminated from the random solution, and it is inconclusive when structurally different extensions are characterized by comparable figures of merit.

Details are in the caption following the image
Perturbations induced in the verification step. (a) Substructure leading to the best solution. (b) Random solution generated applying on the best solution (transparent) the fractional translation vector (0.1, 0.1, 0.1). (c) Sparse group of substructures with reversed helices, selected from the 16 possibilities with 4 fragments. Helices are shown as sticks; arrows indicate helix direction.

Figure 2 illustrates how to interpret the verification plot output by ARCIMBOLDO. SHELXE may successfully rebuild all the helices originally reversed (Figure 2a), all the helices or none (Figure 2b), all the helices, some helices or none (Figure 2c). In all these cases, the best solution is clearly discriminated from the random solution by a CC difference exceeding 15%. In the unsolved structures, all the traces are random, and the figures of merit are similar to the wrong solution (Figure 2d). In these cases, the best solution cannot be discriminated from the random solution, and the difference between their CC falls under 9%.

Details are in the caption following the image
ARCIMBOLDO_LITE Verification plots. The abscissa is the correlation coefficient (CC) of the trace (%) and the ordinate is the mean phase difference (MPD) (°) with respect to the best scoring substructure, also colored from green (structurally similar) to red (structurally different). (a) PDBid 1deb (2.4 Å), two groups are differentiated, the random solution and the group where all traces become equivalent and correct. (b) PDBid 3efg (2.0 Å) shows two groups, an inconsistent group with the random solution and part of the perturbations that diverged, and the best solution joined by equivalent, corrected solutions. (c) PDBid 4oh8 (2.3 Å) with a disperse group where SHELXE has corrected all the helices, some of them or none. (d) PDBid 3tul (2.8 Å) that remains unsolved shows a single group close to the random solution. (e) PDBid 3vir (2.7 Å), where expansion led to equivalent correct solutions since tracing had reversed the incorrect portions. The minor differences in CC or MPD derived from slight differences in the geometry or extension of the trace are irrelevant. (f) PDBid 4pna (2.1 Å), the extension of inconsistent solutions leads to inconclusive results with structurally different structures characterized by comparable CC.

There is an interval where the difference between the CC from the best and the wrong solution would not discriminate if a structure is solved or not. The group of substructures with reversed helices allows to assess whether the top CC corresponds to a structurally unique solution or different structures render comparable scores. If unique, the sensitivity of the CC to discriminate can be trusted, concluding that the structure is solved. Otherwise, the structure presumably remains unsolved. Error remediation shows the power of the data to discriminate.

The difference between the minimum and maximum MPD (Mean Phase Difference calculated with respect to the best solution) values of the reverse group is smaller in solved structures and larger in unsolved structures. Furthermore, for solved structures, the difference between the MPD of the random solution and the maximum MPD value of the reverse group is larger than in the case of unsolved structures. This is illustrated by the cases of PDB entries 3vir (solved) and 4pna (not solved) in Figure 2e,f, respectively.

Figure 3 shows a decision workflow to classify solved, unsolved, or inconclusive results. The verification step will only be performed if the best scoring solution from ARCIMBOLDO_LITE or ARCIMBOLDO_SHREDDER has reached a CC above 25%. From our tests, a structure is considered solved when the CC difference between the best (CCbest) and a random solution (CCw) exceeds 15% and not solved if this difference falls below 9%. In ARCIMBOLDO_SHREDDER for intermediate values, no conclusive discrimination is reached. In ARCIMBOLDO_LITE, evolution of the substructures with reversed helices is assessed in terms of CC discrimination and MPD consistency.

Details are in the caption following the image
Decision workflow analyzing verification in (a) ARCIMBOLDO_LITE (b) ARCIMBOLDO_SHREDDER. Colors label the structure as solved (green), not solved (red), or not discriminable (yellow). MPDw is the Mean Phase Difference (MPD) between the best solution and the random one. CCbest is the Correlation Coefficient (CC) of the best solution. CCw is the CC of the random or wrong solution. CV_CC is the coefficient of variation of the reverse group, is calculated by dividing the standard deviation by the mean of the CC of the reversed group and multiplying by 100. diff_CC is the difference between the highest CC and the lowest CC value of the reversed group. maxMPDr and minMPDr are the maximum and the minimum values of MPD of the reverse group.

2.2 Phasing coiled coils at low resolution

Coiled-coil dedicated modes for low resolution phasing, up to 4 Å, have been implemented both in ARCIMBOLDO_LITE for use of individual alpha helices as search models and in ARCIMBOLDO_SHREDDER employing fragments from predicted models. In both programs, verification fulfills its aim. It identifies a solution (Figure 4a,d), flags possible but inconclusive cases to avoid false positives (Figure 4b,e), and discards wrong solutions (Figure 4c,f).

Details are in the caption following the image
Verification plots for low resolution cases. In (a, d), the verification determined that the structure is solved; in (b, e), the verification cannot reach a conclusive distinction; in (c, f), according to verification the structure is not solved.

2.2.1 Results for ARCIMBOLDO_LITE: Advantages of searching single helices

Phasing coiled-coil structures is complicated by data modulation and anisotropy, generating problems to differentiate genuine intermolecular tNCS from Patterson artifacts (Caballero et al., 2021), overlapping solutions, helical placement in the correct position but in reversed direction, poor side chain discrimination, and wrong solutions with high figures of merit.

Coiled-coil particularities are addressed in a specific mode (Caballero et al., 2018), which involves deactivating the placement of tNCS-related helices, implementing a new packing filter in Phaser to overcome overlapping solutions, addressing generation and testing of reversed helices, and improving the autotracing in SHELXE. Finally, to verify solutions, perturbations are made and compared for discrimination.

In addition, lack of suitable MR model was a problem traditionally, given the fact that in an extended structure, small changes in torsion angles originate large deviations in the overall geometry. Also, they may be versatile in their association (parallel or antiparallel; number of helices involved), the same sequence giving rise to different architectures, both in nature and in the different crystal forms (Leonardo et al., 2021). Their structure prediction may thus fail or simply not correspond to the species in the crystal.

Figure 5a summarizes the performance of the coiled_coil mode implemented in ARCIMBOLDO_LITE on a set of 30 test structures in the resolution range spanning 3–4 Å. Anisotropic diffraction limit and scaling, possibly with STARANISO (Tickle et al., 2018) had been performed in 18 of these datasets. In 9 cases, the diffraction limit of the deposited data was isotropic and remained unchanged by STARANISO. The other three cases were corrected by STARANISO, and both corrected and uncorrected datasets were used.

Details are in the caption following the image
Verification performance in coiled coil mode on 30 test structures within (a) ARCIMBOLDO_LITE and (b) ARCIMBOLDO_SHREDDER. The abscissa represents the correlation coefficient (CC) of the partial structure against the experimental data, and the ordinate represents the weighted Mean Phase Error (wMPE) compared to the reference deposited in the PDB. A thick horizontal line delineates the boundary for considering a structure solved or unsolved; a structure is deemed solved when the wMPE is below 65°, and a vertical thin line marks when verification is conducted, only if the CC is 25% or higher. In the visualization, green indicates verification-confirmed solutions, red indicates cases where verification determined that structures were not solved, and orange indicates cases where verification yielded inconclusive results.

Previously, the verification step allowed us to extend the resolution limit from 2.5 to 3 Å. Here, we assess the validity of verification in the range of 3–4 Å. A structure was considered solved when achieving: a weighted mean phase error (wMPE) versus the reference deposited with the PDB below 65°, a CC of the partial structure against the experimental data of 25% or higher, and confirmation through positive verification. According to this, of the 30 structures, 19 (63.3%) were solved using the default parameterization of the coiled_coil mode. Importantly, we did not encounter any case where verification falsely indicated a wrong structure to be solved. The only parameter that varied across runs was the number    (2–4) and the length of the helices (18–40 residues), adapted to the expected ASU contents. In general, fragment selection would rely on the eLLG, but for example, 3mqb, a 423 residue structure at 3.2 Å resolution, yields an encouraging CC of 40% for a random phase set characterized by MPE 88.3° after placement of helical fragments of 60 residues, rendering very high LLGs of 154, 279, 384, and 464, respectively. A first search configured to find four helices of 18 residues demonstrated in our previous study was used for sampling; indeed, 10 of 16 solved proteins were thus solved. The other six were solved with longer helices of 30–40 residues. Regarding anisotropy correction, in two of the three cases where comparison was possible, no practical difference was observed in our runs; for one of them (PDBid: 4zry), results improved for the anisotropically truncated data, and a verified solution was reached in ARCIMBOLDO_LITE. A table details the characteristics and results for each of the structures probed (Table S1).

At lower resolution, increasing the signal from side chains by using polyserine (Schwarzenbacher et al., 2004) instead of polyalanine helices might be expected to improve the discrimination of helix directionality and phasing success. We construct an ideal polyserine helix using the most frequent rotamers of the amino acids represented. Results with these search models showed no improvement versus polyalanine helices. Indeed, in 4 of 16 cases, the structure solution was not accomplished with the polyserine helices (Table S1, Data S1).

Benchmarking and computing time reduction limiting the number of clusters are given in Data S1.

2.2.2 Results for ARCIMBOLDO_SHREDDER using fragments from multimeric predictions

ARCIMBOLDO_SHREDDER, originally developed for phasing using fragments from remote homologs, has been adapted to use predicted models (Medina et al., 2022). The predicted_model mode automatically preprocesses the AlphaFold or RoseTTAFold model, extracts overlapping compact fragments of a size determined from the expected LLG (Oeffner et al., 2018), and gives them internal degrees of freedom to improve the model (McCoy et al., 2018). Finally, if the structure is a multimer and expansion of a first placement does not suffice to provide a solution, the sequential search for several copies (multicopy) will be activated. This mode should be activated along with the new SHREDDER coiled_coil mode to activate its particular verification and other dedicated features.

AlphaFold multimer (Liu et al., 2023) was used to predict the coiled coils, subsequently employed as search models, selecting those with the highest average of per-residue model confidence score (pLDDT). Figure 5b summarizes the performance of ARCIMBOLDO_SHREDDER on the same set of test structures. Among these structures, the verification step determined successful solution in 16 cases (53.5%), all exhibiting wMPE versus their reference PDB below 65°, indicative of correct solutions. In seven cases, verification determined that the structures were not solved; in all the cases, the wMPE was above 65°. Furthermore, of the seven cases where the verification step was inconclusive, six have wMPE below 65° and one above 65°. This cautious approach is adopted to minimize the risk of false positives. Finally, it is worth mentioning that two structures (PDBid: 1t8b and 5f4y) could only be solved through the multicopy search. Dataset characteristics and results are compiled in the Table S1.

The default pre-processing in ARCIMBOLDO_SHREDDER trims the side chains to alanine residues and sets a common B-value of 25 Å for all atoms, resulting in a library of fragment models with equivalent scattering. In deep-learning protein predictions, the side chains are typically preserved as long as the backbone prediction is accurate (Jumper et al., 2021), so by default side chains from predicted models will be preserved instead of trimming to polyalanine. Furthermore, the B factors are set to a common value of 25 Å for the main chain and 50 Å for the side chains. Finally, H atoms are removed.

A comparison of the performance of the predicted models, including side chains or not, is shown through the wMPE of the ARCIMBOLDO_SHREDDER solution against the deposited structure (Table S1). In two cases, side chains were crucial to solve the structure. For PDBid 1ovu, this resulted in a reduction of 25° in wMPE (from 90° to 64.1°). Similarly, for PDBid 6ixg, the wMPE improves 11.6° (from 72.6° to 61.1°). However, in another case, the use of side chains prevented solution: for PDBid 4w80, wMPE increased from 39.4° to 90°. This outcome is justified considering that, notwithstanding an average pLDDT of 86, 63% of the model superposes with the deposited structure, rendering an RMSD of 3.77 Å, in a position where the sequences do not coincide at all.

Fragment location applies the standard packing filter to exclude probes occupying the same space as some symmetry equivalents. For fragments, given the smaller fraction of the asymmetric unit occupied, this filter is not as effective as for complete models. Even in the absence of clashes, some packing arrangements may appear intuitively unlikely. For example, coplanar, orthogonal arrangements of close helices. The simplest and most robust idea, implemented in ARCIMBOLDO_SHREDDER, superposes the original model onto the corresponding residues in the probe and calculates the clashes for this full model. For globular proteins, this has a significant effect with minimal assumptions, as it contributes to discard wrong probes at early stages promoting a prioritization of the correct ones for expansion while saving run time. This smart packing option is unsuitable for coiled coils, due to two recurring situations, illustrated in Figure S1. The large RMSD encountered in predicted or homologous models, along with the extended packing contacts, may cause a large number of clashes even for close to correct placements and models. Also, fragments of an helix bundle may fit the structure at several places and better account for the scattering at a location different from the fragment would match in the final model. This would be unlikely for nonperiodic structures, but in the case displayed in Figure S2, the two helices misplaced render a phase set characterized by 60° wMPE, comparable to placement on the corresponding residues producing 55° wMPE. Hence, smart packing is disabled by default for coiled coils.

Phase combination of consistent datasets with ALIXE (Millán et al., 2020) was also deactivated, as we tested its impact on structure solution and no significant differences were observed, rather than increasing the computing time.

2.2.3 Comparison between performance of ARCIMBOLDO_LITE and ARCIMBOLDO_SHREDDER

A comparison between the performance of ARCIMBOLDO_LITE and ARCIMBOLDO_SHREDDER is shown in Figure 6. Both programs succeed in 13 cases out of 30. Furthermore, 6 cases were only solved with ARCIMBOLDO_LITE and 3 cases were only solved with ARCIMBOLDO_SHREDDER. In only 8 structures for both programs, verification concluded that they were not solved or inconclusive.

Details are in the caption following the image
Comparison between the performance of ARCIMBOLDO_LITE and ARCIMBOLDO_SHREDDER. Abscissa represents the resolution and ordinates the asymmetric unit content. Structures where verification confirmed solutions with both ARCIMBOLDO_SHREDDER and ARCIMBOLDO_LITE are shown in green. Structures where verification concluded that they were not solved or were inconclusive with both programs are marked in red. Structures only solved with ARCIMBOLDO_LITE are depicted in blue, while structures only solved with ARCIMBOLDO_SHREDDER are represented in yellow.

For example, PDBid 4w80 (Figure 7a) was only solved with ARCIMBOLDO_LITE, along with PDBid 6c4x with an average pLDDT of 46 and an incorrect coiled-coil association (Figure 7b). Since pLDDT in coiled-coil predictions is boosted by the accurate secondary structure, a bad prediction can have a high confidence score. Conversely, three cases were only solved by ARCIMBOLDO_SHREDDER; for PDBid 6gbr, the average pLDDT of the predicted model is 87, like in the previous case (4w80), but the RMSD compared to 98% of the deposited structure is 2.46 Å. For the other two structures, 1t8b and 5f4y, composed of 416 and 319 residues, respectively, solution was only accomplished by searching two copies in the asymmetric unit. ARCIMBOLDO_LITE with ideal helices is advantageous regarding the more sophisticated verification and model independence, but ARCIMBOLDO_SHREDDER may succeed on larger structures or the presence of several copies in the asymmetric unit.

Details are in the caption following the image
Differences between AF model (pLDDT-colored) and experimental determination (gray) precluding SHREDDER solution. (a) 4w80 and (b) 6c4x.

2.2.4 SHELXE with clustering of helical seeds

The autotracing algorithm has been enhanced in SHELXE to be effective in map interpretation and extension of coiled-coil structures at low resolution. At the same time, the atomic character of the algorithm has been retained, since accuracy is essential for structure extension. Torsions within tripeptides are refined with a short helical fragment tethered at the end to avoid losing connectivity at weaker map regions. This choice is automatically triggered within the coiled_coil mode and leads to all autotracing cycles apart from the last being seeded from longer helices and extension of the main chain with helical restraints for Ramachandran angles or helical sliding. It was recognized that best scoring seeds tended to map the same reduced region in the structure, so a more diverse sampling and skipping seeding on solvent-assigned voxels was incorporated. Also, larger radius values are used for the sphere of influence, and extrapolation of unmeasured data is used in all cycles (Usón et al., 2007).

Hence, we developed a new helical tracing algorithm in SHELXE, based on a different choice of seeds (blocking seeds placed in voxels assigned as solvents, clustering seeds rather than just ranking since higher-scoring ones corresponded all to the same helix and the search needed to be broadened). Finally, the extension algorithm was more constrained, refining torsions of helical amino acids linked to a short helical rigid body of five residues to avoid losing connectivity at weaker map regions.

The improvement over the previous SHELXE version is reflected in practice in two of the ARCIMBOLDO_LITE cases, where successful solutions previously ruled out (one as not solved and the other inconclusive) are now recognized as solved. In ARCIMBOLDO_SHREDDER, four successful solutions previously deemed inconclusive are now recognized as solved, and two successful solutions previously considered not solved have been upgraded to inconclusive (Table S1).

2.3 Phasing globular structures at low resolution

2.3.1 SHELXE low resolution extension in the case of the Zika virus NS5

An ARCIMBOLDO solution was obtained in the case of the orthorhombic form of Zika virus NS5 structure (Ferrero et al., 2019). Merging 11 partial datasets rendered complete data, where the solution extended to 4 Å in the best direction but reached 7.4 Å in the worst direction, due to severe anisotropy, as estimated and corrected by STARANISO. The asymmetric unit contained six copies of the full length NS5 protein composed of two domains, leaving space for 80% solvent. ARCIMBOLDO_LITE typically uses main chain helices, but it can take any custom model. In this case, an experimental model was available for the methyltransferase domain (5KQR; Coloma et al., 2016), but the homolog Japanese encephalitis virus had to be used for the RNA-dependent RNA-polymerase domain (JEV RdRP; PDBid 4k6m; Lu & Gong, 2013). Default MR placement of the 12 domains did not work. The coordinates of the partial solution provided starting phases for a map, which, upon density modification with SHELXE (Sheldrick, 2002), allowed manual fitting of the remaining domains (Figure 8).

Details are in the caption following the image
Electron density map calculated from the partial NS5 Zika Virus structure solution. Placed methyltransferase domains (blue) (a) before density modification (b) after density modification showing the manually fitted absent RdRP domain (green) in the map. Electron density maps are shown at 1.5σ.

As the case illustrates, even with good models, in multidomain structures it may be difficult to locate the ones characterized by higher B-values, contributing less to the overall scattering. In practice, this can be achieved with a combination of ARCIMBOLDO_LITE to place the components of a partial solution and likelihood-based docking in Phaser (Millán et al., 2023) using the resulting SHELX maps.

2.3.2 ARCIMBOLDO methodology to solve the full lumenal region of vacuolar sorting receptor 1 at 3.5 Å

For the full lumenal region of Vacuolar Sorting receptor 1 (VSR1) at 3.5 Å resolution, which could not be solved by other methods (Borges et al., 2024) we developed a dedicated methodology within the ARCIMBOLDO principle. It targets this low resolution scenario through systematic verification: a more constrained map interpretation and comparative scoring of alternative hypotheses.

We use the density-modified map calculated with SHELXE from a partial and reliable solution to assemble structural hypotheses consistent with prior knowledge. Multiple alternative fragments derived from remote homologs and secondary structure prediction are evaluated, considering different secondary structure elements, fragment reversion, and side chain inclusion using the most probable sequences generated with the SEQUENCE SLIDER (Borges et al., 2020; Borges et al., 2022) functions. Fragments were rigid-body refined, allowing internal degrees of freedom using Gimble in Phaser (McCoy et al., 2018). Each fragment was scored based on its LLGcontribution to the overall value, roughly estimated from LLG change upon fragment omission. Relative differences in the LLGcontribution of each fragment guided the choice between different possibilities. Fragments with lower values were optimized by changing their size, orientation, curvature, or secondary structure. Emergence of new features in unmodeled map regions confirmed partial model correctness.

The VSR1 structure contains a protease-associated (PA) domain (164 residues), a Central domain (209 residues), and, three epidermal growth factor repeats (EGFs, 158 residues). The complex state of the PA domain 4txj (Luo et al., 2014) rendered a partial solution of VSR1 by MR with Phaser. No models were available for the Central domain at the time, but it shares 15% identity over 150 residues to the extracellular metalloproteinase from Aspergillus fumigatus (PDBid 4 K90) (Fernandez et al., 2013) and Staphylococcus aureus DsbA (PDBid 3bci) (Heras et al., 2008) as identified in HHPRED alignment (Söding et al., 2005). Both structures share a roughly parallel four-helix bundle orientation (Figure S3). This fold was placed in the initial density-modified map, rigid body refinement was performed, and all helices scored LLGcontribution above 10, except the one shown in blue (3.8) (Figure 9a). Guided by this discrepancy, we optimized its orientation, and its LLGcontribution increased to 10.6 (Figure 9b). We manually fitted VSR1 helices to the DsbA model and, as the green helix matched a DsbA strand (Figure 9c), we modeled this alternative secondary structure, improving the LLGcontribution from 10.6 (for 16-residue helix) to 21.6 (for 10-residue strand) (Figure 9d).

Details are in the caption following the image
Phasing Vacuolar Sorting receptor 1 (VSR1) and adding central domain fragments. Electron density maps are shown at 1.5σ. In (a, b, d), the density-modified map is generated from the coordinates of the PA domain (red sticks); (a) four parallel helices are modeled in the region corresponding to the central domain and scored on LLGcontribution; (b) optimization of the blue helix location increases its LLGcontribution; (c) DsbA (3bci) (magenta) superimposed on the four helices of the VSR1 hypothesis, the green helix in the central domain corresponds to a strand of DsbA; (d) the green helix is replaced by a strand and its LLGcontribution increases; (e) density-modified map generated from the PA domain and central domain fragments, extending side chains on strands raises their LLGcontribution versus polyalanine; (f) features develop on density-modified map at the EGF region generated with 153 residues of the PA domain (left) versus 210 residues including fragments from the central domain (right).

After adding the Central domain fragments, whose confidence was supported by LLG calculation and DsbA superposition, we included them in map calculation to reveal new features. The green strand was expanded to a sheet of three strands, and the inclusion of side chains modeled by the SEQUENCE SLIDER function improved their LLGcontribution (Figure 9e). With the significant improvement in the Central domain fragments, we inspected the electron density of a region devoid of atoms, corresponding to the EGF domain. The first density-modified map generated solely from the PA domain was almost featureless in contrast to the map generated using also the Central domain fragments (57 residues), which shows continuous electron density (Figure 9f). This independent control supported fragment correctness. From the new features revealed in the map, we manually extended and joined different fragments of the Central domain and assigned their sequence with SEQUENCE SLIDER.

This methodology within the ARCIMBOLDO framework and its use to phase VSR1 structure establishes a low resolution verification for globular proteins. Building into the SHELXE maps with only the guide from the remote homologs was challenging and involved testing alternative, manually built fragment types. With one or more model alternatives, the Shred_LLG estimation in ARCIMBOLDO_SHREDDER SEQUENTIAL (Sammito et al., 2014) could be used to estimate the contribution of each residue or secondary structure element to the overall LLG, providing a basis for verification.

3 CONCLUSION

Low resolution phasing can exploit the ARCIMBOLDO method, the extension of a partial accurate structure through interpretation of its density modified electron density map. Fragments, rather than individual residues need to be rated to give enough signal. This is illustrated in the case of globular proteins: the Zika Virus NS5 protein, with models for both domains. And plant VSR1, where ab initio extension from one quarter of the structure is achieved through comparative scoring of alternative fragments estimating their LLGcontribution, comparison with superposition of low identity templates and inspection of a region devoid of atoms.

Coiled coils remain a class of proteins for which predicted models may fail, and generation of pseudo-solutions is a typical pitfall. ARCIMBOLDO has been adapted to solve and verify coiled-coil structures at resolutions up to 4 Å, both ab initio and exploiting predicted models. The verification step successfully ruled out 7 false positives in the case of ARCIMBOLDO_LITE and 4 false positives in the case of ARCIMBOLDO_SHREDDER. The development of a suitable map tracing algorithm in SHELXE for this scenario has been instrumental. Verification, defined as a mechanism aiming to disprove an apparent solution against competing alternatives, will also allow to establish the capacity of the data to back the structure. In other words, validate the experimental character of the determination.

4 MATERIALS AND METHODS

4.1 Computing setup

Structure solution and tests were run on a local HTCondor v.8.7.10 (Tannenbaum et al., 2001) grid composed of 146 nodes totaling 237 GFlops using as a submitter a single workstation with one Intel i9-12900KF processor of 16 physical cores, and 64GB RAM, running Linux Debian 11.

AlphaFold2 predictions were performed on a workstation with Intel Core i9-9980XE, GeForce GTX 1080 8 GB, 64 GB RAM, and Debian 10, with a local installation of the code distributed through https://github.com/deepmind/alphafold (Jumper et al., 2021).

4.2 Software versions and figures of merit used

ARCIMBOLDO_LITE and ARCIMBOLDO_SHREDDER are deployed for Linux and Macintosh and are accessible through the Python Package Index (PyPI) (https://pypi.org/project/arcimboldo/) for use in the XDSGUI interface (Brehm et al., 2023) or as part of the CCP4 program suite starting from release 7.0 (Agirre et al., 2023; Winn et al., 2011). Requires Python 3, Phaser version 2.8 or higher from the CCP4 distribution for fragment placement and the SHELXE (Usón & Sheldrick, 2024) version 2024 or higher from SHELX distribution server for density modification and autotracing.

SEQUENCE SLIDER models side chains on partial polypeptide traces in a brute force approach. All probable sequence assignments allowed by the known sequence may be assembled and individually tried. The sequence may be matched to the trace based on the secondary structure prediction to reduce the number of possibilities. Possible models are refined, and crystallographic indicators are used for discrimination. Model extension and improvement of phases for correct models promote solutions. Iteration reveals better discrimination among the possibilities evaluated.

XPREP v.2021/1 was used for data analysis (Sheldrick, 2008). Phaser (McCoy et al., 2007) was employed to calculate the anisotropic delta-B factor for the coiled coils and to perform a molecular replacement for the VSR1. Manual modeling of the VSR1 was made with Coot (Emsley et al., 2010) and PyMol (Schrödinger, LLC, n.d.). Figures were prepared with the PyMOL molecular graphics system (Schrödinger, LLC, n.d.) and Matplotlib v.1.5.3 (Hunter, 2007).

The figures of merit used in decision-making were Phaser intensity-based log-likelihood gain (LLG; Read & McCoy, 2016) and the correlation coefficient between observed and calculated normalized intensities (CC; Fujinaga & Read, 1987 calculated by SHELXE Sheldrick, 2010). Structure-amplitude weighted mean phase errors (wMPE; Lunin & Woolfson, 1993) were calculated with SHELXE against the deposited models PDB (Berman et al., 2000) to assess performance. The mean phase difference (MPD) is formally wMPE calculated against a different structure.

4.3 Coiled-coil test set

The 30 coiled-coil crystal structures used in this study correspond to PDB entries: 6n6s, 6zff, 2oto, 3f6n, 1ovu, 4bxt, 6gbr, 3fx0, 3mqb, 3vem, 4w80, 6yek, 1t8b, 6dlc, 5f4y, 4zry, 4ut5, 3mtt, 5le0, 5zvk, 1s94, 6dma, 2xg7, 3rk3, 3wuq, 6c4x, 6jn2, 3frv, 6ixg, and 3frt.

The resolution limits of the datasets, as defined by the original authors, feature resolution between 3 and 4 Å (23 from 3 to 3.5 Å and 7 from 3.5 to 4 Å). Their structure factors (CIF files) were obtained from the PDB. Upon introducing these data into the STARANISO Server, we observed that diffraction for several datasets was reported as likely truncated due to inappropriate isotropic or anisotropic cut-offs applied in previous processing (6zff, 1ovu, 4bxt, 3mqb, 3vem, 4w80, 1t8b, 6dlc, 5f4y, 4u5t, 5zvk, 6dma, 2xg7, 3rk3, 6jn2, 3frv, 6ixg, and 3frt); for others, no reflections were removed by the anisotropic cut-off (6n6s, 3f6n, 6gbr, 3fx0, 6yek, 3mtt, 5le0, 1 s94, and 3wuq). We performed anisotropic correction using STARANISO on the remaining three datasets: 2oto, 6c4x, and 4zry.

Size lies between 78 and 640 residues, distributed in the asymmetric unit in one to eight chains. Twenty-one different space groups are represented, predominating P212121 and P3121. Helical content above 75% was established with ALEPH (Medina et al., 2020). Furthermore, the coiled-coil domains were confirmed with SOCKET (Walshaw & Woolfson, 2001), and the algorithm recognizes their characteristic knobs-into-holes association, distinguishing coiled coils among the variety of helix–helix packing arrangements observed in globular domains. Otherwise, sequences and architectures of coiled coils are diverse within their characteristic association.

4.4 VSR1 data

Data to a diffraction limit of 3.5 Å were collected at beamline i03 at the Diamond Light Source to determine the functional N-terminus of the Vacuolar Sorting Receptor 1 (VSR1) grown at 4°C (Borges et al., 2024). Crystals belong to space group P213, with cell constants a = b = c = 141.3 Å, and contain one monomer and 70% solvent in the asymmetric unit. The PDBid is 8r4y.

AUTHOR CONTRIBUTIONS

Iracema Caballero: Investigation; writing – original draft; software; methodology; data curation; validation. Albert Castellví: Investigation; methodology. Josep Triviño: Software; investigation; methodology. Elisabet Jiménez: Investigation; methodology; software. Nicolas Soler: Investigation; methodology. Rafael Junqueira Borges: Writing – original draft; investigation; methodology; software. Isabel Usón: Software; investigation; writing – original draft; funding acquisition; supervision; methodology.

ACKNOWLEDGMENTS

This work was supported by STFC-UK/CCP4 “Agreement for the integration of methods into the CCP4 software distribution, ARCIMBOLDO_LOW” and Grant PID2021-128751NB-I00, RE2019-087953 (Ministry of Science and Innovation/Spanish State Research Agency/European Regional Development Fund/European Union). We thank George M. Sheldrick, Randy J. Read, Kay Diederichs, and Claudia Millán for helpful discussion.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.