Volume 24, Issue 9 pp. 1528-1542

Methods and Applications

Free Access

Residue-level global and local ensemble-ensemble comparisons of protein domains

Sarah A. Clark,

Sarah A. Clark

Department of Biochemistry and Biophysics, Oregon State University, Corvallis, Oregon, 97331

Search for more papers by this author

Dale E. Tronrud,

Dale E. Tronrud

Department of Biochemistry and Biophysics, Oregon State University, Corvallis, Oregon, 97331

Search for more papers by this author

P. Andrew Karplus,

Corresponding Author

P. Andrew Karplus

Department of Biochemistry and Biophysics, Oregon State University, Corvallis, Oregon, 97331

Correspondence to: P. Andrew Karplus, Professor, Biochemistry & Biophysics, Oregon State University, 2011 Ag Life Sciences Bldg, Corvallis, OR 97331. E-mail: [email protected]Search for more papers by this author

Sarah A. Clark,

Sarah A. Clark

Department of Biochemistry and Biophysics, Oregon State University, Corvallis, Oregon, 97331

Search for more papers by this author

Dale E. Tronrud,

Dale E. Tronrud

Department of Biochemistry and Biophysics, Oregon State University, Corvallis, Oregon, 97331

Search for more papers by this author

P. Andrew Karplus,

Corresponding Author

P. Andrew Karplus

Department of Biochemistry and Biophysics, Oregon State University, Corvallis, Oregon, 97331

First published: 01 June 2015

https://doi.org/10.1002/pro.2714

Citations: 7

Share a link

Email
Wechat
Bluesky

Abstract

Many methods of protein structure generation such as NMR-based solution structure determination and template-based modeling do not produce a single model, but an ensemble of models consistent with the available information. Current strategies for comparing ensembles lose information because they use only a single representative structure. Here, we describe the ENSEMBLATOR and its novel strategy to directly compare two ensembles containing the same atoms to identify significant global and local backbone differences between them on per-atom and per-residue levels, respectively. The ENSEMBLATOR has four components: eePREP (ee for ensemble-ensemble), which selects atoms common to all models; eeCORE, which identifies atoms belonging to a cutoff-distance dependent common core; eeGLOBAL, which globally superimposes all models using the defined core atoms and calculates for each atom the two intraensemble variations, the interensemble variation, and the closest approach of members of the two ensembles; and eeLOCAL, which performs a local overlay of each dipeptide and, using a novel measure of local backbone similarity, reports the same four variations as eeGLOBAL. The combination of eeGLOBAL and eeLOCAL analyses identifies the most significant differences between ensembles. We illustrate the ENSEMBLATOR's capabilities by showing how using it to analyze NMR ensembles and to compare NMR ensembles with crystal structures provides novel insights compared to published studies. One of these studies leads us to suggest that a “consistency check” of NMR-derived ensembles may be a useful analysis step for NMR-based structure determinations in general. The ENSEMBLATOR 1.0 is available as a first generation tool to carry out ensemble-ensemble comparisons.

Introduction

Biological macromolecules are highly dynamic entities that sample many conformations through vibrational motions, rotations around backbone and side chain torsion angles, and hinge bending motions, and may even have different low energy conformations depending on their environment. In addition, experimental and predictive modeling approaches are associated with uncertainties that vary for different parts of the model. For these reasons, an ensemble of structures is often a more accurate and complete way to represent a molecular structure than is a single discrete conformation. Although X-ray crystallographic determinations of macromolecular structures most commonly lead to a single set of coordinates representing an average structure, it has been noted that an ensemble of conformations created based on a single crystal structure analysis, or a composite of multiple crystal structures, or an ensemble of predicted or NMR derived structures may better represent a protein's native state.1-8 The variability that exists between models in an ensemble provides a wealth of information about the reliability and/or the molecular motions and variability. For instance, it has recently been shown that modeling alternate conformations marginally visible in room temperature protein crystal structures, effectively making an ensemble of models, offers insight into conformational changes related to the protein's functionality.9

With the widespread use of ensembles, especially to represent macromolecular structures determined by NMR, has come some basic protocols for their analysis. In particular, precision on a per-atom basis is routinely defined as the root-mean-square (rms) deviation of the model members from an energy-minimized “average” structure.10, 11 However, even for this purpose, the best way to overlay the ensemble members to calculate the “average” structure has not been standardized. Overlays are often performed based on all backbone or Cα atoms as well as based on a subset of such atoms from “ordered” residues subjectively defined by casual inspection of the initial overlay based on all atoms.12-15 Or one may use a program that more objectively weights atoms to optimize the superimposition. Three of these are: THESEUS, an approach that superimposes structures using all atoms with maximal likelihood derived weights16; CYRANGE, part of the CYANA software package that automatically identifies residue ranges to guide superposition14; and SuperPose, a web server that defines regions to use for superposition by comparing difference distance matrices based on α-carbons.13

As far as we are aware, although, visual inspection (as in Fig. 6 of Ref. 4) is the only common approach for directly comparing full protein ensembles with each other or with single models, as quantitative comparisons typically replace the ensemble with a representative single structure such as a minimized “average” or “medial” model,3, 17, 18 despite the loss of information this entails. Also, often only a single global comparison measure is reported such as a root mean square deviation or the global distance test score (GDT.TS) that does not provide residue or atom level details of how the models/ensembles differ,19-21 even though residue level information would be useful for discovering differences that are potentially related to molecular function or to shortcomings in experimental restraints or in the energy functions used to derive the structures.

Here, we describe a strategy to overcome these limitations by directly comparing protein structure ensembles in a way that allows rapid identification at the atom and residue level of all of the most significant global and local backbone conformational differences between them. The essential strategy, as implemented in the program the ENSEMBLATOR, is to compare the levels of intraensemble and interensemble variation and to identify the regions of greatest similarity between any members of the two ensembles. For global comparisons, the ENSEMBLATOR uses a novel approach to define a set of common core atoms by which all models are overlaid. Here, we first describe the method and then provide illustrative applications of the ENSEMBLATOR to four cases: (i) a standard precision analysis of an NMR-derived ensemble of Ribonuclease (RNase) Sa22, (ii) a comparison of that NMR ensemble with two chains from a crystal structure of RNase Sa24, (iii) a consistency analysis of the NMR-derived ensemble of RNase Sa, and (iv) comparisons to a reference crystal structure of three ensembles for a protein generated by different NMR structure refinement approaches. These examples illustrate the utility of direct ensemble-ensemble comparisons as a part of validating, refining, and analyzing ensembles.

Strategy

In this article, we use the word ensemble in referring to any set of models that are treated as one group for the purposes of a given comparison. In addition, here we use the variables m and n, respectively, to refer to the numbers of models in the first and second ensembles being compared in a given ENSEMBLATOR run [Fig. 1(A)]. If either m or n is zero that would imply that only a single ensemble is being analyzed. The essential information generated by the ENSEMBLATOR for each atom or residue are the levels of intraensemble variation for the two ensembles being compared, as well as the inter-ensemble variation and the closest approach of any member of the two ensembles [Fig. 1(A)]. These four quantities can be compared to discover regions of systematic difference. Particularly notable as the most highly significant differences between the ensembles are those regions for which the closest approach of any member of one ensemble with the other is larger than the internal variation of both individual ensembles. Additional differences, less striking but still of potential importance, are those regions for which the interensemble variation exceeds the variation of both of the individual ensembles. This strategy was developed by one of us (PAK) in the early 1990s and encoded as a set of Fortran programs (eeCORE, eeGLOBAL, and eeLOCAL); these programs are now coordinated with each other and with some Python graphing routines by the ENSEMBLATOR shell script. The ENSEMBLATOR code can be downloaded from http://biochem.science.oregonstate.edu/structural-resources/.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

The ENSEMBLATOR strategy. A: Given two ensembles as input, one containing m structures (purple) and one containing n structures (pink), the ENSEMBLATOR strategy involves calculating the four quantities listed in points 1–4, which could be described in shorthand as spread1, spread2, spread1v2, and closest1v2. These four quantities are calculated at both the level of global conformation and the level of local backbone conformation. B: A flowchart depicting the steps and programs involved in carrying out the ENSEMBLATOR strategy for identifying regions of global and local variability.

A run of the ENSEMBLATOR requires as input a multiple-model NMR-type PDB file containing m plus n models containing exactly the same atoms in the same order. Three distinct types of runs cover the range of typical applications: first, to overlay and determine the levels of variation in a single group of structures, m would be the number of models in the group and n would be zero; second, to compare a group of structures to a single structure, m would be the number of models in the group and n would be one; and third, to compare any two groups of structures, m would be the number of models in the first group and n would be the number in the second group. In the second and third types of comparisons, the groups of structures compared may truly be sets of independently determined structures, or they may be subsets of what might in another context be considered as a single ensemble.

The steps involved in an ENSEMBLATOR run [Fig. 1(B)] begin with file preparation. If coordinates from X-ray crystal structures are to be used, the PDB file format must first be converted to that of an NMR type PDB file as can be done using the script xray.PREP. Once all the structures to be compared are in NMR type PDB files, these files are input to eePREP (ee for ensemble-ensemble), which selects the set of coordinates common to all models and writes them to a new file for the ENSEMBLATOR analyses. The global analyses are done by the pair of programs eeCORE and eeGLOBAL, with eeCORE defining a cutoff (d_cut) dependent core of consistently positioned atoms and eeGLOBAL using that set of atoms to globally overlay all of the structures and produce one file containing the overlaid coordinates, a second containing a set of averaged coordinates, and a third containing the various intra and inter-ensemble global comparison statistics. The local conformation analyses are done by the program eeLOCAL, and this produces a file reporting the intraensemble and interensemble local comparison statistics. For convenience, a set of graphs are also generated that show the eeLOCAL results and that summarize the eeGLOBAL comparisons on a per residue basis for the backbone even though the raw results of eeGLOBAL are on a per atom basis and also include the side chain atoms. These plots have up to four curves and use a standard color scheme with the intraensemble variations of first and second groups being blue and green, the closest approach between any member of the two groups being red, and the intergroup variations being dashed red. In the following two sections, we describe in more detail the approaches used for the global and the local comparisons.

The global overlay approach

Carrying out global comparisons of a set of structures requires that they all be overlaid in a consistent manner. The eeCORE strategy involves selecting a core set of atoms that are positioned similarly in all m + n structures to be compared, with a user-defined cutoff value, d_cut in Å, defining how close atoms must be to be considered as similarly positioned. To select this core [Fig. 2(A)], every pair of structures is subjected to a series of sequential overlays, first based on all atoms, and in each subsequent overlay based only on the subset of atoms that overlaid within d_cut of each other. The process is complete when the atoms used to calculate the overlay match the atoms that have d ≤ d_cut, and exactly this set of atoms is the self-consistent core for that pair of structures and that d_cut. The program records for each pair the number of atoms qualifying and the root-mean-square (rms) deviation both of the core atoms and of all atoms. Tables of these numbers provide information about the structures that allows similar subgroups of the structures to be recognized [Fig. 2(B,C)]. For instance, in Figure 2(B,C) there are two natural groups of structures, one comprising structures 1 to 2 and the other comprising structures 3–6. This aspect of eeCORE was once used in a standalone manner to ascertain that a set of roughly 1000 conformations of a protein that obeyed certain distance restraints could be clustered into about 25 families of structures.25

After all pairs of structures have been compared in this way, the common core of the complete group is defined as the set of atoms that are present in every one of the pairwise self-consistent cores. We note that in this process all non-hydrogen atoms from both the backbone and side chains are used, accessing additional information compared with methods that are based only on Cα or backbone comparisons. These common core atoms are recorded in an output file (combined_coordinates_*.flag) for input to eeGLOBAL to guide the overlay on which its comparisons will be based. The eeCORE results will, of course, depend on d_cut, and the most informative d_cut to use will depend on the structures being analyzed and the purpose of the comparison. For this reason, the ENSEMBLATOR script has been written to automatically run through a series of 10 d_cut values [Fig. 1(B)] spanning the range from a generous value that will typically allow all atoms to qualify for the common core (d_cut = 50 Å) to a stringent value that will typically allow very few if any atoms to qualify in the common core (d_cut = 0.5 Å). With these results in hand, users can decide which d_cut to use for their primary analyses and can also carry out studies to see how robust their conclusions are to variations in d_cut.

The local conformation comparison approach

The local backbone conformation of a residue in a protein is most commonly defined by its φ and ψ torsion angles, assuming, as is true for most peptide units, that the peptide adopts a trans (ω ∼ 180°) conformation. However, in comparing how similar two residues are in conformation, having two angles to compare is not ideal, as it is neither intuitive nor obvious how variations in these two angles translate into structural differences. For this reason, we developed a simple distance-based quantity that does not define individual conformations but defines how closely two conformations compare.23 We call this quantity the locally overlaid dipeptide residual (LODR), and it is calculated as follows [Fig. 3(A)]: the dipeptides are first overlaid based on the Cα, C, O, N, and Cα atoms of the peptide unit preceding the residue, and then the LODR-score is defined as the sum of the distances between the C, O, N, and Cα atoms in the subsequent peptide unit. Given this definition, no LODR values will exist for the first and last residues in a protein (as there are not complete peptide units on both sides of these residues), or for residues bordering chain breaks. The LODR distances appear to capture the structural impact of variations in φ,ψ-angles in a conceptually intuitive way, with the pattern being largely independent of the reference φ,ψ-angle and changes in φ having a much greater impact than changes in ψ [Fig. 3(B)]. The LODR distances range from 0 Å for identical conformations to ∼5 Å for conformations having their φ-values about 180° different.

Results

In this section, we use test cases to explore the behavior of the novel global overlay strategy and how it compares with other available approaches, and to illustrate the utility of the ENSEMBLATOR strategy for providing useful information in the four applications listed at the end of the introduction. The test cases use two proteins: RNase Sa, a 96 amino acid enzyme having a 20-member ensemble (PDB entry 1C54) from an NMR-based solution structure determination22 and a crystal structure solved at 1.2 Å resolution (PDB entry 1RGG) with two molecules in the asymmetric unit24; and Ygdr, an uncharacterized 50 amino acid protein with an NMR-derived 20-member ensemble (PDB entry 2JN0) and a crystal structure solved at 2.7 Å resolution (PDB entry 3FIF) with eight molecules in the asymmetric unit both solved by the Northeast Structural Genomics Consortium and used in recent tests of how the Rosetta energy function can be used to improve the quality of NMR-derived structures.26

Variation in the common core size and overlay behavior as a function of d_cut

Running the RNase Sa 20-member ensemble through eeCORE and eeGLOBAL with the set of 10 standard d_cut values reveals the pattern of behavior we expect to see for most protein ensembles. The number of atoms in the common core decreases monotonically as d_cut decreases, going from 100% of atoms at d_cut = 50 Å to 0% at d_cut = 0.5 Å [Fig. 4(A)]. Gratifyingly, a series of plots of the backbone rms spread of the ensemble along the chain for each d_cut [Fig. 4(B)] are rather consistent for most of the d_cut values. Notable variations occur at d_cut below 1.5 Å for which fewer than 50% of the atoms are included in the common core. A third plot showing how the final rms backbone spread of each residue varied as a function of d_cut [Fig. 4(C)] shows that the overlay of the large majority of protein atoms are moderately consistent for d_cut between 2 and 4 Å, implying the choice of any of these values would yield similar results.

Comparison of global overlay quality with THESEUS, CYRANGE, and SuperPose

Using an eeCORE d_cut value of 2.5 Å, we can compare the RNase Sa superposition calculated by the ENSEMBLATOR with the superpositions calculated by three other programs currently used for the same purpose: THESEUS,16 CYRANGE,14 and SuperPose.13 As seen in Figure 5(A), the qualitative patterns of variation in the ensemble along the chain yielded by the four programs are the same. Two families of curves are seen, with the statistics reported by THESEUS and CYRANGE being in a higher group and those from SuperPose being in a lower group. The eeGLOBAL output includes two measures of ensemble spread, and one of these—the rms spreads of the ensemble around the average structure—closely matches the SuperPose results, and the other—the pairwise distances between the members of the ensemble—closely matches the results of THESEUS and CYRANGE. We conclude that all four methods produce roughly equivalent results in a typical case such as this and that this provides validation for the novel eeCORE approach for defining a core to guide overlays.

Application 1: Global and local precision analysis of a single NMR ensemble

As an example of the analysis of an NMR-based ensemble, the 20 member ensemble of RNase Sa was run through the ENSEMBLATOR using m,n = 20,0 to assess the global and local precision along the chain (Fig. 5). Since there is no second group of structures to be compared, the ENSEMBLATOR output only includes the rms variations for each atom or residue of the single ensemble.

Global precision analysis

The global variation plot created using d_cut = 2.5 Å [Fig. 5(A)] is nothing new, but it shows, in agreement with the analyses in the original article,22 that there are three regions of larger global variability (i.e., poorer precision) in the NMR ensemble: the N-terminus, residues 25–33, and residues 44–50. All three regions reside in loops and have variation in the 1–2 Å range.

Local precision analysis

The plot provided by eeLOCAL [Fig. 5(B)] reports the LODR score for each residue (except the first and last) and provides additional information compared with the conventional global analysis. This plot shows much sharper variability and reveals exactly which residues have conformational variation that give rise to the global variation. For instance, the variability observed in the eeGLOBAL analysis of regions 25–33 and 44–50 can be attributed largely to local variation at residues 30, 32, and 45–47. From a practical perspective, eeLOCAL identifies these residues as places where additional restraints could be sought in order to improve the precision of the structure. Conversely, while residues 39–42 and 60–62 are identified as regions of medium global variation, the eeLOCAL analysis reveals that these residues are actually quite precisely defined in terms of their local conformation.

Application 2: Comparison of an NMR ensemble with a crystal structure or crystal structure ensemble

It is common to compare solution structures to crystal structures, and in the original RNase Sa NMR structure determination, the NMR-derived ensemble was compared separately to chains A and B of a crystal structure of the same protein (see Fig. 7 of Ref. 22). Only small differences for the strands and helices were noted, and larger differences occurring in loops near residues 32, 46, and 63 and in side chains near residue 76, were explained as being due to the influences of crystal packing or, in the case of the 60s loop, to phosphate binding or salt concentration. The authors concluded that “no significant differences were detected.”

Using the ENSEMBLATOR to compare the NMR ensemble separately with chain A or chain B (using runs with m,n = 20,1), and also with a two member crystal structure ensemble consisting of both chains A and B (using a run with m,n = 20,2) provides a different picture. In the global comparisons with chains A or B [Fig. 6(A,B)], the most robustly defined differences are where the closest approach of any NMR model to the crystal structure (red solid line) exceeds the intraensemble variations of the NMR models (blue line). As seen in these plots, such differences occur for both chains A and B near residues 5, 17–21, 38–42, and 76–82; these include residues in secondary structural elements and none of them were commented on in the original report.22 The two differences originally noted in loops near residues 32 and 46 are less robustly defined because of their higher variability in the NMR structure; the difference originally noted in the 60s loop appears significant in the comparison with chain A, but not in the comparison with chain B. This variation between chains A and B emphasizes the potential value of being able to compare with both crystal structures at the same time. Such an ensemble-ensemble comparison [Fig. 6(C)] includes a fourth curve representing the intragroup variation of the X-ray models (green) and identifies residues 61–63 as a point of major variation between the X-ray structures. In this way, the ensemble-ensemble comparison effectively deemphasizes the differences with the NMR structures in the 61–63 region while reinforcing the reality and consistency of the backbone differences near residues 5, 17–21, 38–42, and 76–82.

Furthermore, the eeLOCAL output for comparing the NMR ensemble with the two chain X-ray ensemble [Fig. 6(D)] identifies many residues for which the difference in local backbone conformation between the closest models (red solid curve) substantially exceeds the intraensemble variations (blue and green curves); these include residues 4, 12, 23, 38, 43, 55–7, 61–3, 75, 78, 82–3 and 89–91. Especially striking is residue 83 for which the LODR score is near the maximal possible value of 5 Å, even though the global comparison [Fig. 6(C)] does not show much discrepancy. A closer examination of this segment (Fig. 7) reveals that between the crystal and solution structures the Cα-path is similar, but the 82–83 peptide is flipped making the ψ-value for residue 82 and the φ-value for residue 83 nearly 180° different. A search using the Protein Geometry Database (PGD27) of diverse crystal structures solved at 1.8 Å resolution or better found 4011 and 38 examples of peptides having φ,ψ-angles within ±30° of those seen for residues 82 and 83 in the crystal and the NMR-based structures, respectively. In this light, both conformations are plausible, but the conformation modeled in the NMR structure is so much rarer that it would be deserving of scrutiny.

Although investigating the origins of these differences (i.e., deciphering which are real and which are a result of data or modeling limitations), such as by carrying out a joint X-ray/NMR refinement,28 is beyond the scope of this report, the example illustrates the utility of ensemble-ensemble comparisons in deriving powerfully informative residue-level information about areas of global and local difference, and also highlights how worthwhile it is to have an effective way to account for the variability that exists among crystal structures of a given protein.

Application 3: Assessing the self-consistency of an NMR ensemble

A further analysis of the RNase Sa solution structure ensemble that illustrates the utility of ENSEMBLATOR comparisons is to assess the extent to which the 20 models deposited in the PDB provide a robust sampling of the relevant conformational space.1 Because the individual models of such ensembles are supposed to be of similar quality in terms of their levels of agreement with the NMR data and their energetics, any half of an ensemble should sample a similar range of conformational space as the other half. To test how well this holds true for the RNase Sa ensemble, we compared the first 10 models in the ensemble to the last 10 (i.e., using m,n = 10,10). For the global [Fig. 8(A)] comparison, the closest approach curve (red solid) is lower than all other curves, showing that the two halves of the ensemble have no consistent systematic differences. Also, the variation among the first and second 10 models is similar, except near residue 30 where the second 10 models (green) show about five-fold the variation of the first 10 (blue).

To localize the structures causing the difference, we examined the tabular eeCORE output of the rms deviations for each pair of models [Fig. 8(B)], and these showed that models 19 and 20 are similar to each other and differ from the first 18 models. We then ran the ENSEMBLATOR again to compare models 1–18 with 19–20 (i.e., using m,n = 18,2). The eeLOCAL output of this comparison [Fig. 8(C)] showed distinct differences at Tyr30 and Gln32 with these residues having LODR values of 2.5–3.5 Å for the most similar conformations in the two groups (red curve) compared with internal variations near or below 1.0 Å (blue and green curves).

Looking at the structures [Fig. 9(A)] and a ϕ,ψ-plot [Fig. 9(B)] confirms that residues 30–32 adopt a different conformation in models 19–20 compared with 1–18, and that the conformation seen in models 1–18 are more similar to the conformation in the crystal structures. The ϕ,ψ-plot further shows that residues 31 and 33 in models 19–20 adopt uncommon outlier conformations.29 As above, using the PGD27 to survey diverse proteins solved at 1.8 Å resolution or better, a we found 96, 0, and 1176 segments with ϕ, ψ values within ±30° of the conformations seen in models 1–18, models 19–20, and the crystal structures, respectively. Based on this, we suggest that the conformation of this region of models 19 and 20 is not plausible. Also, given that models are typically ranked by their relative energies so that models 19 and 20 are, we presume, the highest energy models in this ensemble, we further suggest that such self-consistency checks of NMR-derived (and other) ensembles combined with PGD searches could be useful during ensemble preparation to decide at what point higher energy structures in the ensemble are not truly of comparable quality to the rest of the structures and should be deemed unacceptable.

Application 4: Comparing NMR ensembles derived by different protocols

The ENSEMBLATOR can also be used to gain insight into how different structure determination protocols can impact structure quality/accuracy. Rosetta refinement has been shown in specific cases improve NMR structure quality (e.g. Ref. 30), and recently Mao et al.26 tested 40 NMR and crystal structure pairs to see how much refinement with a standard Rosetta protocol alone or in conjunction with NMR-based restraints could increase the similarity of the ensembles with crystal structures. The level of agreement of an NMR ensemble with its corresponding crystal structure was measured only by a single overall value: the GDT.TS (Ref. 31). Here, we use the ENSEMBLATOR to reanalyze the models of Ygdr, for which restrained Rosetta refinement increased the GDT.TS from 0.77 to 0.81, a small numerical increase that provides no information about how the refinement impacted the model. As for the analyses above, we used d_cut = 2.5 Å for global comparisons.

The global comparison of the original NMR ensemble with a crystal structure-based ensemble [Fig. 10(A)] shows that other than the N- and C-termini, there are four broad regions of variability within the original NMR ensemble (blue curve) at residues 7–9, 15–17, 22–24, and 32–35, and that residues 15–18 is the segment matching worst with the crystal structure. The eeLOCAL output (not shown) highlights Lys17 as having particularly poor agreement with respect to the crystal structure. After Rosetta refinement without NMR restraints [Fig. 10(B)], the internal variations near residues 17, 23, and 34 increase and the spread near residue 8 decreases dramatically, while remarkably in all four places the closest agreement of any of the NMR models with a crystal structure model improves somewhat. This implies that the Rosetta energy function is improving the sampling of conformations close to those adopted in the crystal structure. As one example, the eeLOCAL results show that at Lys17 the closest models decrease from LODR = 3.0 Å to 2.2 Å (not shown).

Combining the NMR restraints with the Rosetta refinement [Fig. 10(C)] results in broad decreases in ensemble spread, especially at residues 22–24, indicating the synergistically distinct information content of the Rosetta energy function and the NMR-based restraints. Interestingly, at the loop near residue 33 where the crystal structure ensemble shows the greatest variation, the use of Rosetta actually makes the model slightly less similar to the crystal structures. In contrast, the two regions where Rosetta most dramatically decreased the ensemble spread and increased its agreement with the crystal structures are both β-strands (7–9, 22–24). We conclude that this type of analysis, made easy by the ENSEMBLATOR, provides much more information than does the single overall GDT.TS score reported by Mao et al.20 and such insights about how the Rosetta energy function does and does not improve models could help guide improvements in energy functions as well as structure determination protocols.

Discussion

As illustrated by the four example applications presented above, the ENSEMBLATOR approach to structural comparisons provides a useful tool that is notably more informative than current approaches used for identifying systematic differences between an ensemble and a single structure or another ensemble. In addition to the above examples, we note that a recently published study by Nyarko et al.32 used the ENSEMBLATOR to analyze the NMR ensembles of natural active and inactive variants of a fungal toxin of wheat that differ by a few point mutations. Those analyses identified a region of subtle difference that had notably higher rms spreads in the inactive form, and the authors hypothesized that the additional flexibility in this loop region was in part responsible for the inactivity.

The two most novel concepts of the ENSEMBLATOR analyses are the calculation of a “closest-approach-between-any-members-of-the-ensemble” statistic and the eeLOCAL analyses that use the LODR score to provide a convenient single distance-based measure of local backbone conformational similarity. The former innovation enhances the sensitivity with which the most significant differences can be identified in both the global and local comparisons, and the eeLOCAL analyses help pinpoint the local variations that underlie global differences. A third novel aspect of the ENSEMBLATOR is the eeCORE approach of carrying out all pairwise comparisons to identify a common self-consistent core, as well as to provide statistics that can allow subgroups of similar structures to be identified among the compared models (e.g., Ref. 24) [Fig. 8(B)]. The choice to have the ENSEMBLATOR run eeCORE for a series of d_cut values for every comparison (rather than picking 2.5 Å as a generally useful standard value) provides a purposeful reinforcement of the oft-neglected reality that any given “best overlay” is just one of many that could be chosen, and as noted above, allows users to easily assess how robust their results are to the choice of the overlay parameters.

One aspect of our work with immediate implications is our unexpected finding that the last two models of the RNase Sa NMR ensemble have a segment that adopts implausible ϕ,ψ-angles. We see no reason to suspect that this is a problem unique to this particular ensemble, and suggest that doing such consistency checks during NMR ensemble generation would be a useful general application of the ENSEMBLATOR. As far as we are aware, no standard protocols exist for deciding on the most appropriate size or constitution of an ensemble, or for instance, how large an energy gap between the lowest and highest energy model is acceptable. In this light, we further suggest that it would be useful to record the relative energies of the individual models in an ensemble as part of PDB depositions.

An additional application of such ensemble comparisons would be a more facile finding of differences between modeled and experimental protein ensembles (e.g., Ref. 4), and potentially guiding the improvement of forcefields associated with modeling programs. It was, for instance, recently shown that identifying details of protein structure that were consistently predicted incorrectly very effectively guided the making of specific improvements in the Rosetta forcefield.33 Similarly specific guidance could be obtained by using eeLOCAL to compare ensembles of template-based models against crystal structures (or even better crystal structure ensembles) to locate specific systematic backbone differences that could reflect inadequacies in the force field.

We note that the limitation of the current version of the ENSEMBLATOR to have all input models with the same exact set of atoms is not a limitation in principle but could be overcome by a more sophisticated front end that could assign equivalent atoms between nonidentical structures. In concept, creating an ENSEMBLATOR version that can handle nonidentical proteins may as straightforward as using an existing program like SuperPose13 as a front end to define equivalent atoms and overlay the structures, and then feed those files to modified versions of eeGLOBAL and eeLOCAL. The other limitation worth explicitly noting is that because the eeCORE and eeGLOBAL comparisons use a single overlay, protein domains that move relative to one another should be analyzed separately, as is true for other current ensemble analysis approaches. This limitation, however, does not apply to the eeLOCAL analyses, and in fact, eeLOCAL analyses can be useful for locating specific “hinge” residues that are involved in conformational changes.

Acknowledgment

The authors thank Blaine Roberts and Donnie Berkholz for contributions to the source code.

1 The original report [22] carried out analyses on an ensemble of 40 structures, but we do not have access to all of these as only the top 20 were deposited in the PDB.

References

1Hartmann H, Parak F, Steigemann W, Petsko GA, Ponzi DR, Frauenfelder H (1982) Conformational substates in a protein: structure and dynamics of metmyoglobin at 80 K. Proc Natl Acad Sci USA 79: 4967–4971.
10.1073/pnas.79.16.4967
CAS PubMed Web of Science® Google Scholar
2Burnley BT, Afonine PV, Adams PD, Gros P (2012) Modelling dynamics in protein crystal structures by ensemble refinement. Elife 1: e00311.
10.7554/eLife.00311
CAS PubMed Web of Science® Google Scholar
3Best RB, Lindorff-Larsen K, DePristo MA, Vendruscolo M (2006) Relation between native ensembles and experimental structures of proteins. Proc Natl Acad Sci USA 103: 10901–10906.
10.1073/pnas.0511156103
CAS PubMed Web of Science® Google Scholar
4Tyka MD, Keedy DA, Andre I, Dimaio F, Song Y, Richardson DC, Richardson JS, Baker D (2011) Alternate states of proteins revealed by detailed energy landscape mapping. J Mol Biol 405: 607–618.
10.1016/j.jmb.2010.11.008
CAS PubMed Web of Science® Google Scholar
5Vögeli B, Orts J, Strotz D, Chi C, Minges M, Walti MA, Guntert P, Riek R (2014) Towards a true protein movie: a perspective on the potential impact of the ensemble-based structure determination using exact NOEs. J Magn Reson 241: 53–59.
10.1016/j.jmr.2013.11.016
CAS PubMed Web of Science® Google Scholar
6Gros P, van Gunsteren WF, Hol WG (1990) Inclusion of thermal motion in crystallographic structures by restrained molecular dynamics. Science 249: 1149–1152.
10.1126/science.2396108
CAS PubMed Web of Science® Google Scholar
7Terwilliger TC, Grosse-Kunstleve RW, Afonine PV, Adams PD, Moriarty NW, Zwart P, Read RJ, Turk D, Hung LW (2007) Interpretation of ensembles created by multiple iterative rebuilding of macromolecular models. Acta Crystallogr D Biol Crystallogr 63: 597–610.
10.1107/S0907444907009791
CAS PubMed Web of Science® Google Scholar
8Brünger AT (1997) Free R value: cross-validation in crystallography. Methods Enzymol 277: 366–396.
10.1016/S0076-6879(97)77021-6
CAS PubMed Web of Science® Google Scholar
9Fenwick RB, van den Bedem H, Fraser JS, Wright PE (2014) Integrated description of protein dynamics from room-temperature X-ray crystallography and NMR. Proc Natl Acad Sci USA 111: E445–54. 4
10.1073/pnas.1323440111
CAS PubMed Web of Science® Google Scholar
10Levin EJ, Kondrashov DA, Wesenberg GE, Phillips GN Jr. (2007) Ensemble refinement of protein crystal structures: validation and application. Structure 15: 1040–1052.
10.1016/j.str.2007.06.019
CAS PubMed Web of Science® Google Scholar
11Jewett AI, Huang CC, Ferrin TE (2003) MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance. Bioinformatics 19: 625–634.
10.1093/bioinformatics/btg035
CAS PubMed Web of Science® Google Scholar
12Liu YS, Fang Y, Ramani K (2009) Using least median of squares for structural superposition of flexible proteins. BMC Bioinformatics 10: 29
10.1186/1471-2105-10-29
CAS PubMed Web of Science® Google Scholar
13Maiti R, Van Domselaar GH, Zhang H, Wishart DS (2004) SuperPose: a simple server for sophisticated structural superposition. Nucleic Acids Res 32: W590–594.
10.1093/nar/gkh477
CAS PubMed Web of Science® Google Scholar
14Kirchner DK, Guntert P (2011) Objective identification of residue ranges for the superposition of protein structures. BMC Bioinformatics 12: 170
10.1186/1471-2105-12-170
CAS PubMed Web of Science® Google Scholar
15Kelley LA, Gardner SP, Sutcliffe MJ (1996) An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally related subfamilies. Protein Eng 9: 1063–1065.
10.1093/protein/9.11.1063
CAS PubMed Web of Science® Google Scholar
16Theobald DL, Wuttke DS (2006) THESEUS: maximum likelihood superpositioning and analysis of macromolecular structures. Bioinformatics 22: 2171–2172.
10.1093/bioinformatics/btl332
CAS PubMed Web of Science® Google Scholar
17Zagrovic B, Pande VS (2004) How does averaging affect protein structure comparison on the ensemble level? Biophys J 87: 2240–2246.
10.1529/biophysj.104.042184
CAS PubMed Web of Science® Google Scholar
18Cheetham JC, Smith DM, Aoki KH, Stevenson JL, Hoeffel TJ, Syed RS, Egrie J, Harvey TS (1998) NMR structure of human erythropoietin and a comparison with its receptor bound conformation. Nat Struct Biol 5: 861–866.
10.1038/2302
CAS PubMed Web of Science® Google Scholar
19Eickholt J, Wang Z, Cheng J (2011) A conformation ensemble approach to protein residue-residue contact. BMC Struct Biol 11: 38.
10.1186/1472-6807-11-38
CAS PubMed Web of Science® Google Scholar
20Hegler JA, Latzer J, Shehu A, Clementi C, Wolynes PG (2009) Restriction versus guidance in protein structure prediction. Proc Natl Acad Sci USA 106: 15302–15307.
10.1073/pnas.0907002106
CAS PubMed Web of Science® Google Scholar
21Roy A, Perez A, Dill KA, Maccallum JL (2014) Computing the relative stabilities and the per-residue components in protein conformational changes. Structure 22: 168–175.
10.1016/j.str.2013.10.015
CAS PubMed Web of Science® Google Scholar
22Laurents D, Perez-Canadillas JM, Santoro J, Rico M, Schell D, Pace CN, Bruix M (2001) Solution structure and dynamics of ribonuclease Sa. Proteins 44: 200–211.
10.1002/prot.1085
CAS PubMed Web of Science® Google Scholar
23Sevcik J, Dauter Z, Lamzin VS, Wilson KS (1996) Ribonuclease from Streptomyces aureofaciens at atomic resolution. Acta Crystallogr D Biol Crystallogr 52: 327–344.
10.1107/S0907444995007669
CAS PubMed Web of Science® Google Scholar
24DeWitte RS, Michnick SW, Shakhnovich EI (1995) Exhaustive enumeration of protein conformations using experimental restraints. Protein Sci 4: 1780–1791.
10.1002/pro.5560040913
CAS PubMed Web of Science® Google Scholar
25Spezio MLR (1994) The crystal structure of the catalytic domain of cellulase E2 from Thermomonospora Fusca at atomic (1.18 Å) resolution. Cornell University.
Google Scholar
26Mao B, Tejero R, Baker D, Montelione GT (2014) Protein NMR structures refined with Rosetta have higher accuracy relative to corresponding X-ray crystal structures. J Am Chem Soc 136: 1893–1906.
10.1021/ja409845w
CAS PubMed Web of Science® Google Scholar
27Berkholz DS, Krenesky PB, Davidson JR, Karplus PA (2010) Protein Geometry Database: a flexible engine to explore backbone conformations and their relationships to covalent geometry. Nucleic Acids Res 38: D320–325.
10.1093/nar/gkp1013
CAS PubMed Web of Science® Google Scholar
28Rinaldelli M, Ravera E, Calderone V, Parigi G, Murshudov GN, Luchinat C (2014) Simultaneous use of solution NMR and X-ray data in REFMAC5 for joint refinement/detection of structural differences. Acta Crystallogr D Biol Crystallogr 70: 958–967.
10.1107/S1399004713034160
CAS PubMed Web of Science® Google Scholar
29Davis IW, Murray LW, Richardson JS, Richardson DC (2004) MOLPROBITY: structure validation and all-atom contact analysis for nucleic acids and their complexes. Nucleic Acids Res 32: W615–W619.
10.1093/nar/gkh398
CAS PubMed Web of Science® Google Scholar
30Ramelot TA, Raman S, Kuzin AP, Xiao R, Ma LC, Acton TB, Hunt JF, Montelione GT, Baker D, Kennedy MA (2009) Improving NMR protein structure quality by Rosetta refinement: a molecular replacement study. Proteins 75: 147–167.
10.1002/prot.22229
CAS PubMed Web of Science® Google Scholar
31Zemla A (2003) LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res 31: 3370–3374.
10.1093/nar/gkg571
CAS PubMed Web of Science® Google Scholar
32Nyarko A, Singarapu KK, Figueroa M, Manning VA, Pandelova I, Wolpert TJ, Ciuffetti LM, Barbar E (2014) Solution NMR structures of Pyrenophora tritici-repentis ToxB and its inactive homolog reveal potential determinants of toxin activity. J Biol Chem 289: 25946–25956.
10.1074/jbc.M114.569103
CAS PubMed Web of Science® Google Scholar
33Song Y, Tyka M, Leaver-Fay A, Thompson J, Baker D (2011) Structure-guided forcefield optimization. Proteins 79: 1898–1909.
10.1002/prot.23013
CAS PubMed Web of Science® Google Scholar
34Kabsch W (1978) A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr A34: 827–828.
10.1107/S0567739478001680
Web of Science® Google Scholar

Citing Literature

Volume24, Issue9

September 2015

Pages 1528-1542

Residue-level global and local ensemble-ensemble comparisons of protein domains

Abstract

Introduction