Volume 40, Issue 9 pp. 1270-1279
SPECIAL ARTICLE
Full Access

Using secondary structure to predict the effects of genetic variants on alternative splicing

Robert Wang

Corresponding Author

Robert Wang

Department of Bioengineering, University of California, Berkeley, California

Department of Plant and Microbial Biology, University of California, Berkeley, California

Correspondence Robert Wang, Department of Bioengineering, University of California, Berkeley, CA 94720. Email: [email protected]

Search for more papers by this author
Yaqiong Wang

Yaqiong Wang

Department of Plant and Microbial Biology, University of California, Berkeley, California

Search for more papers by this author
Zhiqiang Hu

Zhiqiang Hu

Department of Plant and Microbial Biology, University of California, Berkeley, California

Search for more papers by this author
First published: 10 May 2019
Citations: 4

For the CAGI5 Special Issue.

Abstract

Accurate interpretation of genomic variants that alter RNA splicing is critical to precision medicine. We present a computational framework, Prediction of variant Effect on Percent Spliced In (PEPSI), that predicts the splicing impact of coding and noncoding variants for the Fifth Critical Assessment of Genome Interpretation (CAGI5) “Vex-seq” challenge. PEPSI is a random forest regression model trained on multiple layers of features associated with sequence conservation and regulatory sequence elements. Compared to other splicing defect prediction tools from the literature, our framework integrates secondary structure information in predicting variants that disrupt splicing regulatory elements (SREs). We applied our model to classify splice-disrupting variants among 2,094 single-nucleotide polymorphisms from the Exome Aggregation Consortium using model-predicted changes in percent spliced in (ΔPSI) associated with tested variants. Benchmarking our model against widely used state-of-the-art tools, we demonstrate that PEPSI achieves comparable performance in terms of sensitivity and precision. Moreover, we also show that using secondary structure context can help resolve several cases where changes in the counts of SREs do not correspond with the directionality of ΔPSI measured for tested variants.

1 INTRODUCTION

Alternative splicing is a major regulatory mechanism that accounts for the macromolecular and cellular complexity observed in eukaryotic organisms (Nilsen & Graveley, 2010). During alternative splicing, the exons of primary transcripts are spliced together in different arrangements by the spliceosome, yielding a structurally and functionally diverse population of messenger RNAs (mRNAs) and proteins. The activity of the spliceosome is primarily regulated through interactions with cis-acting RNA sequence elements, including the donor splice site, acceptor splice site, and branch point site. There are also additional regulatory sequences known as splicing regulatory elements (SREs) that influence choice of adjacent splice sites. By convention, SREs are organized by localization and effect on splicing as exonic splicing enhancers (ESEs), exonic splicing silencers (ESSs), intronic splicing enhancers, and intronic splicing silencers. Intriguingly, studies have demonstrated that molecular recognition of SREs is influenced by RNA secondary structure and chromatin states, which mediate sequence accessibility (Fu & Ares, 2014).

Given the complex regulation of alternative splicing, it is unsurprising that genetic variants that impact pre-mRNA splicing efficiency are implicated in many diseases. In fact, it has been suggested that approximately one-third of all disease-causing mutations alter pre-mRNA splicing (Lim, Ferraris, Filloux, Raphael, & Fairbrother, 2011). Therefore, developing methods that can accurately identify genomic variants that alter splicing is critical to advancing personalized medicine.

A splicing reporter mini-gene assay is an experimental strategy to systematically evaluate the effects of genetic variants on splicing of a certain exon. Recently, a high-throughput reporter system called Vex-seq was developed to determine the splicing impact of exonic and intronic variants for the same exon simultaneously (Adamson, Zhan, & Graveley, 2018). Vex-seq compares the percent spliced-in (PSI or urn:x-wiley:10597794:media:humu23790:humu23790-math-0001)—a metric representing the fraction of transcripts harboring a given exon—between constructs containing a reference sequence and constructs containing a particular variant. The change in PSI (urn:x-wiley:10597794:media:humu23790:humu23790-math-0002) is then calculated for each tested variant. While such assays are straightforward and reliable for detecting splicing defects, there are technical limitations that prohibit its use in routine clinical practice. Moreover, such assays do not fully represent true in vivo splicing given they only represent one or several exons and lack the context of the entire gene.

Over the past several years, a number of in silico prediction tools for detecting mutations that alter splicing have been developed. One class of tools, including MutPredSplice (Mort et al., 2014) and Human Splicing Finder (Desmet et al., 2009), integrate multiple layers of regulatory sequence features to output a score representing the probability that a given variant disrupts splicing. However, the outputs of these tools fail to capture the resulting physical effects of variants on splicing, such as the extent to which a variant increases or decreases the frequency of alternative exon inclusion.

Alternatively, there are other computational models, including HAL (Rosenberg, Patwardhan, Shendure, & Seelig, 2015) and SPANR (Xiong et al., 2014), that directly predict the effects of genetic variants on the relative amounts of alternatively spliced isoforms. However, these tools are often limited in terms of the types of variants that they can evaluate. For example, SPANR is limited to analyzing single nucleotide changes, whereas HAL is limited to variants within the alternative exon. Additionally, while these tools do claim predictive power, their predictions are still far from reliable enough for clinical translation.

A common approach used by many tools for predicting the impact of mutations on splicing is to analyze a given sequence for hexamers that may putatively function as SREs. However, it is possible that for a given sequence context, an identified SRE may not be active because it is occluded by secondary structure. This derives from a long-standing view that sequence-specific RNA binding proteins have limited accessibility to the bases within the major groove of double-stranded RNA, which is narrower than that of double-stranded DNA (Mattaj & Nagai, 1994). It has been experimentally shown that several major splicing factors, such as SR proteins and hnRNP proteins, exhibit reduced binding efficiency to sequence motifs that form secondary structures (Buratti et al., 2004; Damgaard, Tange, & Kjems, 2002). Under this framework, we hypothesized that methods that solely count the number of SREs lost/gained as a result of a mutation may yield several false positives and false negatives when predicting splice-altering variants.

To investigate this idea, we developed a computational framework for predicting the splicing impact of coding and noncoding variants as part of the Fifth Critical Assessment of Genome Interpretation (CAGI5) “Vex-seq” challenge. We trained a random forest regression model, Prediction of variant Effect on Percent Spliced In (PEPSI) using multiple layers of features associated with sequence conservation and regulatory sequence elements. In particular, for features related to SREs, we leverage RNA secondary structure information to characterize the impact of variants that disrupt putatively identified SREs. Compared to state-of-the-art splicing prediction tools, PEPSI achieves comparable sensitivity and precision in classifying splice-disrupting variants based on predicted urn:x-wiley:10597794:media:humu23790:humu23790-math-0003 of genomic variants. Moreover, we demonstrate that RNA secondary structure information can help resolve several cases where changes in the counts of SREs do not correspond with the directionality of urn:x-wiley:10597794:media:humu23790:humu23790-math-0004 measured for tested variants.

2 MATERIALS AND METHODS

2.1 Training and test datasets

The results of the Vex-seq experiment were provided as training and test sets in the CAGI5 “Vex-seq” challenge to evaluate computational approaches for predicting the impact of genetic variants on splicing.

The Vex-seq experiment measured the urn:x-wiley:10597794:media:humu23790:humu23790-math-0005 of 2,055 variants from the Exome Aggregation Consortium (ExAC; Kircher et al., 2014) using a library of reporter constructs transfected into HepG2 cells (Adamson et al., 2018). Variants on chromosomes 1–8 were assigned to the training set, and variants on chromosomes 9–22 and chromosome X were assigned to the test set. There were 957 variants within or adjacent to 52 exons in the training set, including 488 exonic single-nucleotide variants (SNVs), 14 exonic insertions/deletions (indels), 425 intronic SNVs, and 30 intronic indels. In contrast, there were 1,098 variants within or adjacent to 58 exons in the test set, of which there were 563 exonic SNVs, nine exonic indels, 495 intronic SNVs, and 31 intronic indels (Figure 1a).

Details are in the caption following the image

Training and test data sets. (a) Pie chart showing the distribution of exonic SNVs, intronic SNVs, exonic indels, and intronic indels in the Vex-seq training and test sets. (b) Density plot showing the distribution of ΔΨ for variants in the Vex-seq training set and for variants in the Vex-seq test set. SNV, single-nucleotide variant

The median urn:x-wiley:10597794:media:humu23790:humu23790-math-0006 among the variants in the training set was urn:x-wiley:10597794:media:humu23790:humu23790-math-0007 whereas the median urn:x-wiley:10597794:media:humu23790:humu23790-math-0008 among the variants in the test set was urn:x-wiley:10597794:media:humu23790:humu23790-math-0009. Given that these values are reasonably close to 0%, we concluded that variants in the both the training set and test set were equally likely to increase or decrease the frequency of alternative exon inclusion. In both the training set and test set, the distribution of urn:x-wiley:10597794:media:humu23790:humu23790-math-0010 among the variants revealed that more than 50% of the variants had a measured urn:x-wiley:10597794:media:humu23790:humu23790-math-0011 (Figure 1b). Based on this threshold (Xiong et al., 2014), a large proportion of the variants from ExAC did not appear to disrupt splicing in HepG2 cells.

2.2 Features

We assembled 20 features characterizing each variant (Table 1), including distance to the nearest splice site, exon length, seven sequence-based features, and 11 conservation scores derived from annotations used in version 1.3 of CADD (Kircher et al., 2014). We chose to incorporate conservation features in our model given that previous studies have demonstrated how functional sequence elements associated with alternative splicing are phylogenetically and spatially conserved (Minovitsky, Gee, Schokrpur, Dubchak, & Conboy, 2005).

Table 1. Variant features used as predictors by PEPSI
Name Description Type Source
GC Percent GC in a window of ±75 bp num CADD v1.3
priPhCons Primate PhastCons conservation score (excl. human) num CADD v1.3
mamPhCons Mammalian PhastCons conservation score (excl. human) num CADD v1.3
verPhCons Vertebrate PhastCons conservation score (excl. human) num CADD v1.3
priPhyloP Primate PhyloP score (excl. human) num CADD v1.3
mamPhyloP Mammalian PhyloP score (excl. human) num CADD v1.3
verPhyloP Vertebrate PhyloP score (excl. human) num CADD v1.3
GerpN Neutral evolution score defined by GERP++  num CADD v1.3
GerpS Rejected Substitution score defined by GERP++ num CADD v1.3
bStatistic Background selection score int CADD v1.3
mutIndex Genome-wide mutability index int CADD v1.3
fitCons fitCons score num CADD v1.3
nearest_ss_dist Distance to nearest splice site int
exon_length Length of exon int
exon Variant is located within an exon bool
MaxEntScan_5ss Difference in MaxEntScan::score5ss scores between mutated and reference sequences num MaxEntScan
MaxEntScan_3ss Difference in MaxEntScan::score3ss scores between mutated and reference sequences num MaxEntScan
SVM_BP Difference in SVM-BP branch point scores between mutated and reference sequences num SVM-BP
ESEseq Weighted loss/gain of ESE sequence motifs num Ke et al. (2011)
ESSseq Weighted loss/gain of ESS sequence motifs num Ke et al. (2011)
  • Abbreviations: ESE, exonic splicing enhancer; ESS, exonic splicing silencer; PEPSI, Prediction of variant Effect on Percent Spliced In; SVM, single-nucleotide variant.

To measure the impact of variants on splice site strength, we reported the difference in MaxEntScan (Yeo & Burge, 2004) scores between the mutated and reference sequences (MaxEntScan_5ss and MaxEntScan_3ss from Table 1). Specifically, we ran MaxEntScan::score5ss for variants that were either within 3 base pairs (bp) upstream or within 6 bp downstream of the splice donor site. We ran MaxEntScan::score3ss for variants that were either within 20 bp upstream or within 3 bp downstream of the splice acceptor site. Variants that were not within the specified window for either MaxEntScan tool were assigned a score difference of 0.

To assess the impact of variants on branch point sequence recognition, we reported the difference in SVM-BPfinder (Corvelo, Hallegger, Smith, & Eyras, 2010) scores between the mutated and reference sequences (SVM_BP from Table 1) for variants upstream of the splice acceptor site. For each oligo sequence used to assemble a Vex-seq splicing reporter, we ran SVM-BPfinder on the subsequence upstream of the 3′-splice site to identify candidate branch point sequences. We used the score associated with the “best” branch point sequence when calculating the score difference between mutated and reference sequences. Variants that were not upstream of the splice acceptor site were assigned a score difference of 0.

To measure the impact of variants on SRE recognition, we considered datasets of ESE hexamers (ESEseq from Table 1) and ESSs hexamers (ESSseq from Table 1) that were identified using a 3-exon minigene splicing assay (Ke et al., 2011). For each test exon associated with a given variant, we considered a sequence window from 40 bp upstream of the 3′-splice site to 40 bp downstream of the 5′-splice site. A score representing the weighted loss/gain of sequence motifs was calculated for each data set of SREs as follows:

Let urn:x-wiley:10597794:media:humu23790:humu23790-math-0012 be the reference sequence window, and let urn:x-wiley:10597794:media:humu23790:humu23790-math-0013 be the mutated sequence window. For a list of SREs, urn:x-wiley:10597794:media:humu23790:humu23790-math-0014, let urn:x-wiley:10597794:media:humu23790:humu23790-math-0015 be the subset of SREs in urn:x-wiley:10597794:media:humu23790:humu23790-math-0016 that are present in the exonic region of urn:x-wiley:10597794:media:humu23790:humu23790-math-0017, and let urn:x-wiley:10597794:media:humu23790:humu23790-math-0018 be the subset of SREs in urn:x-wiley:10597794:media:humu23790:humu23790-math-0019 that are present in exonic region of urn:x-wiley:10597794:media:humu23790:humu23790-math-0020. The weighted loss/gain of sequence motifs from L for this variant is given by:
urn:x-wiley:10597794:media:humu23790:humu23790-math-0021(1)
where urn:x-wiley:10597794:media:humu23790:humu23790-math-0022 for urn:x-wiley:10597794:media:humu23790:humu23790-math-0023 is a weighted count of sequences motifs from L present in urn:x-wiley:10597794:media:humu23790:humu23790-math-0024, defined as
urn:x-wiley:10597794:media:humu23790:humu23790-math-0025(2)
where, we define urn:x-wiley:10597794:media:humu23790:humu23790-math-0026 to be the probability that motif m from urn:x-wiley:10597794:media:humu23790:humu23790-math-0027 is unpaired within the sequence context of urn:x-wiley:10597794:media:humu23790:humu23790-math-0028. urn:x-wiley:10597794:media:humu23790:humu23790-math-0029 is calculated using RNAplfold (Lorenz et al., 2011), which estimates the probability that a region is unpaired by calculating local-pair probabilities for bases with a maximum span of L nucleotides via a sliding window of size W nucleotides along the input sequence urn:x-wiley:10597794:media:humu23790:humu23790-math-0030. The parameters W and L were set to 80 and 40, respectively, having been previously optimized for small interfering RNA binding predictions (Tafer et al., 2008).
To assess the contribution of secondary structure information to predicting variants that disrupt SREs, we developed a control method to measure the unweighted loss/gain of SREs as follows:
urn:x-wiley:10597794:media:humu23790:humu23790-math-0031(3)
where urn:x-wiley:10597794:media:humu23790:humu23790-math-0032 and urn:x-wiley:10597794:media:humu23790:humu23790-math-0033 are the reference and mutated sequence windows, respectively, urn:x-wiley:10597794:media:humu23790:humu23790-math-0034 is a list of SREs, and urn:x-wiley:10597794:media:humu23790:humu23790-math-0035 is a subset of SREs in urn:x-wiley:10597794:media:humu23790:humu23790-math-0036 that are present in the exonic region of sequence urn:x-wiley:10597794:media:humu23790:humu23790-math-0037 for urn:x-wiley:10597794:media:humu23790:humu23790-math-0038. We separately trained a control model, PEPSI-noSS, using scores for unweighted loss/gain of SREs.

2.3 Computational framework

The workflow for our computational framework is summarized in Figure 2. Variants are first annotated using the 20 features described in Table 1. We used median imputation within the training set to fill in missing feature annotations. After annotating each training variant, we trained our model to predict the urn:x-wiley:10597794:media:humu23790:humu23790-math-0039 for each variant. We observed that interpreting the impact of a variant with a given urn:x-wiley:10597794:media:humu23790:humu23790-math-0040 will depend on the reference urn:x-wiley:10597794:media:humu23790:humu23790-math-0041 of the alternative exon. Two variants with the same urn:x-wiley:10597794:media:humu23790:humu23790-math-0042 value affecting two different exons will have different consequences if one exon is weakly spliced in (low reference urn:x-wiley:10597794:media:humu23790:humu23790-math-0043) and the other exon is frequently spliced in (high reference urn:x-wiley:10597794:media:humu23790:humu23790-math-0044). Given that model features should directly reflect consequences to splicing, we chose to predict on urn:x-wiley:10597794:media:humu23790:humu23790-math-0045 values to numerically distinguish such cases in which two exons with very different reference urn:x-wiley:10597794:media:humu23790:humu23790-math-0046are each affected by a variant with the same measured urn:x-wiley:10597794:media:humu23790:humu23790-math-0047. For a given reference urn:x-wiley:10597794:media:humu23790:humu23790-math-0048and urn:x-wiley:10597794:media:humu23790:humu23790-math-0049 associated with a particular variant, we calculate the urn:x-wiley:10597794:media:humu23790:humu23790-math-0050 as follows:
urn:x-wiley:10597794:media:humu23790:humu23790-math-0051(4)
Details are in the caption following the image

Computational framework. The computational framework first annotates variants using a total of 20 features related to sequence conservation and regulatory sequence elements. A random forest regression model predicts the Δlogit(Ψ) for each annotated variant, and a separate module converts Δlogit(Ψ) to predicted ΔΨ. PEPSI, Prediction of variant Effect on Percent Spliced In

We restricted the range of values for urn:x-wiley:10597794:media:humu23790:humu23790-math-0052 and urn:x-wiley:10597794:media:humu23790:humu23790-math-0053 to urn:x-wiley:10597794:media:humu23790:humu23790-math-0054 to avoid getting infinitely large/small values upon taking the logit transformation. Predicted urn:x-wiley:10597794:media:humu23790:humu23790-math-0055 values for variants in the test set were converted back to urn:x-wiley:10597794:media:humu23790:humu23790-math-0056 values as follows:
urn:x-wiley:10597794:media:humu23790:humu23790-math-0057(5)

Our random forest regression model used a total of 2,000 trees and randomly sampled seven variables as candidates at each split. The importance of each feature during training was calculated as the total decrease in node impurity over all splits involving that feature within each tree, averaged across all trees in the forest as implemented in the randomForest package in R (Breiman, 2001).

2.4 Benchmarking against state-of-the-art splicing defect prediction tools

We benchmarked our method against commonly used state-of-the-art models for predicting the impact of variants on splicing, including HAL (Rosenberg et al., 2015), SPANR (Xiong et al., 2014), and MutPredSplice (Mort et al., 2014). HAL is a linear model trained on over two million random sequences to predict the splicing impact of exonic variants in terms of urn:x-wiley:10597794:media:humu23790:humu23790-math-0058. SPANR is another method that predicts the urn:x-wiley:10597794:media:humu23790:humu23790-math-0059 of exonic and intronic SNVs using a Bayesian deep learning algorithm trained on exon skipping events with 1,393 genomic feature annotations. On the other hand, MutPredSplice is a random forest model trained on various sequence-based and conservation-related features that reports the probability that a given variant is splice-altering.

To carry out this benchmark analysis, we used splicing functional assay data from the Multiplexed Functional Assay of Splicing using Sort-seq (MFASS) experiment (Cheung et al., 2019). The MFASS experiment assayed a total of 27,733 ExAC SNPs within or adjacent to 2,339 exons. Similar to the Vex-seq experiment, MFASS uses a set of three-exon, two-intron reporters in which skipping of the middle exon leads to reconstitution of fluorescence. Fluorescence-activated cell sorting was used to separate the pooled library of splicing reporters into separate bins based on observed fluorescence representing different splicing behavior. An exon inclusion index (EI) was then calculated for each tested sequence based on a weighted average of normalized read counts multiplied by the average exon inclusion level across all bins. The change in inclusion index urn:x-wiley:10597794:media:humu23790:humu23790-math-0060 for a particular library sequence between the reference and mutant was determined for each assayed variant.

Splice-disrupting variants (SDVs) were defined in the MFASS experiment as variants that change the inclusion index of a tested exon by at least 0.5. Based on this threshold, the MFASS experiment determined a total of 1,050 SDVs out of all 27,733 scored variants. We constructed a separate test set consisting of 2,094 ExAC SNPs measured in the MFASS experiment, consisting of 1,047 SNVs classified as splice-disrupting and 1,047 SNVs not considered as splice-disrupting.

We trained PEPSI and PEPSI-noSS on all 2,055 variants from the Vex-seq experiment and predicted the urn:x-wiley:10597794:media:humu23790:humu23790-math-0061 of SNVs in the test set that we constructed from the results of the MFASS experiment. Given that the Vex-seq experiment and the MFASS experiment both assayed variants from ExAC, we made sure that our curated test set from the MFASS experiment did not include any variants from the Vex-seq experiment. We then classified variants as splice-disrupting if the predicted urn:x-wiley:10597794:media:humu23790:humu23790-math-0062 (Xiong et al., 2014).

SPANR predictions for MFASS test set variants were determined using pre-computed annotation scores for SPANR (SPIDEX) that were downloaded from http://www.openbioinformatics.org/annovar/spidex_download_form.php. We used the HAL webserver (http://splicing.cs.washington.edu/SE) and MutPredSplice webserver (http://www.mutdb.org/mutpredsplice/submit.htm) to obtain respective tool predictions for the variants in the MFASS test set. We proceeded to classify variants as splice-disrupting based on scores produced by each tool as follows. For tools that characterize the splicing impact of variants using urn:x-wiley:10597794:media:humu23790:humu23790-math-0063, such as SPANR and HAL, we applied the same threshold of urn:x-wiley:10597794:media:humu23790:humu23790-math-0064 to classify variants as splice-disrupting. For MutPredSplice, we used the default tool threshold for general scores, in which variants with general scores urn:x-wiley:10597794:media:humu23790:humu23790-math-0065 were considered as splice-disrupting.

3 RESULTS AND DISCUSSION

3.1 Incorporating secondary structure into predicting variants that disrupt SREs

We first trained PEPSI and PEPSI-noSS on the Vex-seq training set and predicted the splicing impact of variants from the Vex-seq test set. For PEPSI, we observed a Pearson correlation of urn:x-wiley:10597794:media:humu23790:humu23790-math-0066 between predicted urn:x-wiley:10597794:media:humu23790:humu23790-math-0067 and experimentally measured urn:x-wiley:10597794:media:humu23790:humu23790-math-0068 (Figure 3a). Interestingly, for our control model PEPSI-noSS, we observed a slightly better Pearson correlation of urn:x-wiley:10597794:media:humu23790:humu23790-math-0069 between predicted urn:x-wiley:10597794:media:humu23790:humu23790-math-0070 and experimentally measured urn:x-wiley:10597794:media:humu23790:humu23790-math-0071 (Figure 3b). For both PEPSI and PEPSI-noSS, MaxEntScan scores and distance to nearest splice site were considered to be the most predictive features during model training. Interestingly, we also observed that our SRE features, ESEseq and ESSseq, were ranked with greater importance during model training for PEPSI than during model training for PEPSI-noSS (Tables S1 and S2).

Details are in the caption following the image

Performance on Vex-seq test set. (a) Measured ΔΨ (x-axis) vs. ΔΨ values predicted by PEPSI (y-axis) for variants in Vex-seq test set. (b) Measured ΔΨ (x-axis) vs. ΔΨ values predicted by PEPSI-noSS for variants in Vex-seq test set. PEPSI, Prediction of variant Effect on Percent Spliced In

To analyze how the incorporation of secondary structure influences the prediction accuracy of SREs, such as ESEs and ESSs, we compared the distribution of urn:x-wiley:10597794:media:humu23790:humu23790-math-0072 for variants which gain more ESEs than ESSs (ESEsequrn:x-wiley:10597794:media:humu23790:humu23790-math-0073ESSseq) and for variants that gain more ESSs than ESEs (ESEsequrn:x-wiley:10597794:media:humu23790:humu23790-math-0074ESSseq). In particular, we focused on 1,002 exonic variants that were more than two nucleotides away from the nearest splice site to avoid cases where the measured urn:x-wiley:10597794:media:humu23790:humu23790-math-0075 could be attributed to disruption of splice sites or branch point sequences instead of ESEs/ESSs. For PEPSI, where SRE count changes were weighed by probability of secondary structure formation, we observe that ESEsequrn:x-wiley:10597794:media:humu23790:humu23790-math-0076ESSseq variants have a significantly more positive urn:x-wiley:10597794:media:humu23790:humu23790-math-0077 distribution compared to ESEsequrn:x-wiley:10597794:media:humu23790:humu23790-math-0078ESSseq variants, urn:x-wiley:10597794:media:humu23790:humu23790-math-0079 (the Wilcoxon Rank-Sum Test; Figure 4a).

Details are in the caption following the image

Relationship Between ESEseq scores, ESSseq scores, and HepG2 ΔΨ. (a) Distribution of HepG2 ΔΨ of variants with a greater ESEseq score than ESSseq score and that of variants with a smaller ESEseq score than ESSseq score, as measured in the PEPSI framework. (b) Distribution of HepG2 ΔΨ of variants with a greater ESEseq score than ESSseq score and that of variants with a smaller ESEseq score than ESSseq score, as measured in the PEPSI-noSS framework. (c) Distribution of HepG2 ΔΨ of variants with a greater ESEseq score than ESSseq score and that of variants with a smaller ESEseq score than ESSseq score, measured in a modified framework of PEPSI where the probabilities of single-strandedness used to weight motif count changes were randomly generated based on a shuffling of the original sequence window. ESE, exonic splicing enhancer; ESS, exonic splicing silencer; PEPSI, Prediction of variant Effect on Percent Spliced In

For PEPSI-noSS, where SRE count changes were unweighted, we also observed a significantly more positive urn:x-wiley:10597794:media:humu23790:humu23790-math-0080distribution among ESEsequrn:x-wiley:10597794:media:humu23790:humu23790-math-0081ESSseq variants compared to ESEsequrn:x-wiley:10597794:media:humu23790:humu23790-math-0082ESSseq variants, urn:x-wiley:10597794:media:humu23790:humu23790-math-0083 (the Wilcoxon Rank-Sum Test; Figure 4b). To determine if the probabilities used in weighing SRE count changes were introducing random noise, we recalculated SRE count changes using probabilities of secondary structure formation derived from a randomly shuffled version of the original sequence window. Using these recalculated scores, we did not observe any statistically significant difference in the urn:x-wiley:10597794:media:humu23790:humu23790-math-0084 distribution between ESEsequrn:x-wiley:10597794:media:humu23790:humu23790-math-0085ESSseq variants and ESEsequrn:x-wiley:10597794:media:humu23790:humu23790-math-0086ESSseq variants, urn:x-wiley:10597794:media:humu23790:humu23790-math-0087 (the Wilcoxon Rank-Sum Test; Figure 4c). This suggests that the original probabilities used in adjusting SRE count changes in PEPSI did not necessarily introduce random noise to the model.

Interestingly, for PEPSI-noSS, we observed multiple variants in which count changes in SREs did not correspond with the directionality of the measured urn:x-wiley:10597794:media:humu23790:humu23790-math-0088. Out of the 1,002 exonic variants, 160 ESEsequrn:x-wiley:10597794:media:humu23790:humu23790-math-0089ESSseq variants had a negative urn:x-wiley:10597794:media:humu23790:humu23790-math-0090 and 219 ESEsequrn:x-wiley:10597794:media:humu23790:humu23790-math-0091ESSseq variants had a positive urn:x-wiley:10597794:media:humu23790:humu23790-math-0092. Integrating secondary structure information into calculating SRE count changes for these 379 variants yielded 113 cases in which weighted count changes of SREs corresponded with the directionality of measured urn:x-wiley:10597794:media:humu23790:humu23790-math-0093 (Table S3).

Some of these resolved cases yield interesting insights into the role of secondary structure in SRE recognition. For example, the variant rs771094081:A>C, which had a measured urn:x-wiley:10597794:media:humu23790:humu23790-math-0094, was associated with the loss of 2 putative ESS motifs AGGAGG and AGGTGG. However, based on variant annotations by PEPSI, the weighted count change in ESS motifs indicates a net increase in the number of ESS motifs free of secondary structure. We hypothesized that the two putative ESS motifs that were lost as a result of the mutation were occluded by secondary structure in the original sequence context, thus rendering them inactive. As a result, the loss of these two motifs should not directly correspond to a positive urn:x-wiley:10597794:media:humu23790:humu23790-math-0095. Instead, it is possible that the point mutation alters the landscape of secondary structures such that a pre-existing ESS motif sequestered by secondary structure in the original sequence context is more accessible for binding in the mutated sequence context. This could potentially explain the negative urn:x-wiley:10597794:media:humu23790:humu23790-math-0096 that was measured for this particular variant.

To validate our hypothesis, we considered the original and mutated 95 bp sequences for the exon harboring the mutationNC_000001.11:g.24339710A>C. RNA secondary structures for both the original and mutated sequences were predicted using the Mfold web server (Zuker, 2003). Folding was simulated at a temperature of 37°C and ionic conditions of 1M NaCl. If the Mfold web server produced multiple secondary structures, the structure with the highest negative free energy was chosen to be the representative structure. Comparing the secondary structures for the reference and mutant sequences (Figure 5a,b), we discovered an ESS motif TTGAGG present in both sequences that was more likely to be single-stranded in the mutant sequence compared to the reference sequence.

Details are in the caption following the image

The variant rs771094081:A>C results in the loss of two putative ESS motifs but still decreases exon splicing efficiency. (a) The wild-type sequence of the exon harboring the mutation rs771094081:A>C has a putative ESS motif TTGAGG that is occluded by secondary structure. (b) The mutated sequence of the same exon results in a secondary structure conformation that increases the binding accessibility of putative ESS motif TTGAGG. ESS, exonic splicing silencer

3.2 Benchmarking against state-of-the-art splicing defect prediction tools

We next compared the performance of PEPSI and PEPSI-noSS in classifying SDVs among SNPs assayed in the MFASS experiment (Cheung et al., 2019) using several state-of-the-art tools, including HAL (Rosenberg et al., 2015), SPANR (Xiong et al., 2014), and MutPredSplice (Mort et al., 2014). We first compared the sensitivity and precision of PEPSI and PEPSI-noSS to SPANR in classifying SDVs from all 2,094 SNPs in the curated test set given that SPANR is capable of scoring both exonic and intronic SNPs. Both PEPSI and PEPSI-noSS demonstrated higher sensitivity and precision in classifying SDVs in the MFASS test set compared to SPANR (Table 2). Given that SPANR was trained to predict urn:x-wiley:10597794:media:humu23790:humu23790-math-0097 of exons using human RNA-seq data (Xiong et al., 2014), predictions made by SPANR may more accurately reflect actual in vivo splicing than splicing activity observed in mini-gene splicing assays such as Vex-seq and MFASS. This could explain why SPANR underperformed in classifying SDVs in the MFASS test set compared to PEPSI and PEPSI-noSS.

Table 2. Precision, recall, accuracy, and F1 scores of PEPSI, PEPSI-noSS, and SPANR upon predicting splice-disrupting variants among 2,094 SNVs assayed in the MFASS experiment
Model TP TN FP FN Recall (%) Precision (%) Accuracy (%) F1 score
PEPSI 528 981 66 519 50.43 88.89 72.06 0.6435
PEPSI-noSS 604 978 69 443 57.69 89.75 75.55 0.7023
SPANR 398 958 89 649 38.01 81.72 64.76 0.5189

On the other hand, HAL and MutPredSplice were only able to make predictions on a subset of variants in the MFASS test set that were exonic, with HAL being able to predict on 1,045 variants (49.9% of the original test set) and MutPredSplice only able to predict on 575 variants (27.5% of the original test set). We restricted our benchmark analysis of PEPSI, PEPSI-noSS, SPANR, HAL, and MutPredSplice to the 575 exonic SNVs that were classifiable by all five tools (Table 3). For exonic variants, PEPSI and PEPSI-noSS again demonstrate better sensitivity and precision in predicting SDVs compared to SPANR. Moreover, PEPSI and PEPSI-noSS exhibited higher precision while maintaining the same level of sensitivity in predicting SDVs compared to MutPredSplice. In addition, while HAL had achieved the greatest sensitivity predicting SDVs among the tools tested, HAL also produced the most number of false-positive predictions.

Table 3. Precision, recall, accuracy, and F1 scores of PEPSI, PEPSI-noSS, SPANR, HAL, and MutPredSplice upon predicting splice-disrupting variants among 575 exonic SNVs assayed in the MFASS experiment
Model TP TN FP FN Recall (%) Precision (%) Accuracy (%) F1 score
PEPSI 77 293 19 186 29.28 80.21 64.35 0.4290
PEPSI-noSS 119 291 21 144 45.25 85.00 71.30 0.5906
SPANR 57 275 37 206 21.67 60.64 57.74 0.3193
HAL 192 179 133 71 73.00 59.08 64.52 0.6531
MutPredSplice 79 269 43 184 30.04 64.75 60.52 0.4104

Given that the Vex-seq experiment was carried out using HepG2 cells whereas the MFASS experiment was carried out using Hek293T cells, we hypothesized that differences in transcriptome profiles between the two cell lines could explain for incorrect predictions made by both PEPSI and PEPSI-noSS. To approximate the correlation in transcriptome expression between HepG2 cells and Hek293T cells, we identified a total of 18 ExAC SNPs that were measured in both the Vex-seq experiment and the MFASS experiment. The Spearman correlation coefficient between the Vex-seq urn:x-wiley:10597794:media:humu23790:humu23790-math-0098 values and the MFASS urn:x-wiley:10597794:media:humu23790:humu23790-math-0099 values for these 18 SNVs was urn:x-wiley:10597794:media:humu23790:humu23790-math-0100 (Figure 6). Based on this information, it is possible that better predictions by PEPSI and PEPSI-noSS could be achieved if our test set involved variants whose change in exon inclusion index were measured within HepG2 cells instead of Hek293T cells.

Details are in the caption following the image

Correlation between HepG2 ΔΨ and Hek293T ΔEI for 18 ExAC SNVs, Measured HepG2 ΔΨ values (x-axis) vs. measured Hek293T ΔEI values (y-axis) for 18 ExAC SNVs that were assayed in both the Vex-seq experiment and the MFASS experiment. EI, exon inclusion index; ExAC, Exome Aggregation Consortium; SNV, single nucleotide variant

Interestingly, our control model PEPSI-noSS demonstrated greater sensitivity while maintain the same level of precision in predicting SDVs compared to PEPSI. There still remain several limitations in the SRE scoring method used by the PEPSI framework that could account for the model's underperformance in sensitivity. First, the region that is free to fold behind the transcribing RNA polymerase is not always of a fixed size, which PEPSI assumes for convenience of calculation. The size of this region and the availability of secondary structures that can form are influenced by the speed of transcription and local concentration of RNA binding proteins (Schroeder, Grossberger, Pichler & Waldsich, 2002). Moreover, it is possible that there are splicing factors that exclusively recognize RNA secondary structural elements. It is also possible that proteins that bind near the proximity of secondary structural elements can influence their single-strandedness. In such cases, the probability of single-strandedness may underestimate the true activity of the SRE. Devising a more nuanced method that considers the different ways in which secondary structure influences SRE recognition, as outlined above, could not only help explain directionality changes in measured urn:x-wiley:10597794:media:humu23790:humu23790-math-0101 for variants, but also improve the sensitivity of predictions for splice-disrupting variants.

4 CONCLUSION

In the CAGI5 “Vex-seq” challenge, we investigated whether the use of secondary structure information could help improve predictions for genomic variants that disrupt SREs. We demonstrated that secondary structure information can help resolve cases where direct count changes in SREs do not correspond with the directionality of measured urn:x-wiley:10597794:media:humu23790:humu23790-math-0102 values. Moreover, in a benchmark analysis involving other state-of-the-art splice prediction tools, we show that the PEPSI framework achieves comparable sensitivity and precision in predicting variants that disrupt splicing. However, the approach that PEPSI uses in weighing SRE count changes by the probability of secondary structure formation has several limitations that may restrict its sensitivity in detecting splice-disrupting variants.

There are also several limitations in using PEPSI to model in vivo splicing. First, our model was trained on data from massively parallel splicing assays. The design of these assays lack the full context of an entire gene and chromatin states, which have been shown to regulate in vivo splicing (Luco, Allo, Schor, Kornblihtt, & Misteli, 2011). Moreover, our model was not designed to predict urn:x-wiley:10597794:media:humu23790:humu23790-math-0103 of variants in a cell-type specific manner. Given that the transcriptome profiles of HepG2 cells and Hek293T cells are likely to differ, training our model on Vex-seq experimental data may have led to misleading predictions of splice-disrupting variants detected in the MFASS experiment. Developing a cell-type specific model would provide a more accurate characterization of variant impact on exon splicing, especially in the context of certain diseases.

The predictive power of our computational model can likely be further improved through improvements to our approaches in SRE modeling using secondary structure and increased availability of splicing assay data for different cell types. Our model is openly available at https://github.com/rwang916/PEPSI for researchers to use freely for downstream analysis. All source code for PEPSI was built and tested on a Linux system.

ACKNOWLEDGEMENTS

We thank the organizers of the Fifth Critical Assessment of Genome Interpretation, especially Steven Brenner, John Moult, and Gaia Andreoletti, for coordinating and hosting these challenges. We are grateful to Scott Adamson and Brenton Graveley from the University of Connecticut for providing the Vex-seq experimental data in the “Vex-seq” challenge. We are also grateful to Steven Brenner for helpful discussions. This study was conducted in the laboratory of Steven Brenner at the University of California, Berkeley, Department of Plant and Microbial Biology. The CAGI experiment coordination is supported by NIH U41 HG007346 and the CAGI conference is supported by NIH R13 HG006650, U19 HD077627.

    DATA AVAILABILITY STATEMENT

    Our tool is freely available to researchers at: https://github.com/rwang916/PEPSI.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.