Volume 40, Issue 9 pp. 1270-1279

SPECIAL ARTICLE

Full Access

Using secondary structure to predict the effects of genetic variants on alternative splicing

Robert Wang,

Corresponding Author

Robert Wang

[email protected]

orcid.org/0000-0003-2614-5956

Department of Bioengineering, University of California, Berkeley, California

Department of Plant and Microbial Biology, University of California, Berkeley, California

Correspondence Robert Wang, Department of Bioengineering, University of California, Berkeley, CA 94720. Email: [email protected]

Search for more papers by this author

Yaqiong Wang,

Yaqiong Wang

Department of Plant and Microbial Biology, University of California, Berkeley, California

Search for more papers by this author

Zhiqiang Hu,

Zhiqiang Hu

orcid.org/0000-0001-8854-3410

Department of Plant and Microbial Biology, University of California, Berkeley, California

Search for more papers by this author

Robert Wang,

Corresponding Author

Robert Wang

[email protected]

orcid.org/0000-0003-2614-5956

Department of Bioengineering, University of California, Berkeley, California

Department of Plant and Microbial Biology, University of California, Berkeley, California

Correspondence Robert Wang, Department of Bioengineering, University of California, Berkeley, CA 94720. Email: [email protected]

Search for more papers by this author

Yaqiong Wang,

Yaqiong Wang

Department of Plant and Microbial Biology, University of California, Berkeley, California

Search for more papers by this author

Zhiqiang Hu,

Zhiqiang Hu

orcid.org/0000-0001-8854-3410

Department of Plant and Microbial Biology, University of California, Berkeley, California

Search for more papers by this author

First published: 10 May 2019

https://doi.org/10.1002/humu.23790

Citations: 4

For the CAGI5 Special Issue.

Share a link

Email
Wechat
Bluesky

Abstract

Accurate interpretation of genomic variants that alter RNA splicing is critical to precision medicine. We present a computational framework, Prediction of variant Effect on Percent Spliced In (PEPSI), that predicts the splicing impact of coding and noncoding variants for the Fifth Critical Assessment of Genome Interpretation (CAGI5) “Vex-seq” challenge. PEPSI is a random forest regression model trained on multiple layers of features associated with sequence conservation and regulatory sequence elements. Compared to other splicing defect prediction tools from the literature, our framework integrates secondary structure information in predicting variants that disrupt splicing regulatory elements (SREs). We applied our model to classify splice-disrupting variants among 2,094 single-nucleotide polymorphisms from the Exome Aggregation Consortium using model-predicted changes in percent spliced in (ΔPSI) associated with tested variants. Benchmarking our model against widely used state-of-the-art tools, we demonstrate that PEPSI achieves comparable performance in terms of sensitivity and precision. Moreover, we also show that using secondary structure context can help resolve several cases where changes in the counts of SREs do not correspond with the directionality of ΔPSI measured for tested variants.

1 INTRODUCTION

Alternative splicing is a major regulatory mechanism that accounts for the macromolecular and cellular complexity observed in eukaryotic organisms (Nilsen & Graveley, 2010). During alternative splicing, the exons of primary transcripts are spliced together in different arrangements by the spliceosome, yielding a structurally and functionally diverse population of messenger RNAs (mRNAs) and proteins. The activity of the spliceosome is primarily regulated through interactions with cis-acting RNA sequence elements, including the donor splice site, acceptor splice site, and branch point site. There are also additional regulatory sequences known as splicing regulatory elements (SREs) that influence choice of adjacent splice sites. By convention, SREs are organized by localization and effect on splicing as exonic splicing enhancers (ESEs), exonic splicing silencers (ESSs), intronic splicing enhancers, and intronic splicing silencers. Intriguingly, studies have demonstrated that molecular recognition of SREs is influenced by RNA secondary structure and chromatin states, which mediate sequence accessibility (Fu & Ares, 2014).

Given the complex regulation of alternative splicing, it is unsurprising that genetic variants that impact pre-mRNA splicing efficiency are implicated in many diseases. In fact, it has been suggested that approximately one-third of all disease-causing mutations alter pre-mRNA splicing (Lim, Ferraris, Filloux, Raphael, & Fairbrother, 2011). Therefore, developing methods that can accurately identify genomic variants that alter splicing is critical to advancing personalized medicine.

A splicing reporter mini-gene assay is an experimental strategy to systematically evaluate the effects of genetic variants on splicing of a certain exon. Recently, a high-throughput reporter system called Vex-seq was developed to determine the splicing impact of exonic and intronic variants for the same exon simultaneously (Adamson, Zhan, & Graveley, 2018). Vex-seq compares the percent spliced-in (PSI or $urn:x-wiley:10597794:media:humu23790:humu23790-math-0001$ )—a metric representing the fraction of transcripts harboring a given exon—between constructs containing a reference sequence and constructs containing a particular variant. The change in PSI ( $urn:x-wiley:10597794:media:humu23790:humu23790-math-0002$ ) is then calculated for each tested variant. While such assays are straightforward and reliable for detecting splicing defects, there are technical limitations that prohibit its use in routine clinical practice. Moreover, such assays do not fully represent true in vivo splicing given they only represent one or several exons and lack the context of the entire gene.

Over the past several years, a number of in silico prediction tools for detecting mutations that alter splicing have been developed. One class of tools, including MutPredSplice (Mort et al., 2014) and Human Splicing Finder (Desmet et al., 2009), integrate multiple layers of regulatory sequence features to output a score representing the probability that a given variant disrupts splicing. However, the outputs of these tools fail to capture the resulting physical effects of variants on splicing, such as the extent to which a variant increases or decreases the frequency of alternative exon inclusion.

Alternatively, there are other computational models, including HAL (Rosenberg, Patwardhan, Shendure, & Seelig, 2015) and SPANR (Xiong et al., 2014), that directly predict the effects of genetic variants on the relative amounts of alternatively spliced isoforms. However, these tools are often limited in terms of the types of variants that they can evaluate. For example, SPANR is limited to analyzing single nucleotide changes, whereas HAL is limited to variants within the alternative exon. Additionally, while these tools do claim predictive power, their predictions are still far from reliable enough for clinical translation.

A common approach used by many tools for predicting the impact of mutations on splicing is to analyze a given sequence for hexamers that may putatively function as SREs. However, it is possible that for a given sequence context, an identified SRE may not be active because it is occluded by secondary structure. This derives from a long-standing view that sequence-specific RNA binding proteins have limited accessibility to the bases within the major groove of double-stranded RNA, which is narrower than that of double-stranded DNA (Mattaj & Nagai, 1994). It has been experimentally shown that several major splicing factors, such as SR proteins and hnRNP proteins, exhibit reduced binding efficiency to sequence motifs that form secondary structures (Buratti et al., 2004; Damgaard, Tange, & Kjems, 2002). Under this framework, we hypothesized that methods that solely count the number of SREs lost/gained as a result of a mutation may yield several false positives and false negatives when predicting splice-altering variants.

To investigate this idea, we developed a computational framework for predicting the splicing impact of coding and noncoding variants as part of the Fifth Critical Assessment of Genome Interpretation (CAGI5) “Vex-seq” challenge. We trained a random forest regression model, Prediction of variant Effect on Percent Spliced In (PEPSI) using multiple layers of features associated with sequence conservation and regulatory sequence elements. In particular, for features related to SREs, we leverage RNA secondary structure information to characterize the impact of variants that disrupt putatively identified SREs. Compared to state-of-the-art splicing prediction tools, PEPSI achieves comparable sensitivity and precision in classifying splice-disrupting variants based on predicted $urn:x-wiley:10597794:media:humu23790:humu23790-math-0003$ of genomic variants. Moreover, we demonstrate that RNA secondary structure information can help resolve several cases where changes in the counts of SREs do not correspond with the directionality of $urn:x-wiley:10597794:media:humu23790:humu23790-math-0004$ measured for tested variants.

2 MATERIALS AND METHODS

2.1 Training and test datasets

The results of the Vex-seq experiment were provided as training and test sets in the CAGI5 “Vex-seq” challenge to evaluate computational approaches for predicting the impact of genetic variants on splicing.

The Vex-seq experiment measured the $urn:x-wiley:10597794:media:humu23790:humu23790-math-0005$ of 2,055 variants from the Exome Aggregation Consortium (ExAC; Kircher et al., 2014) using a library of reporter constructs transfected into HepG2 cells (Adamson et al., 2018). Variants on chromosomes 1–8 were assigned to the training set, and variants on chromosomes 9–22 and chromosome X were assigned to the test set. There were 957 variants within or adjacent to 52 exons in the training set, including 488 exonic single-nucleotide variants (SNVs), 14 exonic insertions/deletions (indels), 425 intronic SNVs, and 30 intronic indels. In contrast, there were 1,098 variants within or adjacent to 58 exons in the test set, of which there were 563 exonic SNVs, nine exonic indels, 495 intronic SNVs, and 31 intronic indels (Figure 1a).

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Training and test data sets. (a) Pie chart showing the distribution of exonic SNVs, intronic SNVs, exonic indels, and intronic indels in the Vex-seq training and test sets. (b) Density plot showing the distribution of ΔΨ for variants in the Vex-seq training set and for variants in the Vex-seq test set. SNV, single-nucleotide variant

The median $urn:x-wiley:10597794:media:humu23790:humu23790-math-0006$ among the variants in the training set was $urn:x-wiley:10597794:media:humu23790:humu23790-math-0007$ whereas the median $urn:x-wiley:10597794:media:humu23790:humu23790-math-0008$ among the variants in the test set was $urn:x-wiley:10597794:media:humu23790:humu23790-math-0009$ . Given that these values are reasonably close to 0%, we concluded that variants in the both the training set and test set were equally likely to increase or decrease the frequency of alternative exon inclusion. In both the training set and test set, the distribution of $urn:x-wiley:10597794:media:humu23790:humu23790-math-0010$ among the variants revealed that more than 50% of the variants had a measured $urn:x-wiley:10597794:media:humu23790:humu23790-math-0011$ (Figure 1b). Based on this threshold (Xiong et al., 2014), a large proportion of the variants from ExAC did not appear to disrupt splicing in HepG2 cells.

2.2 Features

We assembled 20 features characterizing each variant (Table 1), including distance to the nearest splice site, exon length, seven sequence-based features, and 11 conservation scores derived from annotations used in version 1.3 of CADD (Kircher et al., 2014). We chose to incorporate conservation features in our model given that previous studies have demonstrated how functional sequence elements associated with alternative splicing are phylogenetically and spatially conserved (Minovitsky, Gee, Schokrpur, Dubchak, & Conboy, 2005).

Table 1. Variant features used as predictors by PEPSI

Name	Description	Type	Source
GC	Percent GC in a window of ±75 bp	num	CADD v1.3
priPhCons	Primate PhastCons conservation score (excl. human)	num	CADD v1.3
mamPhCons	Mammalian PhastCons conservation score (excl. human)	num	CADD v1.3
verPhCons	Vertebrate PhastCons conservation score (excl. human)	num	CADD v1.3
priPhyloP	Primate PhyloP score (excl. human)	num	CADD v1.3
mamPhyloP	Mammalian PhyloP score (excl. human)	num	CADD v1.3
verPhyloP	Vertebrate PhyloP score (excl. human)	num	CADD v1.3
GerpN	Neutral evolution score defined by GERP++	num	CADD v1.3
GerpS	Rejected Substitution score defined by GERP++	num	CADD v1.3
bStatistic	Background selection score	int	CADD v1.3
mutIndex	Genome-wide mutability index	int	CADD v1.3
fitCons	fitCons score	num	CADD v1.3
nearest_ss_dist	Distance to nearest splice site	int
exon_length	Length of exon	int
exon	Variant is located within an exon	bool
MaxEntScan_5ss	Difference in MaxEntScan::score5ss scores between mutated and reference sequences	num	MaxEntScan
MaxEntScan_3ss	Difference in MaxEntScan::score3ss scores between mutated and reference sequences	num	MaxEntScan
SVM_BP	Difference in SVM-BP branch point scores between mutated and reference sequences	num	SVM-BP
ESEseq	Weighted loss/gain of ESE sequence motifs	num	Ke et al. (2011)
ESSseq	Weighted loss/gain of ESS sequence motifs	num	Ke et al. (2011)

Abbreviations: ESE, exonic splicing enhancer; ESS, exonic splicing silencer; PEPSI, Prediction of variant Effect on Percent Spliced In; SVM, single-nucleotide variant.

To measure the impact of variants on splice site strength, we reported the difference in MaxEntScan (Yeo & Burge, 2004) scores between the mutated and reference sequences (MaxEntScan_5ss and MaxEntScan_3ss from Table 1). Specifically, we ran MaxEntScan::score5ss for variants that were either within 3 base pairs (bp) upstream or within 6 bp downstream of the splice donor site. We ran MaxEntScan::score3ss for variants that were either within 20 bp upstream or within 3 bp downstream of the splice acceptor site. Variants that were not within the specified window for either MaxEntScan tool were assigned a score difference of 0.

To assess the impact of variants on branch point sequence recognition, we reported the difference in SVM-BPfinder (Corvelo, Hallegger, Smith, & Eyras, 2010) scores between the mutated and reference sequences (SVM_BP from Table 1) for variants upstream of the splice acceptor site. For each oligo sequence used to assemble a Vex-seq splicing reporter, we ran SVM-BPfinder on the subsequence upstream of the 3′-splice site to identify candidate branch point sequences. We used the score associated with the “best” branch point sequence when calculating the score difference between mutated and reference sequences. Variants that were not upstream of the splice acceptor site were assigned a score difference of 0.

To measure the impact of variants on SRE recognition, we considered datasets of ESE hexamers (ESEseq from Table 1) and ESSs hexamers (ESSseq from Table 1) that were identified using a 3-exon minigene splicing assay (Ke et al., 2011). For each test exon associated with a given variant, we considered a sequence window from 40 bp upstream of the 3′-splice site to 40 bp downstream of the 5′-splice site. A score representing the weighted loss/gain of sequence motifs was calculated for each data set of SREs as follows:

Let $urn:x-wiley:10597794:media:humu23790:humu23790-math-0012$ be the reference sequence window, and let $urn:x-wiley:10597794:media:humu23790:humu23790-math-0013$ be the mutated sequence window. For a list of SREs, $urn:x-wiley:10597794:media:humu23790:humu23790-math-0014$ , let $urn:x-wiley:10597794:media:humu23790:humu23790-math-0015$ be the subset of SREs in $urn:x-wiley:10597794:media:humu23790:humu23790-math-0016$ that are present in the exonic region of $urn:x-wiley:10597794:media:humu23790:humu23790-math-0017$ , and let $urn:x-wiley:10597794:media:humu23790:humu23790-math-0018$ be the subset of SREs in $urn:x-wiley:10597794:media:humu23790:humu23790-math-0019$ that are present in exonic region of $urn:x-wiley:10597794:media:humu23790:humu23790-math-0020$ . The weighted loss/gain of sequence motifs from L for this variant is given by:

$urn:x-wiley:10597794:media:humu23790:humu23790-math-0021$ (1)

where $urn:x-wiley:10597794:media:humu23790:humu23790-math-0022$ for $urn:x-wiley:10597794:media:humu23790:humu23790-math-0023$ is a weighted count of sequences motifs from L present in $urn:x-wiley:10597794:media:humu23790:humu23790-math-0024$ , defined as

$urn:x-wiley:10597794:media:humu23790:humu23790-math-0025$ (2)

where, we define $urn:x-wiley:10597794:media:humu23790:humu23790-math-0026$ to be the probability that motif m from $urn:x-wiley:10597794:media:humu23790:humu23790-math-0027$ is unpaired within the sequence context of $urn:x-wiley:10597794:media:humu23790:humu23790-math-0028$ . $urn:x-wiley:10597794:media:humu23790:humu23790-math-0029$ is calculated using RNAplfold (Lorenz et al., 2011), which estimates the probability that a region is unpaired by calculating local-pair probabilities for bases with a maximum span of L nucleotides via a sliding window of size W nucleotides along the input sequence $urn:x-wiley:10597794:media:humu23790:humu23790-math-0030$ . The parameters W and L were set to 80 and 40, respectively, having been previously optimized for small interfering RNA binding predictions (Tafer et al., 2008).

To assess the contribution of secondary structure information to predicting variants that disrupt SREs, we developed a control method to measure the unweighted loss/gain of SREs as follows:

$urn:x-wiley:10597794:media:humu23790:humu23790-math-0031$ (3)

where $urn:x-wiley:10597794:media:humu23790:humu23790-math-0032$ and $urn:x-wiley:10597794:media:humu23790:humu23790-math-0033$ are the reference and mutated sequence windows, respectively, $urn:x-wiley:10597794:media:humu23790:humu23790-math-0034$ is a list of SREs, and $urn:x-wiley:10597794:media:humu23790:humu23790-math-0035$ is a subset of SREs in $urn:x-wiley:10597794:media:humu23790:humu23790-math-0036$ that are present in the exonic region of sequence $urn:x-wiley:10597794:media:humu23790:humu23790-math-0037$ for $urn:x-wiley:10597794:media:humu23790:humu23790-math-0038$ . We separately trained a control model, PEPSI-noSS, using scores for unweighted loss/gain of SREs.

2.3 Computational framework

The workflow for our computational framework is summarized in Figure 2. Variants are first annotated using the 20 features described in Table 1. We used median imputation within the training set to fill in missing feature annotations. After annotating each training variant, we trained our model to predict the $urn:x-wiley:10597794:media:humu23790:humu23790-math-0039$ for each variant. We observed that interpreting the impact of a variant with a given $urn:x-wiley:10597794:media:humu23790:humu23790-math-0040$ will depend on the reference $urn:x-wiley:10597794:media:humu23790:humu23790-math-0041$ of the alternative exon. Two variants with the same $urn:x-wiley:10597794:media:humu23790:humu23790-math-0042$ value affecting two different exons will have different consequences if one exon is weakly spliced in (low reference $urn:x-wiley:10597794:media:humu23790:humu23790-math-0043$ ) and the other exon is frequently spliced in (high reference $urn:x-wiley:10597794:media:humu23790:humu23790-math-0044$ ). Given that model features should directly reflect consequences to splicing, we chose to predict on $urn:x-wiley:10597794:media:humu23790:humu23790-math-0045$ values to numerically distinguish such cases in which two exons with very different reference $urn:x-wiley:10597794:media:humu23790:humu23790-math-0046$ are each affected by a variant with the same measured $urn:x-wiley:10597794:media:humu23790:humu23790-math-0047$ . For a given reference $urn:x-wiley:10597794:media:humu23790:humu23790-math-0048$ and $urn:x-wiley:10597794:media:humu23790:humu23790-math-0049$ associated with a particular variant, we calculate the $urn:x-wiley:10597794:media:humu23790:humu23790-math-0050$ as follows:

$urn:x-wiley:10597794:media:humu23790:humu23790-math-0051$ (4)

We restricted the range of values for $urn:x-wiley:10597794:media:humu23790:humu23790-math-0052$ and $urn:x-wiley:10597794:media:humu23790:humu23790-math-0053$ to $urn:x-wiley:10597794:media:humu23790:humu23790-math-0054$ to avoid getting infinitely large/small values upon taking the logit transformation. Predicted $urn:x-wiley:10597794:media:humu23790:humu23790-math-0055$ values for variants in the test set were converted back to $urn:x-wiley:10597794:media:humu23790:humu23790-math-0056$ values as follows:

$urn:x-wiley:10597794:media:humu23790:humu23790-math-0057$ (5)

Our random forest regression model used a total of 2,000 trees and randomly sampled seven variables as candidates at each split. The importance of each feature during training was calculated as the total decrease in node impurity over all splits involving that feature within each tree, averaged across all trees in the forest as implemented in the randomForest package in R (Breiman, 2001).

2.4 Benchmarking against state-of-the-art splicing defect prediction tools

We benchmarked our method against commonly used state-of-the-art models for predicting the impact of variants on splicing, including HAL (Rosenberg et al., 2015), SPANR (Xiong et al., 2014), and MutPredSplice (Mort et al., 2014). HAL is a linear model trained on over two million random sequences to predict the splicing impact of exonic variants in terms of $urn:x-wiley:10597794:media:humu23790:humu23790-math-0058$ . SPANR is another method that predicts the $urn:x-wiley:10597794:media:humu23790:humu23790-math-0059$ of exonic and intronic SNVs using a Bayesian deep learning algorithm trained on exon skipping events with 1,393 genomic feature annotations. On the other hand, MutPredSplice is a random forest model trained on various sequence-based and conservation-related features that reports the probability that a given variant is splice-altering.

To carry out this benchmark analysis, we used splicing functional assay data from the Multiplexed Functional Assay of Splicing using Sort-seq (MFASS) experiment (Cheung et al., 2019). The MFASS experiment assayed a total of 27,733 ExAC SNPs within or adjacent to 2,339 exons. Similar to the Vex-seq experiment, MFASS uses a set of three-exon, two-intron reporters in which skipping of the middle exon leads to reconstitution of fluorescence. Fluorescence-activated cell sorting was used to separate the pooled library of splicing reporters into separate bins based on observed fluorescence representing different splicing behavior. An exon inclusion index (EI) was then calculated for each tested sequence based on a weighted average of normalized read counts multiplied by the average exon inclusion level across all bins. The change in inclusion index $urn:x-wiley:10597794:media:humu23790:humu23790-math-0060$ for a particular library sequence between the reference and mutant was determined for each assayed variant.

Splice-disrupting variants (SDVs) were defined in the MFASS experiment as variants that change the inclusion index of a tested exon by at least 0.5. Based on this threshold, the MFASS experiment determined a total of 1,050 SDVs out of all 27,733 scored variants. We constructed a separate test set consisting of 2,094 ExAC SNPs measured in the MFASS experiment, consisting of 1,047 SNVs classified as splice-disrupting and 1,047 SNVs not considered as splice-disrupting.

We trained PEPSI and PEPSI-noSS on all 2,055 variants from the Vex-seq experiment and predicted the $urn:x-wiley:10597794:media:humu23790:humu23790-math-0061$ of SNVs in the test set that we constructed from the results of the MFASS experiment. Given that the Vex-seq experiment and the MFASS experiment both assayed variants from ExAC, we made sure that our curated test set from the MFASS experiment did not include any variants from the Vex-seq experiment. We then classified variants as splice-disrupting if the predicted $urn:x-wiley:10597794:media:humu23790:humu23790-math-0062$ (Xiong et al., 2014).

SPANR predictions for MFASS test set variants were determined using pre-computed annotation scores for SPANR (SPIDEX) that were downloaded from http://www.openbioinformatics.org/annovar/spidex_download_form.php. We used the HAL webserver (http://splicing.cs.washington.edu/SE) and MutPredSplice webserver (http://www.mutdb.org/mutpredsplice/submit.htm) to obtain respective tool predictions for the variants in the MFASS test set. We proceeded to classify variants as splice-disrupting based on scores produced by each tool as follows. For tools that characterize the splicing impact of variants using $urn:x-wiley:10597794:media:humu23790:humu23790-math-0063$ , such as SPANR and HAL, we applied the same threshold of $urn:x-wiley:10597794:media:humu23790:humu23790-math-0064$ to classify variants as splice-disrupting. For MutPredSplice, we used the default tool threshold for general scores, in which variants with general scores $urn:x-wiley:10597794:media:humu23790:humu23790-math-0065$ were considered as splice-disrupting.

3 RESULTS AND DISCUSSION

3.1 Incorporating secondary structure into predicting variants that disrupt SREs

We first trained PEPSI and PEPSI-noSS on the Vex-seq training set and predicted the splicing impact of variants from the Vex-seq test set. For PEPSI, we observed a Pearson correlation of $urn:x-wiley:10597794:media:humu23790:humu23790-math-0066$ between predicted $urn:x-wiley:10597794:media:humu23790:humu23790-math-0067$ and experimentally measured $urn:x-wiley:10597794:media:humu23790:humu23790-math-0068$ (Figure 3a). Interestingly, for our control model PEPSI-noSS, we observed a slightly better Pearson correlation of $urn:x-wiley:10597794:media:humu23790:humu23790-math-0069$ between predicted $urn:x-wiley:10597794:media:humu23790:humu23790-math-0070$ and experimentally measured $urn:x-wiley:10597794:media:humu23790:humu23790-math-0071$ (Figure 3b). For both PEPSI and PEPSI-noSS, MaxEntScan scores and distance to nearest splice site were considered to be the most predictive features during model training. Interestingly, we also observed that our SRE features, ESEseq and ESSseq, were ranked with greater importance during model training for PEPSI than during model training for PEPSI-noSS (Tables S1 and S2).

To analyze how the incorporation of secondary structure influences the prediction accuracy of SREs, such as ESEs and ESSs, we compared the distribution of $urn:x-wiley:10597794:media:humu23790:humu23790-math-0072$ for variants which gain more ESEs than ESSs (ESEseq $urn:x-wiley:10597794:media:humu23790:humu23790-math-0073$ ESSseq) and for variants that gain more ESSs than ESEs (ESEseq $urn:x-wiley:10597794:media:humu23790:humu23790-math-0074$ ESSseq). In particular, we focused on 1,002 exonic variants that were more than two nucleotides away from the nearest splice site to avoid cases where the measured $urn:x-wiley:10597794:media:humu23790:humu23790-math-0075$ could be attributed to disruption of splice sites or branch point sequences instead of ESEs/ESSs. For PEPSI, where SRE count changes were weighed by probability of secondary structure formation, we observe that ESEseq $urn:x-wiley:10597794:media:humu23790:humu23790-math-0076$ ESSseq variants have a significantly more positive $urn:x-wiley:10597794:media:humu23790:humu23790-math-0077$ distribution compared to ESEseq $urn:x-wiley:10597794:media:humu23790:humu23790-math-0078$ ESSseq variants, $urn:x-wiley:10597794:media:humu23790:humu23790-math-0079$ (the Wilcoxon Rank-Sum Test; Figure 4a).

For PEPSI-noSS, where SRE count changes were unweighted, we also observed a significantly more positive $urn:x-wiley:10597794:media:humu23790:humu23790-math-0080$ distribution among ESEseq $urn:x-wiley:10597794:media:humu23790:humu23790-math-0081$ ESSseq variants compared to ESEseq $urn:x-wiley:10597794:media:humu23790:humu23790-math-0082$ ESSseq variants, $urn:x-wiley:10597794:media:humu23790:humu23790-math-0083$ (the Wilcoxon Rank-Sum Test; Figure 4b). To determine if the probabilities used in weighing SRE count changes were introducing random noise, we recalculated SRE count changes using probabilities of secondary structure formation derived from a randomly shuffled version of the original sequence window. Using these recalculated scores, we did not observe any statistically significant difference in the $urn:x-wiley:10597794:media:humu23790:humu23790-math-0084$ distribution between ESEseq $urn:x-wiley:10597794:media:humu23790:humu23790-math-0085$ ESSseq variants and ESEseq $urn:x-wiley:10597794:media:humu23790:humu23790-math-0086$ ESSseq variants, $urn:x-wiley:10597794:media:humu23790:humu23790-math-0087$ (the Wilcoxon Rank-Sum Test; Figure 4c). This suggests that the original probabilities used in adjusting SRE count changes in PEPSI did not necessarily introduce random noise to the model.

Interestingly, for PEPSI-noSS, we observed multiple variants in which count changes in SREs did not correspond with the directionality of the measured $urn:x-wiley:10597794:media:humu23790:humu23790-math-0088$ . Out of the 1,002 exonic variants, 160 ESEseq $urn:x-wiley:10597794:media:humu23790:humu23790-math-0089$ ESSseq variants had a negative $urn:x-wiley:10597794:media:humu23790:humu23790-math-0090$ and 219 ESEseq $urn:x-wiley:10597794:media:humu23790:humu23790-math-0091$ ESSseq variants had a positive $urn:x-wiley:10597794:media:humu23790:humu23790-math-0092$ . Integrating secondary structure information into calculating SRE count changes for these 379 variants yielded 113 cases in which weighted count changes of SREs corresponded with the directionality of measured $urn:x-wiley:10597794:media:humu23790:humu23790-math-0093$ (Table S3).

Some of these resolved cases yield interesting insights into the role of secondary structure in SRE recognition. For example, the variant rs771094081:A>C, which had a measured $urn:x-wiley:10597794:media:humu23790:humu23790-math-0094$ , was associated with the loss of 2 putative ESS motifs AGGAGG and AGGTGG. However, based on variant annotations by PEPSI, the weighted count change in ESS motifs indicates a net increase in the number of ESS motifs free of secondary structure. We hypothesized that the two putative ESS motifs that were lost as a result of the mutation were occluded by secondary structure in the original sequence context, thus rendering them inactive. As a result, the loss of these two motifs should not directly correspond to a positive $urn:x-wiley:10597794:media:humu23790:humu23790-math-0095$ . Instead, it is possible that the point mutation alters the landscape of secondary structures such that a pre-existing ESS motif sequestered by secondary structure in the original sequence context is more accessible for binding in the mutated sequence context. This could potentially explain the negative $urn:x-wiley:10597794:media:humu23790:humu23790-math-0096$ that was measured for this particular variant.

To validate our hypothesis, we considered the original and mutated 95 bp sequences for the exon harboring the mutationNC_000001.11:g.24339710A>C. RNA secondary structures for both the original and mutated sequences were predicted using the Mfold web server (Zuker, 2003). Folding was simulated at a temperature of 37°C and ionic conditions of 1M NaCl. If the Mfold web server produced multiple secondary structures, the structure with the highest negative free energy was chosen to be the representative structure. Comparing the secondary structures for the reference and mutant sequences (Figure 5a,b), we discovered an ESS motif TTGAGG present in both sequences that was more likely to be single-stranded in the mutant sequence compared to the reference sequence.

3.2 Benchmarking against state-of-the-art splicing defect prediction tools

We next compared the performance of PEPSI and PEPSI-noSS in classifying SDVs among SNPs assayed in the MFASS experiment (Cheung et al., 2019) using several state-of-the-art tools, including HAL (Rosenberg et al., 2015), SPANR (Xiong et al., 2014), and MutPredSplice (Mort et al., 2014). We first compared the sensitivity and precision of PEPSI and PEPSI-noSS to SPANR in classifying SDVs from all 2,094 SNPs in the curated test set given that SPANR is capable of scoring both exonic and intronic SNPs. Both PEPSI and PEPSI-noSS demonstrated higher sensitivity and precision in classifying SDVs in the MFASS test set compared to SPANR (Table 2). Given that SPANR was trained to predict $urn:x-wiley:10597794:media:humu23790:humu23790-math-0097$ of exons using human RNA-seq data (Xiong et al., 2014), predictions made by SPANR may more accurately reflect actual in vivo splicing than splicing activity observed in mini-gene splicing assays such as Vex-seq and MFASS. This could explain why SPANR underperformed in classifying SDVs in the MFASS test set compared to PEPSI and PEPSI-noSS.

Table 2. Precision, recall, accuracy, and F1 scores of PEPSI, PEPSI-noSS, and SPANR upon predicting splice-disrupting variants among 2,094 SNVs assayed in the MFASS experiment

Model	TP	TN	FP	FN	Recall (%)	Precision (%)	Accuracy (%)	F1 score
PEPSI	528	981	66	519	50.43	88.89	72.06	0.6435
PEPSI-noSS	604	978	69	443	57.69	89.75	75.55	0.7023
SPANR	398	958	89	649	38.01	81.72	64.76	0.5189

On the other hand, HAL and MutPredSplice were only able to make predictions on a subset of variants in the MFASS test set that were exonic, with HAL being able to predict on 1,045 variants (49.9% of the original test set) and MutPredSplice only able to predict on 575 variants (27.5% of the original test set). We restricted our benchmark analysis of PEPSI, PEPSI-noSS, SPANR, HAL, and MutPredSplice to the 575 exonic SNVs that were classifiable by all five tools (Table 3). For exonic variants, PEPSI and PEPSI-noSS again demonstrate better sensitivity and precision in predicting SDVs compared to SPANR. Moreover, PEPSI and PEPSI-noSS exhibited higher precision while maintaining the same level of sensitivity in predicting SDVs compared to MutPredSplice. In addition, while HAL had achieved the greatest sensitivity predicting SDVs among the tools tested, HAL also produced the most number of false-positive predictions.

Table 3. Precision, recall, accuracy, and F1 scores of PEPSI, PEPSI-noSS, SPANR, HAL, and MutPredSplice upon predicting splice-disrupting variants among 575 exonic SNVs assayed in the MFASS experiment

Model	TP	TN	FP	FN	Recall (%)	Precision (%)	Accuracy (%)	F1 score
PEPSI	77	293	19	186	29.28	80.21	64.35	0.4290
PEPSI-noSS	119	291	21	144	45.25	85.00	71.30	0.5906
SPANR	57	275	37	206	21.67	60.64	57.74	0.3193
HAL	192	179	133	71	73.00	59.08	64.52	0.6531
MutPredSplice	79	269	43	184	30.04	64.75	60.52	0.4104

Given that the Vex-seq experiment was carried out using HepG2 cells whereas the MFASS experiment was carried out using Hek293T cells, we hypothesized that differences in transcriptome profiles between the two cell lines could explain for incorrect predictions made by both PEPSI and PEPSI-noSS. To approximate the correlation in transcriptome expression between HepG2 cells and Hek293T cells, we identified a total of 18 ExAC SNPs that were measured in both the Vex-seq experiment and the MFASS experiment. The Spearman correlation coefficient between the Vex-seq $urn:x-wiley:10597794:media:humu23790:humu23790-math-0098$ values and the MFASS $urn:x-wiley:10597794:media:humu23790:humu23790-math-0099$ values for these 18 SNVs was $urn:x-wiley:10597794:media:humu23790:humu23790-math-0100$ (Figure 6). Based on this information, it is possible that better predictions by PEPSI and PEPSI-noSS could be achieved if our test set involved variants whose change in exon inclusion index were measured within HepG2 cells instead of Hek293T cells.

Interestingly, our control model PEPSI-noSS demonstrated greater sensitivity while maintain the same level of precision in predicting SDVs compared to PEPSI. There still remain several limitations in the SRE scoring method used by the PEPSI framework that could account for the model's underperformance in sensitivity. First, the region that is free to fold behind the transcribing RNA polymerase is not always of a fixed size, which PEPSI assumes for convenience of calculation. The size of this region and the availability of secondary structures that can form are influenced by the speed of transcription and local concentration of RNA binding proteins (Schroeder, Grossberger, Pichler & Waldsich, 2002). Moreover, it is possible that there are splicing factors that exclusively recognize RNA secondary structural elements. It is also possible that proteins that bind near the proximity of secondary structural elements can influence their single-strandedness. In such cases, the probability of single-strandedness may underestimate the true activity of the SRE. Devising a more nuanced method that considers the different ways in which secondary structure influences SRE recognition, as outlined above, could not only help explain directionality changes in measured $urn:x-wiley:10597794:media:humu23790:humu23790-math-0101$ for variants, but also improve the sensitivity of predictions for splice-disrupting variants.

4 CONCLUSION

In the CAGI5 “Vex-seq” challenge, we investigated whether the use of secondary structure information could help improve predictions for genomic variants that disrupt SREs. We demonstrated that secondary structure information can help resolve cases where direct count changes in SREs do not correspond with the directionality of measured $urn:x-wiley:10597794:media:humu23790:humu23790-math-0102$ values. Moreover, in a benchmark analysis involving other state-of-the-art splice prediction tools, we show that the PEPSI framework achieves comparable sensitivity and precision in predicting variants that disrupt splicing. However, the approach that PEPSI uses in weighing SRE count changes by the probability of secondary structure formation has several limitations that may restrict its sensitivity in detecting splice-disrupting variants.

There are also several limitations in using PEPSI to model in vivo splicing. First, our model was trained on data from massively parallel splicing assays. The design of these assays lack the full context of an entire gene and chromatin states, which have been shown to regulate in vivo splicing (Luco, Allo, Schor, Kornblihtt, & Misteli, 2011). Moreover, our model was not designed to predict $urn:x-wiley:10597794:media:humu23790:humu23790-math-0103$ of variants in a cell-type specific manner. Given that the transcriptome profiles of HepG2 cells and Hek293T cells are likely to differ, training our model on Vex-seq experimental data may have led to misleading predictions of splice-disrupting variants detected in the MFASS experiment. Developing a cell-type specific model would provide a more accurate characterization of variant impact on exon splicing, especially in the context of certain diseases.

The predictive power of our computational model can likely be further improved through improvements to our approaches in SRE modeling using secondary structure and increased availability of splicing assay data for different cell types. Our model is openly available at https://github.com/rwang916/PEPSI for researchers to use freely for downstream analysis. All source code for PEPSI was built and tested on a Linux system.

ACKNOWLEDGEMENTS

We thank the organizers of the Fifth Critical Assessment of Genome Interpretation, especially Steven Brenner, John Moult, and Gaia Andreoletti, for coordinating and hosting these challenges. We are grateful to Scott Adamson and Brenton Graveley from the University of Connecticut for providing the Vex-seq experimental data in the “Vex-seq” challenge. We are also grateful to Steven Brenner for helpful discussions. This study was conducted in the laboratory of Steven Brenner at the University of California, Berkeley, Department of Plant and Microbial Biology. The CAGI experiment coordination is supported by NIH U41 HG007346 and the CAGI conference is supported by NIH R13 HG006650, U19 HD077627.

Open Research

DATA AVAILABILITY STATEMENT

Our tool is freely available to researchers at: https://github.com/rwang916/PEPSI.

Supporting Information

REFERENCES

Adamson, S. I., Zhan, L., & Graveley, B. R. (2018). Vex-seq: High-throughput identification of the impact of genetic variation on pre-mRNA splicing efficiency. Genome Biology, 19(1), 71.
10.1186/s13059-018-1437-x
PubMed Web of Science® Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
10.1023/A:1010933404324
Web of Science® Google Scholar
Buratti, E., Muro, A. F., Giombi, M., Gherbassi, D., Iaconcig, A., & Baralle, F. E. (2004). RNA folding affects the recruitment of SR proteins by mouse and human polypurinic enhancer elements in the fibronectin EDA exon. Molecular and Cellular Biology, 24(3), 1387–1400.
10.1128/MCB.24.3.1387-1400.2004
CAS PubMed Web of Science® Google Scholar
Cheung, R., Insigne, K. D., Yao, D., Burghard, C. P., Jones, E. M., Goodman, D. B., & Kosuri, S. (2019). Many rare genetic variants have unrecognized large-effect disruptions to exon recognition. Molecular Cell, 73, 183–194.
10.1016/j.molcel.2018.10.037
CAS PubMed Web of Science® Google Scholar
Corvelo, A., Hallegger, M., Smith, C. W., & Eyras, E. (2010). Genome-wide association between branch point properties and alternative splicing. PLOS Computational Biology, 6(11), e1001016.
10.1371/journal.pcbi.1001016
PubMed Web of Science® Google Scholar
Damgaard, C. K., Tange, T. O., & Kjems, J. (2002). hnRNP A1 controls HIV-1 mRNA splicing through cooperative binding to intron and exon splicing silencers in the context of a conserved secondary structure. RNA, 8(11), 1401–1415.
10.1017/S1355838202023075
CAS PubMed Web of Science® Google Scholar
Desmet, F. O., Hamroun, D., Lalande, M., Collod-Béroud, G., Claustres, M., & Béroud, C. (2009). Human Splicing Finder: An online bioinformatics tool to predict splicing signals. Nucleic Acids Research, 37(9), e67.
10.1093/nar/gkp215
CAS PubMed Web of Science® Google Scholar
Fu, X. D., & Ares, M. (2014). Context-dependent control of alternative splicing by RNA-binding proteins. Nature Reviews Genetics, 15(10), 689–701.
10.1038/nrg3778
CAS PubMed Web of Science® Google Scholar
Ke, S., Shang, S., Kalachikov, S. M., Morozova, I., Yu, L., Russo, J. J., … Chasin, L. A. (2011). Quantitative evaluation of all hexamers as exonic splicing elements. Genome Research, 21(8), 1360–1374.
10.1101/gr.119628.110
CAS PubMed Web of Science® Google Scholar
Kircher, M., Witten, D. M., Jain, P., O'Roak, B. J., Cooper, G. M., & Shendure, J. (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genetics, 46(3), 310–315.
10.1038/ng.2892
CAS PubMed Web of Science® Google Scholar
Lim, K. H., Ferraris, L., Filloux, M. E., Raphael, B. J., & Fairbrother, W. G. (2011). Using positional distribution to identify splicing elements and predict pre-mRNA processing defects in human genes. Proceedings of the National Academy of Sciences of the United States of America, 108(27), 11093–11098.
10.1073/pnas.1101135108
CAS PubMed Web of Science® Google Scholar
Lorenz, R., Bernhart, S. H., Höner Zu Siederdissen, C., Tafer, H., Flamm, C., Stadler, P. F., & Hofacker, I. L. (2011). ViennaRNA Package 2.0. Algorithms for Molecular Biology, 6, 26.
10.1186/1748-7188-6-26
PubMed Web of Science® Google Scholar
Luco, R. F., Allo, M., Schor, I. E., Kornblihtt, A. R., & Misteli, T. (2011). Epigenetics in alternative pre-mRNA splicing. Cell, 144(1), 16–26.
10.1016/j.cell.2010.11.056
CAS PubMed Web of Science® Google Scholar
Mattaj, I. W., & Nagai, K. (1994). RNA-protein interactions. Oxford: Oxford University Press.
Google Scholar
Minovitsky, S., Gee, S. L., Schokrpur, S., Dubchak, I., & Conboy, J. G. (2005). The splicing regulatory element, UGCAUG, is phylogenetically and spatially conserved in introns that flank tissue-specific alternative exons. Nucleic Acids Research, 33(2), 714–724.
10.1093/nar/gki210
CAS PubMed Web of Science® Google Scholar
Mort, M., Sterne-Weiler, T., Li, B., Ball, E. V., Cooper, D. N., Radivojac, P., … Mooney, S. D. (2014). MutPred Splice: Machine learning-based prediction of exonic variants that disrupt splicing. Genome Biology, 15(1), R19.
10.1186/gb-2014-15-1-r19
CAS PubMed Web of Science® Google Scholar
Nilsen, T. W., & Graveley, B. R. (2010). Expansion of the eukaryotic proteome by alternative splicing. Nature, 463, 457–463.
10.1038/nature08909
CAS PubMed Web of Science® Google Scholar
Rosenberg, A. B., Patwardhan, R. P., Shendure, J., & Seelig, G. (2015). Learning the sequence determinants of alternative splicing from millions of random sequences. Cell, 163, 698–711.
10.1016/j.cell.2015.09.054
CAS PubMed Web of Science® Google Scholar
Schroeder, R., Grossberger, R., Pichler, A., & Waldsich, C. (2002). RNA folding in vivo. Current Opinion in Structural Biology, 12, 296–300.
10.1016/S0959-440X(02)00325-1
CAS PubMed Web of Science® Google Scholar
Tafer, H., Ameres, S. L., Obernosterer, G., Gebeshuber, C. A., Schroeder, R., Martinez, J., & Hofacker, I. L. (2008). The impact of target site accessibility on the design of effective siRNAs. Nature Biotechnology, 26(5), 578–583.
10.1038/nbt1404
CAS PubMed Web of Science® Google Scholar
Xiong, H. Y., Alipanahi, B., Lee, L. J., Bretschneider, H., Merico, D., Yuen, R. K., … Frey, B. J. (2014). RNA splicing. The human splicing code reveals new insights into the genetic determinants of disease. Science, 347(6218), 1254806–1254806.
10.1126/science.1254806
PubMed Web of Science® Google Scholar
Yeo, G., & Burge, C. B. (2004). Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. Computational Biology, 11, 377–394.
10.1089/1066527041410418
CAS PubMed Web of Science® Google Scholar
Zuker, M. (2003). Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Research, 31(13), 3406–3415.
10.1093/nar/gkg595
CAS PubMed Web of Science® Google Scholar

Citing Literature

All articles

Using secondary structure to predict the effects of genetic variants on alternative splicing

Abstract

1 INTRODUCTION