Volume 40, Issue 9 pp. 1243-1251

SPECIAL ARTICLE

Full Access

CAGI 5 splicing challenge: Improved exon skipping and intron retention predictions with MMSplice

Jun Cheng,

Corresponding Author

Jun Cheng

[email protected]

orcid.org/0000-0001-5573-9791

Department of Informatics, Technical University of Munich, Garching, Germany

Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-Universität München, Munich, Germany

Correspondence Jun Cheng and Julien Gagneur, Department of Informatics, Technical University of Munich, Boltzmannstraße 3, 85748 Garching, Germany.

Email: [email protected] (J.C.) and [email protected] (J.G.)

Search for more papers by this author

Muhammed Hasan Çelik,

Muhammed Hasan Çelik

Department of Informatics, Technical University of Munich, Garching, Germany

Search for more papers by this author

Thi Yen Duong Nguyen,

Thi Yen Duong Nguyen

Department of Informatics, Technical University of Munich, Garching, Germany

Search for more papers by this author

Žiga Avsec,

Žiga Avsec

Department of Informatics, Technical University of Munich, Garching, Germany

Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-Universität München, Munich, Germany

Search for more papers by this author

Julien Gagneur,

Corresponding Author

Julien Gagneur

[email protected]

Department of Informatics, Technical University of Munich, Garching, Germany

Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-Universität München, Munich, Germany

Correspondence Jun Cheng and Julien Gagneur, Department of Informatics, Technical University of Munich, Boltzmannstraße 3, 85748 Garching, Germany.

Email: [email protected] (J.C.) and [email protected] (J.G.)

Search for more papers by this author

Jun Cheng,

Corresponding Author

Jun Cheng

[email protected]

orcid.org/0000-0001-5573-9791

Department of Informatics, Technical University of Munich, Garching, Germany

Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-Universität München, Munich, Germany

Correspondence Jun Cheng and Julien Gagneur, Department of Informatics, Technical University of Munich, Boltzmannstraße 3, 85748 Garching, Germany.

Email: [email protected] (J.C.) and [email protected] (J.G.)

Search for more papers by this author

Muhammed Hasan Çelik,

Muhammed Hasan Çelik

Department of Informatics, Technical University of Munich, Garching, Germany

Search for more papers by this author

Thi Yen Duong Nguyen,

Thi Yen Duong Nguyen

Department of Informatics, Technical University of Munich, Garching, Germany

Search for more papers by this author

Žiga Avsec,

Žiga Avsec

Department of Informatics, Technical University of Munich, Garching, Germany

Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-Universität München, Munich, Germany

Search for more papers by this author

Julien Gagneur,

Corresponding Author

Julien Gagneur

[email protected]

Department of Informatics, Technical University of Munich, Garching, Germany

Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-Universität München, Munich, Germany

Correspondence Jun Cheng and Julien Gagneur, Department of Informatics, Technical University of Munich, Boltzmannstraße 3, 85748 Garching, Germany.

Email: [email protected] (J.C.) and [email protected] (J.G.)

Search for more papers by this author

First published: 09 May 2019

https://doi.org/10.1002/humu.23788

Citations: 8

Share a link

Email
Wechat
Bluesky

Abstract

Pathogenic genetic variants often primarily affect splicing. However, it remains difficult to quantitatively predict whether and how genetic variants affect splicing. In 2018, the fifth edition of the Critical Assessment of Genome Interpretation proposed two splicing prediction challenges based on experimental perturbation assays: Vex-seq, assessing exon skipping, and MaPSy, assessing splicing efficiency. We developed a modular modeling framework, MMSplice, the performance of which was among the best on both challenges. Here we provide insights into the modeling assumptions of MMSplice and its individual modules. We furthermore illustrate how MMSplice can be applied in practice for individual genome interpretation, using the MMSplice VEP plugin and the Kipoi variant interpretation plugin, which are directly applicable to VCF files.

1 INTRODUCTION

RNA splicing is a process that removes intronic sequences from precursor RNAs to form mature RNAs. Alternative splicing occurs when exons are concatenated in alternative combinations (Alberts et al., 2008). Alternative splicing has been shown to be important for tissue development (Baralle & Giudice, 2017). The most common type of alternative splicing in humans is exon skipping (Y. Wang et al., 2015). Skipping of an exon can be quantified with percent spliced-in (PSI, Ψ), which is defined as the fraction of transcripts that include the exon (Goldstein et al., 2016). Another frequently used splicing metric is the splicing efficiency, which we here define as the fraction of spliced transcripts among spliced and unspliced transcripts (Braberg et al., 2013; Soemedi et al., 2017; Wilhelm et al., 2008). With RNA-seq data, the splicing efficiency can be defined for every splice site by considering the number of reads spanning exon-exon junction and the number of reads spanning exon-intron boundaries. Splicing efficiency captures intron retention. Unlike Ψ, which is only relevant for splice sites involved in alternative splicing, splicing efficiency is relevant for all splice sites.

Splicing defect is one of the most frequent causes of Mendelian disorders (Li et al., 2016; López-Bigas, Audit, Ouzounis, Parra, & Guigó, 2005). Moreover, thousands of splicing QTLs have been identified and linked to common diseases (Consortium et al., 2015; Li et al., 2016).

Genetic variants affect splicing in two common ways. They can change alternative splicing, in particular, exon skipping. They can also change splicing efficiency. Various methods have been developed to predict the effect of variants on splicing. Early methods focused on scoring the effects of splice regulatory elements, such as splice sites (Yeo & Burge, 2004), exon splice enhancers and silencers, intronic splicing enhancers, and silencers (Fairbrother, 2002; Fairbrother et al., 2004; Z. Wang, Xiao, Van Nostrand, & Burge, 2006; Zhang & Chasin, 2004; Zhang & Kangsamaksin, 2005), and branch points (Bretschneider, Gandhi, Deshwar, Zuberi, & Frey, 2018; Paggi & Bejerano, 2018). The potential impact of variants using these methods can be assessed with the difference of scores between the reference and the alternative sequence. Other methods focus on predicting Ψ directly. One of the early successful Ψ prediction methods was developed by Barash et al. (2010) using mouse transcriptome data. The model learned a “splicing code” from variations of Ψ across exons and across tissues. Although the model was trained only with the reference genome and not with genetic variants, it could predict the effect of variants on splicing. A similar model, SPANR, was later on developed for the human genome (Xiong et al., 2015). SPANR was successful in predicting pathogenic variants for several diseases. Even though the approach of learning the splicing code from reference sequence was successful, the model may suffer from evolutionary confounding and fail to learn causal features. To address this issue, large-scale perturbation assays, such as massively parallel reporter assay (MPRA) and saturation mutagenesis screens, have been developed (Barash et al., 2010; Xiong et al., 2015; Rosenberg, Patwardhan, Shendure, & Seelig, 2015; Adamson, Zhan, & Graveley, 2018; Ke et al., 2018). In particular, Rosenberg, Patwardhan, Shendure, and Seelig (2015) probed millions of exonic and intronic random sequences to test their impact on splicing. Their model, HAL, improved upon the state-of-the-art performance at predicting variant effects on exon skipping and alternative donor usage.

Perturbation data are ideal to benchmark computational methods for their predictive power on causal effects. The fifth Critical Assessment of Genome Interpretation (CAGI 5) had two splicing challenges with data from such assays: The Vex-seq (Adamson et al., 2018) challenge and the MaPSy (Soemedi et al., 2017) challenge. The tasks of the two challenges were related yet distinct. The Vex-seq experiment assayed 2,059 natural genetic variants, including exonic and intronic single-nucleotide variants (SNVs) and insertion/deletions (indels). The measured quantity was Ψ. The MaPSy experiment measured the impact of 5,761 exonic disease-causing missense mutations on splicing. The assay was performed both in vivo and in vitro. Approximately 10% of the mutations significantly altered splicing (intron retention) both in vivo and in vitro. Such variants were defined as exonic splicing mutations (ESM). The measured quantity was splicing efficiency (Section 2). Although the two challenges have different measured quantities, we assumed that variant disrupting splicing could affect both Ψ and splicing efficiency. Therefore, we applied a modular modeling approach, MMSplice (Cheng et al., 2019), where the modules score different gene regions and are shared across challenges. The predictors proposed for each challenge differed only in how they combine the scores of the individual modules.

We have described MMSplice and the modular modeling strategy previously (Cheng et al., 2019). In this CAGI special issue, we focus on the application of MMSplice to the CAGI 5 challenges. In particular, we provide insights into modeling assumptions and about the module architecture. We also emphasize model and variant interpretation, as these are relevant for downstream human genetic applications.

2 METHODS

2.1 Modular modeling approach for the Vex-seq and MaPSy challenges

The Vex-seq data covered variants from both exons and introns. We noticed that the training data from CAGI for both challenges were limited with 957 training data points for Vex-seq and 4,964 for MaPSy. It is probably difficult to train a model capturing much of the splicing regulatory elements directly from these data. Therefore, we used complementary data from different sources that are richer (Cheng et al., 2019). We used the GENCODE 24 annotation to train a module to score donor sites and similarly a module to score acceptor sites. In total, 524,569 training data points and 131,143 evaluation data points were used to train the donor module while 566,822 training data points and 141,706 evaluation data points were used to train the acceptor module. The modules were trained by training classifiers to distinguish annotated splice sites from random sequences around the selected splice sites (with some bias to sequence with splice dinucleotide [Cheng et al., 2019]). We further used data from a massively parallel reporter assay (MPRA) that probed the effect of 2 million random sequences on splicing (Rosenberg et al., 2015). The MPRA data had exonic and intronic random sequences, from which we trained modules to score exon and modules to score intron (Cheng et al., 2019). In total, we trained six modules: donor, acceptor, 5′ exon, 3′ exon, 5′ intron, and 3′ intron. The detailed descriptions of all modules and their training methods are given in Cheng et al. (2019). To score variants for their effect on Ψ (∆Ψ) and splicing efficiency, we trained separate linear models from modular predictions from a common set of modules (Figure 1). Our modules collectively consider the sequence of the whole exon and 100 nt flanking intron from both sides and therefore score variants in this range.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

MMSplice model for Vex-seq and MaPSy challenges

2.2 Vex-seq challenge

2.2.1 Data processing

The Vex-seq data tested 2,059 variants from the Exome Aggregation Consortium for their effect on Ψ (Adamson et al., 2018). For each variant-exon pair, Ψ for the reference sequences and the alternative sequences were measured on minigene reporters with RNA-Seq. The assessed variants included SNVs as well as short indels from both exonic and intronic regions. The Vex-seq CAGI challenge provided 957 variants from chromosome 1 to chromosome 8 for training. For each variant, the tested exon coordinates and the associated reference Ψ and ∆Ψ were provided. The test data consisted of 1,054 variants from chromosome 9–22 and chromosome X. The reference Ψ values for the exons with reference sequences were provided. The predictors had to predict ∆Ψ for each variant.

2.2.2 Vex-seq model

To predict ∆Ψ for each variant-exon pair from Vex-seq, five modules were applied to the reference sequence and alternative sequence, separately. These were the donor module, the acceptor module, the 5′ exon module, the 5′ intron module, and the 3′ intron module. A score difference (∆Score) between the reference sequence and the alternative sequence for each module was calculated. A linear model was trained with Vex-seq training data to predict the log odds ratio of Ψ (∆logit(Ψ)) from the five ∆Scores and using interaction terms between scores of overlapping regions. Denoting the logistic function logit, the model reads:

$urn:x-wiley:10597794:media:humu23788:humu23788-math-0001$ (1)

The difference of Ψ in the natural scale, ∆Ψ, was predicted using the reference value $urn:x-wiley:10597794:media:humu23788:humu23788-math-0002$ and the predicted log odds ratios (Cheng et al., 2019).

$urn:x-wiley:10597794:media:humu23788:humu23788-math-0003$ (2)

where

$urn:x-wiley:10597794:media:humu23788:humu23788-math-0004$ (3)

$urn:x-wiley:10597794:media:humu23788:humu23788-math-0005$ (4)

2.3 MaPSy challenge

2.3.1 Data processing

The MaPSy experiment tested 5,761 disease-causing exonic variants from the Human Gene Mutation Database for their impact on RNA splicing efficiencies (Equation 5) both in vivo and in vitro (Soemedi et al., 2017), quantified as:

$urn:x-wiley:10597794:media:humu23788:humu23788-math-0006$ (5)

where $urn:x-wiley:10597794:media:humu23788:humu23788-math-0007$ is the mutant spliced RNA read count, $urn:x-wiley:10597794:media:humu23788:humu23788-math-0008$ is the mutant input (unspliced) RNA read count, $urn:x-wiley:10597794:media:humu23788:humu23788-math-0009$ is the wild-type spliced RNA read count, and $urn:x-wiley:10597794:media:humu23788:humu23788-math-0010$ is the wild-type input RNA read count (Cheng et al., 2019; Soemedi et al., 2017). Transcripts with skipped exons or mis-splicing were ignored.

2.3.2 MaPSy model

In vivo

The in vivo experiment of MaPSy used a three-exon construct with the test exon in the middle. As all variants are exonic, we used three modules that overlap exons: the donor module, the acceptor module and the 5′ exon module. A linear model was trained with the ∆Scores of these three modules to predict splicing efficiency (Equation 6).

$urn:x-wiley:10597794:media:humu23788:humu23788-math-0011$ (6)

In vitro

The in vitro experiment of MaPSy used a two-exon construct with the test exon being the second exon in the transcript. Therefore, the test exons did not have donor sites. Consequently, we applied two modules: the acceptor module and the 5′ exon module. A linear model was trained similarly as for the in vivo model (Equation 7).

$urn:x-wiley:10597794:media:humu23788:humu23788-math-0012$ (7)

Exon splicing mutation classification

To classify ESMs, we trained a logistic regression model with the predicted in vitro splicing efficiency change and eight other features:

MMSplice 5′ exon module score for the wild-type sequence
MMSplice donor module score for the wild-type sequence
MMSplice acceptor module score for the wild-type sequence
Experiment exon length, which is the exon length in the experimental construct and may differ from the annotated genomic exon
Log-transformed wild-type in vitro input
Log-transformed mutant in vitro input
Target exon phastcons conservation score (Siepel et al., 2005)
Target exon flanking intron length in ensemble 75 annotation

Features were selected with threefold cross-validation. Besides the above features, scores from CADD, SIFT, phastCons, LoFtool, and GC content change were also initially considered but not selected because they did not improve the prediction performance.

2.4 VEP plugin

We have developed an Ensembl VEP (McLaren et al., 2016) plugin, which integrates the functionalities of our algorithm to VEP. The VEP plugin allows direct analysis of VCF file using the VEP database and services with a common API to existing VEP plugins or pipelines. The plugin is written in Perl based on 'BaseVepPlugin' interface recommended by Ensembl. During the analysis, the plugin executes the following steps: it matches corresponding exons for each variant, obtains reference and alternative sequences using VEP APIs, sends those sequences to MMSplice python package with standard input, and fetches associated scores from the standard output. We found this to be the simplest way for the plugin to communicate with the MMSplice python package. Moreover, a Docker container that contains all the dependencies including VEP is provided to facilitate installation and usage of the plugin at https://github.com/gagneurlab/MMSplice/tree/master/VEP_plugin.

2.5 Evaluation

For all regression tasks, we chose the Pearson correlation (R) as the primary evaluation metric. However, as the Pearson correlation is invariant to affine transformations, we also report the root-mean-square errors (RMSE), which measures the deviation between predicted values and measured ones. For all classification tasks, we report the precision-recall curve and the area under the curve (auPR) for the cases where there is a strong class imbalance. For the cases where the classes are balanced, we chose to use the receiver operating characteristic (ROC) curve and report area under the ROC curve (auROC).

3 RESULTS

3.1 Training performance of modular models

The donor and acceptor modules were trained by classifying annotated splice sites versus random sequences selected around annotated splice sites of the GENCODE 24 genome annotation (Cheng et al., 2019). Both modules were able to distinguish annotated splice sites with high accuracy on the validation dataset (auROC = 0.98 for both donor and acceptor modules, Figure 2a,b).

We evaluated our exon modules and intron modules on predicting Ψ₅ and Ψ₃ measured by the MPRA experiment (Cheng et al., 2019; Soemedi et al., 2017). Our 3′ exon module and 5′ intron module predicted Ψ₃ for the A5SS library with a correlation of 0.77 and 0.31, respectively (Figure 2c,d). Note that all the predictions were done with a single module ignoring all other information. This approach is not comparable to the Rosenberg et al (Rosenberg et al., 2015) approach, which used complete sequence information.

3.2 Vex-seq data does not support additive variant effects on the natural scale

The Vex-seq challenge requested to predict ∆Ψ. However, Ψ is bounded to [0,1]. This constrains the predictions. For instance, for a reference Ψ close to 1, ∆Ψ cannot be largely positive. The CAGI 5 organizers therefore also provided the reference Ψ level. MMSplice models additive effects in the log odds scale (∆logit(Ψ), Equation 2; Section 2). Application of the logistic function ensures the predictions of the alternative Ψ to be bounded to the [0,1] interval. An alternative approach would have been to model additive effects in the natural scale and to cap all predictions to the [0,1] interval.

We investigated whether the Vex-seq data would support the additive natural scale model. To this end, we looked first at all Vex-seq data for which (a) the reference Ψ level was lower than 0.5 and (b) ∆Ψ was positive. If the effects of variants were additive in the natural scale and independent of the reference Ψ level, then we would expect larger deviations for the constructs with Ψ_ref close to 0 as they can increase by as much as 1, compared with constructs with Ψ_ref close to 0.5, which are bounded to increase by not more than 0.5. In fact, we observed the opposite trend as ∆Ψ values for variants with Ψ_ref close to 0 were significantly smaller (P = 2.2e−08, Figure 3a). The same was also observed for the Vex-seq data for which (a) the reference Ψ level was larger than 0.5 and (b) ∆Ψ was negative (Figure 3b). Hence, the effects of variants appeared to be larger for Ψ_ref close to 0.5 than for Ψ_ref close to 0 or 1. This observation further motivated modeling Ψ as a result of the logistic function, which has the smallest gradient around 0 and 1 and the largest gradient at 0.5.

3.3 Vex-seq challenge: Predicting variant effect on exon skipping level

We trained a linear model from the modular predictions to predict ∆Ψ from the 957 training variants provided by Vex-seq challenge. As the Vex-seq variants originated from both intron and exon, five potentially overlapped modules were used: 3′ intron, acceptor, 5′ exon, donor, and 5′ intron. We used a 5′ exon module instead of the 3′ exon module because it performed better on the Vex-seq training data (McLaren et al., 2016). In total, nine parameters were trained (Section 2).

On the Vex-seq training data, MMSplice was able to score all variants including indels. When separating the variants into 3′ intron, exon and 5′ intron, MMSplice had a good performance in all three regions (3′ intron: R = 0.78, RMSE = 0.09; Exon: R = 0.61, RMSE = 0.11; 5′ intron: R = 0.72, RMSE = 0.11; Figure 4; Table S1).

On the unseen test data, MMSplice had similar performance compared with the training data (R = 0.68, RMSE = 0.1; Cheng et al., 2019), indicating that we did not overfit the training data. Moreover, we outperformed the state-of-the-art methods SPANR (R = 0.26, RMSE = 0.14; Xiong et al., 2015), and HAL (R = 0.44, RMSE = 0.28; Rosenberg et al., 2015), indicating that the modular approach was effective (Cheng et al., 2019). This model ranked the first on the Vex-seq challenge.

3.4 MaPSy challenge

Encouraged by the results on Vex-seq data, we trained linear models similarly on the MaPSy challenge training data for in vivo and in vitro separately with a log-allelic ratio as the response variable (Section 2). We first focused on training MMSplice for predicting the log-allelic ratio (splicing efficiency change, Section 2). On the training data, MMSplice accurately predicted variant effects on splicing efficiency both in vivo (R = 0.59, RMSE = 1.02) and in vitro (R = 0.56, RMSE = 0.04; Figure 5a,b; Table S2). On the unseen test data, MMSplice was still accurate (Cheng et al., 2019). Our log-allelic ratio prediction was the most accurate one in the MaPSy challenge.

We then trained a classifier to classify ESMs (Section 2). On the training data, the classifier had auPR 0.3 (Figure 5c). On the unseen test variants, the classifier had auPR 0.19 (Figure 5c; Table S2).

3.5 Variant interpretation

To support the interpretation of the predictions made by MMSplice, we followed the in silico mutagenesis approach. In silico mutagenesis computes predictions for every possible SNV for a given input sequence, and display the predictions in a heat map called mutation map. The mutation map allows assessing the relative importance of variants compared with other possible variants in the vicinity. The MMSplice implementation followed the Kipoi API (version 0.65), a programmatic standard for predictive models in genomics (Avsec et al., 2019). In particular, it is compatible with the Kipoi variant effect prediction plugin allowing the generation of mutation maps. As an illustrative example, we considered the variant (rs746677712; Figure 6a). This variant lies 5 nucleotides inside the intron near the donor site of exon 5 of the gene FCGR2B. MMSplice predicts this variant to increase the skipping of this exon compared with reference sequence with an odds ratio of 0.14 (log odds ratio = −1.99). The mutation maps also shows that, for the considered sequence, only SNVs on the canonical 5′ dinucleotide GT or the last two bases of the exon AG can lead to effects on exon skipping of similar amplitude (Figure 6a). Similarly, the variant rs773534127 close to the acceptor of exon 5 was predicted to strongly decrease exon inclusion level with an odds ratio of 0.20 (log odds ratio = −1.59), nearly as strong as the predicted effect of SNVs on the canonical dinucleotide AG (Figure 6b). Mutation maps are also useful to identify possible splicing regulatory elements as consecutive nucleotides that are predicted to have a strong impact on splicing when mutated. One illustrative example is provided with the mutation map around the variant rs751723286 (Figure 6c). This SNV affects the motif TAGGG, which is the binding site of Heterogeneous Nuclear Ribonucleoprotein A1 (HNRNPA1), an import splicing regulatory RNA binding protein (Burd & Dreyfuss, 1994). The mutation map shows that every mutation on this motif is predicted to increase exon inclusion level, consistent with the repressive role of HNRNPA1 on exon inclusion (Mayeda & Krainer, 1992).

4 CONCLUSION

We have participated in two CAGI 5 splicing prediction challenges, Vex-Seq and MaPSy, with a single modeling framework MMSplice which ranked among the best on both challenges. The reasons for the success of MMSplice are multiple. First, we have trained the model mostly on richer complementary functional genomics data with about three orders of magnitude more data points than the CAGI challenge data. We have used the CAGI data only to fit a very few numbers of parameters for each model. Second, we have worked on the log odds scale rather than on the natural scale. This was not only justified by mathematical convenience but also by the Vex-seq data which showed that the higher impacts of variants were found for intermediate levels of splicing. Third, we made use of an existing high-throughput perturbation assay (Rosenberg et al., 2015) to fit the model. Because the number of publications with MPRAs keeps increasing, such datasets will play a major role in building predictive models for genomics in the future as they allow capturing causal effects. Fourth, we have used a modular approach so that we could reuse elements of the model for one challenge in the other challenge.

Depending on the assay and the genomic region of the variants, the correlation between MMSplice predicted changes and the measured changes varies between 0.56_(in vitro MaPSy) and 0.78 (3′ intronic variants for Vex-seq). The effects of many assayed variants are small, and therefore cannot be precisely predicted because their estimates are likely dominated by noise. Moreover, MMSplice might be improved by overcoming the following limitations: First, Ψ is also be affected by the stabilities (half-lives) of different isoforms. Second, the effect of certain splicing motifs dependent on the position with respect to splice sites (Erkelenz et al., 2012), which we did not model. Third, our exon and intron modules were learned from alternative 5′ and 3′ splicing events instead of exon skipping directly. Hence, they may not fully capture the biology of exon skipping. Fourth, as splicing is tissue-specific, a model that is specific for the target tissue or cell type might perform better.

Our current model will likely, and hopefully, be soon superseded by future models integrating more data. However, we hope that some of the principles identified here will be useful. In particular, we believe that if models would adopt a modular structure and satisfy some reasonable degree of compatibility, the community could more efficiently leverage models from each other. We provide MMSplice and all the individual modules in the model repository Kipoi, which could be helpful to this end.

ACKNOWLEDGEMENT

The CAGI experiment coordination is supported by NIH U41 HG007346 and the CAGI conference by NIH R13 HG006650. J.C. was supported by the Competence Network for Technical, Scientific High-Performance Computing in Bavaria KONWIHR. Z.A. and J.C. were supported by a Deutsche Forschungsgemeinschaft fellowship through the Graduate School of Quantitative Biosciences Munich.

CONFLICT OF INTERESTS

The authors declare that there are no conflict of interests.

Supporting Information

REFERENCES

Adamson, S. I., Zhan, L., & Graveley, B. R. (2018). Vex-seq: High-throughput identification of the impact of genetic variation on pre-mRNA splicing efficiency. Genome Biology, 19(1), 1–12. https://doi.org/10.1186/s13059-018-1437-x
10.1186/s13059-018-1437-x
PubMed Web of Science® Google Scholar
Alberts, B., Lewis, J., Roberta, K., Johnson, A., Raff, M., & Walter, P. (2008). Molecular Biology of the Cell, Molecular Biology of the Cell ( 6.). New York: Garland Science.
Google Scholar
Avsec, Z., Kreuzhuber, R., Israeli, J., Xu, N., Cheng, J., Shrikumar, A. … Gagneur, J. (2019). The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nature Biotechnology, 1–31.
PubMed Web of Science® Google Scholar
Baralle, F. E., & Giudice, J. (2017). Alternative splicing as a regulator of development and tissue identity. Nature Reviews Molecular Cell Biology, 18(7), 437–451. https://doi.org/10.1038/nrm.2017.27
10.1038/nrm.2017.27
CAS PubMed Web of Science® Google Scholar
Barash, Y., Calarco, J. A., Gao, W., Pan, Q., Wang, X., Shai, O., … Frey, B. J. (2010). Deciphering the splicing code. Nature, 465(7294), 53–59. https://doi.org/10.1038/nature09000
10.1038/nature09000
CAS PubMed Web of Science® Google Scholar
Braberg, H., Jin, H., Moehle, E. A., Chan, Y. A., Wang, S., Shales, M., … Krogan, N. J. (2013). From structure to systems: High-resolution, quantitative genetic analysis of RNA polymerase II. Cell, 154(4), 775–788. https://doi.org/10.1016/j.cell.2013.07.033
10.1016/j.cell.2013.07.033
CAS PubMed Web of Science® Google Scholar
Bretschneider, H., Gandhi, S., Deshwar, A. G., Zuberi, K., & Frey, B. J. (2018). COSSMO: Predicting competitive alternative splice site selection using deep learning. Bioinformatics, 34(13), i429–i437. https://doi.org/10.1093/bioinformatics/bty244
10.1093/bioinformatics/bty244
CAS PubMed Web of Science® Google Scholar
Burd, C. G., & Dreyfuss, G. (1994). RNA binding specificity of hnRNP A1: Significance of hnRNP A1 high-affinity binding sites in pre-mRNA splicing. The EMBO Journal, 13(5), 1197–1204.
10.1002/j.1460-2075.1994.tb06369.x
CAS PubMed Web of Science® Google Scholar
Cheng, J., Nguyen, T. Y. D., Cygan, K. J., Çelik, M. H., Fairbrother, W. G., Avsec, Ž., & Gagneur, J. (2019). MMSplice: Modular modeling improves the predictions of genetic variant effects on splicing. Genome Biology, 20(1), 48. https://doi.org/10.1186/s13059-019-1653-z
10.1186/s13059-019-1653-z
PubMed Web of Science® Google Scholar
Consortium, T. G., Ardlie, K., Deluca, D. S., Segre, A. V., Sullivan, T. J., Young, T. R., … Lockhart, N. C. (2015). The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science, 348(6235), 648–660. https://doi.org/10.1126/science.1262110
10.1126/science.1262110
CAS PubMed Web of Science® Google Scholar
Erkelenz, S., Mueller, W. F., Evans, M. S., Busch, A., Schöneweis, K., Hertel, K. J., & Schaal, H. (2012). Position-dependent splicing activation and repression by SR and hnRNP proteins rely on common mechanisms. RNA, 19(1), 96–102. https://doi.org/10.1261/rna.037044.112
10.1261/rna.037044.112
PubMed Web of Science® Google Scholar
Fairbrother, W. G. (2002). Predictive Identification of exonic splicing enhancers in human genes. Science, 297(5583), 1007–1013. https://doi.org/10.1126/science.1073774
10.1126/science.1073774
CAS PubMed Web of Science® Google Scholar
Fairbrother, W. G., Yeo, G. W., Yeh, R., Goldstein, P., Mawson, M., Sharp, P. A., & Burge, C. B. (2004). RESCUE-ESE identifies candidate exonic splicing enhancers in vertebrate exons. Nucleic Acids Research, 32, W187–W190. https://doi.org/10.1093/nar/gkh393
10.1093/nar/gkh393
CAS PubMed Web of Science® Google Scholar
Goldstein, L. D., Cao, Y., Pau, G., Lawrence, M., Wu, T. D., Seshagiri, S., & Gentleman, R. (2016). Prediction and quantification of splice events from RNA-seq data. PLoS One, 11(5), 1–18. https://doi.org/10.1371/journal.pone.0156132
10.1371/journal.pone.0156132
Web of Science® Google Scholar
Ke, S., Anquetil, V., Zamalloa, J. R., Maity, A., Yang, A., Arias, M. A. … Chasin, L. A. (2018). Saturation mutagenesis reveals manifold determinants of exon definition. Genome Research, 28(1), 11–24. https://doi.org/10.1101/gr.219683.116
10.1101/gr.219683.116
CAS PubMed Web of Science® Google Scholar
Li, Y. I., van de Geijn, B., Raj, A., Knowles, D. A., Petti, A. A., Golan, D., & Pritchard, J. K. (2016). RNA splicing is a primary link between genetic variation and disease. Science, 352(6285), 600–604. https://doi.org/10.1126/science.aad9417
10.1126/science.aad9417
CAS PubMed Web of Science® Google Scholar
López-Bigas, N., Audit, B., Ouzounis, C., Parra, G., & Guigó, R. (2005). Are splicing mutations the most frequent cause of hereditary disease? FEBS Letters, 579(9), 1900–1903. https://doi.org/10.1016/j.febslet.2005.02.047
10.1016/j.febslet.2005.02.047
CAS PubMed Web of Science® Google Scholar
Mayeda, A., & Krainer, A. R. (1992). Regulation of alternative pre-mRNA splicing by hnRNP A1 and splicing factor SF2. Cell, 68(2), 365–375. https://doi.org/10.1016/0092-8674(92)90477-T
10.1016/0092-8674(92)90477-T
CAS PubMed Web of Science® Google Scholar
McLaren, W., Gil, L., Hunt, S. E., Riat, H. S., Ritchie, G. R. S., Thormann, A., … Cunningham, F. (2016). The ensembl variant effect predictor. Genome Biology, 17(1) https://doi.org/10.1186/s13059-016-0974-4
10.1186/s13059-016-0974-4
PubMed Web of Science® Google Scholar
Paggi, J. M., & Bejerano, G. (2018). A sequence-based, deep learning model accurately predicts RNA splicing branchpoints. RNA, 24(12), 1647–1658. https://doi.org/10.1261/rna.066290.118
10.1261/rna.066290.118
CAS PubMed Web of Science® Google Scholar
Rosenberg, A. B., Patwardhan, R. P., Shendure, J., & Seelig, G. (2015). Learning the sequence determinants of alternative splicing from millions of random sequences. Cell, 163(3), 698–711. https://doi.org/10.1016/j.cell.2015.09.054
10.1016/j.cell.2015.09.054
CAS PubMed Web of Science® Google Scholar
Siepel, A., Bejerano, G., Pedersen, J. S., Hinrichs, A. S., Hou, M., Rosenbloom, K. … Haussler, D. (2005). Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research, 15, 1034–1050. https://doi.org/10.1101/gr.3715005
10.1101/gr.3715005
CAS PubMed Web of Science® Google Scholar
Soemedi, R., Cygan, K. J., Rhine, C. L., Wang, J., Bulacan, C., Yang, J., … Fairbrother, W. G. (2017). Pathogenic variants that alter protein code often disrupt splicing. Nature Genetics, 49(6), 848–855. https://doi.org/10.1038/ng.3837
10.1038/ng.3837
CAS PubMed Web of Science® Google Scholar
Wang, Y., Liu, J., Huang, B. O., Xu, Y.-M., Li, J., Huang, L.-F., … Wang, X.-Z. (2015). Mechanism of alternative splicing and its regulation. Biomedical Reports, 3(2), 152–158. https://doi.org/10.3892/br.2014.407
10.3892/br.2014.407
CAS PubMed Web of Science® Google Scholar
Wang, Z., Xiao, X., Van Nostrand, E., & Burge, C. B. (2006). General and specific functions of exonic splicing silencers in splicing control. Molecular Cell, 23(1), 61–70. https://doi.org/10.1016/j.molcel.2006.05.018
10.1016/j.molcel.2006.05.018
CAS PubMed Web of Science® Google Scholar
Wilhelm, B. T., Marguerat, S., Watt, S., Schubert, F., Wood, V., Goodhead, I., … Bähler, J. (2008). Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature, 453(7199), 1239–1243. https://doi.org/10.1038/nature07002
10.1038/nature07002
CAS PubMed Web of Science® Google Scholar
Xiong, H. Y., Alipanahi, B., Lee, L. J., Bretschneider, H., Merico, D., Yuen, R. K. C., … Frey, B. J. (2015). The human splicing code reveals new insights into the genetic determinants of disease. Science, 347(6218), 1254806–1254806. https://doi.org/10.1126/science.1254806
10.1126/science.1254806
CAS PubMed Web of Science® Google Scholar
Yeo, G., & Burge, C. B. (2004). Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. Journal of Computational Biology, 11(2–3), 377–394. https://doi.org/10.1089/1066527041410418
10.1089/1066527041410418
CAS PubMed Web of Science® Google Scholar
Zhang, X. H. F., & Chasin, L. A. (2004). Computational definition of sequence motifs governing constitutive exon splicing. Genes and Development, 18(11), 1241–1250. https://doi.org/10.1101/gad.1195304
10.1101/gad.1195304
CAS PubMed Web of Science® Google Scholar
Zhang, X., & Kangsamaksin, T. (2005). Exon inclusion is dependent on predictable exonic splicing enhancers. Molecular and Cellular Biology, 25(16), 7323–7332. https://doi.org/10.1128/mcb.25.16.7323-7332.2005
10.1128/MCB.25.16.7323-7332.2005
CAS PubMed Web of Science® Google Scholar

Citing Literature

All articles

Filename	Description
humu23788-sup-0001-Supplementary_Table_S1.xlsx31 KB	Supplementary information
humu23788-sup-0002-Supplementary_Table_S2.xlsx379 KB	Supplementary information

CAGI 5 splicing challenge: Improved exon skipping and intron retention predictions with MMSplice

Abstract

1 INTRODUCTION

2 METHODS