Volume 37, Issue 6 pp. 579-597
Review
Free Access

Variation Interpretation Predictors: Principles, Types, Performance, and Choice

Abhishek Niroula

Abhishek Niroula

Department of Experimental Medical Science, Lund University, BMC B13, Lund, SE-22184 Sweden

Search for more papers by this author
Mauno Vihinen

Corresponding Author

Mauno Vihinen

Department of Experimental Medical Science, Lund University, BMC B13, Lund, SE-22184 Sweden

Correspondence to: Mauno Vihinen, Protein Structure and Bioinformatics Group, Department of Experimental Medical Science, BMC B13, Lund University, SE-22, 184 Lund, Sweden. E-mail: [email protected]Search for more papers by this author
First published: 14 March 2016
Citations: 100

Contract grant sponsors: Lund University; Barncancerfonden; Vetenskapsrådet.

For the 25th Anniversary Commemorative Issue

ABSTRACT

Next-generation sequencing methods have revolutionized the speed of generating variation information. Sequence data have a plethora of applications and will increasingly be used for disease diagnosis. Interpretation of the identified variants is usually not possible with experimental methods. This has caused a bottleneck that many computational methods aim at addressing. Fast and efficient methods for explaining the significance and mechanisms of detected variants are required for efficient precision/personalized medicine. Computational prediction methods have been developed in three areas to address the issue. There are generic tolerance (pathogenicity) predictors for filtering harmful variants. Gene/protein/disease-specific tools are available for some applications. Mechanism and effect-specific computer programs aim at explaining the consequences of variations. Here, we discuss the different types of predictors and their applications. We review available variation databases and prediction methods useful for variation interpretation. We discuss how the performance of methods is assessed and summarize existing assessment studies. A brief introduction is provided to the principles of the methods developed for variation interpretation as well as guidelines for how to choose the optimal tools and where the field is heading in the future.

Introduction

Next-generation sequencing (NGS) methods produce a wealth of detailed information, including large variation datasets. Every human genome contains about 3 million single nucleotide substitutions in comparison with a reference genome, but only a few of those are disease-related. In addition, there are other more complex variations, fewer in number but challenging to interpret. The 1000 Genomes project estimated that each genome codes for about 11,000 amino acid substitutions, approximately 12,500 synonymous substitutions, and hundreds of insertions and deletions [Abecasis et al., 2010].

DNA sequencing has been automated very effectively and thus is fast and inexpensive, especially in the case of resequencing the human genome. The $1,000 genome sequence is already a reality. Sequencing is widely used for diagnostic purposes, but is often done for individual genes. As the price of exome and complete genome sequencing is falling below the cost of sequencing one or a few specific genes, there is increased interest in using genome sequencing in health care.

The fall in sequencing costs has exposed the major deficits in our understanding of the pathogenic significance of most variants. The situation reduces clinical utility and greatly affects the potential for disease treatment and prevention. The problematic lack of knowledge arises because large numbers of the identified variations are novel, or without proper annotation, and knowledge about disease association is missing. Data analysis and variation interpretation are the most time-consuming steps in sequencing projects. Whole-genome or whole-exome sequencing will provide valuable new information about an individual if this information could be utilized and interpreted in a reliable and meaningful fashion. The genome-wide interpretation of variants is only feasible with computational prediction methods. The bottleneck in the utilization of sequence information has shifted from obtaining the sequences to understanding and interpreting them.

Several groups are working on developing guidelines for how to utilize sequencing information for clinical purposes. Recommendations by both the European Society of Human Genetics (ESHG) and the American College of Medical Genetics (ACMG) include the use of prediction methods [Matthijs et al., 2015; Richards et al., 2015]. For certain diseases, sequencing is the only method for differential diagnosis. Diagnosis and prognosis in many diseases is highly dependent on sequence information. Numerous computational tools have been developed for the analysis, prioritization, and interpretation of variations and their effects. End users of these tools are often puzzled by not knowing which method to choose, which ones are the most reliable, and how to choose the most suitable applications. This paper aims at describing what types of predictions are currently possible, how the tools can be compared, and their performance be assessed, as well as some principles that are necessary to understand and to be able to evaluate methods. Finally, tips are provided for how to choose the most suitable computational methods and what we can expect to have in the future. Although computational approaches are very helpful and vital for variation interpretation, one has to bear in mind that these methods alone cannot define whether a variant is pathogenic or not, and other types of information (such as patient clinical information, family history, etc.) are also needed for making such decisions. The type(s) of additional information varies for different diseases.

Several organizations are working toward providing and promoting reliable variation information. The Human Genome Variation Society (HGVS, http://www.hgvs.org/) has been the scientific society for the variation database and interpretation community, variation nomenclature being one of their major achievements [den Dunnen and Antonarakis, 2001]. The Human Variome Project (HVP, http://www.humanvariomeproject.org/) “is working to ensure that all information on genetic variation and its effect on human health can be collected, curated, interpreted, and shared freely and openly.” Recently, the Global Alliance for Genomics and Health (GA4GH, http://genomicsandhealth.org/) was founded to enable sharing of genomic and clinical data.

Variations can have their effects on many levels. The most studied are amino acid substitutions; however, also other types of variations, such as insertions and deletions, are important. Amino acid substitutions can have numerous effects and the mechanisms behind them are diverse [Steward et al., 2003; Thusberg and Vihinen, 2009; Vihinen, 2015c]. They may disrupt or block critical sites for protein function, such as catalytic residues or ligand-binding pockets. They can lead to alterations in the protein structural properties, causing, for example, abnormal folding, structural instability, or aggregation of the protein. Even small changes in the size or chemical or physical nature of an amino acid side chain can be critical. Amino acid substitutions may affect protein posttranslational modifications by generating or deleting such sites, for example, for phosphorylation or glycosylation, or protease cleavage, or altering protein localization targeting signals. DNA and RNA variations often affect important functional sites such as binding sites, for example, for transcription factors or microRNAs or in the case of RNA, the structure of the molecule. For detailed systematic organization of the variation types and effects, see the Variation Ontology (VariO) [Vihinen, 2014b, 2015c].

Most studies are concentrated on loss-of-function variations; however, there are also gain-of-function alterations due to irregular or altered binding of ligands or loss of specificity. Many proteins are robust and can tolerate alterations and insertions to numerous sites without any or just a minor effect on protein function [Poussu et al., 2004]; on the other hand, there are sites that do not tolerate any variations. Recently, some computational studies have been published in which all possible amino acid substitutions or those caused by single-nucleotide changes were analyzed. These include kinase domain variants in Bruton tyrosine kinase (BTK) [Väliaho et al., 2015], variants in four mismatch repair (MMR) proteins [Niroula and Vihinen, 2015a], and predicted protein solubility affecting variations in human interleukin-1 β [Yang et al., 2016]. In addition, similar studies have been performed for mitochondrial tRNA molecules [Kondrashov, 2005; Niroula and Vihinen, 2016]. These studies predicted the effects of all possible amino acid/nucleotide substitutions in the proteins/RNAs and provided important information about the types of variations and their distribution along the molecule sequence and structure highlighting both structural and functional aspects. The frequency of predicted harmful variants varies greatly. Depending on the location and the type of variations, the outcomes are different. The BTK kinase domain has the highest published ratio of harmful variants, with 67% of the studied cases [Väliaho et al., 2015]. In MMR proteins, the proportions of harmful amino acid substitutions varied from 14.6% (in PMS2) to 40.4% (in MSH2) [Niroula and Vihinen, 2015a]. There are only a few experimental studies for massive amounts of amino acid substitutions in single proteins. The ratios of harmful variations vary widely in them, see Väliaho et al. [2015].

Several reviews have been published about prediction methods and their use [Thusberg and Vihinen, 2009; Cline and Karchin, 2011; Capriotti et al., 2012; Peterson et al., 2013], but none of them have covered the full scope of the discussion as done here.

Variation Databases

When investigating the effects of variations, one has to find out whether the identified alteration has been previously found in patients or in healthy individuals. If a variant is frequent in healthy individuals in populations, possibly in several ethnic groups, it is likely not harmful, although there are some exceptions. Central variation databases, such as the Human Gene Mutation Database (HGMD) [Stenson et al., 2014], Online Mendelian Inheritance in Man (OMIM) [Hamosh et al., 2005], and ClinVar [Landrum et al., 2014], include variants for many human genes (Table 1). The UniProtKB/SwissProt [Uniprot Consortium, 2015] database contains manually annotated protein entries with some variants [Yip et al., 2008] available at the SwissVar service (http://swissvar.expasy.org/). For a comprehensive review of the current state and future challenges of genotype–phenotype databases, see Brookes and Robinson [2015].

Table 1. Representative list of variation databases
Database URL Reference
General variation databases
ClinVar http://www.ncbi.nlm.nih.gov/clinvar/ Landrum et al. [2014]
Database of immunodeficiency-causing variations (IDbases) http://structure.bmc.lu.se/idbase/ Piirilä et al. [2006]
Database of short genetic variations (dbSNP) http://www.ncbi.nlm.nih.gov/SNP/ Sherry et al. [2001]
Ensembl Variation Database http://www.ensembl.org/info/genome/variation/index.html
Exome Aggregation Consortium (ExAC) http://exac.broadinstitute.org/
Exome Variant Server (EVS) http://evs.gs.washington.edu/EVS/ Fu et al. [2013]
Human Gene Mutation Database (HGMD) http://www.hgmd.cf.ac.uk/ac/index.php Stenson et al. [2014]
Leiden Open Variation Databases (LOVD) http://www.lovd.nl/3.0/home Fokkema et al. [2011]
MITOMAP http://www.mitomap.org/MITOMAP Lott et al. [2013]
Online Mendelian Inheritance in Man (OMIM) http://www.ncbi.nlm.nih.gov/omim Hamosh et al. [2005]
SwissVar http://swissvar.expasy.org/ Yip et al. [2008]
Universal Mutation Database (UMD) http://www.umd.be/ Beroud et al. [2000]
Databases of genomic structural variations
dbVar http://www.ncbi.nlm.nih.gov/dbvar/ Lappalainen et al. [2013]
Database of Genomic Variants (DGV) http://dgv.tcag.ca/dgv/app/home MacDonald et al. [2014]
Database of Genomic Variants archive (DGVa) http://www.ebi.ac.uk/dgva Lappalainen et al. [2013]
Human Polymorphic Inversion Database (InvFEST) http://invfestdb.uab.cat/ Martinez-Fundichely et al. [2014]
Database of cancer variations
Catalogue of Somatic Mutations in Cancer (COSMIC) http://cancer.sanger.ac.uk/cosmic Forbes et al. [2011]
Database or Curated Mutations (DoCM) http://docm.genome.wustl.edu/
Kin-Driver http://kin-driver.leloir.org.ar/ Simonetti et al. [2014]
Databases of benchmark datasets
VariBench http://structure.bmc.lu.se/VariBench/ Nair and Vihinen [2013]
VariSNP http://structure.bmc.lu.se/VariSNP/ Schaafsma and Vihinen [2015]
Specific databases
Frequency INherited Disorders database (FINDbase) http://www.findbase.org/ Papadopoulos et al. [2014]
Thermodynamic Database for Proteins and Mutants (ProTherm) http://www.abren.net/protherm/ Kumar et al. [2006]

Locus-specific variation databases (LSDBs) list variants in specific genes/diseases and are typically manually annotated. Many LSDBs contain also other information, sometimes even very detailed clinical data. LSDBs exist at the Leiden Open Variation Database (LOVD) system for all human genes [Fokkema et al., 2011]. Many of them are still empty and without curators. One of the problems is that there can be more than one database for some genes and diseases and they can contain overlapping data [Mitropoulou et al., 2010]. LSDBs are usually the primary and most trusted variation information sources as they are curated and maintained by experts in the genes and diseases. The currently existing LSDBs are listed at The Human Genome Variation Society (HGVS) Website (http://www.hgvs.org/locus-specific-mutation-databases), the LOVD site (http://grenada.lumc.nl/LSDB_list/lsdbs), the GEN2PHEN server (http://www.gen2phen.org/data/lsdbs) [Thorisson et al., 2009], and at the Web Analysis of the Variome (http://bioinformatics.ua.pt/WAVe/) [Lopes et al., 2011]. Other large collections of LSDBs include IDbases [Piirilä et al., 2006], and those maintained at Universal Mutation Databases (UMD) platform [Beroud et al., 2000].

Several guidelines and recommendations have been published for LSDBs including how to establish a database [Vihinen et al., 2012], their curation [Celli et al., 2012], overall contents [Kohonen-Corish et al., 2010], ethics [Cotton et al., 2005; Povey et al., 2010], data collection [Cotton et al., 2007; Cotton et al., 2009], somatic variations [Olivier et al., 2009], interpretation and reporting of variants [Plon et al., 2008; Richards et al., 2015], data sharing [den Dunnen et al., 2009], and nomenclature [den Dunnen and Antonarakis, 2001; Taschner and den Dunnen, 2011]. Beside variant descriptions, LSDBs can also contain clinically relevant information, including where and by whom the variant was found (contact information, publication details), how it was detected (screening methodology), and whether it influences the function of the gene or its product. Some LSDBs contain details of the phenotype of the patient.

The dbSNP database [Sherry et al., 2001] is a very large database for short variants and contains data from The 1000 Genomes Project (http://www.1000genomes.org/). Note that dbSNP contains both disease-causing and benign variants. The NHLBI Exome Sequencing Project (ESP) Exome Variant Server (EVS, http://evs.gs.washington.edu/EVS/) provides exome data from 200,000 individuals from African-American and European-American populations [Fu et al., 2013]. Individuals with specific traits related to blood, heart diseases, and lung diseases as well as controls are included in the EVS project. The Exome Aggregation Consortium (ExAC, http://exac.broadinstitute.org) has data from 60,706 unrelated individuals and the Allele Frequency Community (AFC, http://www.allelefrequencycommunity.org/) currently contains data for about 100,000 exomes/genomes. Some databases annotate human variation data with phenotype variations and protein structural and functional information such as KMDB/MutationView (http://mutview.dmb.med.keio.ac.jp/MutationView/jsp/index.jsp). The University of California, Santa Cruz (UCSC) Genome Browser [Kent et al., 2002], the National Center for Biotechnology Information (NCBI) Map Viewer [Wheeler et al., 2003], the Ensembl Genome Browser [Stalker et al., 2004], and others provide information about genes, their products, and sequence variants. The Encyclopedia of DNA Elements (ENCODE) project [Sloan et al., 2016] aims at identifying and mapping functional and regulatory elements in the human genome also outside protein coding regions. PhenCode [Giardine et al., 2007] connects human phenotype and clinical data in LSDBs with genomic data from the ENCODE project and other resources in the UCSC Genome Browser.

Additional types of variation databases include national and ethnic databases [Patrinos, 2006] such as ETHNOS [van Baal et al., 2010], variation frequency databases including FINDbase [Papadopoulos et al., 2014], chromosomal structural variation databases such as The Database of Genomic Variants (DGV) [MacDonald et al., 2014], Database of Genomic Variants archive (DGVa, http://www.ebi.ac.uk/dgva), dbVar (http://www.ncbi.nlm.nih.gov/dbvar) [Lappalainen et al., 2013], InvFEST database of inversions [Martinez-Fundichely et al., 2014], Mitelman Database of chromosome aberrations and gene fusions (http://cgap.nci.nih.gov/Chromosomes/Mitelman), and European Cytogeneticists Association Register of Unbalanced Chromosome Aberrations (ECARUCA) database containing cytogenetic and clinical information of rare chromosomal disorders including microdeletions and microduplications [Feenstra et al., 2006] (Table 1). Databases dedicated to certain types of variations or to an effect or mechanism are available, for example, in the ProTherm database for protein stability affecting variations [Kumar et al., 2006], and The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/) and the Catalogue of Somatic Mutations in Cancer (COSMIC, http://cancer.sanger.ac.uk/cosmic) [Forbes et al., 2011] databases for genetic variations in cancer (Table 1).

From the Cafe Variome [Lancaster et al., 2015], one can find whether a variant has been previously published, but without revealing further information. For more information about the variant, one needs to contact the relevant data owner/provider. One additional type of databases is highly relevant for variation effect predictions, namely, benchmark databases. VariBench [Nair and Vihinen, 2013] and VariSNP [Schaafsma and Vihinen, 2015] collect benchmark variation datasets from different sources and distribute datasets that provide gold standards for method developers and assessors.

It is recommended to use systematics in variation databases whenever possible, including those for phenotype such as the Human Phenotype Ontology (HPO) [Robinson et al., 2008] and the Variation Ontology (VariO) [Vihinen, 2014b] for a systematic description of effects, mechanisms, and consequences of variants. The usage of these systematic approaches facilitates efficient searches, data integration, as well as the development of computational systems for data analysis and interpretation. Several de facto standards are useful for variation databases. These include systematic gene names available from the HUGO Gene Nomenclature Committee (HGNC) [Gray et al., 2015]. Reference sequences have to be specified including version numbers, unless Locus Reference Genomic (LRG) [Dalgleish et al., 2010] entries are used. LRGs are recommended for human sequences as they are stable and allow unambiguous mapping of positions. For naming variations, the HGVS nomenclature [den Dunnen and Antonarakis, 2001] should be used. It is widely used in the literature and also by several computer programs including those interpreting variations. The Mutalyzer tool can generate systematic names automatically and performs a number of consistency checks [Wildeman et al., 2008].

The HVP has released quality assessment criteria to evaluate the quality of genetic variation databases and to stimulate database curators to make improvements where they are needed (http://www.humanvariomeproject.org/finish/19/255.html) [Vihinen et al., 2016]. Once the quality scheme is implemented, the quality information will allow users to check how reliable, up-to-date, and user-friendly a certain database is.

Principles

It is important for users of any method, whether utilized in laboratories or run with computers, to know how they work. Only by doing so can one evaluate the significance of the observed findings. Prediction methods can be rather complex and complicated, especially when considering the details; however, the general principles of these methods can be understood by any scientist. Bioinformaticians seek to find the best solutions for their methods. This means that novel approaches are implemented to improve the performance of the methods and new methods are released frequently.

Machine Learning

A majority of the recent methods in the variation interpretation field are based on machine learning (ML). ML is a form of artificial intelligence where a computer system can learn from given data (Fig. 1). Among the different types of ML techniques, supervised ML classification is widely used for variation interpretation. In supervised ML classification, a computer model is trained to distinguish between two or more classes using given examples. The most common ML algorithms include neural networks, support vector machines, and random forests. The ML methods are highly dependent on the quality of data used for their training. Typically, the methods have been binary, that is, providing two classes of predictions. Some of the methods have more than two classes as variation pathogenicity is far from the binary, black and white, pathogenic or not dichotomy. Binary classifiers can still be useful for many applications. Some methods do not categorize the variations into classes but provide continuous scores.

Details are in the caption following the image
General framework for developing a machine learning-based method. A blind test dataset should be separated before performing feature selection and algorithm parameter optimization. The splitting should be done so that all variations in the same protein or preferably protein family should be kept in either the training or the test dataset.

ML methods are widely used to tackle complex phenomena, which would be otherwise difficult to handle. For a more detailed discussion, see Vihinen [2012], for example. Variations can have large numbers of effects that would be difficult to take into account in other types of predictors. An important aspect of ML method development is the choice of the approach. Among the numerous approaches, there is not one superior architecture. The quality of a predictor depends on how the training is done, which features are used to explain the phenomenon, and how the method is optimized.

Features are used to describe the characteristics of the investigated phenomenon. If several features are available, it is important to choose those that best capture the phenomenon (Fig. 1). This is partly due to the curse of dimensionality, which means that much more data are needed when the number of features increases. In fact, the volume of the feature space grows exponentially with the dimensionality such that the data become sparse and insufficient to adequately describe the feature space. Another problem is overfitting, which means that the algorithm describes noise or random features in the training dataset instead of the real phenomenon due to sparse data, the complexity of the model, an excessive learning procedure or any combination of them. Overfitting leads to a decreased performance on real cases.

ML methods are trained to detect differences in datasets. For that purpose, a good quality training dataset is required. The dataset should represent the space of possible cases [Nair and Vihinen, 2013]. This space is huge for genetic variations as there are so many different effects and underlying mechanisms. ML predictors are trained with known positive and negative instances in an approach called supervised learning that leads to reorganization of the system—learning (Fig. 1). Once the method has learned to distinguish between the given examples, it can be used to classify unknown cases. The training set should contain about equal numbers of cases in each class. An imbalance in the numbers of cases in the classes can reduce the performance of the method [Wei and Dunbrack, 2013].

The same data cannot be used both for method training and testing (Fig. 1). The trick is to partition the dataset. This can be done in different ways, cross-validation being the most popular. The dataset is divided into disjoint partitions, one of which is used for testing and the others for training. This is repeated until all the partitions have been used as test set. The average performance measures computed from the splits are used to describe the overall prediction performance. The prediction performance would be biased if the algorithm parameters and feature sets are optimized using the same or a part of the data. Therefore, the use of an independent test data set is highly recommended. The data in this set should not be used during any of the previous steps of method development (Fig. 1).

Conservation Scores

Evolutionary conservation-based features are powerful for predicting the impact of variations. Disease-causing variants appear frequently at evolutionarily conserved positions. These positions are usually essential for the structure or function of the protein [Miller and Kumar, 2001; Vitkup et al., 2003; Shen and Vihinen, 2004]. On the other hand, disease-associated variations are underrepresented at positions that are variable during evolution [Miller and Kumar, 2001]. Evolutionary conservation is estimated based on multiple sequence alignment (MSA) of evolutionarily related sequences, called homologous sequences. Homologous sequences have a common ancestor and are separated by speciation (called orthologs) or by duplication (called paralogs) during evolution. Homologous sequences often have a high sequence similarity compared with unrelated sequences. The Basic Local Alignment Search Tool (BLAST) is often used to find similar sequences; however, the obtained sequences may not necessarily be evolutionarily related. Sequences annotated as homologs and orthologs in sequence databases are also used to obtain MSA. Several measures have been derived from MSAs including normalized frequencies of amino acids [Ng and Henikoff, 2001], evolutionary rates [Cooper et al., 2005], selective pressures [Niroula et al., 2015], sequence profiles [Calabrese et al., 2009], position-specific scoring matrix scores [Johansen et al., 2013], and position-specific independent count scores [Sunyaev et al., 1999]. Other conservation measures include pairwise alignment scores of homologous sequences [Choi et al., 2012], and scores derived from alignments of protein families such as substitution position-specific evolutionary conservation (subPSEC) [Thomas and Kejariwal, 2004] and hidden Markov Model (HMM) PHC scores [Karchin et al., 2005]. Most methods developed to predict the impact of variations use conservation scores as features. Some methods are completely based on conservation scores by optimizing the conservation scores to predict the impact of variations [Ng and Henikoff, 2001; Choi et al., 2012].

Energy Functions

Proteins form complex structures that are stabilized by various interactions. Variations can disrupt or form these interactions and therefore change the stability of a protein. Several stability predictors use energy functions, whereas others are ML-based methods. The energy function-based methods can be grouped into physical potential function, statistical potential function, and empirical potential function methods [Guerois et al., 2002; Capriotti et al., 2004; Parthiban et al., 2006]. These methods calculate the free energy change (∆∆G), that is, the free energy difference between the normal and variant protein. The most common applications utilize statistical potential calculations since physical potential calculations are very time-consuming [Gilis and Rooman, 2000]. The empirical potential methods combine and weigh physical and statistical energy components together with, for example, structure-related features [Guerois et al., 2002]. Some methods have also combined ML and energy function for stability prediction [Dehouck et al., 2009; Masso and Vaisman, 2010; Laimer et al., 2015].

Hybrid Methods

Hybrid methods combine diverse features. Here, we discuss hybrid methods that utilize evidence from experimental tests and clinical features. Hybrid methods integrate evidence for clinical features with scores obtained from predictors. The first step involves predicting scores for variation impact. In the second step, the scores from the first step are used as priors and integrated with evidence to estimate the posterior probability. The final classification of variations is based on the posterior probability. This approach has been used to classify variants in BRCA1 and BRCA2 genes in breast cancer [Goldgar et al., 2004, 2008; Lindor et al., 2012] as well as MMR genes in Lynch syndrome [Thompson et al., 2013a; Thompson et al., 2014]. Recently, this approach was implemented for classification of mitochondrial tRNA variations [Niroula and Vihinen, 2016].

Types of Tools

Before starting variation interpretation, one has to assume that the sequencing-related steps have been professionally and properly performed. Steps such as alignment, variation calling, and assessment of read quality are essential prerequisites. Variation interpretation cannot correct mistakes made in earlier steps; therefore, it is essential that the users of these tools are aware of the quality of the detected variants. In this respect, issues such as coverage and sequencing technologies used are very important; however, these are outside the scope of this article. Bear in mind that the interpretations of variation callers do not always agree [O'Rawe et al., 2013].

The effects of many deleterious variants are straightforward to explain, such as large deletions, protein truncations, amphigoric amino acid insertions and deletions that change the sequence after the variation site, and some other types. Usually, the most difficult ones to interpret are minor alterations, most often single-nucleotide substitutions and consequent amino acid changes. The effects on proteins are more difficult to study experimentally than those on DNA or RNA. Hybridization and sequence complementarity based methods work (almost) equally well for all nucleotide sequences, whereas a dedicated analysis protocol is needed for every individual protein and quite often even for individual variants, see Storz and Zera [2011], Yates and Sternberg [2013], and Perniola and Musco [2014], for example. Experimental methods are the first choice to study the detailed effects of variants. Such methods are often expensive and time-consuming and cannot be performed in the scale that exome and complete genome sequencing methods produce variants. Computational methods are thus useful for prioritizing variants for experimental studies [Thusberg and Vihinen, 2009; Zhang et al., 2012; Kucukkal et al., 2014]. Several computational tools have been developed for variation prioritization. The methods vary widely in the principle, implementation, and application.

Tolerance Predictors

Predictors for amino acid substitutions

The largest number of predictors is for tolerance prediction, that is, pathogenicity of variants. They vary widely in their concept and implementation [Karchin, 2009; Thusberg and Vihinen, 2009; Capriotti et al., 2012]. Most predictors are available for amino acid substitutions that are often erroneously called as missense variations [Vihinen, 2015a]. The methods utilize evolutionary information (alone or together with other features) from MSAs to estimate the conservation of the variant position. Practically, all the existing methods utilize sequence comparison data in one way or other. Some methods compute conservation scores and optimize them to classify the variations [Ng and Henikoff, 2001; Choi et al., 2012]. Other methods use the conservation scores as features for training. Additional features used by methods include physicochemical properties of the amino acids (e.g., hydropathy, charge, size, secondary structure propensities), annotations in sequence and structure databases, information from three-dimensional protein structures, gene ontology (GO) annotations, and so on. These methods include CADD [Kircher et al., 2014], MutationTaster2 [Schwarz et al., 2014], MutPred [Li et al., 2009], nsSNPAnalyzer [Bao et al., 2005], PolyPhen-2 [Adzhubei et al., 2010], PON-P2 [Niroula et al., 2015], SNPs&GO [Calabrese et al., 2009], and VEST [Carter et al., 2013]. Some representative methods are listed in Supp. Table S1. When considering the types of amino acid changes, disease-causing variants typically have more drastic changes in their physicochemical properties than tolerated variants [Steward et al., 2003; de Beer et al., 2013]. Variations at functionally and structurally important sites have a more severe impact than those at nonfunctional sites. Although GO annotation is a protein-specific feature, it has been shown to be useful for classifying deleterious and benign variations [Calabrese et al., 2009; Niroula et al., 2015].Some methods use features derived from three-dimensional protein structures. However, since experimental structures are not available for all proteins, these features cannot be used for variations in all proteins unless the structure or the features are predicted that might introduce problems and reduce method performance. Some metapredictors that utilize the results of other tools have been implemented including Condel [Gonzalez-Perez and Lopez-Bigas, 2011], PON-P [Olatubosun et al., 2012], Meta-SNP [Capriotti et al., 2013], and PredictSNP [Bendl et al., 2014].

The majority of these tools utilize ML algorithms. They are trained and tested on variations with known impact. Usually, the datasets have been different but recently benchmark datasets of sufficient size have become available and are being used increasingly. The VariBench database [Nair and Vihinen, 2013] contains positive and negative datasets for several types and effects of variants, whereas the VariSNP database contains only benign variants extracted from dbSNP [Schaafsma and Vihinen, 2015].

Methods intended to handle NGS datasets should, in addition to being accurate, have the capability to handle large datasets in reasonable time. Only some of the tools can handle very large datasets in reasonable time. In a test, the elapsed time ranged from about 10 min to more than 2 weeks for 30,000 variants [Niroula et al., 2015]. In addition, the application of such methods also depends on the possibility of submitting data in several formats, including those used by sequencers, such as variant call format (VCF) files. A programmatic interface (API) available for PON-P2 [Niroula et al., 2015] and possibility for local installation in some other programs allow seamless integration to variation analysis pipelines.

Predictors for synonymous variations

Although amino acid substitutions are the most common variants associated with disease, other types of variations including synonymous variations (that do not alter amino acid sequence) are associated with several diseases [Sauna and Kimchi-Sarfaty, 2011], for example, by altering splicing, changing mRNA structure, or changing translation speed [Hunt et al., 2014]. Synonymous variations at the splice sites as well as in exonic splicing regulatory regions may lead to splicing defects. Several splice-site effect predictors (discussed later) predict the impact of both intronic and exonic variations on splicing. The Silent Variation Analyzer (SilVA) is a method for prioritization of harmful synonymous variations [Buske et al., 2013]. The majority of the variations in the training data lead to a splicing defect but there are also variations that alter the methylation pattern or translational efficiency. The method suffers from the limitation of small numbers of known disease-causing variations, just 33 and six variants for training and validation, respectively.

Predictors for insertions and/or deletions

Insertions and/or deletions in coding regions can lead to three types of variations in a protein sequence: amphigoric alteration of the amino acid sequence due to RNA frameshift; insertion or deletion of one or more amino acid(s) without changing the reading frame; or termination of amino acid sequence by introducing an mRNA stop codon at the variation site or soon after it. Insertions or deletions of length divisible by 3 do not lead to translation frame alteration (i.e., do not change the amino acid sequence after the variation site). Tools in this domain include KD4i [Bermejo-Das-Neves et al., 2014] and SIFT Indel [Hu and Ng, 2013] that predict the impact of nonframeshifting insertions and deletions. Some other tools predict the impact of frameshifting as well as nonframeshifting insertions and deletions [Zia and Moses, 2011; Hu and Ng, 2012; Zhao et al., 2013; Liu et al., 2014; Douville et al., 2016] (Supp. Table S1). CADD [Kircher et al., 2014], MutationTaster2 [Schwarz et al., 2014], PROVEAN [Choi et al., 2012], and VEST [Carter et al., 2013; Douville et al., 2016] can predict impacts of insertions, deletions, as well as amino acid substitutions.

Predictors for variations in noncoding regions

Although variation interpretation methods largely focus on coding regions and protein sequences, some methods have been developed to predict the impact of variations in noncoding regions including regulatory regions [Macintyre et al., 2010; Manke et al., 2010; Ritchie et al., 2014; Lee et al., 2015; Zhou and Troyanskaya, 2015] (Supp. Table S1). Certain generic tolerance prediction tools including CADD [Kircher et al., 2014] and MutationTaster2 [Schwarz et al., 2014] can predict the impact of variations in coding as well as noncoding regions. The ENCODE project [Sloan et al., 2016] has compiled experimental and predicted information for vast amounts of noncoding sites, which can be used for prediction purposes.

Numerous tools and services have been established to collect and disseminate predictions of multiple predictors. The database of human nonsynonymous single-nucleotide variations (dbNSFP v3.0) contains results for 14 predictors, altogether for 83 million variants [Liu et al., 2016]. Annotation tools including ANNOVAR [Yang and Wang, 2015], AVIA v2.0 [Vuong et al., 2015], SnpEff [Cingolani et al., 2012], Variant Effect Predictor (VEP) [McLaren et al., 2010], and so on provide predictions of several predictors.

Domain-, Protein-, and Disease-Specific Predictors

Several databases and annotation tools are specific for certain genes, protein domains, or regions, for example, primary immunodeficiency-causing genes [Piirilä et al., 2006; Samarghitean et al., 2007; Ortutay and Vihinen, 2009], MMR genes [Thompson et al., 2014], protein kinase domain [Stenberg et al., 2000; Ortutay et al., 2005; Vazquez et al., 2016], phosphorylation sites [Safaei et al., 2011; Hornbeck et al., 2012], and many others. Specific predictors are available for variations in MMR genes [Chao et al., 2008; Ali et al., 2012; Thompson et al., 2013b; Thompson et al., 2014; Niroula and Vihinen, 2015a], the cystic fibrosis transmembrane conductance regulator protein [Masica et al., 2012], cytochrome P450 enzymes [Fechter and Porollo, 2014], hypertrophic cardiomyopathy-related proteins [Jordan et al., 2011], protein kinase domains [Torkamani and Schork, 2007; Väliaho et al., 2015; Vazquez et al., 2016], phosphorylation sites [Wagih et al., 2015], signal peptides [Hon et al., 2009], and others (Supp. Table S1). The size of the training dataset for such specific predictors is usually small, which may cause overfitting unless specifically taken care of. As it is difficult to find sufficient numbers of validated cases for evaluating specific predictors, it is essential to check how the performance of the methods has been determined. The methods should be tested on independent test datasets and all the recommended performance scores (see below) should be reported.

Prediction of the Effect or Mechanism of a Variant

Tolerance predictors detect variants that can be harmful but cannot explain what is wrong with them and in which way. Variations can have various effects, consequences, and mechanisms (Fig. 2); for those in amino acids, see Vihinen [2015c]. Variation effect analyses have to be done separately as there is no tool to combine the different predictions and interpret their results together. As the methods have widely varying speeds, it is beneficial to run a fast tolerance predictor first and then use tools to predict the effects and mechanisms for the potential harmful variants. This is also warranted by the fact that method and effect-specific tools typically have higher error rates than the best tolerance tools and they can be substantially slower to run. Therefore, filtering variants by tolerance predictors could improve their performance.

Details are in the caption following the image
Types of variation effects and predictions.

For some of the effects, including protein stability, aggregation, and solubility, there are several tools, some with tested performance. The biggest problem for the end-user with many of these predictors is that the method performance has often been presented in a somewhat biased or selective way and without comparison to related tools. Thus, it may be impossible to interpret the quality of the obtained predictions. The situation is improving due to the use of benchmark databases.

Protein stability predictions

Stability affects function, activity, and regulation of biomolecules. Incorrect folding and decreased stability are the major consequences of pathogenic amino acid substitutions [Wang and Moult, 2001; Ferrer-Costa et al., 2002; Stefl et al., 2013; Peng and Alexov, 2016]. Substitutions can cause, for example, a reduction in the hydrophobic area, over packing, backbone strain, loss of electrostatic interactions, and thereby affecting protein stability. Alterations to intramolecular interactions affect the free energy difference (ΔΔG) between the folded and unfolded states of proteins.

Several methods predict stability effects by comparing ΔΔG between a wild-type protein and its variants. Some of these methods rely on energy functions, whereas others apply ML approaches. Methods utilizing energy functions can be subdivided to physical potential approaches, statistical potential approaches, and empirical potential energy approaches [Guerois et al., 2002; Parthiban et al., 2006]. All the ML-based methods are trained with data from ProTherm database [Kumar et al., 2006]. They use features derived from tertiary protein structure, evolutionary conservation, sequence environment, network topology, and properties of amino acid residues to describe each variation [Capriotti et al., 2005; Cheng et al., 2006; Capriotti et al., 2008; Teng et al., 2010; Chen et al., 2013; Yang et al., 2013; Folkman et al., 2014; Giollo et al., 2014; Pires et al., 2014; Fariselli et al., 2015]. It is evident that different features are important for tolerance and stability predictors. Some methods combine both ML algorithms as well as energy functions [Dehouck et al., 2009; Masso and Vaisman, 2010; Laimer et al., 2015]. While most of the methods predict the stability changes for individual variations, some predict the combined effect of variation pairs [Huang and Gromiha, 2009].

Protein localization predictions

To function properly, a protein must be translocated to the appropriate cellular compartment. Proteins are directed to the locations by short targeting signal peptide sequences. Substitutions in signal peptides can disrupt or alter the signal and prevent localization or redirect the produced protein to a wrong cellular part. When a protein is not transported to the correct subcellular location, central reactions can be inactivated or signaling pathways are misregulated. An active mislocalized protein is likely to have harmful effects by exerting its function in the wrong environment.

Despite the fact that numerous predictors are available for predicting the protein subcellular localization, they do not predict the impact of variations on protein localization. Using WoLF PSORT [Horton et al., 2007] and a pipeline of several tools [Emanuelsson et al., 2007], disease-causing variations were studied in relation to protein localization [Laurila and Vihinen, 2009]. Among the 22,416 disease-causing variations tested, hundreds of variations are likely to change protein localization. PROlocalizer was later developed to predict the cellular localization as well as the impact of variations on localization [Laurila and Vihinen, 2011].

Splice-site effect predictors

Splicing of nascent premessenger RNA (pre-mRNA) is an essential step in mRNA maturation. Alternative splicing increases mRNA and protein diversity by generating multiple mRNA products from the same pre-mRNA. Hence, one gene can code for multiple protein products. At least about 95% of human genes have more than one transcript [Pan et al., 2008]; however, many of them are extremely rare and their biological significance has not been established. Among the variations in the HGMD, about 10% of the disease-causing variations occur within splice sites [Krawczak et al., 2007], which is likely to be an underestimation since the exonic variations outside the splice sites and variants at deep intronic sites could potentially disrupt splicing [Cartegni et al., 2002; Hunt et al., 2014]. Among the primary immunodeficiency-causing variants in IDbases, 15% are splice-site variants [Piirilä et al., 2006]. Exonic RNA substitutions are classified as missense, nonsense, or silent variations and their role in splicing is not generally analyzed. Several methods have been developed to predict splice sites [Hebsgaard et al., 1996; Pertea et al., 2001; Yeo and Burge, 2004] and splicing regulatory sites [Fairbrother et al., 2002; Wang et al., 2004; Goren et al., 2006; Smith et al., 2006]. There are also some tools to predict the effects of variations on splicing (Supp. Table S1). The Human Splicing Finder (HSF) combines results from several splice site and splice regulatory site prediction tools to estimate the impact of variations on splicing motifs [Desmet et al., 2009]. HSF and a commercial program ASSEDA [Nalla and Rogan, 2005] can predict the splice-site defect of exonic as well as intronic variations. MutPredSplice [Mort et al., 2014] and SKIPPY [Woolfe et al., 2010] predict splicing defects for exonic variations only.

Protein disorder predictors

Although proteins are often considered as static molecules, they undergo all kinds of fluctuations at different time scales. Some proteins or parts of proteins do not have an ordered structure at all. These intrinsically disordered regions and proteins are involved in several activities [Dunker et al., 2002]. The disordered regions are difficult to crystallize because they are highly flexible. Although numerous methods are available for predicting protein disorder, tools to predict the impact of variations on protein disorder were released only recently. Disease-causing amino acid substitutions are more often caused by disordered-to-ordered changes compared with the neutral amino acid substitutions [Vacic and Iakoucheva, 2012]. PON-Diso is a ML-based method for prediction of variation effects on protein disorder [Ali et al., 2014]. The method uses features representing evolutionary conservation and biochemical properties of amino acids.

Protein aggregation predictors

Protein aggregation is characterized by an increased level of β-sheet structure. Amyloid fibrils and amorphous aggregates are formed by protein aggregation. These fibrils and aggregates are caused by irreversible structural alterations and are associated with neurodegenerative diseases such as Alzheimer, Huntington, and Parkinson disease. Normal proteins can become toxic upon fibrillation [Bucciantini et al., 2004] and other proteins not related to amyloid diseases can aggregate under destabilizing conditions [Chiti et al., 1999]. Numerous tools predict amyloidogenic regions in protein sequences [Tartaglia et al., 2008; Garbuzynskiy et al., 2010; Maurer-Stroh et al., 2010; Emily et al., 2013]. Amino acid substitutions can increase the tendency of amyloidogenic proteins to aggregate. Methods for predicting the effect of amino acid substitutions on amyloidogenic proteins include AGGRESCAN [Conchillo-Sole et al., 2007], PASTA2.0 [Walsh et al., 2014], TANGO [Fernandez-Escamilla et al., 2004], and Aggrescan3D [Zambrano et al., 2015] (Supp. Table S1). Aggrescan3D requires three-dimensional protein structure to predict the effect of amino acid substitutions to aggregation, whereas the others utilize peptide sequences.

Protein solubility predictors

Protein solubility has been of great interest because of its relevance to protein (over)expression and for protein structural studies. Both nuclear magnetic resonance and X-ray crystallography require the protein of interest to be soluble in a high concentration. A low protein solubility has also been associated with diseases. Methods for predicting the effects of variations on protein solubility include OptSolMut [Tian et al., 2010], CamSol [Sormanni et al., 2015], and PON-Sol [Yang et al., 2016] (Supp. Table S1). OptSolMut and CamSol require a known 3-D protein structure, whereas PON-Sol is based on sequence context, evolutionary conservation, properties of amino acids, and other sequence-based features. Solubility and aggregation are related physical phenomena. Aggregation is an irreversible process due to structural and other changes, whereas precipitated proteins may be dissolved by dilution [Arakawa and Timasheff, 1985].

Additional mechanisms that amino acid substitutions can affect and for which prediction methods are available include, electrostatics, residue-residue contacts, side chain rotamers, cavity formation, and sterical clashes [Thusberg and Vihinen, 2009; Vihinen, 2015c]. When a residue is replaced, its chemical and physical properties may be altered causing structural alterations, especially when the original residue is smaller than the substituting one.

Prediction of Cancer Variants

Cancer is a systemic disease where the accumulation of genetic variations leads to alterations of multiple phenomena and processes. Depending on the cancer, the numbers of somatic variations range from a few to more than one million. A majority of these are so-called passengers and have little or no effect on the disease. Driver variants essential for initiation, development, progression, and/or maintenance of tumors are typically small in number.

Large amounts of cancer genomic data are available from genomic projects including the Cancer Genome Project (CGP, https://www.sanger.ac.uk/research/projects/cancergenome/), The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/), and the International Cancer Genome Consortium (ICGC, https://icgc.org/). These projects collect and provide various types of genetic data for large numbers of cancer samples. These massive datasets provide unprecedented possibilities for data analysis. Various approaches have already been used to study mechanisms of tumorigenesis. Still, the vast majority of nonfunctional somatic variations remain a major challenge in cancer genomics.

Despite a large number of studies of genetic variation in numerous cancers, relatively little is known about the true cancer driver variants. The Cancer Gene Census (CGC, http://www.sanger.ac.uk/genetics/CGP/Census/) lists genes implicated in cancer [Futreal et al., 2004]. Variations have been identified in these genes, available in the COSMIC database [Forbes et al., 2011]. However, the variations have seldom been experimentally validated to be drivers. In the end, even in the genes listed in CGC, most variations are predicted to be harmless [Niroula and Vihinen, 2015b].

Several approaches have been taken to search for the driver variations, genes, networks, and pathways. Computational methods and tools for interpreting variations in cancers have been reviewed previously [Gonzalez-Perez et al., 2013; Ding et al., 2014; Raphael et al., 2014; Chen et al., 2015; Tian et al., 2015]. Here, we briefly discuss the commonly used methods and tools in cancer variation analysis. Tolerance prediction methods are often applied for analysis of somatic variations found in cancer genomes. Some of them have been used to develop cancer-specific predictors. Specific tools including CHASM [Wong et al., 2011], transFIC [Gonzalez-Perez et al., 2012], CanPredict [Kaminker et al., 2007], SPF-Cancer [Capriotti and Altman, 2011b], cancer-specific FATHMM [Shihab et al., 2013], CanDrA [Mao et al., 2013], and others prioritize variants in cancer (Supp. Table S1).

Most of these methods are trained using frequent somatic variations in COSMIC and other cancer-related variations and putative neutral variations from various sources. Only a few databases contain validated cancer variants [Niroula and Vihinen, 2015b] (Table 1). The Database of Curated Mutations (DoCM) (http://docm.genome.wustl.edu/sources) contains disease-causing variations in cancers. The TP53 mutation database contains somatic variations in TP53 together with information about the effect on protein activity [Edlund et al., 2012]. Kin-Driver is a manually curated database of validated driver variations [Simonetti et al., 2014].

Frequencies and patterns of variations have been used to identify driver genes. These methods calculate the variation frequency compared with the expected background mutation rate. MutSigCV [Lawrence et al., 2013], MuSiC [Dees et al., 2012], and InVEx [Hodis et al., 2012] identify genes containing a significantly higher variation frequency compared with the background mutation rate. DrGaP [Hua et al., 2013] uses background mutation rates to identify significant genes and pathways, whereas OncodriveFM [Gonzalez-Perez and Lopez-Bigas, 2012] and MADGiC [Korthauer and Kendziorski, 2015] utilize the functional impact of variations. OncodriveCLUST [Tamborero et al., 2013] identifies genes that have significant clusters of variations in proteins. The 20/20 rule was derived to identify oncogenes and tumor-suppressor genes based on whether there are recurrent variations or the variations are distributed along the protein sequences [Vogelstein et al., 2013]. According to the rule, oncogenes bear at least 20% of the variations in the gene at recurrent sites and the variants lead to amino acid substitutions, and in tumor-suppressor genes, at least 20% of the total variants are inactivating. Based on similar concept, SomInaClust [Van den Eynden et al., 2015] and DOTS-Finder [Melloni et al., 2014] identify tumor-suppressor genes and oncogenes. ContrastRank [Tian et al., 2014] prioritizes putative driver genes based on mutation rates in the 1000 Genomes Project and normal samples in the TCGA, and ActiveDriver [Reimand and Bader, 2013] identifies genes containing significantly higher variation frequencies at active sites.

Certain methods recognize gene/protein interaction networks and pathways affected in cancer samples. NetBox identifies significant network modules and driver variations in cancer [Cerami et al., 2010]. VarWalker identifies sample-specific gene interactors and integrates them to prioritize driver genes [Jia and Zhao, 2014]. Mutual exclusivity of variations and/or genes across cancer samples has been used to identify pathways affected in large numbers of samples. This approach is based on the observation that variations affecting the same signaling pathways do not co-occur in the same samples [The Cancer Genome Atlas Research Network, 2008; Ciriello et al., 2012]. Dendrix [Vandin et al., 2012], MEMo [Ciriello et al., 2013], Multi-Dendrix [Leiserson et al., 2013], CoMEt [Leiserson et al., 2015], MEMCover [Kim et al., 2015], NBM [Wu et al., 2015], RME [Miller et al., 2011], CoMDP [Zhang et al., 2014], and others [Lu et al., 2015] use mutual exclusivity to identify affected pathways and genes.

As the large cancer genomic projects have collected heterogeneous data from large numbers of samples, data integration has become essential. Methods that integrate different types of data, for example, genome, transcriptome, proteome, and epigenome, include DawnRank [Hou and Ma, 2014], DriverNet [Bashashati et al., 2012], OncoIMPACT [Bertrand et al., 2015], and Pathway Relevance Ranking [Verbeke et al., 2015].

Cancer genomics is rapidly expanding and the numbers of new tools for analyzing cancer genomics are increasing rapidly. Predictors are important for prioritizing cancer-related variants, genes, networks, and pathways; however, they have limitations. Functional impact predictors have varying performance. Even minor differences in the performance lead to large numbers of differently predicted variations when applied to large datasets. Additional complexity arises because variations co-occur and their combined impact cannot be reliably predicted.

The frequency of a variation is not a direct indication of its relevance for disease. Some sites are just prone to variations, such as CpG sites [Ollila et al., 1996], but this does not necessarily mean that those variants are harmful. Variation spectra vary markedly between cancers. Those originating from actions of carcinogens such as melanoma and lung cancers accumulate large numbers of variants. In addition, the background mutation rate varies up to 100-fold in the genome [Hodgkinson and Eyre-Walker, 2011]. Large numbers of false positives could be identified due to high mutation rates in many cancers and many false negatives due to tumor heterogeneity [Lawrence et al., 2013]. These methods are less efficient in cancers that have a lower variation frequency. Even less frequent variations can equally affect the same pathways as frequent variations in cancer. Network- and pathway-based methods can point out the affected mechanisms. Since human protein interaction network and pathway databases are far from complete, these methods have limitations.

Method Performance Assessment

When choosing a prediction method, one should look at performance scores, preferably from independent benchmark studies. Three approaches can be used for testing method performance [Vihinen, 2012]. The first approach is challenge-based testing. Critical Assessment of Genome Interpretation (CAGI, http://genomeinterpretation.org/) is a challenge for method developers in the field of phenotypic impacts of genomic variation. Challenges do not aim for a systematic analysis of predictions, instead they assess what is currently doable, providing proof of concept, charting where to direct future efforts, and identifying new areas where predictive approaches would be needed.

The second test strategy is typically used by method developers. Most often the performance tests are not comprehensive, and the results are incomparable with those obtained from other methods due to using developer collected test sets. The third approach, systematic analysis, uses benchmark data and several evaluation measures to explain method performance.

Numerous measures should be used to describe predictor performance, since a single measure cannot capture all aspects of the performance. Typically, prediction methods are used as classifiers to define whether a case has the investigated feature or not. Results of this kind of binary predictions are presented in a 2×2 confusion or contingency table or matrix. Based on the four descriptors in the contingency table, several measures can be calculated (Fig. 3). Sensitivity, also called true positive rate (TPR) or recall, and specificity (true negative rate, TNR) show the ratio of the correctly predicted pathogenic and neutral cases. Positive predictive value (PPV/precision) and negative predictive value (NPV) are the conditional probabilities for pathogenic or neutral variants to be predicted as pathogenic or neutral, respectively. All these measures are calculated by using only half of the information in the contingency table. Accuracy and the Matthews correlation coefficient (MCC) utilize the whole matrix and are more balanced, representative, and comprehensive. To obtain a full picture of the predictor performance, it is important to evaluate all these six measures together [Vihinen, 2012].

Details are in the caption following the image
Recommended performance measures for binary predictors. Accuracy and MCC use all four values from the contingency table.

Unless sufficient and similar numbers of positive and negative cases are used, the values of NPV and PPV may be biased, even meaningless. Accuracy and the MCC are affected by class imbalance only in extreme cases. The usual requirement is that the numbers of positive and negative cases be equal. To overcome the class imbalance problem, different approaches can be taken [Vihinen, 2012]. It can be done by pruning the size of the bigger class to that of the smaller one. It is also possible to normalize values in the contingency table. Normalizing makes sense only if dataset is representative.

Receiver operating characteristic (ROC) analysis is a visualization of prediction performance. It indicates the tradeoffs between sensitivity and specificity. The ROC curve can be drawn when the predictor provides a score for the classification. The faster the curve rises and the higher it reaches in the beginning the better the method is (Fig. 4). The area under the ROC curve (AUC) has been used as a measure of goodness for predictions as it approximates the probability of ranking a randomly chosen positive instance higher than a randomly chosen negative one. A value of 0.5 indicates a random classification, whereas 1 would indicate a perfect binary classifier. The ROC curve shows the method's ranking potential, which is related to overall performance.

Details are in the caption following the image
Example of a ROC curve. The shaded area is called area under the curve (AUC). It ranges from 0 to 1 where a random predictor will have an AUC 0.5 and a perfect predictor will have an AUC 1.

If there are more than two predicted classes, the measures discussed above cannot be applied [Vihinen, 2012]. The data can still be presented in an N x N contingency table. Many performance scores used for binary classification can be generalized to multiclass case. The data could be divided into several partitions of two categories. It is possible to calculate row- and column-wise ratios. The MCC is a special performance score for binary data of linear correlation coefficient, which can be used for several classes in its general format. Generalized squared correlation (GC2) can be used for more than two classes [Baldi et al., 2000].

Method Performance Evaluations

Several studies have assessed the performance of prediction methods (Table 2). Here, we summarize results from them. There have been mainly two types of evaluations, larger benchmark-based studies and single gene, protein, or family-related investigations.

Table 2. Prediction method performance assessment studies
Number of variations Evaluated tools Reference
Tolerance predictors
40,505 MutPred, nsSNPAnalyzer, PANTHER, PhD-SNP, PolyPhen, PolyPhen-2, SIFT, SNAP, SNPs&GO Thusberg et al. [2011]
43,882; 32,776; 3,497; 11,994 MAPP, nsSNPAnalyzer, PANTHER, PhD-SNP, PolyPhen, PolyPhen-2, PredictSNP, SIFT, SNAP Bendl et al. [2014]
25,656; 2906; 1200; 15,818; 1,804 CADD, Condel, MutationTaster2, PolyPhen-2, PON-P, PON-P2, PROVEAN, SIFT, SNAP Niroula et al. [2015]
40,389; 8,850; 10,226; 16,098; 12,729 CADD, FATHMM, GERP++, LRT, MutationAssessor, MutationTaster2, phyloP, PolyPhen-2, SIFT Grimm et al. [2015]
30; 2,314 CADD, Condel, MutationAssessor, PANTHER, PolyPhen-2, SIFT Miosge et al. [2015]
Protein stability predictors
2,156 CC/PBSA, EGAD, FoldX, I-Mutant2.0, Rosetta, Hunter Potapov et al. [2009]
2,716 CUPSAT, Dmutant, FoldX, I-Mutant2.0, I-Mutant3.0, Mupro, MultiMutate, Scide, SRide, Scpred Khan and Vihinen [2010]
Splicing predictors
272 ESE finder, Human Splicing Finder, MaxEntScan, NNSplice, RESCUE-ESE, Splice Site Finder Houdayer et al. [2012]
623 GeneSplicer, GENSCAN, Human Splicing Finder, MaxEntScan, NNSplice, SplicePort, SplicePredictor, SpliceView, SROOGLE Desmet et al. [2010]
2,959 GeneSplicer, GENSCAN, Human Splicing Finder, MaxEntScan, NetGene2, NNSplice, Position Weight Matrix, SplicePredictor Jian et al. [2014]
Cancer variation predictors
989 CanDrA (breast, lung, melanoma), CHASM (breast, lung, melanoma), Condel, FATHMM (cancer), FATHMM (missense), MutationAssessor, MutationTaster, PolyPhen-2, PROVEAN, SIFT, VEST Martelotto et al. [2014]
  • a Separator when there is more than one dataset.

Protein tolerance predictors

A dataset of over 40,000 variants, 19,335 positive and 21,170 negative cases, was used to test the performance of nine widely used tolerance methods [Thusberg et al., 2011]. The programs were used to predict the entire dataset and the predictions were compared with the experimentally verified outcome. The results indicated the methods to have a wide variation in their performance. The best performing methods, SNPs&GO [Calabrese et al., 2009] and MutPred [Li et al., 2009], had MCC values close to 0.65. The performance differences have a big practical impact in exome and complete genome-wide analyses where the number of correctly predicted variants can vary by several hundreds or thousands per genome depending on the selected methods.

The predictors were further tested by addressing their performance on secondary structural elements and protein structural classes and by investigating effects of residue accessibility [Thusberg et al., 2011]. All programs predicted the effects of substitutions at different secondary structures with almost equal accuracy and precision as all variations in general. Solvent accessible surface areas of the positions did not markedly affect predictions; however, all methods were more sensitive to the effects of substitutions at buried positions. The performance varied significantly depending on the structural class of proteins. The results were generally better for proteins in the α-β class.

In another study, the authors generated benchmark datasets to evaluate the performance of eight methods and compared with their new tool, PredictSNP [Bendl et al., 2014]. Benchmark datasets were obtained by combining several previously used datasets and eliminating the variants used for training the evaluated methods. Three independent datasets were generated, the biggest containing 19,800 deleterious and 24,082 neutral variations. On the largest dataset, PhD-SNP and PredictSNP showed the best performance. The highest accuracy and MCC for both methods were 0.75 and 0.49, respectively.

The performance of seven prediction methods was evaluated using three different datasets and the methods were compared with PON-P2 [Niroula et al., 2015]. The two best performing methods were PON-P2 and MutationTaster2. The accuracy and MCC for PON-P2 were 0.86 and 0.71, respectively, in the PON-P2 test dataset. In the MutationTaster2 test dataset, the accuracy of PON-P2 were 0.95 and 0.90, respectively; for MutationTaster2, they were 0.89 and 0.79, respectively.

A study of 10 prediction methods that were evaluated with five datasets [Grimm et al., 2015] revealed that the performance of methods decreases when the variations present in the methods’ training dataset were eliminated from the test data. It is well known that the performance of a method is overestimated when tested on cases present in its training data that is called “Type 1 circularity.” Therefore, the guidelines for method testing demand the use of disjoint datasets for training and testing [Vihinen, 2013]. Some methods still had good performance after avoiding “Type 1 circularity”; however, it decreased significantly when variations in proteins present in the training dataset were excluded. When the evaluation data and the training data were disjoint at the protein level, the performance of all the methods was reduced and the highest MCC was 0.36. The differences in the performance scores were due to two types of circularity. The study showed that appropriate measures should be taken based on the features used to train the methods to obtain their actual performance. This study did not include PON-P2 that has been trained and tested on data where proteins in the same family were either present in the training or in the test dataset; thus, the results presented above for PON-P2 are without circularity. The datasets for all the studies mentioned above are in VariBench [Nair and Vihinen, 2013].

The prediction performance of six prediction methods was assessed using 30 variants in 23 primary immunodeficiency related proteins and 2,314 variants in the TP53 protein [Miosge et al., 2015]. The methods showed low false negative prediction rates but very high false positive prediction rates, that is, 42% for PolyPhen2 and 45% for CADD for the TP53 variants. A comparison of transcriptional activities showed that the false positive variants had a lower transcriptional activity, that is, 86% of the activity for true negative variants. Hence, the compared methods overestimated the deleteriousness of the tested variants. Full set of the measures described above was not used for the assessment.

Protein stability predictors

Six programs were compared [Potapov et al., 2009] by using 2,156 single variations from ProTherm (Table 2). The goal of the study was to compare the performance of the methods in ΔΔG prediction and thus the effect on protein function was not addressed. The only measure used was correlation between the experimental and predicted ΔΔG values. The highest correlation coefficient of 0.59 was for EGAD. Combination of the methods did not improve the correlation coefficients.

In another study, 11 methods were evaluated [Khan and Vihinen, 2010]. The dataset contained 1,784 variations in 80 proteins, with 1,154 positive cases, of which 931 were destabilizing (ΔΔG ≥ 0.5 kcal/mol), 222 were stabilizing (ΔΔG ≤ −0.5 kcal/mol), and 631 were neutral (0.5 kcal/mol ≥ ΔΔG ≥ −0.5 kcal/mol). The overall performance was best for I-Mutant3.0 (structure version) [Capriotti et al., 2008], Dmutant [Zhou and Zhou, 2002], and FoldX [Schymkowitz et al., 2005] with accuracies from 0.54 to 0.64. However, the MCC values were poor for all the tested predictors, the best being I-Mutant3.0 (structure version) that had an MCC of 0.27.

All the programs succeeded better when predicting either stability-increasing or stability-decreasing variations separately [Khan and Vihinen, 2010]. The majority of the programs predicted variants in different secondary structural elements with almost equal accuracy. All the programs predicted the effects at buried sites with a higher accuracy and specificity than those at surface positions. When the original residue is replaced by a residue with a smaller volume, a cavity may form in the protein interior with a detrimental effect on the stability. Large volume changes were predicted better than small changes. For destabilizing variations, there were no significant performance deviations in the methods for different charge changes. The predictors performed similarly despite differences in the extent to which the volume or charge varied as functions of the original residue and the variant.

Splice-site predictors

Splicing effect predictors have been assessed in several studies [Houdayer et al., 2008; Holla et al., 2009; Vreeswijk et al., 2009; Thery et al., 2011; Houdayer et al., 2012; Grodecka et al., 2014]. The methods have high performances for variations occurring at the canonical AG/GT splice sites, and lower performances for variations further away from the splice site. Six splice-site predictors were tested using 17 variations at splice-site junctions and 22 variations outside the junction [Houdayer et al., 2008] (Table 2). The effects of variants at the splice-site junction were correctly predicted but there were greater inconsistencies in the scoring of variations outside the junctions. In another study, the performance of the same six tools was assessed using 272 variations in the BRCA1 and BRCA2 genes [Houdayer et al., 2012]. The variations were grouped into canonical AG/GT splice sites, splicing consensus region (11 base pairs at the 5’ splice site and 14 base pairs at the 3’ splice site), intronic region outside the consensus region, and exonic region outside the consensus region. Among them, 65 variants occurred within the consensus-splicing region. The differences between the scores for the reference sequence and the altered sequence (called as variant score) were used to define the splice-site defects. Due to differences in the scoring systems, the tools performed best at different cutoffs of the variant score. MaxEntScan (MES) had the best sensitivity (0.96) and specificity (0.74) at a 15% cutoff of the variant score, whereas splice site finder (SSF) [Shapiro and Senapathy, 1987] had the best sensitivity (0.91) and specificity (0.87) at a 5% cutoff of the variant score. Combination of MES and SSF had the highest sensitivity (0.96) and specificity (0.83) for variants in the consensus splicing region.

In another study, the performance of nine predictors was assessed using 538 variations affecting splicing and 85 variations that do not affect splice sites [Desmet et al., 2010]. Most of the tested tools showed high accuracies (as high as 100%) for variants at 3′ and 5′ invariant positions. The methods showed high accuracies for variants at +3 and +5 positions at the 5’ end, but slightly lower accuracies for variants at other positions at the 5′ and 3′ ends. Among exonic variations, the methods predicted most accurately the variants at the last bases of the exons. MES showed the highest accuracy for intronic variants except those outside the immediate 5′ and 3′ ends but within 100 bp from the ends. MES also showed a higher false positive rate (25%) compared with the other methods. Overall, the methods showed very different prediction accuracies for variants at different locations.

Recently, a larger dataset containing 1,164 positive and 1,795 negative cases was used to evaluate eight splice-site predictors [Jian et al., 2014]. The dataset contained single-nucleotide substitutions in the splicing consensus region. Overall, the compared methods showed high accuracies ranging from 0.83 to 0.91. Position weight matrix and MES had the best performance with accuracies of 0.911 and 0.895, respectively. The performance assessments show that the predictors have high performances for variants in consensus splicing regions, as expected due to the high conservation of these sites. Since different performance measures have been used for the assessment and none of the studies reports all the recommended performance measures, the results remain descriptive.

Cancer variation impact predictors

The performance of both cancer-specific tools as well as generic tolerance predictors on cancer variants were assessed using putative variations with high minor allele frequencies and recurrent variations in the COSMIC database [Gnad et al., 2013]. There are problems with the design of the study, however. Recurrent variations from COSMIC have been used to train cancer-specific predictors, causing circularity. Relevance of high-frequency variations used as neutral dataset in this study is unknown in cancer. In another evaluation of 15 prediction methods, experimentally validated variants causing Li-Fraumeni syndrome, Li-Fraumeni-like syndrome, and early-onset breast and ovarian cancer were used [Martelotto et al., 2014]. In the assessment, most methods showed high PPVs but the NPVs were variable. CanDrA and CHASM can take into account the tissues of origin. Predictions of CanDrA for specific tissues showed variable performances, whereas CHASM showed similar performances for all three tested tissue types. Recently, a tolerance predictor, PON-P2, was assessed using 1,058 cancer variants where 69.3% (733) were predicted as harmful, 0.3% (4) as benign, and the remaining (321) as variants of unknown significance [Niroula and Vihinen, 2015b]. The method could not be tested with noncancer variants due to a lack of data. As experimental evidence for the involvement of specific variants in cancer is scarce, it has been difficult to develop and test performance of cancer predictors. This is further complicated by the fact that several alterations are simultaneously required for cancer development.

Problems and Remedies

Problems in the application of computational methods to variation analysis can appear at several stages and affect the interpretation of the results [Rogan and Zou, 2013; Vihinen, 2013, 2014a]. Before using any method, one should understand how it is used and for what purpose. As variation interpretation has become an ever more important topic, many methods have been developed. There has sometimes been a “me too attitude,” with people applying methods that have been developed originally in some other domain. These may cause problems, especially when the application domain and the datasets are not well known to the developers. Some quite poor datasets have been used for training many predictors. Methods developed in one application area can be useful in another domain when properly trained.

Another problematic area has been overfitting of data, see, for example, Grimm et al. [2015]. If the training data and features are not representative of the full spectrum of true cases, the generalization ability of an ML algorithm is poor and the method is overfitted. Overfitting may also occur due to other reasons that were described in an earlier section of ML methods. The same data should never be used for both training and testing [Vihinen, 2012] since it introduces circularity [Grimm et al., 2015]. Even usage of highly similar cases, such as variants in the same protein or protein family, may bias the method and provide good test performance but poor performance when applied to unseen real-life cases [Capriotti and Altman, 2011a; Bendl et al., 2014; Niroula et al., 2015].

Many variation effects are rare or very rare; therefore, it may be difficult to obtain sufficient numbers of cases for method training. It has been shown that the best predictor performance is obtained when using a balanced training dataset, that is, data that have equal numbers of positive and negative cases [Wei and Dunbrack, 2013]. If this is not done, the predictor will be biased toward the bigger class. For example, if a predictor is trained on a dataset containing more positive cases than the negative cases, the predictor will have a high TPR but a low TNR. The training dataset can be balanced by downsampling the bigger class, oversampling the smaller class or by assigning class weights [Vihinen, 2012]. It is also recommended to use a balanced test dataset for performance evaluation. However, the dataset imbalance does not have as high impact in testing as in training. Balanced performance scores such as balanced accuracy should be used in such cases.

It is not uncommon in the literature that authors present overblown statements about the quality of their methods [Vihinen, 2014a]. Therefore, it is important to look for independent quality assessments. Another problem is selective presentation of quality measures, showing only those that have a good performance for the tool. Many methods are not described in sufficient detail so that the readers could get a full picture about their quality. Many journals limit method details in articles such that the supplementary materials may be much longer than the actual article. In principle this is not problematic; however, quite often the supplements are not as carefully produced as the main text.

Human Mutation has published requirements both for prediction method developers and users [Vihinen, 2013] (see Boxes 1 and 2). For method users, these include the description of the choice and details of the method employed, its version, parameters and program options, and so on. In addition, program-generated statistical values should be provided. Special attention needs to be paid to the interpretation of the obtained results. Without all these details, readers cannot fully understand and evaluate the given results.

Box 1. Publication guidelines for prediction method developers

Adapted from Vihinen [2013] and Human Mutation submission requirements.

General requirements

  • The method should be novel and have good performance; clearly improved performance; or describe novel useful parameters or applications and have reasonable performance

Method description

  • Has to be in sufficient detail, including (depending on the approach): description of training, optimization, input parameters and their selection, details of applied computational tools, hardware requirements, and time complexity

Datasets

  • Description of training and test datasets, preferably the largest high-quality benchmark datasets
  • Description of how data were collected, quality controls, details of sources, sizes of datasets
  • Distribution, preferably via a benchmark database

Method performance assessment

  • Report of all appropriate performance measures. In the case of binary classifiers, six measures characterizing the contingency matrix (Fig. 3), if possible, including ROC analysis
  • Statistically valid evaluation with sufficiently large and unbiased dataset(s) that contains both positive and negative cases and comparison to related methods
  • Training and test sets have to be disjoint
  • Dataset imbalance needs to be mitigated

Implementation

  • Detailed description of the method including user guidelines and examples
  • Availability of the program, either to download or as a Web service, and preferably open source
  • Batch submission possibility is highly recommended

Sometimes methods are used outside their initial application area and without knowledge about their suitability to the new application [for more details, see Vihinen, 2014a]. One example is the use of generic protein disorder predictors to investigate effects of amino acid substitutions. Recently, the applicability of these tools was assessed and found to be very poor outside their primary use case [Ali et al., 2014]. Only PON-Diso, a tool dedicated for disorder affecting variants, had reasonable performance.

Box 2. Publication guidelines for prediction method users

Adapted from Vihinen [2013] and Human Mutation submission requirements.

  • Choose a method designed for the task and with proven performance.
  • Sometimes it may be beneficial to use several methods but beware of possible problems due to using similar or same data.
  • Report in detail the method used including citation, URL, version, used parameters, and program options.
  • Additional information such as user-generated multiple sequence alignment should be made available.
  • Report output data plus P values, confidence intervals, and all other reliability measures provided by the tool(s).
  • Be careful with data interpretation.
  • Understand the principle of the method, its use, limitations, and applications.

Since many methods are available for predicting pathogenicity, it may be tempting to combine the results from several tools. This can be done, but there are certain caveats. The consensus approach can be problematic if the same or similar data dominate [see examples in Vihinen, 2014a]. Predictions from several methods can be combined and has been done in metapredictors like Condel [Gonzalez-Perez and Lopez-Bigas, 2011], PON-P [Olatubosun et al., 2012], Meta-SNP [Capriotti et al., 2013], and PredictSNP [Bendl et al., 2014] that have trained ML methods to weigh the outputs of the individual predictors in different ways. Metapredictors should be trained on cases not used for training the constituent methods to avoid overfitting.

Choice of Tools

Whenever assessing variant pathogenicity, start by looking at known variants in databases including LSDBs, ClinVar and HGMD (especially if you have the commercial license), and others (Table 1). Many variants have heterogeneous effects; thus, these databases may not be fully conclusive. When using prediction methods, start with generic tolerance predictors or gene/protein/disease-specific predictors, if available (Supp. Table S1). In the next phase, mechanism/effect-specific tools can be applied to investigate the actual reason for the phenotype. Remember that no method is better than the input data; therefore, be aware of your variation data quality.

Choosing suitable methods can be a difficult task; however, it is important and will pay back. Reliable tools provide the best possible starting point for variation interpretation both for clinical and research purposes. Many tools are constantly being developed; therefore, it is important to follow the updates. Methods that were good a year or two ago may not have cutting-edge performance anymore. The community should not accept any black-box approaches; unfortunately, this is common with commercial solutions [Vihinen, 2015b]. Many companies do not provide sufficient details to allow understanding of how their methods work and what their true performance is. For a checklist of items to consider when choosing methods, see Box 3.

Box 3. Checklist for choosing methods

Adapted from Vihinen [2012].

Items to check when estimating method performance and comparing performance of different methods:

  • Is there a comprehensive method description?
  • Have established databases and benchmarks been used (if available)? If not, are the used datasets publicly available?
  • Is the version of the method mentioned (if several versions exist)?
  • Is the contingency table available with all measures and values included?
  • Have the developers reported the six performance measures: sensitivity (or true positive rate, TPR), specificity (or true negative rate, TNR), positive predictive value, negative predictive value, accuracy, and Matthews correlation coefficient for binary predictors?
  • Has-cross validation or some other partitioning method been used in method testing? Use of independent test dataset is highly recommended.
  • Are the training and test datasets disjoint?
  • Are the results in balance?
  • Has the ROC curve been drawn based on the entire test data? Inspect the ROC curve and AUC.
  • How does the method compare with others in all the measures?
  • Does the method provide probabilities for predictions?

In addition to choosing the right method, the user has to be able to interpret the obtained results and understand the effect of adjustable program parameters. This means understanding the principles of the algorithm. This is probably the most difficult of the requirements for many biologists and medical scientists; however, they are instrumental and not impossible to master. The performance of the methods should be properly and comprehensively reported in articles. As a user, be sure that the method is applicable for the intended purpose. Use only methods with proven performance, preferably tested on established benchmark datasets or at least tested on the available datasets. Studies addressing method performance are in this respect very useful. Pathogenicity predictors provide one type of data for variants. Carefully consider the importance and relevance of the predictions and combine with other kinds of data, whether clinical signs or laboratory results.

The Way Ahead

Progress in variation interpretation has been fast and is not likely to slow down, especially since the need is growing due to the increased application of sequencing technologies in research and clinics. Gene- and protein-specific tools will likely become more common and they can be trained when more data become available. There will still be a need for accurate tolerance predictors. Benchmark datasets need to be improved and expanded, developed for new application areas, and used for predictor training and testing. Only experimentally validated cases should be used for training. To continue to sell their products, companies need to become more open and unveil details about how their methods have been developed and how they perform on benchmark data.

Predictors should provide information that can be easily used for diagnostic decision making. Integration of various data items from several sources will be one of the themes in the future. This could create additional requirements for method developers, as all the data items may not always be available. Approaches to handle incomplete data have to be devised. Standardization of variation data and annotation has to be improved. The use of HPO and other ontologies for pathogenicity will allow easy utilization of phenotype data and VariO annotations will facilitate generation of accurate datasets for training and test purposes. Other standards are needed and they have to be applied strictly. An example of failed use of a standard data format is the VCF format, of which there are probably dozens of different versions that cause problems when transferring data between programs.

Genome sequencing is increasingly provided as a direct-to-consumer service. There are several problems related to data interpretation and especially in the communication of the significance of the findings. This should be done by professional counsellors, a service that these companies do not provide.

Protein and likely also RNA structural information will be increasingly used for the interpretation of variant effects. There are already tools available, but the throughput speed is usually not suitable for large-scale studies. Structural insight will allow for new ways of understanding variation effects as well as for designing therapies to treat pathologies. For example, many variants that have stable structures affect molecular interactions [Sahni et al., 2015].

Bioinformatic tools are increasingly compiled into pipelines and workflows, such as Galaxy [Goecks et al., 2010], Taverna [Wolstencroft et al., 2013], and Anduril [Ovaska et al., 2010]. They have numerous benefits, especially in routine use. Users also have to be aware of situations when the pipelines are inappropriate to use. Cloud computing has been a buzz word in recent times. For most users of variation interpretation tools, it does not matter where the data are analyzed. But for clinical applications, security is a prime question. Due to potential security and privacy issues with distributed computing, interpretation via cloud services may need to be approached according to local and national laws. Indeed, improvements in online security are being made at constant pace.

Quality of interpretation in prediction algorithms will likely be the issue of greatest importance in the future. In fact, it applies to the whole process of genomic medicine. The smallest possible error rate must be achieved in sequencing, variant calling, annotation, and interpretation. Prediction methods have to be highly accurate, and at the same time very fast, to cope with exome and complete genome sequence data for numerous individuals. Interpretation has become and likely will remain the bottleneck for the use of variation data in clinical practice. Intelligent solutions are already available and will be further improved in the future. Computer-assisted variation interpretation will be part of the toolbox for everyone working on sequencing data generation and interpretation.

Acknowledgments

We thank Gerard Schaafsma for proofreading the manuscript.

    Disclosure statement: There is no conflict of interest to declare.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.