Phenotype-driven approaches to enhance variant prioritization and diagnosis of rare disease
Abstract
Rare disease diagnostics and disease gene discovery have been revolutionized by whole-exome and genome sequencing but identifying the causative variant(s) from the millions in each individual remains challenging. The use of deep phenotyping of patients and reference genotype−phenotype knowledge, alongside variant data such as allele frequency, segregation, and predicted pathogenicity, has proved an effective strategy to tackle this issue. Here we review the numerous tools that have been developed to automate this approach and demonstrate the power of such an approach on several thousand diagnosed cases from the 100,000 Genomes Project. Finally, we discuss the challenges that need to be overcome if we are going to improve detection rates and help the majority of patients that still remain without a molecular diagnosis after state-of-the-art genomic interpretation.
1 INTRODUCTION
Rare diseases (RDs) are estimated to affect a substantial proportion of the population, estimated at 6% by one study although exact numbers vary considerably depending on definitions of RD, methodologies, and sources of data (Ferreira, 2019; Haendel et al., 2019). In addition, most RD patients undergo considerable medical odysseys before a diagnosis (Splinter et al., 2018). Next-generation sequencing has started to transform RD diagnostics and research and numerous programs have demonstrated improved diagnostic yields from large-scale whole-exome and genome sequencing (WES and WGS) studies as well as efficient identification of novel disease−gene associations: Care4Rare (Dyment et al., 2015), Centers for Mendelian Genomics (Posey et al., 2019), Undiagnosed Diseases Network (Splinter et al., 2018). In particular, the UK 100,000 Genomes Project has transformed the way that genomics is used in the UK's National Health Service (NHS) for RDs with a WGS now the standard genetic test for many types of RD (Smedley et al., 2021).
Despite these successes, the causative mutations in the genomes of the majority of patients remain undetected after WES or WGS with diagnostics yields of 25%–50% (Clark et al., 2018; de Ligt et al., 2012; Rauch et al., 2012; Tammimies et al., 2015; Y. Yang et al., 2013, 2014; Zhu et al., 2015). Numerous variants in an affected individual typically remain after filtering a WES or WGS for variants using standard strategies for RD candidacy. These include identifying variants that: (i) are extremely rare according to population sequencing database such as GnomAD (Karczewski et al., 2020), (ii) segregate with disease in extended pedigrees, and (iii) are predicted to be pathogenic using in silico algorithms such as, for example, REVEL (Ioannidis et al., 2016), MVP (Qi et al., 2018), PolyPhen-2 (Adzhubei et al., 2010), CADD (Kircher et al., 2014), MutationTaster (Schwarz et al., 2010), and SIFT (P. C. Ng & Henikoff, 2001). In high-throughput and often under-resourced healthcare settings, the causative variant can often be overlooked in this background. One increasingly adopted approach is to collect detailed clinical phenotype data on each affected individual using the Human Phenotype Ontology (HPO; Köhler et al., 2021) and compare that to reference phenotypic knowledge associated with each candidate variant and gene to narrow down the search further. The majority of the projects described above have successfully used this approach to improve diagnostic outcomes and many groups have built computational frameworks and pipelines to automate the phenotypic comparisons (summarized in Table 1).
Software | Low-throughput web access | High-throughput programmatic access | GRCh38 analysis | Family-based analysis | SNV analysis | SV analysis | Noncoding analysis | Novel disease gene discovery through model organism, pathway, PPI data etc. | Natural language processing of latest literature | Reference data update (first published) |
---|---|---|---|---|---|---|---|---|---|---|
Exomiser framework (Smedley et al., 2015) including PhenIX (Zemojtel et al., 2014) and Genomiser (Smedley et al., 2016) | Yes | Yes | Yes | Yes | Yes | Yes | Yes (Genomiser only) | Yes | No | 2021 (2014) |
AMELIE (Birgmeier et al., 2020) | Yes | Yes | No | Yes | Yes | No | No | No | Yes | 2021 (2020) |
AnnotSV (Geoffroy et al., 2021) | Yes | Yes | Yes | No | Yes | Yes | Yes | No | No | 2021 (2018) |
SvAnna (Danis et al., 2021a) | No | Yes | Yes | No | No | Yes | Yes | No | No | 2021 (2021) |
LIRICAL (Robinson et al., 2020) | No | Yes | Yes | No | Yes | No | No | No | No | 2021 (2021) |
xRare (Q. Li, Zhao, et al., 2019) | No | Yes | No | No | Yes | No | No | Yes | No | 2018 (2019) |
VARPP (Anderson et al., 2019) | No | Yes | No | No | Yes | No | No | No | No | 2019 (2019) |
DeepPVP (Boudellioua et al., 2019) | No | Yes | No | No | Yes | No | No | Yes | No | 2018 (2019) |
MutationDistiller (Hombach et al., 2019) | Yes | No | No | No | Yes | No | No | Yes | No | 2018 (2019) |
GenIO (Koile et al., 2018) | Yes | No | No | No | Yes | No | No | No | No | 2017 (2018) |
wAnnovar (H. Yang & Wang, 2015) | Yes | No | Yes | No | Yes | No | No | Yes | No | 2017 (2015) |
QueryOR (Bertoldi et al., 2017) | Yes | No | No | Yes | Yes | No | No | Yes | No | 2017 (2017) |
BierApp (Alemán et al., 2014) | Yes | No | No | Yes | Yes | No | No | No | No | 2016 (2014) |
OVA (Antanaviciute et al., 2015) | Yes | No | No | No | Yes | No | No | Yes | No | 2015 (2015) |
Phen-Gen (Javed et al., 2014) | No | Yes | No | Yes | Yes | No | Yes | Yes | No | 2013 (2014) |
eXtasy (Sifrim et al., 2013) | Yes | Yes | No | No | Yes | No | No | Yes | Yes | 2013 (2013) |
- Note: Peer-reviewed, freely available (to academics/nonprofits at a minimum) software offering, HPO-based prioritization of variants from rare disease case-based VCF files. Software was reviewed for a range of features required for accurate, up-to-date interpretation at scale.
- Abbreviations: SNV, single nucleotide variant; SV, structural variant.
A whole range of computational algorithms has been deployed in these tools incorporating natural language processing, machine learning, and artificial intelligence including deep neural networks, semantic similarity, and statistical probability approaches such as likelihood ratios. Each of the published tools also varies in terms of licensing, whether high-throughput programmatic use is possible and whether they support features such as human genome assembly GRCh38 and family-based analysis (Table 1). However, only a handful of tools, including Exomiser, AMELIE, and LIRICAL, show evidence of active maintenance with underlying databases updated since 2019. Caution should be exercised when using the other tools as any of the numerous, recently discovered new disease−gene associations will likely not be detectable. Further illustrating the problems with long-term maintenance of academic software, many of the tools were no longer available at their published locations and are therefore not included in Table 1: PhenoPro (Z. Li, Zhang, et al., 2019), OMIMExplorer (James et al., 2016), Phenoxome (Wu et al., 2019).
In this article, we first explore how phenotype-driven methods can improve diagnostic yields for RD using a large cohort of 4877 affected individuals who had received a molecular diagnosis (i.e., solved cases) from the 100,000 Genomes Project (Turnbull et al., 2018) and a set of 184 causative structural variants and corresponding phenotypic data curated from the literature. We then discuss some of the future challenges in the field that need to be overcome to address the overwhelming numbers of RD patients that still do not receive a molecular diagnosis after the current standard of care analysis of their WES and WGS samples.
2 CLINICAL PHENOTYPES ARE CRITICAL FOR AUTOMATED DETECTION OF RD DIAGNOSES
We explored the potential of phenotype-driven variant prioritization software on 4877 molecularly diagnosed cases from the 100,000 Genome Project. This cohort represents diagnoses in some 1315 different genes for probands recruited under eligibility criteria for 257 broad clinical indications across all major branches of RD, for example, cardiovascular, ciliopathies, dermatological, dysmorphic and congenital abnormalities, endocrine, gastroenterological, growth, hematological, hearing, metabolic, neurology and neurodevelopmental, ophthalmological, renal and urinary tract, respiratory, rheumatological, skeletal, and finally tumor syndromes. Varying numbers of affected and unaffected family members were recruited and sequenced alongside the proband, bringing the total number of genomes analyzed in this cohort to 10,887. HPO terms were collected from the recruiting clinicians for each participant: median of 4 positive terms and range 1−61 per participant. Previous studies have shown that having more HPO terms per patient increases the chances of a diagnostic variant being ranked top by phenotype-based, variant prioritization tools, but using more than five terms only improves performance slightly (Thompson et al., 2019).
To analyze this large cohort we required software that could be run on both GRCh37 and GRCh38 single nucleotide variant (SNV)/insertion-deletion (indel) Variant Call Format (VCF) files (Danecek et al., 2011), offered local installation in the Genomics England research environment, as well as high-throughput, programmatic querying. Exomiser and LIRICAL were the only two tools that satisfied these conditions and the performance of both is shown in Figure 1. Overall, Exomiser was able to prioritize 82.6%, 91.3%, 92.4%, and 93.6% of the 4877 diagnoses in the top, top 3, top 5, and top 10 ranked candidates. This demonstrates the effectiveness of a phenotype-driven approach, across the whole breadth of RD, in automatically detecting the diagnostic variant(s) from the several million variants in the family WGS samples. Performance was similar for the more challenging singleton samples (N = 1591), demonstrating that sequencing of family members is not necessarily critical to identify a disease-causative variant when deep, clinical phenotypes are collected. LIRICAL can currently only be run on singleton samples and, despite showing slightly reduced performance relative to Exomiser, still achieved efficient prioritization of diagnoses with 85.2% of diagnoses detected in the top 5 compared to 94.3% by Exomiser for these samples. Exomiser is able to use local frequency data available for the 100,000 Genomes Project to remove many false-positive variant calls, which likely explains much of this difference in the recall. Where diagnoses were not recalled by the automated software, this was due to variants being filtered out as they were flagged as low quality in the VCF (1%), had unusually high minor allele frequencies (2%), or were incompletely penetrant (3%).

Exomiser, like most of the methods described above, combines variant- and phenotype-associated data into a single combined score or probability (Figure 2). The variant-based filtering and scoring utilize minor allele frequencies from local and population sequencing sources, in silico predicted pathogenicity, variant molecular consequence for the gene, and segregation across affected and unaffected members. The phenotype-based scoring is obtained from the semantic similarity between the proband's phenotype and the phenotypic profiles of human diseases and model organisms associated with the gene or nearby neighbors in a protein−protein interaction network.

The importance of combining both variant and phenotypic data is seen in Figure 3a where the recall and precision for detecting the diagnoses in the 4877 are shown across the full range of Exomiser's variant, phenotype, and combined score cutoffs. Although a high recall (0.92) can be obtained using a variant score threshold of 0.8, the precision is poor (0.04) meaning an average of 25 variants per case have to be reviewed by a clinical geneticist before a report can be issued. In contrast, a phenotype score cutoff of 0.6 can be used to achieve a better precision (0.15) but with a considerably reduced recall (0.80). Combining both into the Exomiser score with a threshold of 0.7 allows accurate recall (0.89) with reasonable precision (0.15). Figure 3b summarizes how combining variant and phenotype data into an Exomiser score is critical for efficient variant prioritization with 82% of diagnoses recalled as the top hit compared to only 33% and 55% using the variant and phenotype scores respectively. In practice, we recommend users review the top 5 Exomiser candidates regardless of the score where a recall and precision of 0.92 and 0.18, respectively, can be obtained (Figure 3c).

For many of the tools, the fact that they have not been upgraded to GRCh38 since publication prevented their evaluation here, and this is likely a reflection of the challenges of software maintenance in academia. Although it is difficult to predict the relative performance of the tools if they could be updated to work with the latest genome assembly and disease data, we do expect that, in general, all tools would demonstrate that a combined variant and phenotype-based approach is highly effective.
3 CHALLENGES IN RD INTERPRETATION
Although most of the phenotype-based variant prioritization methods have demonstrated impressive recall and precision on known molecular diagnoses, there still remain a proportion of those diagnoses that are not detected at all, for example, ~5% in the 100,000 Genomes Project benchmarking of Exomiser, as well as the much larger problem of most patients still not receiving a molecular diagnosis after a comprehensive analysis of their WES/WGS data, for example, 75% of 100,000 Genomes Project probands (Smedley et al., 2021). Better methods are needed to detect these missed diagnoses that improve: (i) the detection and prioritization of noncoding and structural variants, (ii) identify causative variants in genes that have not previously been associated with human disease, and (iii) deal with more complex genetic scenarios such as incomplete penetrance. Improvements are also required to allow easier reinterpretation of unsolved cases, simpler sharing of phenotype data, and diagnostics in a prenatal context. Here we will discuss the latest advances in these areas.
3.1 Prioritization of noncoding variants
A substantial proportion of RD diagnoses are likely to involve noncoding variants, for example, 4% of molecular diagnoses reported in the 100,000 Genomes Project pilot paper, demonstrating that WGS can accurately detect such diagnoses (Smedley et al., 2021). However, most pipelines are not routinely pursuing such diagnoses, largely due to the problem of overwhelming numbers of variants to interpret and validate. Phenotype-based algorithms such as Genomiser (Smedley et al., 2016, part of the Exomiser framework) can automatically highlight candidate variants across the whole genome including enhancers, promoters, untranslated regions, and introns; previous benchmarking revealed that 77% of known noncoding molecular diagnoses could be recalled as the top candidate in WGS samples (Smedley et al., 2016). However, the issue of how to efficiently perform functional validation of novel noncoding variants limits the wider application of such approaches.
Researchers have therefore focussed their efforts on variants that change mRNA splicing as these are much more amenable to high-throughput validation through techniques such as transcriptomics. The simplest definition of a splice variant includes variants that affect the most conserved AG/GT dinucleotides of the intron termini. However, variants at other splice site positions or variants located outside of the splice sites were also shown to cause defective splicing by introducing cryptic splice sites or by disrupting splicing regulatory element binding sites (Boichard et al., 2008). Recent algorithms such as SQUIRLS (Danis et al., 2021b) and SpliceAI (Jaganathan et al., 2019) have revolutionized the detection of such variants in WGS. Being able to integrate these new variant-based algorithms into the phenotype-based tools promises to deliver many additional diagnoses, as in some genes up to 50% of all disease-causing variants are splice variants (Ars et al., 2000). Exomiser allows new variant deleteriousness or pathogenicity algorithms to be immediately incorporated into the analysis as tabix-format score files. Initial exploration of this approach using SpliceAI and SQUIRLS on unsolved cases from the 100,000 Genomes Project has revealed tens of thousands of predicted pathogenic, cryptic splice variants within genes known to be associated with the patient's condition. It can be anticipated that intersecting these candidate variants with large-scale transcriptomic analysis will allow the detection of many new molecular diagnoses. The direct integration of transcriptomic analysis into existing phenotype-based variant prioritization software would also make this process much more efficient and powerful, building on existing gene prioritization approaches such as GADO (Deelen et al., 2019).
3.2 Prioritization of structural variants
Similarly, many unsolved RD cases are thought to involve structural variants (SVs), either alone or in combination with SNVs/indels. Even with the current limitations of calling SVs from short-read WGS samples, 8% of the diagnoses reported by the 100,000 Genomes Project involved SVs (Smedley et al., 2021).
The challenge with SV prioritization ultimately stems from the primary technological challenges of sequencing, assembly, and calling of structural variants compared to short sequence variants, especially using short-read technologies (Mahmoud et al., 2019). SV callers for both long- and short-read technologies have varied performance depending on the class of SV they are calling, with insertions being a particularly troublesome class for reliable detection (Kosugi et al., 2019). In general, long-read sequencing offers improved detection of SVs. However, while whole-genome short-read sequencing costs have dramatically reduced over the past decade, long-read costs are still beyond what would be tolerated for routine diagnosis.
While the VCF specification has support for describing SVs, it is less well-specified compared to sequence variants, with several open tickets (https://github.com/samtools/hts-specs/issues/544) under discussion for v4.4. Moreover, callers often follow the specification in an idiosyncratic manner, which makes it exceptionally difficult for variant prioritization software to reliably utilize the calls. One of the most powerful metrics for judging variant pathogenicity is variant frequency where pathogenic variants are often absent or present at very low frequencies in databases such as gnomAD. GnomAD-SV (Collins et al., 2020) now offers a reference database produced from high-coverage sequencing to perform this task also for SVs. However, the recent gnomAD-SV data set was created from 14,891 individuals and is far smaller than the original gnomAD SNV/indel data set with around 140,000 individuals. This reduces the filtering power based on variant frequency of gnomAD-SV. Other resources such as DECIPHER, DGV, and dbVAR also contain SVs but data such as SV type, insertion length, and copy number are not always recorded consistently within or between resources. A further problem with trying to utilize these resources is that SVs are harder to categorize, far longer, and often have imprecise boundaries when compared to small variants, and are therefore significantly harder to look up in reference databases. Guidelines for reporting clinical pathogenicity of structural variants have only been introduced recently (Riggs et al., 2020) compared to the long-established ones for SNV variants (Richards et al., 2015), leading to fewer high-quality clinical assertions in ClinVar (Landrum et al., 2018) for tools to reference.
Despite these challenges, several phenotype-based prioritization tools have recently emerged that offer SV prioritization. SvAnna (Danis et al., 2021a) focuses on SVs called from long-read technologies. AnnotSV (Geoffroy et al., 2021) in contrast has a short-read focus. The latest release of Exomiser (13.0.0) allows phenotype-based prioritization of SVs alongside SNVs/indels so that the impact of the SVs on the coding regions of one or more genes is assessed alongside any rare, predicted damaging SNVs/indels present under various segregation models for each affected individual.
The ability of Exomiser to prioritize known molecular diagnoses involving an SV is shown in Figure 4. Previously described phenopackets (https://phenopacket-schema.readthedocs.io/en/latest/index.html) representing curated phenotypic and pedigree data from the literature (Danis et al., 2021a) were used as input to Exomiser alongside corresponding VCFs containing the curated variant(s) added to a control WGS VCF file based on either short- or long-read technologies. The former used an Illumina short-read sample with SVs called using Manta and Canvas. For the long-read benchmarking we used a Genome in a Bottle (GIAB) sample generated by PacBio sequencing and pbsv calling (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/PacBio_pbsv_05212019/HG002_GRCh38.pbsv.vcf). Exomiser was able to prioritize 74% of the SV diagnoses as the top-ranked candidate and 89% in the top 5 for the short-read samples. The long-read samples were more problematic with performance dropping to 61% and 78% for the top and top 5 ranked candidates, but still showing relatively effective prioritization of SV diagnoses. Only 14 SV diagnoses were completely missed by Exomiser: nine involving SVs that disrupt noncoding regions which are not currently handled and five unspecified breakend (BND)-type SVs that again are not currently supported. Twenty-eight of the 184 curated known SV diagnoses involve an SNV/indel in compound heterozygosity with an SV and in all cases Exomiser was able to detect both variants and prioritize the diagnosis effectively in the top 3 ranked candidates. The same phenopackets have already been assessed for SvAnna and AnnotSV using a different set of long-read, pbsv called VCFs and showed 61% and 86% in the top and top 5 candidates for SvAnna and 60% and 65% for AnnotSV (Danis et al., 2021a).

3.3 Incomplete penetrance
Exomiser and Phen-Gen are the only phenotype-based variant prioritization tools to offer the option of allowing for incomplete penetrance. In Exomiser 13.0.0, the user can configure the analysis to retain variants in unaffected family members instead of removing them as part of the standard filtering pipeline. Phen-Gen has a stringency setting to adjust the level of penetrance. Allowing for incomplete penetrance obviously leads to more candidates to review per case. Both Exomiser and Phen-Gen apply these settings across the genome and future improvements to restrict to a curated set of genes with known incomplete penetrance would reduce the number of candidates and improve performance. We were able to benchmark Exomiser on 35 families from the 100,000 Genomes Project with incompletely penetrant molecular diagnoses and show that 54% were still detected as the top-ranked candidate, 77% in the top 3 with a further 14% found outside the top 10. Phen-Gen benchmarking was not possible on these samples as GRCh38 analysis is not enabled. Without accounting for the incomplete penetrance, none of these diagnoses would have been detected.
3.4 Novel disease−gene discovery through phenotype-based methods
The usual route for disease gene discovery involves identifying pathogenic/likely pathogenic variants in the same gene in several unrelated families with the same phenotype and then performing functional validation. The phenotype-based tools that incorporate model organism data, pathway, and/or protein-protein network approaches can prioritize variants in genes that have not previously been associated with human disease and potentially support this functional validation step. This is enabled by making use of existing knowledge, for example, from large-scale efforts such as the International Mouse Phenotyping Consortium (Lloyd et al., 2020) that are characterizing the function of every protein-coding gene through systematic mouse knockouts and phenotyping. For example, the Children's Hospital Los Angeles demonstrated the successful discovery of diagnoses in novel disease genes using a semi-automated pipeline involving Exomiser (Ji et al., 2019). In another example, ANKRD17 was identified as a candidate gene for cases of intellectual disability in the 100,000 Genomes Project through the identification of highly ranked Exomiser candidate de novo variants based on protein−protein interaction evidence. Subsequent identification of further cases worldwide and functional characterization have now confirmed this association (Chopra et al., 2021).
3.5 Reinterpretation of unsolved cases
- (i)
programmatic access allowing high-throughput analysis and, in the case of Exomiser and LIRICAL, simple, local installation so security around data transfer is not an issue (local reinstallation of latest versions required though). AMELIE offers the advantage of natural language processing of the latest literature to identify reference genotype to phenotype knowledge. Other tools such as Exomiser and LIRICAL rely on the curation of the latest disease−gene associations by OMIM and Orphanet and the associated phenotypes by the HPO team before this knowledge is available to the software;
- (ii)
simple configuration, including sensible presets for exome- or genome-based analysis in the case of Exomiser, so only the bare minimum of patient-level information needs to be entered;
- (iii)
standardized input using VCF files, HPO terms, and, in the case of Exomiser and LIRICAL, compatibility with the Global Alliance for Genomics and Health (GA4GH) approved Phenopacket standard that will allow future direct connection to electronic health record (EHR) systems;
- (iv)
fast run times (<30 s for a WES, <5 min for a WGS) making regular reinterpretation feasible;
- (v)
JavaScript Object Notation (JSON) output for incorporation into bioinformatics pipelines and, in the case of Exomiser and LIRICAL, user-friendly HyperText Markup Language (HTML) output.
3.6 Standardized phenotype representation
Phenotype-driven RD genome analysis tools have benefited enormously from standardized formats for capturing genomic variation from next-generation sequencing technologies (VCF), yet until recently had no analog for describing patient phenotype. While the Human Phenotype Ontology (Köhler et al., 2021) has become the accepted standard for capturing patient phenotype from deep-phenotyping for use in analysis, there is no standardized way of conveying this information to bioinformatics tools. Most tools rely on a simple list of phenotype terms or a disease identifier (e.g., from OMIM (Amberger et al., 2019), Orphanet (Pavan et al., 2017), or MONDO (http://obofoundry.org/ontology/mondo)) to try and convey this information, but this method cannot convey a complete description of an individual's phenotype including modifiers such as severity, laterality, and age of onset for each phenotype as well as their progression over time. The GA4GH Phenopacket (https://phenopacket-schema.readthedocs.io/en/v2/) aims to solve this by providing a standardized, structured format for describing patient-level phenotypic features, allowing for a rich description of each feature including the absence, severity and time of onset. Since its initial release, several tools (LIRICAL, SvAnna, Exomiser, Phen2Gene (Zhao et al., 2020)) support the standard which offers significantly increased portability of phenotype data between these tools.
3.7 Interpretation of prenatal cases
While becoming routine for pediatric and adult diagnosis, the use of phenotype-driven RD analysis for prenatal diagnosis is a developing area. Currently, the HPO has 151 terms in the subhierarchy starting from Abnormality of prenatal development or birth [HP:0001197]. Out of the 199,197 annotations to 7902 Mendelian diseases currently present in the HPO, roughly 0.5% refer to terms from the subhierarchy HP:0001197, such as Fetal distress [HP:0025116] or Short fetal humerus length [HP:0011429]. However, current knowledge of the prenatal manifestations of Mendelian disease remains limited. Prenatal genomic testing is becoming increasingly common for fetuses with suspected Mendelian disease but the interpretation of expanded prenatal sequencing is reliant on deeper fetal phenotyping (Gray et al., 2019). The HPO project is currently conducting a series of workshops in this area to expand the depth and breadth of relevant coverage.
4 CONCLUSIONS
WES and WGS are now widely used in both diagnostic and research settings. A large driver for the successful adoption of these strategies has been the collection of deep phenotype data using HPO terms and software allowing automated prioritization of variants. Without these tools, clinicians and researchers would be overwhelmed, in most cases, by the sheer number of candidate variants to interpret. Phenotype-driven, RD diagnosis is now a part of routine clinical practice in the United Kingdom and many other healthcare systems. There are still numerous challenges to overcome before we can efficiently deliver on the promise of genomics to fully transform the diagnosis and eventual treatment of RD. Further development and adoption of standards are needed to connect EHR systems and the variant prioritization tools. More research and development of the tools are needed to identify the overlooked molecular diagnoses that are present in existing genomic samples as well as those that will emerge through further advances in omics technologies. However, we have come to a remarkable distance in the last decade since the first reported WES successes (S. B. Ng et al., 2010), and we expect considerable advances in all these areas in the next few years.
ACKNOWLEDGMENTS
This study was supported by the National Institutes of Health (NIH) grants 1R24OD011883, U54 HG006370, and NIH, National Institute of Child Health and Human Development 1R01HD103805-01. This study was made possible through access to the data and findings generated by the 100,000 Genomes Project. The 100,000 Genomes Project is managed by Genomics England Limited (a wholly owned company of the Department of Health and Social Care). The 100,000 Genomes Project uses data provided by patients and collected by the National Health Service as part of their care and support.
CONFLICTS OF INTEREST
Julius Jacobsen and Damian Smedley declare they previously acted as part-time consultants for Congenica Ltd. The other authors declare no other potential conflicts of interest.
Open Research
DATA AVAILABILITY STATEMENT
All data described in the paper are already provided in the paper except for access to the 100,000 Genomes Project samples which is by application to Genomics England.