Actionable Genes, Core Databases, and Locus-Specific Databases
Contract grant sponsors: Aix-Marseille Université, INSERM, European Union Seventh Framework Program (grant no. 305444).
For the Next Generation Sequencing special issue
ABSTRACT
Adoption of next-generation sequencing (NGS) in a diagnostic context raises numerous questions with regard to identification and reports of secondary variants (SVs) in actionable genes. To better understand the whys and wherefores of these questioning, it is necessary to understand how they are selected during the filtering process and how their proportion can be estimated. It is likely that SVs are underestimated and that our capacity to label all true SVs can be improved. In this context, Locus-specific databases (LSDBs) can be key by providing a wealth of information and enabling classifying variants. We illustrate this issue by analyzing 318 SVs in 23 actionable genes involved in cancer susceptibility syndromes identified through sequencing of 572 participants selected for a range of atherosclerosis phenotypes. Among these 318 SVs, only 43.4% are reported in Human Gene Mutation Database (HGMD) Professional versus 71.4% in LSDB. In addition, 23.9% of HGMD Professional variants are reported as pathogenic versus 4.8% for LSDB. These data underline the benefits of LSDBs to annotate SVs and minimize overinterpretation of mutations thanks to their efficient curation process and collection of unpublished data.
Introduction
Progress in sequencing technologies have led to the rapid adoption of next-generation sequencing (NGS) in a research context to facilitate the identification of disease-causing genes, especially in the field of rare genetic diseases. Based on their successes, these technologies have been transferred to diagnosis. This switch was not transparent but rather has been accompanied by new ethical issues. In fact, patients are addressed for a specific set of symptoms associated to a particular disease spectrum such as a neuromuscular disease. In the course of the Whole-Exome Sequencing (WES) now routinely proposed in many countries, it is frequent to identify potentially harmful mutations in genes unrelated to these symptoms but which may be of importance for patient follow-up. These discoveries have been named “secondary findings,” “incidental findings,” or “secondary variants (SVs)” and have to be distinguished from “unsolicited findings” that are found in the genes linked to the tested disease [Matthijs et al., 2016]. Depending on national guidelines, it may be mandatory or not to look for these “SVs” (for more information, see dedicated paper in this issue). Because these findings usually target genes involved in a completely different clinical field, such as cancer predisposing genes, the diagnostic laboratory may not be an expert of these genes while the interpretation of results requires a strong expertise. During the last 25 years, Locus-specific databases (LSDBs) have been slowly developed and maintained to ensure optimal quality of data to facilitate data interpretation. Here, we will review the various resources, LSDB and other databases, that could be used to facilitate the interpretation of these findings and discuss assets and drawbacks.
Materials and Methods
Variant list from Johnston et al. (2012) is available as Supp. Table S1 in Johnston et al. (2012) and Matthijs et al. (2016). List of actionable genes is available in Green et al. (2013). Reports in LSDBs for each of the 318 variants were searched for with an in-house designed Perl script for LOVD Beacon (http://mcupak.github.io/beacon-of-beacons/queries.html) and LOVD Share (http://databases.lovd.nl/shared/genes) core databases. Data from each UMD-LSDB (APC, BRCA1, BRCA2, MEN1, MLH1, MSH2, MSH6, MUTYH, TP53, and VHL genes) were searched online at http://www.umd.be. Finally, manual search was performed in the 250 other LSDBs reported for each of the 23 genes and are listed in Supp. Table S2.
To homogenize variant classification, Human Gene Mutation Database (HGMD) variant types have been matched as following: disease-causing mutation (“DM”) = class 5, disease-causing mutation? (“DM?”) = class 3, disease-associated polymorphism (DP) = class 1, functional polymorphism (FP) = class 2, and disease functional polymorphism (DFP) = class 2. When two or more databases had different variant classification, variant was classified as variant of unknown significance (VUS).
Reported frequencies from each variant from Exome Sequencing Project (ESP), Exome Aggregation Consortium (ExAC), 1000G, dbSNP (build 144) were extracted from the file provided by the Annovar Tool [Wang et al., 2010] as well as information from ClinVar (06-2015) (http://www.ncbi.nlm.nih.gov/clinvar/).
In silico predictions were performed with the UMD-Predictor tool [Salgado et al., 2016] through the corresponding Web service. Finally, data were merged into one table using a homemade Perl script.
What Are SVs and What Is the Importance of Reporting Them?
In 2013, the American College of Medical Genetics and Genomics (ACMG) recommended identification and return of SVs collected through NGS techniques such as WES and whole-genome sequencing (WGS) in diagnostic settings from a minimum set of 56 actionable genes as these variants, unrelated to the indication for which sequencing is ordered, are of medical value for patient care [Christenhusz et al., 2013; Green et al., 2013; ACMG Board of Directors, 2015]. These variants should be reported regardless of the age of the patient as preventive measures and/or treatment are available and individuals with pathogenic mutations might be asymptomatic for long periods of time. It was expected that the clinician would contextualize these variants for the patient in light of personal and family histories and physical examination.
Identification and reporting of these SVs led to a broad discussion in the last few years notably on: (1) clinicians’ obligations or not to report them [Biesecker, 2013; Clayton et al., 2013; Gliwa and Berkman, 2013; van El et al., 2013], (2) patient's right “not to know” [Andorno, 2004; American College of Medical Genetics and Genomics, 2013; Scheuner et al., 2015], (3) extra workload needed for variant interpretation and confirmation [Dorschner et al., 2013; Hegde et al., 2015], (4) uncertain accuracy of genotypic predictions in the absence of familial segregation data [Burke et al., 2013], (5) or possibility of inadequate depth and breadth of sequencing coverage at clinically relevant locations [Park et al., 2015], but also (6) cost-effectiveness of this detection [Douglas et al., 2016], and finally (7) whether this effort would be compensated [Hegde et al., 2015]. One of the emerging questioning is our real capacity to label all true SVs.
How SVs Are Selected During the Variant Filtering Process?
- dbSNP: The Single-Nucleotide Polymorphism database (dbSNP) (http://www.ncbi.nlm.nih.gov/snp) was established in September 1998, to address the need for a general catalog of genomic variation [Sherry et al., 2001]. dbSNP was initially composed of small-scale locus-specific submissions defined by flanking invariant sequences. Following the advent of high-throughput sequencing and the availability of complete genome assemblies for many organisms, dbSNP now receives a greater number of variants defined by sequence change at asserted locations on a reference sequence. dbSNP data evolved according to submissions from public laboratories and private organizations and now contains data from patients and controls of various ethnic groups. At present, dbSNP combines results from HapMap, 1000 Genomes, EVS, and ExAC projects (see below).
- 1000 Genomes Project: 1000 Genomes (1000G) Project (http://www.1000genomes.org) includes today individual-level genotype data from 2,504 individuals from 26 populations [1000 Genomes Project Consortium et al., 2015]. Data are reconstructed genomes using a combination of low-coverage WGS, deep exome sequencing, and dense microarray genotyping. Populations are distributed as follows: 504 individuals with East Asian Ancestry, 489 with South Asian Ancestry, 661 with African Ancestry, 503 with European Ancestry, and 347 with American Ancestry. All these individuals are assumed to be healthy.
- ESP: Due to its goals, the NHLBI GO ESP (http://evs.gs.washington.edu/EVS/) contains in its last release (ESP6500SI-V2) exome variant data from 6,503 patients presenting with heart, lung, and blood disorders [Fu et al., 2013]. A subset of these data (ESP2500) having more stringent filtering criteria is available in the latest release of dbSNP (build 134) [Tennessen et al., 2012]. Samples are from unrelated individuals (samples showing first-degree to third-degree relatedness have been removed). Large-scale validation of the variants was not performed. However, sequencing validation of a small number of singletons (∼200) and high-frequency SNP calls (∼800) was performed [Tennessen et al., 2012]. The complete set of the SNP calls from the NHLBI ESP project is included in the dbSNP build-138.
- ExAC: ExAC (http://evs.gs.washington.edu/EVS/) aggregates and harmonizes exome sequencing data from 60,706 unrelated individuals sequenced as part of various disease-specific and population-genetic studies [Lek et al., 2015]. Individuals affected by severe pediatric diseases have been removed so this data set could serve as a reference set of allele frequencies for severe disease studies. ExAC contains today 7,404,909 high-quality variants, including 317,381 indels. Although 1000G and ESP are contributing projects, the majority of variants have very low frequency and 72% are absent from both 1000G and ESP [Lek et al., 2015].

- HGMD Professional: The HGMD (http://www.hgmd.org) is a comprehensive collection of germline mutations in nuclear genes that underlie, or are associated with, human-inherited disease [Stenson et al., 2014]. HGMD is available in two versions: one public (permanently 3 years out of date and without any of the additional annotations) and one “Professional” obtainable by subscription (up to date version with curatorial comments). The mutation collection process is performed by automatic data mining systems that extract mutations from the various publication sources and checks their validity in comparison to reference sequences using the international nomenclature. Manual curation is provided when necessary. By February 2016, the database contained over 127,000 different lesions detected in over 4,860 different genes in the public version and over 179,000 lesions in 7,189 different genes in the Professional version.
- ClinVar: ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/) is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence [Landrum et al., 2016]. ClinVar is seeded with records based on allelic variants described in OMIM (http://www.omim.org), GeneReviews or UniProt, variants submitted with clinical information to dbSNP, voluntary submissions from clinical testing laboratories, researchers, LSDBs, expert panels, and groups establishing professional guidelines. Submissions to ClinVar are categorized according to associated data as the type of submission (clinical testing, results part of research project, data extracted from the literature), the number of submitters, evidence that supports interpretation (genetic testing, family studies, comparison of tumor/normal tissue, animal models, etc.). ClinVar does not curate interpretations of clinical significance or arbitrate conflicts in interpretation. They invite the clinical genetics community to form expert panels, which should perform high-level curation for variant interpretations. ClinVar contains to date 173,216 records among which 85,642 have assertion criteria. Novel variants submitted to ClinVar are in turn submitted to dbSNP or dbVar.
Some teams also chose to evaluate pathogenicity of SVs according to in silico analyses. The most used and reliable prediction tools are: UMD-Predictor [Salgado et al., 2016], MutationTaster 2 [Schwarz et al., 2014], CADD [Kircher et al., 2014], Polyphen 2.2.2 [Adzhubei et al., 2013], SIFT 5.1.1 [Sim et al., 2012], Provean 1.1.3 [Choi et al., 2012], Mutation Assessor 2 [Reva et al., 2011], and CONDEL 1.5 [González-Pérez and López-Bigas, 2011] for missense variations and HSF [Desmet et al., 2009], ESE Finder [Smith et al., 2006], MaxEntScan [Yeo and Burge, 2004], and NNsplice [Reese et al., 1997] for variations potentially impacting splicing. Salgado et al. (2016) discusses these tools in this issue.
Can the Number of Individuals with Expected Actionable SVs Be Estimated?
This question is especially challenging as each identified variation is not linked to a specific sample for evident patient confidentiality. Various attempts have been made to evaluate the number of patients with SVs. These estimates are mainly based on variants already reported in HGMD Professional release followed by manual curation by specialists using PubMed and/or pathogenicity evaluation with different in silico tools that select only highly penetrant pathogenic mutations. If we restrict these different analyses to the ACMG recommended list of 56 genes, SVs have been found in a range from 1% to 5.6% of the participants (6/179 individuals [3.35%] [Xue et al., 2012], 19/1,000 participants [1.90%] [Dorschner et al., 2013], 12/1,092 participants [1.10%] [Olfson et al., 2015], 92/6,503 participants [1.41%] [Amendola et al., 2015], 623/11,068 participants [5.6%] [Gambin et al., 2015], 2/149 participants [2%] [Yavarna et al., 2015]).
Is the Number of Expected Actionable SVs Underestimated?
The reported range of 1%–5.6% of studied samples with SVs can be discussed. First, a high discordance among reviewers has been noticed by Amendola et al. (2015). Reviewers are likely to be inconsistent in their categorization and reports biased toward more pathogenic categories. Second, even if population minor allele frequency (MAF) is a useful factor for variant classification, data are also limited by population diversity and by the number of tested alleles. Some populations are poorly or not represented such as South Asian (Afghanistan, Bangladesh, Bhutan, India, Maldives, Nepal, Pakistan, and Sri Lanka) and Latino individuals, or the Middle East population (Egypt, Iran, Turkey, Iraq, Saudi Arabia, Yemen, Syria, United Arab Emirates, Israel, Jordan, Palestine, Lebanon, Oman, Kuwait, Qatar, Bahrain, and Cyprus). Databases might benefit from including a broader sampling of human diversity. Third, a possible overestimation of some SVs frequencies could be observed due to our inability to assess the MAF calculation. A bias could be introduced if MAF is based on a population enriched for pathogenic or likely pathogenic variants in specific ACMG genes. Several cohorts were, for example, enriched for lipid disorders, vascular disease, or chronic obstructive lung disease and are not a random sampling of the population. Finally, selection of variants according to their description in HGMD also introduces a bias as HGMD is not a comprehensive database. The impossibility nowadays to publish mutations in already known causing gene leads diagnostic laboratories to gather and store their variants in in-house databases. These data are submitted to core databases (as ClinVar) or to LSDBs only in the rare best case scenario.
In This Context, What About LSDBs for ACMG Genes?
LSDBs are a highly organized recording of variation data for specific genes. Lists of some LSDBs are available at the Human Genome Variation Society Website (HGVS, http://www.hgvs.org/locus-specific-mutation-databases), the Universal Mutation Database (UMD, http:www.umd.be), the Leiden Open Variation Database Website (LOVD, http://grenada.lumc.nl/LSDB_list/lsdbs), or the Gen2Phen Knowledge center (G2P, http://gen2phen.org/data/lsdbs). The majority of LSDBs presently available have been constructed with a small number of database management systems (DBMSs) among which the Leiden Open variation Database (LOVD, http:www.lovd.nl) [Fokkema et al., 2005] and the Universal Mutation Database (UMD, http:www.umd.be) [Béroud et al., 2005] that offer generic tools to build LSDBs. As they include data from a single gene, they collect all mutations and VUSs and often include unpublished data.
Numerous LSDBs are available for ACMG actionable genes (Supp. Table S2). All genes are represented in almost three different LSDBs, from which many involve the same DBMS (LOVD) but in different location (the number of variants was different). However, there are several LSDBs (18) that, although installed, have no variant documented.
LSDBs show a large heterogeneity in their contents and quality. Curation process varies largely among them. Highest quality in LSDB mutation collection process is provided by manual annotation of variants. This is a tedious but critical step since up to 10% of articles contain errors concerning mutation nomenclature: errors in type or position of mutations [Soussi et al., 2006] or use of a control sequence different from the current recognized reference.
Data commonly found in LSDBs are nucleotide position according to the reference sequence, exon number, description of the variation and nomenclature at the nucleotide (cDNA and genomic), and protein levels according to HGVS recommendations (http://www.hgvs.org/mutnomen/), reference of description (literature, diagnostic laboratories, etc.). For example:
(FBN1: sample IDXX c.3761G>A p.Cys1254Tyr g.48776092C>T [Stheneur et al., 2009] PMID19293843).
In some LSDBs, other data can be recorded such as associated disease, gender, transmission type (de novo, familial), geographic origin, specific location of the mutation at the protein level, consequences at the mRNA level, or experimental associated data. In silico analyses can also be available in some of them.
A wide heterogeneity is found for phenotypic data depending on the DBMS used. Phenotype description is usually reduced to single words in the great majority. Conversely, the UMD DBMS was developed notably to facilitate the collection of detailed phenotypes in view of performing genotype/phenotype correlation studies. Overall, the time spent to collect data and submit them according to LSDB needs is generally extensive and often restrains the involvement of large numbers of submitters and thus restrains their dissemination in the community.
Finally, LSDBs play a key role in the interpretation and classification of variants. It is widely accepted that classification of variation in genes is best performed by experts in those genes and/or pathology. Classification can be performed by individual curator(s) or an expert panel working with the curator and representing different areas of expertise (clinical, diagnostic, molecular, and computational). They display conclusion related to pathogenicity if a consensus has been reached. Pathogenicity was mainly based on familial segregation, evidence that supports a conclusion of pathogenicity, in silico prediction and frequency reported in core databases. For this, all these associated data have to be collected. Nevertheless, numerous LSDBs still do not provide manual annotation or classification of variants.
Use Case
In order to face a real situation, we searched for lists of variations identified by exome sequencing in the 56 ACMG genes before any filtration by HGMD Pro. We based our analysis on lists published by Johnston et al. (2012). They performed exome sequencing on 572 participants selected for a range of atherosclerosis phenotypes, but not for personal or family histories of cancer. They analyzed nonsense, frameshift, splice-site, and nonsynonymous variants in 37 genes involved in cancer susceptibility syndromes among which 23 are part of the ACMG gene list. They provided a list of 451 variants among which 318 are carried by genes of the ACMG list. Reports and classification of each of these 318 variations were searched for by homemade Perl scripts. We queried “core” LOVD databases as LOVD share (http://databases.lovd.nl/shared/genes) and LOVD Beacon (http://mcupak.github.io/beacon-of-beacons/queries.html). UMD databases were queried online at http://umd.be. All other databases listed in Supp. Table S2 were also manually queried. Frequencies from ESP, ExAC, 1000G, and dbSNP (build 144), as well as in silico predictions with the UMD-Predictor tool [Salgado et al., 2016] were merged into one table (Supp. Table S1).
We first looked for the presence of variants in HGMD Pro (03/15/2016), LOVD Share, LOVD Beacon, and UMD databases. Results from all other databases listed in Supp. Table S2 (250 queried databases) were merged into a single category named “Other databases.” Time to colligate all these data was estimated to be 16 hr. Of the 318 variations reported by Johnston et al. [2012], 138 (43.4%) were found in HGMD Pro (03/15/2016) (Table 1) and 227 (71.4%) in LSDBs. Representation in other databases was wide and only seven variations reported in HGMD (5%) were not found in LSDBs (Table 1). For the 180 variations not found in HGMD, 96 (53.3%) were at least reported in one LSDB and 84 (46.7%) were never reported, highlighting the added value of LSDB data (Table 2).
Database | Number of variations found |
---|---|
LSDBs | 227 |
LOVD Beacon | 43 |
LOVD Share | 96 |
UMD databases | 123 |
HGMD Pro | 138 |
Reported in HGMD and absent in LSDBs | 7 |
Database | Number of variations found |
---|---|
Not reported in LSDBs | 84 |
Reported at least once in LSDBs | 96 |
LOVD Beacon | 23 |
LOVD Share | 28 |
UMD databases | 39 |
The most common of the cancer susceptibility syndromes analyzed was hereditary breast and ovarian cancer linked to BRCA1/2 gene mutations with a combined frequency of ∼1/500. Consequently, as Johnston et al. (2012), we considered that a variant with a MAF of >1.5.10−2 was unlikely to cause a highly penetrant, rare, dominant disorder. Using ExAC allele frequencies (Supp. Table S1), a subset of 30/318 variations (9.4%) could be excluded with this criterion.
- - DMs were matched with class 5 (causal);
- - the annotation DM? corresponds to (1) variants initially classified as damaging in publications but with a degree of uncertainty, (2) variants reported by HGMD curators as having limited evidence for pathogenicity, and (3) variant for which pathogenicity was reconsidered after new evidence was provided. These variants were matched with class 3 (VUS);
- - DPs are variants with evidence for a significant association with a disease/clinical phenotype along with additional evidence that the polymorphism is itself likely to be of functional relevance, although there may be no direct evidence of a functional effect. These variants were matched with class 1 (neutral);
- - FPs correspond to variations that exert a direct functional effect but with no disease association reported as yet. These variants were matched with class 2 (likely neutral).
- - DFPs correspond to variations that exert a direct functional effect, with no disease association reported as yet and displaying evidence of being of direct functional relevance. These variants were matched with class 2 (likely neutral).
When classification was conflicting between different LSDBs, class 3 (VUS) is assigned to variants.
In order to estimate the added value of LSDBs without involving another curation process, classification of variations not reported in databases were not evaluated.
Classifications of variants were compared between databases in order to identify the respective numbers of variants to be reported as SVs (Table 3). HGMD Pro (03/15/2016) reported 33 damaging variants (63 in 2012). Johnston et al. (2012) reported eight mutations after curation. Eleven variations in UMD databases and other LSDBs are described in class 5 (Table 4). In these 11 causal variants, five are not reported in HGMD Pro (Table 4), and three were classified as VUS by Johnston et al. (2012) (another one was not evaluated as described with poor quality, “class 0”). Three variations described as causal by Johnston et al. (2012) were not reported in databases (for two) or described as VUS (for one) (Table 4). In the 33 variants annotated as damaging by HGMD Pro 2016 (Table 4), 23 have been classified as nonpathogenic by LSDBs (class 1 to 3), three as “not reported,” and six as causal (Table 5).
Database | Class 1 | Class 2 | Class 3 | Class 4 | Class 5 | Total |
---|---|---|---|---|---|---|
HGMD Pro (03/15/2016) | 15 | 9 | 81 | 0 | 33 | 138 |
HGMD Pro (2012) | 11 | 7 | 51 | 0 | 63 | 132 |
Johnston classification (2012) | 69 | 12 | 168 | 2 | 8 | 258 |
UMD classification | 69 | 15 | 34 | 0 | 5 | 123 |
Other LSDBs classification | 15 | 16 | 152 | 3 | 6 | 192 |
Gene | RefSeq AAChange | EVS6500 | 1000G_2015Aug | Exac03 | UMD-predictor score | UMD-predictor prediction | UMD databases classification | Other LSDB classification | Number of reports in “Other databases” | HGMD Pro Classification (2012) | HGMD Pro Classification (03/15/2016) | Johnston's classification (2012) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BRCA1 | NM_007294.3 c.68_69del p.Glu23Valfs*17 | NA | NA | 0,0002000 | NA | NA | Not reported | 5 - causal | 1 | 5 - causal | Not reported | 5 - causal |
BRCA1 | NM_007294.3 c.547+2T>A | NA | NA | NA | NA | NA* | Not reported | 5 - causal | 1 | 5 - causal | 5 - causal | 5 - causal |
BRCA1 | NM_007294.3 c.688G>T p.Glu230* | NA | NA | NA | 100 | Pathogenic | Not reported | 5 - causal | 1 | Not reported | Not reported | 0 |
BRCA2 | NM_000059.3 c.5946delT p.Ser1982ArgfsX22 | 0.0002 | NA | 0,0003000 | NA | NA | 5 - causal | 5 - causal | 1 | 5 - causal | 5 - causal | 5 - causal |
BRCA2 | NM_000059.3 c.8297delC p.Thr2766AsnfsX11 | NA | NA | NA | NA | NA | 5 - causal | 5 - causal | 1 | 5 - causal | 5 - causal | 5 - causal |
MUTYH | NM_001048171.1 c.494A>G p.Tyr165Cys | 0.0022 | 0.000199681 | 0,0016000 | 90 | Pathogenic | 5 - causal | Not reported | 0 | 1-neutral | Not reported | 5 - causal |
MUTYH | NM_001048171.1 c.779G>A p.Arg260Gln | 0.0003 | NA | 0,0003000 | 84 | Pathogenic | 5 - causal | Not reported | 0 | 5 - causal | Not reported | 3-VUS |
MUTYH | NM_001048171.1 c.1145G>A p.Gly382Asp | 0.0038 | 0.00239617 | 0,0028000 | 90 | Pathogenic | 5 - causal | Not reported | 0 | 1-neutral | Not reported | 5 - causal |
PTEN | NM_000314.4 c.235G>A p.Ala79Thr | NA | NA | 0,0001000 | 87 | Pathogenic | Not reported | 4- probably causal | 1 | 5 - causal | 5 - causal | 3-VUS |
RET | NM_020630.4 c.874G>A p.Val292Met | NA | 0.00379393 | 0,0006000 | 48 | Polymorphism | Not reported | 5 - causal | 3 | Not reported | 5 - causal | 3-VUS |
SDHC | NM_003001.3 c.43C>T p.Arg15* | NA | NA | 0,0000083 | 100 | Pathogenic | Not reported | 4- probably causal | 2 | 5 - causal | 5 - causal | 5 - causal |
Described as pathogenic only in Johnston et al. [2012] | ||||||||||||
MUTYH | NM_001048171.1 c.691C>T p.Arg231Cys | NA | 0.000199681 | 0,0000837 | 96 | Pathogenic | 3-VUS | Not reported | 0 | 5 - causal | Not reported | 4- probably causal |
MUTYH | NM_001048171.1 c.892-2A>G | NA | 0.00299521 | 0,0010000 | NA | NA | Not reported | Not reported | 0 | 5 - causal | Not reported | 4- probably causal |
BRCA2 | NM_000059.3 c.5482_5486del p.Lys1828Valfs*4 | NA | NA | NA | NA | NA | Not reported | Not reported | 0 | Not reported | Not reported | 5 - causal |
- HSF prediction [Desmet et al., 2009]: alteration of the WT donor site, most probably affecting splicing.
- Nucleotide numbering uses +1 as the A of the ATG translation initiation codon in the reference sequence, with the initiation codon as codon 1.
Gene | RefSeq AAChange | EVS6500 | 1000G_2015Aug | Exac03 | UMD-Predictor Score | UMD-Predictor Prediction | UMD database classification | Other LSDB classification | Number of reports in Other databases | Johnston classification | HGMD Pro 2012 | HGMD Pro 03/15/2016 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
APC | NM_000038.4 c.607C>G p.Gln203Glu | 0.0003 | NA | 0,0005000 | 47 | Polymorphism | 3 - VUS | 3 - VUS | 3 | 3 - VUS | Not Reported | 5 - Causal |
APC | NM_000038.4 c.3479C>A p.Thr1160Lys | 0.0002 | NA | 0,0000995 | 93 | Pathogenic | Not Reported | Not Reported | 0 | 3 - VUS | 5 - Causal | 5 - Causal |
APC | NM_000038.4 c.6821C>T p.Ala2274Val | 0.0011 | 0.000199681 | 0,0010000 | 54 | Probable polymorphism | 3 - VUS | 3 - VUS | 1 | 3 - VUS | 5 - Causal | 5 - Causal |
APC | NM_000038.5 c.7717A>G p.Ile2573Val | 0.0003 | NA | 0,0001000 | 66 | Probably pathogenic | Not Reported | Not Reported | 0 | 3 - VUS | 5 - Causal | 5 - Causal |
BRCA1 | NM_007294.3 c.547+2T>A | NA | NA | NA | NA | NA | Not Reported | 5 - Causal | 1 | 5 - Causal | 5 - Causal | 5 - Causal |
BRCA2 | NM_000059.3 c.964A>C p.Lys322Gln | NA | 0.000599042 | 0,0000582 | 5 | Polymorphism | 3 - VUS | Not Reported | 1 | 3 - VUS | 5 - Causal | 5 - Causal |
BRCA2 | NM_000059.3 c.5946delT p.Ser1982ArgfsX22 | 0.0002 | NA | 0,0003000 | NA | NA | 5 - Causal | 5 - Causal | 1 | 5 - Causal | 5 - Causal | 5 - Causal |
BRCA2 | NM_000059.3 c.7504C>T p.Arg2502Cys | 0.0012 | 0.000399361 | 0,0003000 | 47 | Polymorphism | 3 - VUS | Not Reported | 9 | 3 - VUS | 5 - Causal | 5 - Causal |
BRCA2 | NM_000059.3 c.8297delC p.Thr2766AsnfsX11 | NA | NA | NA | NA | NA | 5 - Causal | 5 - Causal | 1 | 5 - Causal | 5 - Causal | 5 - Causal |
MLH1 | NM_000249.3 c.1742C>T p.Pro581Leu | NA | 0.00119808 | 0,0001000 | 84 | Pathogenic | Not Reported | 3 - VUS | 16 | 3 - VUS | 5 - Causal | 5 - Causal |
MLH1 | NM_000249.3 c.1963A>G p.Ile655Val | 0.0032 | 0.00259585 | 0,0010000 | 29 | Polymorphism | 1 - Neutral | 3 - VUS | 26 | 3 - VUS | 5 - Causal | 5 - Causal |
MLH1 | NM_000249.3 c.1964T>C p.Ile655Thr | 0.0002 | NA | 0,0000989 | 87 | Pathogenic | 1 - Neutral | 3 - VUS | 13 | 3 - VUS | 5 - Causal | 5 - Causal |
MSH2 | NM_000251.1 c.4G>A p.Ala2Thr | NA | NA | 0,0004000 | 84 | Pathogenic | 1 - Neutral | 1 - Neutral | 2 | 3 - VUS | 5 - Causal | 5 - Causal |
MSH2 | NM_000251.1 c.815C>T p.Ala272Val | 0.0004 | NA | 0,0002000 | 75 | Pathogenic | 1 - Neutral | 3 - VUS | 37 | 3 - VUS | 5 - Causal | 5 - Causal |
MSH2 | NM_000251.1 c.944G>T p.Gly315Val | NA | NA | 0,0002000 | 96 | Pathogenic | Not Reported | 3 - VUS | 2 | 3 - VUS | 5 - Causal | 5 - Causal |
MSH2 | NM_000251.1 c.1748A>G p.Asn583Ser | 0.0002 | NA | 0,0000997 | 81 | Pathogenic | 3 - VUS | 3 - VUS | 6 | 3 - VUS | 5 - Causal | 5 - Causal |
MSH2 | NM_000251.1 c.1787A>G p.Asn596Ser | 0.0002 | NA | 0,0003000 | 87 | Pathogenic | 3 - VUS | 3 - VUS | 24 | 3 - VUS | 5 - Causal | 5 - Causal |
MSH2 | NM_000251.1 c.2425G>A p.Glu809Lys | NA | 0.000399361 | 0,0003000 | 54 | Probable polymorphism | 3 - VUS | 3 - VUS | 2 | 3 - VUS | Not Reported | 5 - Causal |
MSH6 | NM_000179.2 c.1526T>C p.Val509Ala | 0.0008 | NA | 0,0007000 | 50 | Probable polymorphism | 2 - Likely neutral | 3 - VUS | 7 | 3 - VUS | 5 - Causal | 5 - Causal |
MUTYH | NM_001048171.1 c.74G>A p.Gly25Asp | NA | 0.00179712 | 0,0011000 | 48 | Polymorphism | 2 - Likely neutral | 3 - VUS | 14 | 3 - VUS | 5 - Causal | 5 - Causal |
MUTYH | NM_001048171.1 c.53C>T p.Pro18Leu | NA | 0.00179712 | 0,0011000 | 60 | Probable polymorphism | 2 - Likely neutral | 3 - VUS | 14 | 3 - VUS | 5 - Causal | 5 - Causal |
PTEN | NM_000314.4 c.235G>A p.Ala79Thr | NA | NA | 0,0001000 | 87 | Pathogenic | Not Reported | 4- Probably causal | 1 | 3 - VUS | 5 - Causal | 5 - Causal |
RB1 | NM_000321.2 c.411A>T p.Glu137Asp | 0.0007 | NA | 0,0004000 | 63 | Probable polymorphism | Not Reported | 3 - VUS | 6 | 3 - VUS | 5 - Causal | 5 - Causal |
RB1 | NM_000321.2 c.1966C>T p.Arg656Trp | 0.0005 | NA | 0,0006000 | 93 | Pathogenic | Not Reported | 3 - VUS | 3 | 3 - VUS | 5 - Causal | 5 - Causal |
RET | NM_020630.4 c.785T>C p.Val262Ala | 0.0002 | 0.000199681 | 0,0002000 | 63 | Probable polymorphism | Not Reported | Not Reported | 0 | 3 - VUS | 5 - Causal | 5 - Causal |
RET | NM_020630.4 c.833C>A p.Thr278Asn | NA | 0.00399361 | 0,0021000 | 57 | Probable polymorphism | Not Reported | 3 - VUS | 1 | 3 - VUS | 5 - Causal | 5 - Causal |
RET | NM_020630.4 c.874G>A p.Val292Met | NA | 0.00379393 | 0,0006000 | 48 | Polymorphism | Not Reported | 5 - Causal | 3 | 3 - VUS | Not Reported | 5 - Causal |
RET | NM_020630.4 c.1942G>A p.Val648Ile | 7.7e-05 | NA | 0,0000908 | 48 | Polymorphism | Not Reported | 3 - VUS | 4 | 2 - Likely neutral | 5 - Causal | 5 - Causal |
SDHC | NM_003001.3 c.43C>T p.Arg15* | NA | NA | 0,0000083 | 100 | Pathogenic | Not Reported | 4- Probably causal | 2 | 5 - Causal | 5 - Causal | 5 - Causal |
TSC1 | NM_000368.4 c.1960C>G p.GLN654Glu | NA | 0.00199681 | 0,0008000 | 69 | Probably pathogenic | Not Reported | 3 - VUS | 11 | 2 - Likely neutral | 5 - Causal | 5 - Causal |
TSC2 | NM_000548.3 c.1939G>A p.Asp647Asn | 7.7e-05 | 0.000399361 | 0,0004000 | 48 | Polymorphism | Not Reported | 3 - VUS | 4 | 3 - VUS | 5 - Causal | 5 - Causal |
TSC2 | NM_000548.3 c.3430G>A p.Val1144Met | 0.0002 | NA | 0,0002000 | 30 | Polymorphism | Not Reported | 2 - Likely neutral | 2 | 3 - VUS | 5 - Causal | 5 - Causal |
TSC2 | NM_000548.3 c.5383C>T p.Arg1795Cys | 0.0013 | 0.000798722 | 0,0012000 | 71 | Probably pathogenic | Not Reported | 3 - VUS | 7 | 2 - Likely neutral | 5 - Causal | 5 - Causal |
- Nucleotide numbering uses +1 as the A of the ATG translation initiation codon in the reference sequence, with the initiation codon as codon 1.
The proportion of SVs to report for cancer susceptibility syndromes in 572 exomes varies largely with 5.77% in HGMD Pro, 1.40% in Johnston's study, and 1.92% in LSDBs.
These results demonstrate the constant evolution of our knowledge leading to reannotation of variants in HGMD and in LSDBs over the years. Nevertheless, they also show that LSDBs give access to more information and help in classifying variants identified thanks to NGS.
Conclusion: How Can We Work Together?
LSDBs have evolved to serve many purposes to address the changing needs of the genetics community in evaluating and interpreting human genetic variation [Dalgleish, 2016]. There is no perfect generic design for LSDBs because of the heterogeneity of genetic diseases, associated phenotypes, and goals. Nevertheless, some recommendations have recently been published [Vihinen et al., 2016]. The more they offer phenotypic information, the less they are easy to maintain since quality of submitted data varies from center to center and over time. Another key challenge is to make the LSDB both easy to use and useful.
LSDBs are an ideal tool for integration and dissemination of data to the medical community. As expected, LSDBs contain more mutations than HGMD as they include up to 50% of unpublished variations (depending on the genes), often with phenotypic descriptions. Consequently, LSDBs are extremely useful tools, contributing to the identification of causative mutations, providing information about phenotypic patterns associated with a specific mutation, enabling researchers to define an optimal strategy for mutation detection, and helping in the characterization of SVs. LSDBs data could indeed significantly advance the interpretation of missense variants by facilitating estimates of the frequency of rare variants in patients presenting a given phenotype, of rare events co-occurring with pathogenic/nonpathogenic variants, of allele frequencies in specific populations and the association of variants with clinical or pathological features.
Today, data are still fragmented and various attempts have been made to develop unified databases. Nevertheless, they have mainly been unsuccessful, not only because of a lack of funding to create such databases. Indeed, first, LSDB have different goals reflected by different contents, infrastructures, and quality making them somehow hard to merge. Second, global efforts to gather genetic information from different databases and registries into a common global database have arisen [Bean and Hegde, 2016]. Such initiative must strongly benefit from LSDBS but this could be achieved only if they do not replace them, otherwise data sharing and expert curation will be compromised, especially as LSDBs are facing sustainability issues to offer accurate and updated data.
- 1.
to receive complete information about patient's clinical presentation when a diagnostic test is ordered. As such data are usually not available from the diagnostic laboratory itself, it is important to involve clinicians in this data sharing. Indeed, clinical descriptions are often scarce because of the lack of time in the course of the medical consultation. Clinicians are also insufficiently informed about the diagnostic laboratory possibility to transfer, in agreement with patient consent, accurate phenotypic data associated with mutations into databases.
- 2.
to justify the time spent on collecting data to their trustees. This aspect is a real concern for diagnostic laboratories but also for clinicians. As previously mentioned, once a gene is described as disease-causing, most of the subsequent mutation identifications take place in a clinical setting. These data often present a low interest from journals because of the “lack of novelty.” A large amount of curated sequence data therefore lies within the clinical laboratories for their own activity, waiting to be shared with the medical and research communities.
Attempts have been made to stimulate the sharing of those data by various mechanisms as microattribution, which unfortunately never expanded because of the lack of recognition by funding agencies or by trustees as a positive effort made by the investigators. However, although essential for optimum delivery of genetic healthcare and for medical research, the main difficulty for LSDBs is obtaining funding for the collection of such data. Only win-win approaches will be sustainable. The future may lie in public–private partnerships as illustrated by the successful BRCA-Share™ initiative [Béroud et al, 2016], to improve the detection of inherited risk of breast and ovarian cancers.
Acknowledgments
A.P. is supported by a PhD studentship from AFSMA (Association Française du Syndrome de Marfan et Apparentés). M.M. is supported by a PhD studentship from MENESR (Ministère de l'Education Nationale, de l'Enseignement Supérieur et de la Recherche).