Deciphering exome sequencing data: Bringing mitochondrial DNA variants to light
Abstract
The expanding use of exome sequencing (ES) in diagnosis generates a huge amount of data, including untargeted mitochondrial DNA (mtDNA) sequences. We developed a strategy to deeply study ES data, focusing on the mtDNA genome on a large unspecific cohort to increase diagnostic yield. A targeted bioinformatics pipeline assembled mitochondrial genome from ES data to detect pathogenic mtDNA variants in parallel with the “in-house” nuclear exome pipeline. mtDNA data coming from off-target sequences (indirect sequencing) were extracted from the BAM files in 928 individuals with developmental and/or neurological anomalies. The mtDNA variants were filtered out based on database information, cohort frequencies, haplogroups and protein consequences. Two homoplasmic pathogenic variants (m.9035T>C and m.11778G>A) were identified in 2 out of 928 unrelated individuals (0.2%): the m.9035T>C (MT-ATP6) variant in a female with ataxia and the m.11778G>A (MT-ND4) variant in a male with a complex mosaic disorder and a severe ophthalmological phenotype, uncovering undiagnosed Leber's hereditary optic neuropathy (LHON). Seven secondary findings were also found, predisposing to deafness or LHON, in 7 out of 928 individuals (0.75%). This study demonstrates the usefulness of including a targeted strategy in ES pipeline to detect mtDNA variants, improving results in diagnosis and research, without resampling patients and performing targeted mtDNA strategies.
1 INTRODUCTION
Mitochondrial disorders make up a vast group of diverse pathologies affecting 1 out of 5,000 live births (Bannwarth et al., 2013). Since mitochondrial organelles are present in every human cell, all organs can be concerned by mitochondrial dysfunction. The clinical spectrum is highly heterogeneous, varying from one to several symptoms including deafness, retinopathy, diabetes, myopathy, epilepsy, and renal, cardiac, or hepatic dysfunction. Metabolic investigation can reveal blood and/or cerebrospinal fluid (CSF) lactate increase, and muscle biopsy can evidence respiratory chain anomalies and/or ragged-red fibers. Mitochondrial disorders are particular that they can be transmitted by either Mendelian or mitochondrial inheritance since mitochondria are complex cellular organelles composed of multiple mitochondrial proteins encoded either by nuclear or mitochondrial DNA (mtDNA genes; Dinwiddie et al., 2013).
To date, more than 300 nuclear genes have been implicated in mitochondrial disorders (Gorman et al., 2016), mainly with the autosomal recessive mode of inheritance (Chinnery, 2002). For the maternally inherited mtDNA, 37 genes are known to be involved in human diseases with large clinical and genetic variability (Dinwiddie et al., 2013). Indeed, different phenotypes are linked to the same mtDNA variant and the same phenotype to different mtDNA variants. For example, the GenBank NC_012920.1:m.3243A>G variant (MT-TL1; MIM# 590050.0001; dbSNP Build 152: rs199474657 [Sherry et al., (2001)]) causes different phenotypes including mitochondrial encephalomyopathy, lactic acidosis, and stroke-like episodes (MELAS; MIM# 540000), chronic progressive external ophthalmoplegia (CPEO; MIM# 530000), and maternally inherited diabetes-deafness syndrome (MIDD; MIM# 520000). Conversely, MELAS syndrome may be the consequence of other pathogenic variants in MT-TL1 (e.g., NC_012920.1:m.3256C>T, NC_012920.1:m.3260A>G, NC_012920.1:m.3271T>C, and NC_012920.1:m.3291T>C; MIM# 590050.0003, 590050.0007, 590050.0002, and 590050; dbSNP Build 152: rs199474659, rs199474663, rs199474658, and rs869312463) or in other transfer ribonucleic acid (tRNA) genes (e.g., NC_012920.1:m.583G>A in MT-TF; MIM# 590070.0001; dbSNP Build 152: rs118203885 or NC_012920.1:m.12147G>A in MT-TH; MIM# 590040.0003; dbSNP Build 152: rs121434474). Individual clinical phenotypes also depend on mitochondrial specificities named heteroplasmy/homoplasmy (Bai & Wong, 2005). In a single individual, mtDNA sequence can be variable with some mtDNA variants present in mixed proportions with wild-type mtDNA, within a cell or a tissue: this condition is called heteroplasmy. Homoplasmy is when the mtDNA variant is fully present in all the cells.
In a diagnostic clinical setting, the expanding implementation of next-generation sequencing (NGS) has considerably improved the diagnosis of mitochondrial diseases over the last few years. Targeted whole-mtDNA sequencing initially allowed researchers to detect mtDNA variants easily (Y.He et al., 2010; Vasta, Ng, Turner, Shendure, & Hahn, 2009). Targeted strategies are also routinely used to detect mtDNA variants when a mitochondrial disorder has been clinically suspected, focusing on expected single-nucleotide variants (SNVs) or structural variants. Older technologies such as polymerase chain reaction-restriction fragment length polymorphism (PCR-RFLP) currently detect single mtDNA variants (Holt, Harding, Petty, & Morgan-Hughes, 1990), which is useful for confirming low heteroplasmic rate variants (Vasta et al., 2009). For deletions and duplications, Southern Blot, real-time PCR, and long-range PCR are still commonly used (Colter-Mackie, Applegarth, Toone, & Gagnier, 1998; L.He et al., 2002).
The enrichment of exome sequencing (ES) capture was also proposed as a way to efficiently and specifically target all nuclear mitochondrial and mtDNA genes enriched fragments (Falk et al., 2012). Exome capture kits mainly used in a diagnostic scoop do not include mtDNA. It is possible to design exome capture kits for simultaneous nuclear exome and mitochondrial sequencing but they are rarely available in catalogs. Moreover, mtDNA can be sequenced indirectly, and therefore the mtDNA sequence can be extracted and reassembled from ES data (Samuels et al., 2013). This indirect mtDNA sequencing was initially used in cancer studies (Guo, Li, Li, Shyr, & Samuels, 2013; Zhang et al., 2016), with a level of efficiency similar to direct mtDNA sequencing in five breast cancer cell lines (Zhang et al., 2016). Single-nucleotide polymorphism and mutant loads were detected and quantified by indirect mtDNA sequencing, and the results obtained were almost the same as the direct approach. In mitochondrial disorders, indirect mtDNA sequencing has only been reported in a few studies. For instance, Dinwiddie et al. (2013) reported the cases of three affected individuals highly suspected of having a mitochondrial disorder, and were able to identify mtDNA variants in two of them: the pathogenic homoplasmic (100%) NC_012920.1:m.8993T>C (p.L156P; MT-ATP6; MIM# 516060.0002; dbSNP Build 152: rs199476133) variant involved in Leigh syndrome and a variant of uncertain significance (VUS) novel heteroplasmic (25%) NC_012920.1:m.3754C>A (MT-ND1; MIM# 516000) variant. Increasing use of NGS requires the implementation of bioinformatics tools to process and analyze the huge amount of generated data. While the pipeline for nuclear variants identification from ES or genome sequencing (GS) data follows the known scheme described and recommended in GATK Best Practices (Van der Auwera et al., 2013), the indirect mtDNA sequencing and analysis based on ES data was not mentioned and presents some bioinformatics and biological limits. mtDNA can be studied with the same nuclear pipeline (Dinwiddie et al., 2013) from quality assessment to variant calling steps, but specific databases and processes are required for variant annotation. Moreover, over the course of human evolution, mtDNA acquired specific sets of associated variants defining mitochondrial haplotypes (or haplogroups) depending on geographical origins. These haplogroups are hierarchically organized in a tree comprising haplogroups and sub-haplogroup branches. The effects of mtDNA variants may vary according to the mitochondrial genetic background (Wallace, Fan, & Procaccio, 2010). Thus, mitochondrial haplogroups need to be determined to properly interpret mitochondrial variants that notably differ from current nuclear variant interpretation. Haplogroup polymorphic variants would then be filtered out from the candidate variant list. Moreover, after mitochondrial endosymbiosis, mtDNA sequences have colonized the nuclear DNA (Calabrese, Simone, & Attimonelli, 2012). In primates, this can occur in several copies along the nuclear genome: these regions are nuclear mitochondrial DNA sequences (NUMTs) and their variants need to be filtered out to avoid false-positive mtDNA variant detection.
Different tools such as MitoSeek (Guo et al., 2013), mit-o-matic (Vellarikkal et al., 2015), or MtoolBox (Calabrese et al., 2014) have already been developed. While MitoSeek works only on remaining reads not aligned after nuclear genome analysis and does not identify haplogroups, mit-o-matic, and MtoolBox assign haplogroups. However, mit-o-matic aligns reads on revised Cambridge Reference Sequence (rCRS; GenBank NC_012920.1) human mtDNA reference only, and MtoolBox aligns sequences on the mitochondrial and nuclear genome separately. These tools have previously been used for indirect mtDNA sequencing data analysis, also with filtering of haplogroup variants, but only on a limited number of individuals suspected of autism (Patowary, Nesbitt, Archer, Bernier, & Brkanac, 2017).
Here, we present the creation and use of a bioinformatics pipeline with Burrows—Wheeler Aligner (BWA; Li & Durbin, 2009), SAMtools (Li et al., 2009), and genome analysis toolkit (GATK; McKenna et al., 2010) tools, using haplogroup determination to analyze the indirect mtDNA sequencing results issued from ES data in a large unspecific cohort of individuals with developmental anomalies (DA) and/or neurological disorders to identify whether mtDNA variants related to mitochondrial disorders can be found in individuals without identified nuclear variants.
2 INDIVIDUALS AND METHODS
2.1 Individuals
We gathered a cohort of 928 unrelated individuals with DA (81%) or primary neurological disorder (19%; 536 males and 392 females), mainly Caucasian and with ages ranging from unborn to 89 years (median 11 years), for which ES was performed in a diagnostic or research setting. ES was performed as previously described (Thevenon et al., 2016). The nuclear ES analysis underwent a positive result in 249/928 (26.8%). The mtDNA analysis was performed in the ES data of the 928 probands whatever the results of the nuclear ES analysis. When available, DNA from family members was used for segregation analysis. Informed written consent was obtained from individuals or parents for ES analysis.
2.2 Positive controls
To validate this new mitochondrial bioinformatics pipeline, we analyzed ES data from four positive controls issued from DNA samples with a well-known mitochondrial variant previously identified by targeted sequencing (Table S1). For these samples, ES was performed with the Agilent CRE V2 or Agilent V5 enrichment kits, which are two of the different kits used in our cohort.
2.3 Indirect mtDNA sequencing from ES data analysis
2.3.1 Sequences alignment, file conversion, and mitochondrial genome extraction
We developed a bioinformatics pipeline devoted to mtDNA analysis of ES data. It was designed using the guidelines outlined in GATK Best Practices (Van der Auwera et al., 2013) and based on the “in-house” pipeline already employed for nuclear ES in our laboratory (Thevenon et al., 2016; Figure 1a). After quality control, pairs of fastq files were aligned on the GRCh38 reference because it contains the currently used mtDNA reference GenBank NC_012920.1 (Andrews et al., 1999), unlike the hg19 genome. Using the BWA-MEM (v.0.7.15; Li & Durbin, 2009), we chose the most recommended approach (Ye, Samuels, Clark, & Guo, 2014): complete alignment on GRCh38 reference.

mtDNA analysis. (a) Schematic overview of the mtDNA pipeline. BAM files and raw VCF files generation scripts were based on our nDNA pipeline and the steps were completely automated (Thevenon et al., 2016). Merged VCF file generation required manual intervention due to the online HaploGrep 2.0 step. Only one raw merged file was needed. Variant frequency data were extracted and calculated from the Genbank sequence depository of Mitomap. In-house scripts allowed for a rapid filter of variants. Tools are noted in italics. (b) mtDNA rare variant analysis strategies. Variants identified as synonymous, present in more than four patients and considered as haplogroup markers were eliminated. “Confirmed” or “reported” variants in Mitomap or in literature and variants reported within the same region involved in a similar phenotype as the proband were analyzed. BWA, Burrows–Wheeler Aligner; GATK, Genome Analysis Toolkit; mtDNA, mitochondrial DNA; VCF, Variant Call Format
After the alignment step, we used SAMtools (v.1.2; Li et al., 2009) to convert SAM into a sorted BAM, extracted the mitochondrial genome data and indexed it. Picard tools (v.2.4.1; Broad Institute, n.d) were then used to clean and mark optical duplicate reads on the extracted mtDNA data. Thereafter, base quality score recalibration was performed by the GATK (v.3.7; McKenna et al., 2010).
2.3.2 Variants’ calling, raw file generation, and filtering
Raw VCF files containing mtDNA variants were obtained for each individual with one GATK variant caller: HaplotypeCaller used with default parameters and which can detect simultaneously SNV and small indels. All raw VCF files were then merged in a global variant matrix file which grouped together with the data from all individuals. The first filter applied on the variant matrix consisted in keeping only rare variants, based on population frequency <1% (2012 Genomes Project Consortium et al., 2012) from the Mitomap database entries (Lott et al., 2013). As some of Leber's hereditary optic neuropathy (LHON) variants are detected with a 0.36% population frequency, we chose the arbitrary 1% value to be conservative. A supplementary filter was also applied for synonymous variants ( Figures 1b and 2a).

mtDNA variants of interest identification. (a) mtDNA variant filtering. Frequent variants and synonymous variants were filtered out at the beginning. Well represented variants in our database (present in more than four individuals) and variants defining haplogroups were removed. With these first four filters, 73.5% (n = 1,786) of variants were filtered out from the raw variant list, without considering the proband’s phenotype. Looking for Mitomap status and citations in literature was not an automatic step as articles had to be analyzed and phenotypes compared. At the end of the process, 99.71% (n = 2,422) of variants were filtered out. (b) Mitomap variant filtering. Mitomap status was identified for each of the 31 variants (39 individuals). Each variant was then filtered by Mitomap frequency. Then, phenotype and/or heteroplasmy concordances were analyzed, except for “possibly synergistic” variants. These latter were filtered if no primary variant was present in involved individual mtDNA or if there was no other causal evidence. After this filtering step, only seven variants were detailed. (c) Identification of pathogenic mtDNA variants. The mtDNA is represented in linear form. Then the control region goes from 16024 to 576 (pink). Identified pathogenic variants are in blue and secondary findings in purple. All these variants are classified as “confirmed” or “reported—possibly synergistic” in Mitomap. MT-RNR1: 12S ribosomal RNA (MIM# 561000). MT-ATP6 (MIM# 516060), MT-ND4 (MIM# 516003), and MT-ND6 (MIM# 516006) are protein-coding genes. The only regions in which pathogenic variants were identified are specified. mtDNA: mitochondrial DNA
The median number of individuals for the 2,429 raw variants was 2.0. Moreover, all well-known and “confirmed” mtDNA variants in mitochondrial disorders were checked. None was present in more than three individuals, so we chose to keep variants with less than four individuals.
2.3.3 Haplogroup determination
In parallel, the mitochondrial haplogroup was determined from raw VCF files using HaploGrep 2.0 (Weissensteiner et al., 2016) online software. After individuals’ haplogroups were determined, each variant involved in one or several haplogroups (Van Oven, 2015) was also annotated. When all individuals carrying the same variant had the same haplogroup or subhaplogroup, the variant was considered a polymorphism and filtered out. If the individuals did not have the same haplogroup, the considered variant was kept and further analyzed.
2.3.4 Variant interpretation
The next analysis on the remaining variants consisted of checking the variant status in Mitomap (Lott et al., 2013; Figures 1b and 2b). We first retained the variants considered as “confirmed,” “reported,” or “unclear” in Mitomap. For the variants that were absent or present but not labeled, we explored other public databases (e.g., Clinvar [Landrum et al., 2018], OMIM [OMIM—Online Mendelian Inheritance in Man., 1996]) and scientific literature. Individuals’ phenotypes and those described in Mitomap or in scientific articles were compared. When the status was “confirmed” or there was a probable concordance between the described phenotype and heteroplasmic rates, the variant was considered as a candidate variant and confirmed by PCR-RFLP or NGS. In cases of nonconcordance or absence in the databases, nearby regions were carefully examined for the presences of reported variants said to be involved in similar phenotypes.
Then, for tRNA variants with phenotype concordance, the Mitochondrial tRNA Informatics Predictors (MitoTIP) score was studied. The MitoTIP (Sonney et al., 2017) score provided by Mitomap is used to predict the probability that unknown or rare variants are pathogenic (score from 1%–99%). A tRNA variant is predicted as likely pathogenic or possibly pathogenic when its percentile score is more than 50%. On the contrary, with a score value less than 50%, the variant is predicted as possibly or likely benign.
To ensure that identified mtDNA possible causal variants were not due to NUMTs, nuclear equivalent regions were verified on mtDNA. The nuclear coordinates of NUMTs that were overlapping studied mtDNA sequences were extracted (Calabrese et al., 2012), and nuclear equivalent positions were manually checked.
2.3.5 Depth analysis
Each position depth was determined by GATK DepthOfCoverage (Figure 4a). In parallel, NUMTs targeted during ES were obtained by intersecting exome capture kit target lists and the NUMT list (Calabrese et al., 2012) using BEDTools (Quinlan & Hall, 2010). Common nuclear regions were converted into mitochondrial coordinates and sorted by their chromosomal origins. Four Agilent SureSelect kits were used for ES and thus studied: Clinical Research Exome (CRE), CRE_v2, XT Human All Exon v4 and v5.
2.3.6 Variant validation methods
A second molecular method (PCR, Sanger sequencing, or PCR-RFLP) was used to validate candidate variants (PCR and Sanger sequencing conditions are available in Supplementary Materials). The quantification of heteroplasmy was performed for each mtDNA variant using the fluorescent PCR-RFLP method. The region of interest was PCR amplified, the amplicons were digested, and the fragments produced were analyzed by capillary electrophoresis (Applied 3130XL). The results were analyzed with Peak Scanner Analysis Software (Thermo Fisher Scientific) and heteroplasmy was expressed as the percentage of mutant load.
2.3.7 MToolBox tests
We also tested MToolBox on all positive cases and on our four positive controls to compare our method to an existing method.
The pipeline can be downloaded from http://gitlab.gad-bioinfo.org/gad-public/pipelinemito.
3 RESULTS
All pathogenic mitochondrial SNV were identified in positive controls (Table S1), confirming the reliability of our method.
Among the 928 individuals, a total of 2429 mtDNA variants were obtained (Figure 2a). After frequency filtering (<1%), 2019 out of 2429 variants were kept. Almost half of them were extracted as nonsynonymous and 1200 out of 2019 underwent the rest of the process. Only 984/1200 variants were present in less than four individuals. After removing the variants defining haplogroups, 643/984 variants were analyzed. After we checked Mitomap and a completed thorough literature review, 31/643 variants in 39/928 individuals (4.2%) were considered candidate variants because they were found to have “confirmed,” “reported,” or “unclear” status (Table S4; Figure 2a).
The five “confirmed” Mitomap variants, NC_012920.1:m.1494C>T (MT-RNR1; MIM# 561000.0004; dbSNP Build 152: rs267606619), NC_012920.1:m.1555A>G (MT-RNR1; MIM# 561000.0001; dbSNP Build 152: rs267606617), NC_012920.1:m.9035T>C (YP_003024031.1:p.L170P; MT-ATP6; MIM# 516060), NC_012920.1:m.11778G>A (YP_003024035.1:p.R340H; MT-ND4; MIM# 516003.0001; dbSNP Build 152: rs199476112) and NC_012920.1:m.14484T>C (YP_003024037.1:p.M64V; MT-ND6; MIM# 516006.0001; dbSNP Build 152: rs199476104), were considered pathogenic in 6/39 individuals, contributing or responsible for individual features in 2/6 individuals, or as a secondary finding in 4/6 individuals. The almost homoplasmic m.9035T>C variant was confirmed by Sanger sequencing as heteroplasmic in blood and homoplasmic in urine in a 30-year-old female with learning disabilities, ataxia, and axonal neuropathy for 10 years (Table 1 individual 1; Figure 3a) leading to clumsiness, fatigue, and balance problems. This variant was found in the same configuration in the individual’s 59-year-old mother who had been diagnosed 6 years previously with isolated axonal neuropathy after undergoing family testing and had been experiencing pain and lower limb impairment (Figure 3a). The m.11778G>A variant was confirmed by PCR-RFLP in blood (heteroplasmic status) and fibroblasts (homoplasmic status) in a 30-year-old male, who presented a newly described mosaic neuroectodermal dysplasia, with dental, pigmentary, acral, and cerebral anomalies, eye malformation, and vision loss (Table 1 individual 2; Figure 3b). This mosaic condition had previously been explained by a causal postzygotic variant in a nuclear gene (Vabres et al., in 2019). However, the proband’s severe ophthalmological impairment was atypical vis-à-vis the other individuals harboring pathogenic variants in the same gene; he was indeed the only individual in this cohort to carry the m.11778G>A variant (unpublished data) which was suspected of contributing to the ophthalmological severity. This variant was also found in the blood (homoplasmic) of his unaffected 59-year-old mother.
Individual (sex) | Age (year) | Haplogroup determined (HaploGrep 2.0) | MtDNA variants | Mitomap status | Base depth | Described heteroplasmy rate | Individual heteroplasmy rate (alt/tot) | Variant associated diseases | Individual phenotype |
---|---|---|---|---|---|---|---|---|---|
1 | 30 | U5a1i1 | m.9035T>C | Confirmed | 17 | 100% | 100% | Ataxia syndromes (Pfeffer et al., 2012) | Progressive ataxia |
(Female) | p.L170P | 90–96% | 17/17 | ||||||
2 | 30 | H6a1b3 | m.11778 G>A | Confirmed | 432 | 100% | 100% | LHON (Wallace et al., 1988) | Ocular impairments + mosaic neuroectodermal dysplasia |
(Male) | p.R340H | or less | 432/432 | ||||||
3 | Fetus | H23 | m.1494C>T | Confirmed | 17 | 100% | 100% | Aminoglycoside-induced non-syndromic deafness (Guan, 2004; H.Zhao et al., 2004) | Developmental anomalies |
(Male) | 17/17 | ||||||||
4 | 19 | H1c | m.1555A>G | Confirmed | 27 | 100% or less | 66.67% | Aminoglycoside-induced non syndromic deafness (Casano et al., 1999; Fischel-Ghodsian, Prezant, Bu, & Öztas, 1993; Matsunaga et al., 2004) | Ataxia + nystagmus |
(Male) | 18/27 | ||||||||
5 | 15 | H1be | m.1555A>G | Confirmed | 15 | 100% or less | 100% | Aminoglycoside-induced non-syndromic deafness (Casano et al., 1999; Fischel-Ghodsian et al., 1993; Matsunaga et al., 2004) | Polymalformative syndrome |
(Male) | 15/15 | ||||||||
6 | 34 | X1′2'3 | m.14484T>C | Confirmed | 5 | 100% or less | 100% | LHON (Wallace & Lott, 2017) | Muscular dystrophy |
(Male) | p.M64V | 5/5 | |||||||
7 | 20 | L2a1a2 | m.14502T>C | Reported —possibly Synergistic | 10 | 100% | 100% | LHON (F.Zhao et al., 2009) | Rubinstein-Taybi syndrome |
(Female) | p.I58V | 10/10 | |||||||
8 | 5 | M10a1a1b1 | m.14502T>C | Reported—possibly Synergistic | 49 | 100% | 97.95% | LHON (F.Zhao et al., 2009) | GLUT1 deficiency syndrome 1 |
(Female) | p.I58V | 48/49 | |||||||
9 | 28 | HV0a1 | m.14502T>C | Reported—possibly Synergistic | 3 | 100% | 100% | LHON (F.Zhao et al., 2009) | Intellectual disabilities+developmental anomalies |
(Female) | p.I58V | 3/3 |
- Abbreviations: LHON, Leber's Hereditary Optic Neuropathy.
- Note: The reference sequence used was GenBank NC_012920.1. These variants were submitted to ClinVar (SUB5620966): https://www.ncbi.nlm.nih.gov/clinvar/.
- Haplogroup assignment was performed with HaploGrep 2.0 online software.

Molecular validation of individuals A and B candidate variants. (a) The m.9035T>C was confirmed by Sanger sequencing in blood and urine for both individual 1 and her mother. The mutant load was close to 100%. (b) The m.11778G>A was confirmed by PCR-RFLP in the blood and fibroblast cells of proband 2 and the blood of his mother. The heteroplasmic rate was almost 100% for proband 2, and the variant was homoplasmic in the blood of the mother. PCR, polymerase chain reaction; RFLP, restriction fragment length polymorphism
The three other “confirmed” in Mitomap variants were considered secondary findings identified in individuals without a matching phenotype (Table 1; Individuals 3–6). The m.1494C>T and m.1555A>G variants involved in aminoglycoside-induced deafness were found in a fetus with DA (cleft palate, oligodactyly, and lower limb hypoplasia; homoplasmic m.1494C>T variant), in one 19-year-old patient with ataxia (heteroplasmic m.1555A>G variant), and one 15-year-old patient with polymalformative syndrome (homoplasmic m.1555A>G variant). No clinical history of progressive deafness was recorded in their family history. The homoplasmic m.14484T>C variant responsible for LHON (MIM# 535000) was identified in a 34-year-old individual suffering from muscular dystrophy but exhibiting no ophthalmological features. These variants were submitted to ClinVar, #SUB5620966 (https://www.ncbi.nlm.nih.gov/clinvar/).
Seventeen variants described as “reported” in Mitomap were identified in 23 individuals (Figure 2b). Sixteen were filtered out because the individuals’ phenotype or heteroplasmic rates were not concordant with the literature. The homoplasmic NC_012920.1:m.5567T>C (MT-TW; MIM# 590095) variant was identified in two unrelated individuals (55 and 14 years old), both affected with cerebellar atrophy and ataxia and sharing the same haplogroup (K1a). The variant harbored a low MitoTIP score (32.70%), was predicted to be likely benign and identified in 41 GenBank sequences. The common mitochondrial background was also not in favor of its pathogenicity but could highlight a common evolving story. This variant was not kept for further analysis.
For the 11 remaining individuals, variants were filtered out because of nonconcordance between individual phenotypes or heteroplasmic rates and the literature. First, three variants found in three individuals were described as “unclear” in Mitomap or had conflicting reports. Two individuals with two of these variants were not considered because they were not consistent with the previously described phenotypes. The last individual, presenting sensorineural hearing loss due to the dilation of vestibules and aqueducts, had the homoplasmic NC_012920.1:m.8348A>G variant (MT-TK; MIM# 590060; dbSNP Build 152: rs1556423430). The phenotypes described for this variant diverged considerably, and its pathogenicity remained unclear: it was finally not considered pathogenic. Second, two variants, NC_012920.1:m.11696G>A (YP_003024035.1:p.V312I; MT-ND4; MIM# 516003; dbSNP Build 152: rs200873900) and NC_012920.1:m.14502T>C (YP_003024037.1:p.I58V; MT-ND6; MIM# 516006; dbSNP Build 152: rs201327354), found in four individuals were described as possibly synergistic when associated with primary pathogenic m.1555A>G (Deafness), m.11778G>A (LHON), and 14484T>C (LHON) variants. Synergistic variants are described as associated with primary pathogenic variants involved in known pathologies, thought to modulate the clinical phenotypes. There is some evidence that homoplasmic m.14502T>C could also be directly involved in LHON (F.Zhao et al., 2009) with lower penetrance than when it is associated with primary pathogenic variants. It was found in three individuals affected with Rubinstein–Taybi syndrome, GLUT1 deficiency syndrome and intellectual disability associated with DA. This homoplasmic m.14502T>C variant was considered to be a secondary finding and was confirmed with a second molecular method when DNA was still available. This variant was submitted to ClinVar, #SUB5620966 (https://www.ncbi.nlm.nih.gov/clinvar/). In the absence of primary pathogenic variants in their mtDNA and a nonconcordant phenotype, m.11696G>A variant was removed from the candidate variant list. The last four variants, found in five individuals, were already seen variants but not classified in Mitomap. These variants were not retained because the phenotypes were not concordant when compared to the literature data.
Most NUMTs appeared intergenic or intronic and therefore not covered by ES (unpublished data; Figure 4a). For possible causal variants, the study of nuclear NUMTs revealed that none of the variants was found in these regions, so the variants detected in our cohort were mtDNA variants as confirmed by molecular validation. In parallel, each position depth was studied: all the regions were covered by off-target reads but mean depth was variable. The NUMT analysis confirmed that the whole mtDNA could be captured with Agilent enrichment kits.

Off-target mitochondrial reads depth. (a) Smooth depth average is represented. All mitochondrial regions are covered with off-target reads. NUMTs are represented depending on their nuclear localization and their equivalent mitochondrial coordinates. (b) Mean depth distribution is presented. Threshold (×5) is represented by a blue vertical line. Less than 1% of the samples have a mean depth lower than ×5 (Griffin et al., 2014). NUMTs, nuclear mitochondrial DNA sequences
MToolBox was tested on 12 samples: our four positive controls, individuals 1-3, 5-7, 9, and the mother of the individual 2. This tool failed to identify one variant (m.14502T>C) in individual 9, confirmed by mtDNA NGS. Correlation coefficient value is thus 0.478 between MToolBox and specific mtDNA methods while it is 0.916 between our method and mtDNA specific methods.
4 DISCUSSION
This study of indirect mtDNA sequencing on ES data in 928 individuals with DA and/or neurological disorders led to the identification of two different pathogenic variants (m.9035T>C and m.11778G>A) in 2/928 unrelated individuals (Table 1) responsible or contributing to the phenotype (0.13% from DA cohort and 0.6% from neurological cohort) as well as secondary findings (m.1494C>T, m.1555A>G, m.14502T>C and m.14484T>C) in 7/928 unrelated individuals.
Certain bioinformatics or biological challenges were faced during the course of the study. The bioinformatics pipelines required substantial modifications to properly manage indirect mtDNA sequencing data (specific reference during alignment, specific databases, specific steps not existing in nuclear pipelines: mitochondrial chromosome extraction and haplogroup assignment). Indeed, although nuclear and mtDNA molecules were simultaneously extracted and sequenced (Samuels et al., 2013), nuclear and mitochondrial ES data could not be treated in the same way seeing as DNA references, databases and variant filters were different. The first improvement concerned the genome of reference. While most teams continued to align ES data on GRCh37/hg19, which is still mostly used in public databases, only GRCh38 contained the current mtDNA reference (Ye et al., 2014). It has been described as the best approach to decrease bias in heteroplasmic rate determination due to the overalignement of NUMTs. Since this study was based on off-target reads, it was essential to verify that detected mtDNA variants were not linked to NUMTs.
One of the major points of this study was the automatic haplogroup determination to improve variant interpretation and prioritization in a large unspecific cohort of patients not suspected of mitochondrial disorders. Haplogroup identification made it possible to verify the quality of mitochondrial genome reconstruction (Diroma et al., 2014). Haplogroup defining variants filtering for prioritization has already been described (Santorsola et al., 2016) in studies of tumor cells (F.M.Calabrese et al., 2016) or in a cohort of individuals suspected of mitochondrial diseases (Patowary et al., 2017).
In oncology, comparison of germline and somatic mtDNA variants highlighted specific mtDNA variants requiring prioritization filters such as population frequency or haplogroup assignment (F.M.Calabrese et al., 2016). In a cohort with suspected mitochondrial disorders, filtering out haplogroup defining variants made it possible to prioritize mtDNA variants, to confirm the causality of well-known variants for LHON (Santorsola et al., 2016) during performance evaluation of the prioritization criteria, or to highlight variants of interest in autism (Patowary et al., 2017), a disease, which may be linked to mitochondrial dysfunction. In our large cohort, which was not suspected of mitochondrial disorders, haplogroup determination removed common polymorphisms not filtered upstream.
Variant interpretation also presented “biological” challenges and limits. The first group of five “confirmed” variants in Mitomap was straightforward because the variants were well described and their pathogenicity demonstrated. When phenotype and mutant load were concordant with the literature, the variants were considered as causal after molecular validation (Individuals 1 and 2). When phenotypes were nonconcordant with those previously described in the literature (Individuals 3–9), variants were then considered as secondary findings, given that incomplete penetrance was described for mtDNA variants. Individual 1 and her mother carried the homoplasmic m.9035T>C variant, which was responsible for ataxia and a milder maternal phenotype (Figure 3a). This variant has previously been reported in ataxic syndromes (Pfeffer et al., 2012). Individual 2 and his unaffected mother carried the homoplasmic m.11778G>A variant (Figure 3b), associated with low penetrance LHON (Wallace et al., 1988). This molecular double hit allowed us to specify the phenotypical spectrum of the neuroectodermal disorder, providing important data for genetic counseling.
Other variants in Mitomap were classified as “reported,” “synergistic,” and “unclear” or were unclassified variants. Their pathogenicity was not clearly established and excluded because of the absence of primary variants, low pathogenic prediction score, or phenotype nonconcordance.
One of the well-known limits in the exploration of mtDNA is the mutational rate since the heteroplasmic state depends heavily on the choice of tissues. As mtDNA data derived from off-target ES sequences, mtDNA regions were not homogeneously covered (Figure 4a). The mean depth was lower in mtDNA (×50 mean coverage) compared with specifically targeted nuclear genes (×100 mean coverage). The differences may be due to the capture kits chosen for ES, since each kit presented with a distinct design and different targeted sequences. Thus, off-target region sequencing differed, resulting in nonuniform mitochondrial coverage. Some variants or heteroplasmic rates could not be detected because depth and coverage obtained by this method are not optimal for mitochondria study and significantly below mtDNA NGS study (~×500; Figure 4b). Specific mtDNA analysis remains therefore indicated for negative results after our method. Nevertheless, as shown with positive controls analysis, our method can both determine homoplasmic/heteroplasmic status and heteroplasmic rate even with low depth. Heteroplasmic rates were comparable between the initial method and our determination (correlation coefficient 0.943). Moreover, a good correlation between mtDNA NGS data and our sequencing data can be observed, leading to an increase of diagnostic yield as new cases have been solved. As GS can use PCR free technology, the adaptation of our method to a GS pipeline would obtain better coverage and depth. Indeed, mtDNA molecules are 10 to 100 times more present in cells than nDNA molecules (Dinwiddie et al., 2013) improving the identification of variants and heteroplasmic rates. A lower depth made the interpretation of variants difficult, considering the presence of a variant calling threshold.
The determination of the mutant load is also impacted depending on the number of aligned reads (Griffin et al., 2014). For this study, ES data were extracted from only one type of tissue, mostly blood, in which a candidate variant could be absent, at a low level or even undetectable. As the coverage remains heterogeneous, NGS analysis of the whole mtDNA still remains indicated in individuals highly suspected of being affected by a mitochondrial disorder.
Incomplete penetrance is another challenge for variant interpretation. Indeed, even if family segregation can help, it can remain difficult to diagnose a mitochondrial disorder because of incomplete penetrance or unusual phenotypical variability.
We did not expect the rate of secondary findings to be higher than the rate of a positive diagnosis. These molecular findings are responsible for well-defined phenotypes with incomplete penetrance, estimated at 0.53% patients with LHON in this cohort, more than the expected rate in the general population of 0.29% (1/350), and 0.3% of patients with aminoglycoside-induced deafness, similar to the expected rate in the general population (0.28%). It is worth discussing whether these results should be returned to patients considering their potential use for health prevention counseling.
To date, only one study has reported the development of analytical tools for evaluating mtDNA in whole-exome data for the diagnosis of rare diseases (Patowary et al., 2017). Patowary et al. (2017) studied ten multiplex families with autism to provide further support for the role of mitochondria in autism spectrum disorders. They confirmed that whole-exome sequencing may be combined with mtDNA analysis. They highlighted the challenges of analyzing mtDNA variants with ES, including phenotype heterogeneity, age of onset, incomplete penetrance, or the heteroplasmic rate. They chose MToolBox to detect mitochondrial variants from ES data. This tool used a different method to analyze mtDNA sequence. ES data are first aligned on mitochondrial reference before hg19 reference. Reads uniquely mapped on mtDNA reference are kept. They are then used to generate a VCF file, to determine heteroplasmy rate, and to reconstruct a complete mitochondrial genome. Based on this latter, an individual haplogroup is assigned by macro-haplogroup-specific consensus sequence alignment. Our method does not need reconstructed mitochondrial genome to assign haplogroup thanks to HaploGrep2 tool. In addition, allele frequencies were extracted from 1,000 genomes samples rather than from Mitomap, a specific mtDNA database also containing more entries.
The correlation coefficient between our method and specific mtDNA methods is higher than the correlation coefficient between MtoolBox and specific mtDNA methods. Our pipeline can thus be considered as more sensitive.
In conclusion, we developed a bioinformatics pipeline to prospectively identify mtDNA variants by indirect mtDNA sequencing from ES data parallel to nuclear exome analysis. After testing the approach on a series of patients carrying pathogenic mtDNA variants, we confirmed the interest of implementing this approach systematically in routine ES bioinformatics pipelines in large cohorts. This technique provides a significant opportunity to reduce the diagnostic odyssey (Thevenon et al., 2016) for certain patients. However, the question of how to manage secondary findings from mtDNA remains complex and warrants further discussion.
ACKNOWLEDGMENT
We thank the probands and their families for their participation; and the Center De Calcul (CCuB) at the University of Burgundy for providing technical support and management of the informatics core facility. This work was supported by grants from the Regional Council of Burgundy (to C.T.-R.), the FEDER 2017, PARI 2017, and CIFRE (ANRT) between Laboratoire Cerba and Regional Council of Burgundy for the doctoral work at Laboratoire Cerba and GAD.
CONFLICT OF INTERESTS
The authors declare that there are no conflict of interests.