Estimating the proportion of pathogenic variants from breast cancer case–control data: Application to calibration of ACMG/AMP variant classification criteria
Abstract
For genes with reliable estimates of disease risk associated with loss-of-function variants, case–control data can be used to estimate the proportion of variants of typical risk effect for defined groups of variants, of relevance for variant classification. A calculation was derived for a maximum likelihood estimate of the proportion of pathogenic variants of typical effect from case–control data and applied to rare variant counts for ATM, BARD1, BRCA1, BRCA2, CHEK2, PALB2, RAD51C, and RAD51D from published breast cancer studies: BEACCON (5770 familial cases and 5741 controls) and breast cancer risk after diagnostic sequencing (60,466 familial and population-based cases and 53,461 controls). There was significant evidence of pathogenic variants among rare noncoding variants, in particular deeper intronic variants, for BRCA1 (13%, p = 8.3 × 10−7), BRCA2 (6%, p = 0.016) and PALB2 (13%, p = 0.001). The estimated proportion of pathogenic missense variants varied markedly between genes, generally with enrichment in familial cases, for example, 9% for BRCA2 versus 60%–90% for CHEK2. Stratifying missense variants by position indicated that, for most genes, location within a functional domain significantly predicted pathogenicity, whereas location outside domains provided robust evidence against pathogenicity. Our approach provides novel insights into the spectrum of pathogenic variants of specific breast cancer genes and has wider application to inform gene-focused specifications of American College of Medical Genetics and Genomics (ACMG)/Association of Molecular Pathology (AMP) codes for variant curation.
1 INTRODUCTION
Genetic case–control studies are typically designed to provide an estimate of the risk associated with germline pathogenic variants (PVs) in disease-related genes. However, if the disease risk associated with PVs in a particular gene has already been reliably established, case–control data can instead be used to estimate the proportion of PVs of typical risk effect within a defined group of variants; for example, missense variants overall or missense variants located in a specific protein domain. This approach can provide interesting biological insights and also contribute to the calibration and specification of variant classification criteria. Here we demonstrate the utility of a simple maximum likelihood calculation for estimating the proportion of PVs within specific subsets of variants for several breast cancer genes and consider its relevance for clinical calibration of American College of Medical Genetics and Genomics (ACMG)/Association of Molecular Pathology (AMP) gene-specific classification criteria.
2 METHODS
2.1 Maximum likelihood estimate (MLE) of the proportion of PVs
The 95% confidence interval for is determined by the values for where B(VcaseVtotal,P(case)) has a cumulative probability 0.025 < p < 0.975 (Supporting Methods). The number of variants observed in the group being examined, but not the total number of individuals in the study, influences the accuracy of the estimate of PVs in the set (Figure S2). The calculation is based on the counts of variants observed rather than the number of unique sequence variants, and can be applied to case–control summary statistics. Without information on the individual sequence variants, it is not possible to determine the relative contribution of any recurrently observed PV, such as founder variants.
2.2 Assigning the strength of evidence associated with a set of variants for classification
If a set of variants is consistently enriched for either benign variants or PVs, then belonging to that set can be used as evidence for classification as either benign or pathogenic, with a likelihood ratio, LRpathogenic = P(pv)/P(bv). The MLE of the LR can be converted to a criterion weight for use in ACMG/AMP classification (Richards et al., 2015) using the intervals derived in Tavtigian et al. (2018; Table S1). To ensure that it is valid to apply this code strength, a hypothesis test is applied to determine that the 95% confidence interval of the MLE for derived from the binomial distribution does not overlap the confidence interval associated with no evidence, that is, LR = 1. A simple calculator was developed to estimate the LR and ACMG/AMP code strength applicable for gene-specific evidence criteria, based on case–control variant counts for any predefined category (see Supporting Calculator).
2.3 Breast cancer case–control data sets
To illustrate the type of results that can be generated by this approach, we examined specific variant subgroups of interest from two case–control studies, which assessed the frequency of variants in breast cancer predisposition genes in sizable cohorts.
The hereditary breast case-control (BEACCON) study (Li et al., 2021) is a gene discovery study with elements of “extreme phenotype” design. Female breast cancer cases (n = 5770) were recruited through Australian Familial Cancer Clinics and are enriched for early-onset cancer and strong family history. Control subjects (n = 5741) were older, cancer-free women recruited via a national breast-screening program. All participants had previously undergone sequencing of BRCA1 and BRCA2, and women with pathogenic BRCA1 or BRCA2 variants were excluded. Genetic data were derived from targeted capture and sequencing of the coding region and splice junctions for up to 1330 known and candidate genes, as well as sequencing of the full gene locus of BRCA1, BRCA2, and PALB2 covering all introns and the promoter regions. As such, subgroup analysis on noncoding variation for the BEACCON data set was limited to BRCA1, BRCA2, and PALB2. To allow comparison with the breast cancer risk after diagnostic sequencing (BRIDGES) study, analysis considered only variants with global minor allele frequency < 0.001.
The BRIDGES collaborative study (Dorling et al., 2021) combined sequence data from female breast cancer cases and controls from 44 studies internationally, to estimate risks associated with variations in 32 known/proposed breast cancer genes. Cases (n = 60,466) and controls (n = 53,461) were sequenced using a targeted panel of all exons and splice boundaries for the genes of interest. For reporting, contributing studies were divided into population-based cohorts and familial studies with ascertainment enriched for family history. BRIDGES analysis considered only variants with global minor allele frequency < 0.001, which passed study quality filters. The MLE subgroup analysis of the BRIDGES data set was necessarily limited to the analysis of exonic variation in genes for which appropriate summary counts were available. Summary counts of missense variants counts inside/outside domain were provided for ATM, BARD1, BRCA1, BRCA2, CHEK2, PALB2, RAD51C, and RAD51D (Table S13), and missense variant counts for specific domains within genes were provided for ATM, BRCA1, BRCA2, CHEK2, and PALB2 (Tables S14–S18), where table numbers refer to those from the parent publication (Dorling et al., 2021).
Data were sourced from public material or provided under approval of the relevant Human Research Ethics Committees (Supporting Material). Variants were annotated against the canonical transcripts in both studies, as defined by the Matched Annotation from the National Center for Biotechnology Information and European Bioinformatics Information project. The per-gene odds ratios reported for “loss of function” variants from population-based cohorts in the BRIDGES study (Dorling et al., 2021) were used as the best estimate of the expected breast cancer risk associated with PVs (see Table 2).
3 RESULTS
3.1 Noncoding variants by region
In the clinical setting, noncoding variation outside the essential splice site region is rarely sequenced or interpreted, although deeper intronic or promoter PVs have been described for hereditary breast cancer genes (Canson et al., 2020; Evans et al., 2018). Further, synonymous exonic (silent) substitutions may be filtered or ignored in the context of reporting. To estimate the proportion of PVs in this variant category, we examined the BEACCON study for rare noncoding and synonymous variants in BRCA1, BRCA2, and PALB2. Results provided evidence that a subset of rare noncoding variants are disease-associated variants of typical effect (Table 1). Considering variants subdivided based on gene location (promoter, untranslated region, exon, intronic splice region, and deeper intronic), evidence for enrichment of PVs was significant among deeper intronic variants for all three genes (p = 0.02 to 8.3 × 10−7), with borderline evidence of pathogenicity for a subset of promoter region variants for BRCA2 (p = 0.04).
Case | Control | PV MLE | 95% CI | pa | ||
---|---|---|---|---|---|---|
BRCA1 | ||||||
Promoter | 95 | 93 | 0.01 | (0–0.19) | 0.43 | |
5′, 3′-UTRs | 32 | 36 | 0 | (0–0.23) | 0.65 | |
Intronic splice regionb | 18 | 22 | na | |||
Deeper intronic | 1090 | 874 | 0.13 | (0.08–0.18) | 8.3 × 10−7 | |
Synonymous exonic | 25 | 37 | 0 | (0–0.08) | 0.92 | |
BRCA2 | ||||||
Promoter | 76 | 56 | 0.21 | (0–0.45) | 0.04 | |
5′, 3′-UTRs | 11 | 17 | 0 | (0–0.26) | 0.83 | |
Intronic splice regionb | 13 | 14 | na | |||
Deeper intronic | 1176 | 1070 | 0.06 | (0.01–0.12) | 0.016 | |
Synonymous exonic | 63 | 80 | 0 | (0–0.07) | 0.91 | |
PALB2b | ||||||
Promoter | 119 | 111 | 0.06 | (0–0.26) | 0.24 | |
5′, 3′-UTRs | 7 | 8 | 0 | (0–0.71) | 0.49 | |
Intronic splice regionb | 7 | 2 | 0.84 | (0.06–Inf) | 0.02 | |
Deeper intronic | 628 | 533 | 0.13 | (0.05–0.22) | 0.001 | |
Synonymous exonic | 28 | 22 | 0.19 | (0–0.61) | 0.15 |
- Note: Number of individuals with sequence data for the full gene locus was as follows: BRCA1 and BRCA2, 5770 cases and 5741 controls; PALB2, 3780 cases and 3839 controls. As noted in the methods, BEACCON cases excluded individuals with clinically detected pathogenic variants in BRCA1 or BRCA2.
- Abbreviations: CI, confidence interval; inf, infinite; MLE, maximum likelihood estimate; na, not applicable, the proportion of PV in the canonical splice sites in BRCA1 and BRCA2 were not calculated, as these changes would be expected to be detected by the previous clinical testing in this group and consequently by absent by design; PV MLE, maximum likelihood estimate of the proportion of pathogenic variants of typical effect; UTRs, untranslated regions.
- a One-sided p that corresponds to the probability that no variants in the group are pathogenic.
- b Intronic splice region was defined as +1 to +7 at the donor site and −1 to −21 at the acceptor site, with remaining intronic sequence labeled “deeper intronic.”
The data also indicated that the large majority of rare noncoding variants are benign. Considered as a classification criterion, for all three genes the observfation of a rare variant in a noncoding region corresponds to moderate level evidence for classification as benign. Based on the sum of all variants per gene in Table 1, the LRs against pathogenicity were as follows: BRCA1 8.09 (5.29–15.39), BRCA2 14.15 (7.20–89.91), and PALB2 7.06 (3.95–19.83). Application of the hypothesis test found it was valid to apply these LRs for all three genes. To check that the calculation was not influenced by multiple observations of individual sequence variants, the calculation was repeated after excluding all recurrent variants, with no change to the overall conclusions (Table S2).
3.2 Rare missense variation
The degree that missense variants contribute to the spectrum of germline PVs varies for different genes. We estimated the proportion of PVs in the group of rare missense variants in eight genes for which there is sufficient support for an involvement of protein termination codon variants in breast cancer predisposition (Dorling et al., 2021), using data from BRIDGES (including cases grouped as either population-based or familial; Dorling et al., 2021) and BEACCON (familial cases; Li et al., 2021). A clear ascertainment effect was observed (Table 2); few or no PVs were predicted in this missense category for most genes from population-based data but the estimated proportion was greater in the data sets with familial case recruitment. Overall, the proportion of missense variants estimated to be pathogenic was low for BRCA1, BRCA2, and PALB2, but was greater for other genes. In particular, even in the population studies, a third of all the rare CHEK2 missense variants were estimated to have a pathogenic effect.
Gene OR typical PV |
Study | Subgroup | Casea | Controla | PV MLE | 95% CI |
---|---|---|---|---|---|---|
BRCA1 | BRIDGES | Population | 1393 | 1300 | 0.06 | 0.02–0.11 |
OR 10.57 | BRIDGES | Familial | 276 | 1099 | 0.04 | 0.01–0.09 |
BRIDGES | In domain | 278 | 167 | 0.24 | 0.15–0.34 | |
BRIDGES | Out of domain | 1395 | 1213 | 0.01 | 0–0.05 | |
BRCA2 | BRIDGES | Population | 2831 | 3038 | 0 | 0–0.01 |
OR 5.85 | BRIDGES | Familial | 704 | 2616 | 0.09 | 0.05–0.13 |
BRIDGES | In domain | 965 | 895 | 0 | 0–0.02 | |
BRIDGES | Out of domain | 2580 | 2359 | 0 | 0–0.01 | |
PALB2 | BEACCON | All—familial | 92 | 83 | 0.07 | 0–0.3 |
OR 5.02 | BRIDGES | Population | 805 | 892 | 0 | 0–0.02 |
BRIDGES | Familial | 237 | 806 | 0.14 | 0.07–0.22 | |
BRIDGES | In domain | 805 | 742 | 0 | 0–0.03 | |
BRIDGES | Out of domain | 247 | 244 | 0 | 0–0.03 | |
CHEK2 | BEACCON | All—familial | 122 | 71 | 0.60 | 0.3–0.92 |
OR 2.54 | BRIDGES | Population | 895 | 697 | 0.32 | 0.22–0.44 |
BRIDGES | Familial | 307 | 602 | 0.90 | 0.74–1 | |
BRIDGES | In domain | 852 | 502 | 0.46 | 0.36–0.57 | |
BRIDGES | Out of domain | 354 | 246 | 0.28 | 0.13–0.44 | |
ATM | BEACCON | All—familial | 319 | 219 | 0.52 | 0.29–0.75 |
OR 2.10 | BRIDGES | Population | 2411 | 2471 | 0.02 | 0–0.1 |
BRIDGES | Familial | 691 | 2139 | 0.49 | 0.38–0.61 | |
BRIDGES | In domain | 1040 | 783 | 0.23 | 0.12–0.34 | |
BRIDGES | Out of domain | 2064 | 1808 | 0.01 | 0–0.09 | |
BARD1 | BEACCON | All—familial | 77 | 55 | 0.47 | 0.01–0.95 |
OR 2.09 | BRIDGES | Population | 591 | 616 | 0 | 0–0.16 |
BRIDGES | Familial | 162 | 525 | 0.43 | 0.21–0.68 | |
RAD51C | BEACCON | All—familial | 28 | 24 | 0.24 | Undefined |
OR 1.93 | BRIDGES | Population | 196 | 206 | 0 | 0–0.29 |
BRIDGES | Familial | 45 | 182 | 0.17 | 0–0.67 | |
RAD51D | BEACCON | All—familial | 56 | 30 | 1 | 0.36–1 |
OR 1.8 | BRIDGES | Population | 224 | 212 | 0.16 | 0–0.49 |
BRIDGES | Familial | 52 | 173 | 0.52 | 0.04–1 |
- Abbreviations: BRIDGES, breast cancer risk after diagnostic sequencing; CI, confidence interval; OR, odds ratio assigned as the typical effect of pathogenic variants (ref. 6); PV MLE, maximum likelihood estimate of the proportion of pathogenic variants of typical effect.
- a Number of individuals with sequence data for each subgroup was as follows: BEACCON study 5770 cases and 5741 controls; BRIDGES study: 60,466 cases and 53,461 controls; Population based 48,826 cases and 50,703 controls; Familial studies 9408 cases and 43,451 controls. Domain analysis used counts for population and familial cohorts combined, as drawn from N Engl J Med (2021) Feb 4;384(5):428–439, PMID: 33471991, Supplementary Appendix (Tables S14–S18); slight differences in total counts (population + familial vs. in domain + out-of-domain) were not accounted for in this original report.
3.3 Missense variants within or outside a known functional domain
The BRIDGES study (Dorling et al., 2021) reported risk associations for rare BRCA1, BRCA2, PALB2, CHEK2, and ATM missense variants according to their location in known functional domains, for population-based and familial cohorts combined (Data Supplement from Dorling et al., 2021). We used these data to estimate the proportion of missense PV according to their location relative to known domains; information that can be used in variant classification through the application of ACMG/AMP criterion PM1 (Table 2). For BRCA2 and PALB2, according to the BRIDGES study data, there was no evidence for any pathogenic missense variants within the known functional domains collectively or outside of a known functional domain. For BRCA1 and ATM, distinction based on functional domains enriched for PVs. However, due to the low proportion of rare missense variants estimated to have a typical pathogenic effect overall, the LR for this group of variants amounted to only benign supporting evidence: BRCA1 3.17 (1.82–6.87); ATM 3.35 (1.77–9.00). However, outside a known domain, only ~1% of rare BRCA1 and ATM missense variants were estimated to have pathogenic effect, equivalent to strong evidence in favor of benign effect: BRCA1 LR 89.91 (15.95–undefined); ATM LR 75.92 (8.62–undefined). These LRs can be interpreted as combining both the overall low chance that a rare missense variant will be pathogenic in these genes and the relative depletion of PVs in regions outside the annotated functional domains, which is ~24-fold difference in the case of both BRCA1 and ATM. For CHEK2, 46% of missense variants within a functional domain were estimated to be pathogenic (LR 1.14, not significant) versus 28% outside these region (Benign LR = 2.56 (1.14–9.31), suggesting that location with respect to a described functional domain provides little information to assist with classification of CHEK2 missense variants.
When this analysis was extended to consider variants located within the different individual functional domains for a given protein (Table S3), the effective enrichment of pathogenic missense variants was found to be greater for some domains compared to others. Examples of strongly enriched domains included the RING domain of BRCA1, the phosphatidylinositol 3-kinase/phosphatidylinositol 4-kinase domain of ATM and the FHA domain of CHEK2. Despite the negative result overall, there was some evidence of pathogenic missense variants in the BRCA2 PALB2-binding domain (PV MLE 8%). However, the low number of variants at the level of individual domains means that these results should be interpreted cautiously.
4 DISCUSSION
Odds ratios derived from large case-control studies are used to estimate the magnitude of the risk of disease associated with variants in a gene. In most studies, the calculation uses the combined frequency of all presumed PVs in a gene, with the assumption that the magnitude of risk is essentially the same for whole group. This is a reasonable assumption in the case of predicted protein termination codon and canonical splice-site variants that are predominantly expected to produce transcripts subject to nonsense mediated decay and be effectively null. For most other types of variants, an odds ratio derived from case–control data is likely to reflect a mixture of different proportions of truly pathogenic and benign variants.
Here we describe a simple approach of directly estimating from case–control data the proportion of PVs of typical risk effect that are present in any subset of variant observations in a given category. This approach depends on two assumptions: (i) that the risk associated with a typical PV is accurately characterized for a particular gene and (ii) that truly PVs in the gene are all associated with essentially the same risk of the condition, independent of variant type or location. For genes where loss of function is an established mechanism of disease, predicted protein termination codon variants provide a standard to fulfill the first criterion. The second assumption, a binary distinction between fully pathogenic and completely benign variants, is one embedded in approaches to variant classification generally and is a feature of many of the current ACMG/AMP criteria.
We used the MLE approach to address several important areas of uncertainty in the interpretation of the clinical significance of variants in breast cancer predisposition genes and report results with important implications for germline variant classification using gene-specific ACMG/AMP criteria. First, estimates indicate that ~10% of rare noncoding variants identified in a familial cohort ascertained based on BRCA-like features have a pathogenic effect, across the genes BRCA1, BRCA2, and PALB2, principally located in intronic regions outside the expected acceptor and donor splice sites. The results point to an important component of unexplained familial breast cancer risk arising from regions of the genes that are not currently included in standard diagnostic testing. Notwithstanding these estimates, it remains true that the majority of the rare noncoding variants were predicted to be benign, with LRs equivalent to moderate level of evidence against pathogenicity. Given that the estimates from this highly ascertained cohort represents a plausible upper limit, these findings justify expansion of the original BP7 criterion (silent variant with no prediction to alter splicing) to capture other noncoding variant types. There remains the potential to explore the use of additional bioinformatic analyses to allow for further subdivision of this large group of variants and refinement of the criterion weight.
Second, estimates of the proportion of pathogenic missense variants, overall or by functional domain location, emphasized the degree to which this differs for individual genes. A large proportion of missense changes in CHEK2 are estimated to be associated with a similar level of risk as the average protein termination codon variant (0.32–0.90, dependent on cohort and ascertainment), but enrichment within functional domains is insufficient to allow use of this information as evidence towards pathogenicity. Around 50% of ATM rare missense variants were estimated to be pathogenic in familial cases and location in a functional domain significantly predicted pathogenicity in familial and population-based cases combined. The analysis indicated a much lower proportion of pathogenic missense variants overall in BRCA1, BRCA2, and PALB2, but location in a domain remained a predictor towards variant pathogenicity in BRCA1. Missense variants also appear to be a feature of the spectrum of PVs in BARD1, RAD51D, and possibly RAD51C, although the lower number of variants resulted in wide confidence intervals.
Importantly, the variation in the findings for individual genes all implicated in breast and/or ovarian cancer risk suggests that classification using known functional domains is not a stable feature that can be applied at the same strength in all instances. This highlights the need to calibrate the PM1 code for individual genes, consistent with the general recommendations of Harrison et al advising on approaches for informing PM1 (and PP2) code weight (Harrison et al., 2019).
Further, as shown by results for BRCA1, BRCA2, PALB2, and ATM, such analyses can provide rationale to apply a weighted form of BP1 (missense variant in a gene for which truncating variants are known to cause disease) for missense or in-frame protein alterations, having stipulated that these are located outside a known functional domain and are not predicted or known to impact splicing.
The key limitation of the MLE approach remains one of statistical power. This is illustrated by the loss of precision for the estimate of pathogenic missense variants in rarely implicated genes such as RAD51C, despite the unique scale of the BRIDGES data set. If data on the individual sequence variants is not available, the calculation cannot determine the relative contribution of any recurrently observed PVs to the overall estimate. It is also important to be aware that this approach can perpetuate any limitations that are already present in the case–control data being analyzed, and this should be considered before applying the MLE approach. Any differences in the frequency of variants in cases and controls that is due to poor matching between the groups will result in a spurious estimate of the proportion of PV. This may be due to technical differences (e.g., when applying control data from public databases rather than sequencing cases and controls together) or due to subtle differences in demographics or population stratification between groups. The impact of technical or demographic differences between studies can be minimized if the “typical” risk is determined from the same data set. Of note, differences that genuinely affect the proportion of PV in cases versus controls, such as ascertainment criteria, will be accurately reflected in the estimate of PV, which is a potential strength of this approach. Our analysis illustrated the major effect that ascertainment has on the genetic composition of cohorts, with the estimated proportion of PVs derived from studies with familial ascertainment several fold higher than population-based studies. This is relevant when considering estimates of pathogenic noncoding variants derived from the BEACCON study with strong familial ascertainment.
In summary, analysis of case–control data using an uncomplicated MLE approach has utility to identify subgroups of variants worthy of further investigation as contributors to disease and to inform specifications of ACMG/AMP codes for different genes.
ACKNOWLEDGMENTS
The authors gratefully acknowledge support from the National Breast Cancer Foundation (IF-15-004, Paul A. James and Ian G. Campbell), Cancer Australia/National Breast Cancer Foundation (PdCCRS_1107870, Paul A. James and Ian G. Campbell), and the National Health and Medical Research Council of Australia (GNT1023698, Paul A. James; GNT1071779 and GNT1177524, Amanda B. Spurdle; GNT116589, Cristina Fortuno; GNT1041975, Ian G. Campbell).
CONFLICTS OF INTEREST
The authors declare no conflicts of interest.
Open Research
DATA AVAILABILITY STATEMENT
Further description of the BEACCON study data can be found in the data record: https://doi.org/10.6084/m9.figshare.1443945548. The full sequencing data are available through the European Genotype-phenotype Archive under the following accession: https://identifiers.org/ega.dataset:EGAD00001007025 (study ID: EGAS00001005043). Summary statistics for the BRIDGES study used in this analysis are available in the original publication, https://doi.org/10.1056/NEJMoa1913948, or available through the BCAC Data Access Co-ordinating Committee ([email protected]). All data analyzed in this manuscript is publicly available as described in the text.