INTEGRAL-ILCCO cohort data analysis revealed the association of clonal haematopoiesis with an increased risk of lung cancer
Abstract
To investigate the association between clonal haematopoiesis (CH) and lung cancer risk, we identified CH mutations in 1 059 lung cancer cases and 899 controls using the blood whole-exome sequencing data generated from the Integrative Analysis of Lung Cancer Etiology and Risk project of the International Lung Cancer Consortium (INTEGRAL-ILCCO). Based on the variant allele frequency (VAF) of these mutations, we stratified CH carriers into two groups, low VAF (1%–10%) and high VAF (≥10%), respectively. We observed a significant association between the presence of CH mutations and the risk of lung cancer after adjusting for known risk factors (odd ratio, OR = 1.37, 95% confidence interval, CI = 1.02–1.85). Such an association was largely driven by CH mutations with high-VAF, the OR for high-VAF CH and low-VAF CH were 2.54 (1.38–4.93) and 1.14 (0.82–1.6), respectively. Trend analysis indicated a significant dose–response relationship (P trend = 0.004). This association between high-VAF CH and lung cancer risk remained consistent when subjects were stratified by risk factors or lung cancer histological subtypes. A combination of results from INTEGRAL-ILCCO, UKBB, and MGBB cohorts resulted in a meta-analysed OR of 1.36 (95% CI = 1.14–1.62) for all CH carriers and of 1.76 (95% CI = 1.34–2.31) for high-VAF CH carriers, respectively. In conclusion, our analysis revealed a significant association between CH and increased risk of lung cancer as supported by three independent cohorts.
1 INTRODUCTION
Clonal haematopoiesis (CH) is a phenomenon of the asymptotic expansion of blood cells descended from a single mutated haematopoietic stem cell.1 As a type of abnormality in blood cells, the association of CH with an increased risk of haematologic cancer has been well-established.2 Interestingly, a link between CH and other cancers have also been observed.3 Recently, Tian et al. analysed the whole exome sequencing (WES) data from the UK Biobank (UKBB)4 and the Mass General Brigham Biobank (MGBB) cohorts5; and reported a significant association between CH mutations and lung cancer risk.5 The association was primarily driven by the expansion of clones harbouring CH mutations with high variant allele frequency (VAF). In the UKBB data, after adjusting for age, sex, race and smoking status, subjects with high-VAF CH mutations (VAF ≥ 10%) had an odds ratio (OR) of 1.77, compared to an OR of 1.28 for subjects with low VAF mutations (2%–10%). A similar result was also observed in the MGBB data.
In this study, we identified all CH mutations using the WES data generated from by INTEGRAL-ILCCO, the largest cohort for the lung cancer genetic study. We stratified CH carriers based on the VAF of their mutations and performed risk association analysis. We observed a significant association between the presence of CH mutations and the risk of lung cancer after adjusting for other known risk factors. Such an association was largely driven by CH mutations with high-VAF with an odd ratio (OR) of 2.54. This association between high-VAF CH and lung cancer risk remained consistent when subjects were stratified by risk factors or lung cancer histological subtypes. Our study provides independent evidence that supports the links between CH and increased lung cancer risk.
2 METHODS
2.1 Human subjects and data processing
INTEGRAL-ILCCO is a large cohort for studying lung cancer genetics, which has generated different types of genetics and genomic data.6, 7 This cohort includes samples from both lung cancer cases and controls, which were selected by matching geographic distribution and age groups. In this study, we selected from this cohort 1 059 lung cancer cases and 899 controls with WES data (Table 1). None of these subjects has been previously diagnosed with haematologic disorders. For these subjects, the cohort has collected detailed clinical information, including age at lung cancer diagnosis, sex, race, smoking history, and lung cancer family history (FHLC). Compared with controls, lung cancer cases are more likely to be smokers and have FHLC. The genotyping data for these subjects were generated by using the high-density Human610-Quad BeadChip platform.8 Using the data, we calculated the first three principal components (PCs) of the subjects’ genetic ancestry.7
Characteristics | Cases (n = 1059) | Controls (n = 899) | P-value |
---|---|---|---|
Age, years, mean ± SD | 62.2 ± 12.3 | 60.8 ± 11.8 | n.s. |
Male, No. (%) | 436 (41.2) | 378 (42.0) | n.s. |
Race, No. (%) | |||
White | 922 (87.1) | 843 (93.8) | |
Others | 53 (5.0) | 31 (3.4) | n.s. |
Smoking status, No. (%) | |||
Never | 125 (11.8) | 303 (33.7) | |
Past | 429 (40.5) | 380 (42.3) | |
Current | 487 (46.0) | 199 (22.1) | <0.001 |
Family history of lung cancer, No. (%) | 505 (47.7) | 71 (7.9) | <0.001 |
Lung cancer subtype, No. (%) | |||
Adenocarcinoma | 460 (43.4) | ||
Squamous cell carcinoma | 339 (32.0) | ||
Others | 247 (23.3) | ||
CH Status | |||
CH carriers, No. (%) | 153 (14.4) | 93 (10.3) | 0.008 |
1 CH mutation | 146 (13.8) | 88 (9.8) | |
≥ 2 CH mutations | 7 (0.7) | 5 (0.6) | n.s. |
Low-VAF (1 - 10%) | 111 (10.5) | 76 (8.5) | |
High-VAF (≥ 10%) | 42 (4.0) | 17 (1.9) | n.s. |
- Note: Compared with controls, lung cancer samples were more likely to be smokers and were more likely to have FHLC (p < 0.01, Fisher's exact test).
- Abbreviations: CH, clonal haematopoiesis; VAF, variant allele frequency.
The WES data were generated by using the blood samples collected at the time of lung cancer diagnosis6{Hong, 2022 #7}. All blood samples are collected before the anti-cancer treatment, which excludes the potential confounding effects of the cancer treatment on CH.9 DNA extracted from peripheral white blood cells was captured by the Agilent SureSelect v5 kit. Paired-ended sequencing with a read length of 125 bp was performed on the DNA library with an average coverage of 97x.
2.2 Identification of clonal haematopoiesis mutations
To identify CH mutations, we focussed on 34 established CH genes according to previous studies,9, 10 including ASXL1, CBL, DNMT3A, GNAS, JAK2, NRAS, SF3B1, TP53, U2AF1, BCOR, PPM1D, TET2, IDH1, IDH2, SRSF2, RUNX1, SH2B3, ZRSR2, STAT3, KRAS, MYD88, ATM, CALR, CEBPA, ETV6, EZH2, FLT3, KIT, MPL, NPM1, STAG2, WT1, SETD2 and CREBBP. Sequence reads were mapped to the human reference GRCh37/hg19 with the Burrows–Wheeler Aligner by using the default parameter setting.11 Mutations were called by using the SamTools mpileup program.12 Reads with a base quality score ≤ 30 at the mutation site were excluded. To further improve the detection accuracy, we applied a binomial error model to identify and exclude sequencing errors.10 We excluded synonymous single nucleotide variations (SNVs), but considered the following mutation types: nonsynonymous SNV, frameshift deletion/insertion, non-frameshift deletion/insertion, and stop gain/loss.
After identifying all candidate SNVs, the following criteria were then used to select CH mutations in these 34 genes: (1) covered by at least 20 uniquely mapped reads, (2) not listed as a known genetic variant in dbSNP v151,13 (3) reported as a cancer mutation in the Catalogue of Somatic Mutations in Cancer (COSMIC) version 92,14 and (4) non-synonymous with VAF between 1%−35%. With these criteria, we obtained a list of candidate CH mutations. Furthermore, we referred to a list of CH mutations curated based on nine independent studies.2, 9, 10, 15-21 Specifically, we selected the final CH mutations that were identified in at least three studies (hot spot mutations) or in at least one study but supported by ≥ 3 mutated reads.
2.3 Statistical analysis
Based on the presence of CH mutations, we divided all subjects (cases and controls) into CH carriers (n = 246) and non-CH carriers (n = 1 712). The CH carriers were further stratified into a low-VAF group (VAF = 1%–10%, n = 187) and a high-VAF group (VAF ≥ 10%, n = 59). Multivariable logistic regressions were used to investigate the association between the presence of CH mutations and lung cancer risk, while adjusting for known confounding risk variables, including age, sex, race, smoking status, FHLC and the first three PCs of genetic ancestry. Stratified analyses were performed by stratifying subjects by lung cancer subtypes, age, sex and smoking status. The P-values for interactions between CH status and other risk factors were estimated using the Wald test. The P-values for trend were calculated by setting the CH status as an ordinal variable with levels non-CH < low-VAF < high-VAF in the logistic model. The P-value for heterogeneity between different lung cancer subtypes was estimated using the polytomous logistic regression.
A meta-analysis was performed to combine the results from our study (the INTEGRAL-ILCCO cohort) and the results reported by Tian et al. (the UKBB and the MGBB cohorts). Specifically, the generic inverse variance method was applied to estimate a meta-analysed OR and 95% CI using a fixed-effects model.22 For the INTEGRAL-ILCCO cohort, we used the results adjusted by age, sex, race and smoking status. Meta-analysis was implemented by using the metagen function in the R package “meta.” The P-value for heterogeneity was estimated using Cochran's Q-test. All the statistical analyses and tests were implemented in the R platform (version 3.6.3).
3 RESULTS
Out of all 1 958 subjects, we identified a total of 258 CH mutations from 246 (13%) subjects in the 34 leukaemia/lymphoma-related genes. Of the 246 CH carriers, only 12 (5%) have multiple CH mutations, while the others have a single CH mutation. The numbers of CH mutations detected in the 34 genes vary dramatically, with over 40% of mutations identified in the top three genes, DNMT3A (n = 46), TET2 (n = 37) and ATM (n = 21) (Figure S1). Tian et al. have reported that the association between CH and lung cancer risk was largely driven by mutations with high VAF. Thus, we stratified CH carriers into a high-VAF and a low-VAF group with 59 and 187 subjects, respectively. The high-VAF carriers were defined as those harbouring at least one CH mutation with a VAF ≥ 10%. In contrast, all the CH mutations detected in low-VAF carriers had a VAF between 1%−10%. The remaining 1 712 subjects without detected CH mutations were denoted as CH-negative subjects.
Without distinguishing high- versus low-VAF CH mutations, we observed a significant but weak association between CH status and the risk of lung cancer (OR = 1.46, CI = 1.11–1.93, Table 2). Consistent with the report from Tian et al., our analysis also indicated that the association is mainly driven by the high-VAF mutations. As shown, the presence of high-VAF mutations was associated with a 2.2-fold of increase in lung cancer risk (OR = 2.20, CI = 1.26–3.99) with respect to CH-negative subjects, while the association for the low-VAF carriers was not significant (OR = 1.30, CI = 0.96–1.77). The association between high-VAF CH and lung cancer risk remains significant after adjusting for age, sex, race, and smoking status (OR = 2.54, CI = 1.38–4.93), and further adjusting for FHLC and top three PCs for genetic ancestry (OR = 4.05, CI = 1.43–14.7). Despite the dominant effect of high-VAF mutations, the low-VAF carriers also tended to have a higher lung cancer risk compared with those CH-negative subjects. Indeed, trend analysis indicated a dose–response relationship with CH mutations of higher VAF being associated with a more increased risk of lung cancer (Table 2).
CH carriers | ||||
---|---|---|---|---|
All CH mutations | Low-VAF CH mutations | High-VAF CH mutations | P trenda | |
All participants | ||||
No adjustment (cases/controls = 1059/899) | ||||
No. of CH samples in cases (%)/controls (%) | 153 (14.45%)/93 (10.34%) | 111 (10.48%)/76 (8.45%) | 42 (3.97%)/17 (1.89%) | |
OR (95% CI)b | 1.46 (1.11–1.93) | 1.30 (0.96–1.77) | 2.20 (1.26–3.99) | 0.007 |
P value | 0.0066 | 0.094 | 0.0069 | |
Adjusted for age, sex, race and smoking status (cases/controls = 957/857) | ||||
No. of CH samples in cases (%)/controls (%) | 131 (13.7%)/88 (10.3%) | 93 (9.7%)/73 (8.5%) | 38 (4%)/15 (1.8%) | |
Adjusted OR (95% CI) | 1.37 (1.02–1.85) | 1.14 (0.82–1.6) | 2.54 (1.38–4.93) | 0.004 |
P value | 0.041 | 0.44 | 0.0038 | |
Additionally adjusted for FHLC and the first 3 PCs of genetic ancestry (cases/controls = 778/352)c | ||||
No. of CH samples in cases (%)/controls (%) | 116 (14.91%)/39 (11.08%) | 82 (10.54%)/35 (9.94%) | 34 (4.37%)/4 (1.14%) | |
Adjusted OR (95% CI) | 1.38 (0.9–2.16) | 1.11 (0.7–1.79) | 4.05 (1.43–14.7) | 0.018 |
P value | 0.14 | 0.67 | 0.016 |
- Abbreviations: CH, clonal haematopoiesis; CI, confidence interval; INTEGRAL-ILCCO, Integrative Analysis of Lung Cancer Etiology and Risk project of the International Lung Cancer Consortium; OR, odds ratio; VAF, variant allele frequency.
- a Linear trend test of ordinal logistic regression.
- b Odds ratio and 95% confidence interval (lower 2.5%–upper 97.5%).
- c FHLC: family history of lung cancer. PC: principal component.
Additionally, we stratified samples by clinical factors (sex, age, smoking status) and lung cancer subtypes to test the significance of each risk factor associated with lung cancer (Figure 1 and Table S1). As shown in Figure 1, after adjusting other risk factors a significant association between the presence of high-VAF mutations and increased lung cancer risk was detected in the old (age ≥ 65, OR = 2.85, CI = 1.30−6.82) but not in the young subjects (age < 65, OR = 2.20, CI = 0.83−6.57). A significant association was detected in female (OR = 3.33, CI = 1.26−10.49) subjects. Male subjects showed similar but weak association (OR = 2.18, CI = 1.00−5.13). As for smoking status, a significant association was detected in the ever-smoking (OR = 3.40, CI = 1.59−8.43) but not in the never-smoking (OR = 1.53, CI = 0.39−5.11) subjects. The observed variations might mostly be caused by statistical power difference, since no significant interactions were detected between CH and these risk factors (all P interaction > 0.05, Figure 1).

In addition, we examined the association of high-VAF CH with lung cancer risk in different lung cancer histological subtypes. As shown in Figure 1, a significant association was observed in all subtypes, including lung adenocarcinoma (OR = 2.55, CI = 1.22−5.39) and other lung cancers (OR = 3.30, CI = 1.37−7.67). High-VAF CH also showed weak correlation with squamous cell carcinoma (OR = 2.39, CI = 0.97−5.84).
Finally, we performed a meta-analysis to combine the results from the INTEGRAL-ILCCO cohort with those from the UKBB and MGBB cohorts reported by Tian et al.4 According to the meta-analysed result from a fixed effects model, the meta OR was 1.36 (CI = 1.14−1.62) for all CH and 1.76 (CI = 1.34−2.31) for the high-VAF CH (Figure 2 and Figure S2). No heterogeneity between the three datasets was identified (all P heterogeneity > 0.05). In addition, we performed meta-analyses specifically for never-smokers and ever-smokers (current and past smokers). For ever-smokers, the meta-analysis revealed significant associations with increased risk of lung cancer for all (OR = 1.36, CI = 1.11−1.67) and high-VAF (OR = 1.65, CI = 1.18−2.30) carriers (Figure S3). In contrast, a significant lung cancer risk association was only observed for high-VAF carriers (OR = 2.06, CI = 1.26−3.36).

4 DISCUSSION
Due to the study design and other reasons, the cohort selected in this study from INTEGRAL-ILCCO is different from the two cohorts investigated by Tian et al. in the following ways. First, in our cohort, the blood samples used for WES sequencing were collected at the time of lung cancer diagnosis, whereas Tian et al. selected subjects with blood samples collected at least three years before cancer diagnosis from UKBB. As such, the UKBB cohort provides evidence of a causal relationship between CH and increased lung cancer risk. Second, the cases in our cohort have a high rate of family history of lung cancer (58.4%), compared to 20.7% in the UKBB and 11.3% in the MGBB. Third, although we have a similar amount of lung cancer cases, the number of controls in our sub-cohort is smaller than that in the UKBB cohort. In addition, our study used a panel of well-defined CH genes to identify CH mutations, while Tian et al. focussed on a list of cancer driver mutations in 178 genes.5 Despite these differences, our analysis in the INTEGRAL-ILCCO cohort confirmed the primary finding by Tian et al.: the presence of CH mutations is associated with an increased risk of lung cancer, which is mainly driven by high-VAF mutations. The association between CH status and lung cancer risk is independent of other risk factors and disease subtypes.
Our analysis in the INTEGRAL-ILCCO cohort revealed a higher OR of high-VAF CH than the OR reported by Tian et al. in UKBB (2.38 [1.35–4.39] vs. 1.77 [1.27–2.46]). This difference might be explained by the facts (1) our cohort had a higher rate of family history of lung cancer than UKBB (58.4% vs. 20.7%), and/or (2) the blood samples were collected at the time of lung cancer diagnosis in our cohort but at least 3 years before cancer diagnosis for subjects selected from UKBB.
In summary, consistent with the previous report by Tian et al., our analysis using the INTEGRAL-ILCCO cohort data indicated that the presence of CH mutations, especially those with high VAF, was associated with a significantly increased risk of lung cancer (Figure 3).

AUTHOR CONTRIBUTIONS
Chao Cheng and Christopher I. Amos designed and supervised the study. Chao Cheng and Wei Hong performed data analyses and interpreted the results. Chao Cheng, Christopher I. Amos and Wei Hong contributed to writing the manuscript.
ACKNOWLEDGMENTS
We thank High Performance Computer (HPC) support from the Dan L Duncan Comprehensive Cancer Center at Baylor College of Medicine. We thank the Cheng Lab members for helpful discussions and suggestions. This study is supported by the Cancer Prevention Research Institute of Texas (CPRIT) (RR180061 to CC and RR170048 to CA) and the National Cancer Institute of the National Institute of Health (1R01CA269764 to CC). Chao Cheng and Christopher I Amos are CPRIT Scholars in Cancer Research.
CONFLICTS OF INTEREST STATEMENT
The authors declare no conflicts of interest.
ETHICAL APPROVAL
Not applicable.
Open Research
DATA AVAILABILITY STATEMENT
Whole Exome Sequencing data of all subjects in this study are available from the database of Genotypes and Phenotypes (dbGaP) with Accession ID phs000876.v2.p1. Confidential access to the data can be requested at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id = phs000876.v2.p1.