Volume 24, Issue 5 pp. 1056-1065
ORIGINAL ARTICLE
Full Access

Using DNA methylation to validate an electronic medical record phenotype for smoking

Kathleen A. McGinnis

Corresponding Author

Kathleen A. McGinnis

Veterans Affairs Connecticut Healthcare System, West Haven, CT, USA

Correspondence to: Kathleen A McGinnis, DrPH, MS, VA Connecticut Healthcare System, 950 Campbell Avenue, Building 35a 2 Floor (11-ACSLG), West Haven, CT 06516, USA. E-mail: [email protected]Search for more papers by this author
Amy C. Justice

Amy C. Justice

Veterans Affairs Connecticut Healthcare System, West Haven, CT, USA

Yale School of Medicine, New Haven, CT, USA

Yale School of Public Health, New Haven, CT, USA

Search for more papers by this author
Janet P. Tate

Janet P. Tate

Veterans Affairs Connecticut Healthcare System, West Haven, CT, USA

Yale School of Medicine, New Haven, CT, USA

Search for more papers by this author
Henry R. Kranzler

Henry R. Kranzler

VISN 4 MIRECC, Crescenz VAMC, Philadelphia, PA, USA

University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA

Search for more papers by this author
Hilary A. Tindle

Hilary A. Tindle

Vanderbilt University Medical Center, Nashville, TN, USA

Geriatric Research Education and Clinical Centers (GRECC), Veterans Affairs Tennessee Valley Healthcare System, Nashville, TN, USA

Search for more papers by this author
William C. Becker

William C. Becker

Veterans Affairs Connecticut Healthcare System, West Haven, CT, USA

Yale School of Medicine, New Haven, CT, USA

Search for more papers by this author
John Concato

John Concato

Veterans Affairs Connecticut Healthcare System, West Haven, CT, USA

Yale School of Medicine, New Haven, CT, USA

Search for more papers by this author
Joel Gelernter

Joel Gelernter

Veterans Affairs Connecticut Healthcare System, West Haven, CT, USA

Yale School of Medicine, New Haven, CT, USA

Search for more papers by this author
Boyang Li

Boyang Li

Yale School of Medicine, New Haven, CT, USA

Search for more papers by this author
Xinyu Zhang

Xinyu Zhang

Yale School of Medicine, New Haven, CT, USA

Search for more papers by this author
Hongyu Zhao

Hongyu Zhao

Yale School of Medicine, New Haven, CT, USA

Yale School of Public Health, New Haven, CT, USA

Search for more papers by this author
Kristina Crothers

Kristina Crothers

University of Washington, Seattle, WA, USA

Search for more papers by this author
Ke Xu

Ke Xu

Veterans Affairs Connecticut Healthcare System, West Haven, CT, USA

Search for more papers by this author
For the VACS Project Group

For the VACS Project Group

Search for more papers by this author
First published: 04 October 2018
Citations: 10

Abstract

A validated, scalable approach to characterizing (phenotyping) smoking status is needed to facilitate genetic discovery. Using established DNA methylation sites from blood samples as a criterion standard for smoking behavior, we compare three candidate electronic medical record (EMR) smoking metrics based on longitudinal EMR text notes. With data from the Veterans Aging Cohort Study (VACS), we employed a validated algorithm to translate each smoking-related text note into current, past or never categories. We compared three alternative summary characterizations of smoking: most recent, modal and trajectories using descriptive statistics and Spearman's correlation coefficients. Logistic regression and area under the curve analyses were used to compare the associations of these phenotypes with the DNA methylation sites, cg05575921 and cg03636183, which are known to have strong associations with current smoking. DNA methylation data were available from the VACS Biomarker Cohort (VACS-BC), a sub-study of VACS. We also considered whether the associations differed by the certainty of trajectory group assignment (<0.80/≥0.80). Among 140 152 VACS participants, EMR summary smoking phenotypes varied in frequency by the metric chosen: current from 33 to 53 percent; past from 16 to 24 percent and never from 24 to 33 percent. The association between the EMR smoking pairs was highest for modal and trajectories (rho = 0.89). Among 728 individuals in the VACS-BC, both DNA methylation sites were associated with all three EMR summary metrics (p < 0.001), but the strongest association with both methylation sites was observed for trajectories (p < 0.001). Longitudinal EMR smoking data support using a summary phenotype, the validity of which is enhanced when data are integrated into statistical trajectories.

Introduction

Twin and family-based studies show that approximately 50 percent of the risk of tobacco dependence is heritable (Goldman et al. 2005). However, the numerous genetic variants that have been linked to smoking behavior explain only a small proportion of the phenotypic variation. A major challenge for gene discovery is phenotypic ambiguity, especially that stemming from self-reported health behaviors such as smoking in which social desirability bias and/or lack of documentation in the health record may obscure actual behavior (Kinsinger et al. 2017). Additional inaccuracy can result from cross-sectional rather than longitudinal approaches to characterizing behavioral phenotypes. Both factors can contribute to inadequate statistical power to detect small individual genetic effects that characterize complex disorders like smoking.

Several biological markers could be employed to validate self-reported smoking data. Cotinine, a major metabolite of nicotine in tobacco smoke, is the most commonly used biomarker. However, cotinine is elevated from smoking for only days to weeks (Jarvis et al. 1998). Expired carbon monoxide, although sensitive to very recent smoking, lasts for only up to 8 hours (Sandberg, et al. 2011).

DNA methylation is an epigenetic process in which a methyl group is added to a DNA molecule. Environmental factors, such as exposure to cigarette smoke, can result in epigenetic changes (Bollati and Baccarelli 2010), and these changes can serve as biomarkers for long-term environmental exposure (Ladd-Acosta 2015). Epigenome-wide association studies have identified hundreds of DNA methylation sites, referred to as CpG or cg sites, associated with smoking (Breitling et al. 2011; Harlid et al. 2014; Gao et al. 2015; Ambatipudi et al. 2016, Joehanes et al. 2016). The two sites most frequently associated with smoking are methylation of cg05575921 in the AHRR gene and cg03636183 in the F2RL3 gene (Gao et al. 2015; Ambatipudi, et al. 2016; Lee et al. 2016; Bojesen et al. 2017), and these two CpG sites have been proposed as candidate biomarkers to determine smoking status (Philibert et al. 2013). These sites have been consistently linked to smoking status in multiple population groups (Elliott et al. 2014; Guida et al. 2015; Beach et al. 2017), among both sexes (Dogan et al. 2015), across a wide range of ages (Beach et al. 2015) and different tissues (Stueve et al. 2017) and across different methylation platforms (Breitling et al. 2011, Joehanes et al. 2016).

The performance of cg05575921 as a predictor of smoking had an area under the curve of 0.99 (Philibert et al. 2015) in a small study. A recent study in two independent samples showed that cg05575921 is a robust indicator for smoking, as confirmed by serum cotinine concentrations (Andersen et al. 2017). Methylation at cg05575921 and cg03636183 along with other CpG sites has also been shown to predict lung cancer risk (Zhang et al. 2015; Baglietto et al. 2017) and lung cancer incidence with an area under of curve of approximately 0.8 (Zhang et al. 2016).

Smoking-associated methylation persists after smoking cessation (Joehanes et al. 2016). Methylation at cg05575921 takes approximately 10 years (Fasanelli et al. 2015), and methylation at cg03636183 takes approximately 20 years (Zhang et al. 2014) after quitting smoking to return to a level similar to that of never smokers. Among Framingham Heart Study participants, cg03636183 was one of the top 36 most statistically significant methylation sites that did not return to never-smoker levels, even 30 years after smoking cessation (Joehanes et al. 2016). The recovery of methylation at cg05575921 has been suggested as a quantitative marker for smoking cessation (Philibert et al. 2016) and cigarette consumption (Philibert et al. 2018). These findings suggest that DNA methylation changes at cg05575291 and cg03636183 can be reliably detected and have long-term stability as measures of smoking exposure.

The Veterans Health Administration (VHA) benefits from one of the most highly developed health information systems in the world (Corrigan et al. 2002; McQueen et al. 2004). Smoking data from the VHA electronic medical record (EMR) date back to 2000 and have been previously validated against two sources of survey data (McGinnis et al. 2011). An emerging source of large genome-wide association studies data is the Million Veteran Program (MVP) of the Department of Veterans Affairs, which is in the process of obtaining phenotypic and genomic data. One goal is to identify genetic variation contributing to nicotine dependence risk. Thus, the phenotypic refinement effort described here has important implications for achieving this aim. Our goal was to use EMR data to develop and validate a longitudinal phenotype for smoking behavior that could be used to identify novel genetic variants in genome-wide association studies. More specifically, we aimed to determine which EMR summary smoking metric (most recent, modal or trajectory) was most strongly associated with DNA methylation markers (cg05575921 and cg03636183).

Materials and Methods

Subjects

The Veterans Aging Cohort Study (VACS) is a large observational cohort study consisting of data from the national VA EMR that includes all HIV-infected (HIV+) patients (over 53 000) in VA care from October 1996 to September 2015 and more than 111 000 uninfected patients matched on region, age, race/ethnicity, and gender. VACS is described in detail elsewhere (Fultz et al. 2006; Justice et al. 2006). The VACS Biomarker Cohort (VACS-BC) is a subsample of VACS that includes 1525 HIV+ and 843 uninfected individuals who provided a blood sample from 2005 to 2007. The VACS-BC is described in detail elsewhere (Armah et al. 2012; Freiberg et al. 2016; Justice et al. 2012). We analyzed genomic DNA data on a subset of 728 of the VACS-BC participants.

Measures

EMR smoking data/phenotype determination

Smoking data were obtained from the VHA Corporate Data Warehouse. Details on the extraction methods are provided elsewhere (McGinnis et al. 2011). In brief, EMR smoking data are collected from patients nationally and on a yearly basis using the clinical reminder process. EMR smoking data consist of text values that represent responses to specific smoking-related questions asked of patients. These questions can vary by site and over time. Mapping strategies have been created to classify these responses into ‘never,’ ‘past’ and ‘current’ smoking and can be found on www.vacohort.org (McGinnis et al. 2011). For these analyses, we coded never as 0, past as 1 and current as 2. Using all available EMR smoking observations from 2000 to 2015, we created the following smoking phenotypes: (1) the most recent time point available (never, past, current); (2) the modal value—the most common value of all data available (never, past, current); and longitudinal smoking trajectory. When there was more than one smoking entry for any year of age, we used the highest smoking value for that year, considering current as highest and never as lowest. Trajectory modeling sorts each participant's smoking values (never, past, current) into ‘clusters’ and estimates distinct trajectories (Marshall et al. 2015). We used age as the time scale to account for possible decreases in smoking with age. The procedure calculates each individual's probability of belonging to each trajectory and assigns the individual to the trajectory with the highest probability of membership. We used a censored normal model (Nagin, 2005; Jones & Nagin, 2013) and evaluated three-, four- and five-group models and first, second and third-order terms. For maximum precision, trajectories were developed in the full VACS sample. Smoking trajectory fit, as measured by the Bayesian information criterion, improved substantially when increasing from three to four and four to five groups. In the five-group model, one of the groups had a mean probability of group membership of 0.69. In contrast, in the four-group model with second-order terms, the mean probabilities of group membership for the four trajectory groups ranged from 0.81 to 0.99 using second-order terms, and the smallest trajectory group contained 23.2 percent of the sample. The four-group model using third-order terms performed similarly in terms of probabilities of group membership and Bayesian information criterion, but the groups were less equally distributed; therefore, the four group trajectory model with second-order terms was used (Appendix). The smoking trajectory groups were designated as mostly never (0), mix (1), current and past (2) and mostly current (3).

DNA methylation

Genomic DNA was extracted from whole blood, and DNA methylation was profiled using an Infinium Illumina HumanMethylation 450 K Beadchip (Illumina, San Diego, CA, USA) at the Yale Center of Genomic Analysis. Array data are deposited in GSE100264. After data quality control and normalization, β values for two CpG sites, cg03636183 and cg05575921, were obtained for each sample and were applied in the subsequent analyses; β values ranged continuously from 0 to 1, representing the level of methylation at each site. Based on prior research, current smoking status was assigned for methylation β values for two cutoffs of <0.80 and <0.70 for cg05575921 (Philibert et al. 2015) and two cutoff β values of <0.68 and <0.59 for cg03636183. Two cutoffs were used for each analysis to determine whether the results are dependent on a particular cutoff. These cutoffs were chosen based on the average methylation levels in smoking versus non-smoking groups in other published studies (Fasanelli F et al. 2015; Gao et al. 2015; Philibert et al. 2015; Philibert et al. 2018; Zhang et al. 2016).

Analyses

Demographic characteristics were summarized for patients in VACS and those in the VACS-BC methylation subset. We compared the following EMR self-reported smoking metric pairs in the VACS data: most recent versus modal, most recent versus trajectory and modal versus trajectory using crosstab and Spearman's correlation coefficient.

Among individuals in the VACS-BC with methylation marker data, we identified those with low methylation, which is indicative of current smoking (<0.80 or <0.70 for cg05575921 and <0.68 or <0.59 for cg03636183, as reported above) for each EMR self-reported smoking metric. Chi-square tests were used to determine whether the methylation markers were statistically significantly associated with each self-reported EMR smoking metric. Analyses for cg03636183 were stratified by ancestry (African or European), because methylation of cg03636183 has been shown to vary by population (Dogan et al. 2015; Zhang et al. 2014).

To determine which EMR smoking metric was most strongly associated with the methylation markers, we generated logistic regression models and calculated corresponding area under the receiver operating characteristics curve (AUROC) and concordance statistics (C-statistics). Linear predictors were generated from each model to use with the roccomp command in Stata to test whether there were statistically significant differences between the AUROCs.

In a sensitivity analysis, we reran the analysis after limiting the sample to patients who had a probability of ≥0.80 for their assigned smoking trajectory category. A high probability reflects a higher percentage of smoking observations falling into their assigned smoking trajectory category, whereas a low probability indicates a lower percentage of smoking observations falling into their designated smoking trajectory category. We also ran the VACS-BC analysis including the smoking value that was closest in time to collection of the blood sample from which the methylation data were obtained. Because methylation of the AHRR gene reflects cannabis smoking, we re-ran the analysis of cg05575921 including only participants who on the confidential VACS survey reported using marijuana or hashish less than once a month or never. Analyses were run in Stata 14.2.

Results

Of the 140 152 patients in VACS with available smoking data, the mean age was 47 years at enrollment, 31 percent were HIV+, 97 percent were male, 48 percent were African American, 40 percent were white and 12 percent were Hispanic or other. Based on modal smoking, 53 percent were currently smoking, 18 percent smoked in the past and 29 percent never smoked. Of the 728 participants in VACS-BC with methylation data, the mean age was 53 years at the time of the blood draw, 84 percent were HIV+, 97 percent were male, 81 percent were African American, 15 percent were white and 4 percent were Hispanic or other. Based on modal smoking, 68 percent were currently smoking, 12 percent smoked in the past and 20 percent never smoked (Table 1).

Table 1. Characteristics of VACS and VACS-BC with methylation data.
VACS (n = 142 152) VACS-BC (n = 728)
Mean age (SD) 47.2 (10.9) 52.8 (7.9)
HIV+ 31% 84%
Male 97% 97%
Race/Ethnicity
African–American 48% 81%
White 40% 15%
Hispanic/Other 12% 4%
Marijuana or Hashish use -
Never - 26%
Not in past year - 45%
<Once per month - 9%
≥Once per month - 14%
Unknown - 6%
Smoking variables
Median # smoking observation (IQR) 6 (3–9) 15 (8–24)
Self-report from EMR
Most recent
Never (0) 33% 23%
Past (1) 24% 24%
Current (2) 43% 53%
Most common
Never (0) 29% 20%
Past (1) 18% 12%
Current (2) 53% 68%
Trajectory
Mostly never (0) 24% 14%
Mix (1) 26% 22%
Current and past (2) 16% 26%
Mostly current (3) 33% 37%
Current smoking based on methylation markers
Cg05575921 < 0.80 - 69%
Cg05575921 < 0.70 - 54%
Cg03636183 < 0.68 - 82%
Cg03636183 < 0.59 - 52%
  • EMR = electronic medical record; VACS = Veterans Aging Cohort Study; VACS-BC = Veterans Aging Cohort Study—Biomarker Cohort Substudy.

There were differences in how VACS participants were classified using the three self-reported EMR smoking metrics and the agreement differed between each pair (Table 2). The correlations among the three EMR smoking phenotypes were high: 0.80 for most recent and modal, 0.85 for most recent and trajectory and 0.89 for modal and trajectory.

Table 2. Comparison of self-reported EMR smoking metrics in VACS (n = 142 152).
Most recent
Most common Never (0) Past (1) Current (2)
Never (0) 37 098 2824 1222
Past (1) 4690 19 648 1036
Current (2) 5407 11 067 59 160
Correlation: 0.80
Most recent
Trajectories Never (0) Past (1) Current (2)
Mostly Never (0) 33 133 1428 203
Mix (1) 10 917 21 959 4228
Current and past (2) 2579 8955 11 335
Mostly current (3) 586 1294 45 651
Correlation: 0.85
Most common
Trajectories Never (0) Past (1) Current (2)
Mostly never (0) 34 678 65 1
Mix (1) 6348 23 845 6911
Current and past (2) 91 1416 21 262
Mostly current (3) 27 45 47 459
Correlation: 0.89
  • EMR = electronic medical record.

In VACS-BC, all EMR smoking metrics were associated with both smoking methylation markers. The percentage of patients identified as currently smoking based on both methylation markers increased monotonically from never to past to current smoking for each self-reported EMR smoking metric (Fig. 1). The gradient was steepest for the trajectory smoking metric for cg05575921. When we included only patients whose probability of trajectory membership was ≥0.80, we found similar patterns (Fig. 1), with only a modest improvement in the association between the EMR smoking measures and methylation markers at the two sites.

Details are in the caption following the image
(a) Percent with current smoking based on Cg05575921 for three self-reported smoking phenotypes of all participants (n = 728) and those with trajectory probability ≥0.80 (n = 583). (b) Percent with current smoking based on Cg03636183 for three self-reported smoking phenotypes of those with African ancestry (n = 551) and limited to those with trajectory probability ≥0.80 (n = 440). (c) Percent with current smoking based on Cg03636183 for three self-reported smoking phenotypes of those with European ancestry (n = 84) and limited to those with trajectory probability ≥0.80 (n = 64)

Agreement (reflected in higher C-statistics) was greater for cg05575921 than for cg03636183 for all EMR smoking metrics (Fig. 2). Limiting the sample to patients with ≥0.80 probability of trajectory assignment resulted in higher C-statistics for all comparisons, but did not alter the relative comparisons among metrics. In all comparisons, the smoking trajectory metric had the highest C-statistics (from 0.67 to 0.89), followed by modal smoking (from 0.67 to 0.86), most recent smoking (0.64 to 0.80) and closest smoking (0.63 to 0.81). For cg05575921, the C-statistics were significantly higher for the trajectories than most recent (p < .001), closest (p < .001) and modal (p = .03) metrics and for modal compared to most recent (p = .002) and closest (p = .01). When we limited the analyses to patients with ≥0.80 probability of trajectory group membership, the differences with cg05575921 were significantly greater for trajectories than for most recent and closest (both p < .001), and for modal than for most recent (p < .001) and closest (p = .02). For cg03636183 among participants of African ancestry limited to those with ≥0.80 probability of trajectory group membership, C-statistics were significantly higher for trajectory than for modal and closest (p = .026 and .049). For cg03636183 among participants of European ancestry, C-statistics were significantly higher for trajectory than for most recent (p = .012) and modal (p = .05). The sample size was too small for comparisons of C-statistics among participants of European ancestry and with ≥0.80 probability of trajectory group membership. Including only the 567 participants who reported using marijuana or hashish less than monthly or never, cg05575921 results were similar to when all 728 were included.

Details are in the caption following the image
(a) Comparison of AUROC C-statistics from logistic regression models using three phenotypes of EMR self-reported smoking to predict current smoking based on Cg05575921 among all participants (n = 728) and those with trajectory probability ≥0.80 (n = 583). (b) Comparison of AUROC C-statistics from logistic regression models using three phenotypes of EMR self-reported smoking to predict current smoking based on Cg03636183 among all those of African ancestry (n = 551) and limited to those with trajectory probability ≥0.80 (n = 440). (c) Comparison of AUROC C-statistics from logistic regression models using three phenotypes of EMR self-reported smoking to predict current smoking based on Cg03636183 among all those of European ancestry (n = 84) and limited to those with trajectory probability ≥0.80 (n = 64)

Discussion

Although methylation-defined smoking status was statistically significantly associated with the single, recent, self-reported smoking status in EMR, there was a greater association of methylation-defined smoking status with longitudinal measures (either modal or trajectory). Using methylation as a criterion standard, regardless of methylation cutoff, we found that longitudinal EMR data provided a valid phenotype for smoking status and the validity was enhanced by the integration of longitudinal data into statistical trajectories compared to modal or most proximal phenotypes. In all cases, associations were strong for the methylation site cg05575921 and for cg03636183 for participants of European ancestry. In addition, associations were stronger for cg03636183 for participants of European ancestry than for those of African ancestry, as expected based on previous studies. These results were robust whether we considered only patients for whom the probability of trajectory group assignment was ≥0.80 or irrespective of the certainty of assignment.

This study both confirms and extends prior work on smoking and DNA methylation levels. This work is in line with prior studies demonstrating that cg05575921 and cg03636183 are associated with smoking status (Gao et al. 2015; Joehanes, et al. 2016) and both have been applied as biomarkers to define smoking status (current, past and never) (Shenker et al. 2013), smoking intensity (quantity and duration) (Joehanes et al. 2016; Wilson et al. 2017) and smoking cessation (Philibert et al. 2016). To our knowledge, we are the first to use methylation data as a criterion standard for developing a longitudinal phenotype of smoking behavior for genetic discovery. Methylation data are amenable to this application for several reasons. First, they are not subject to the social desirability bias that can confound self-reported smoking status. Second, changes in methylation are only reversed following a long period of smoking abstinence (i.e. decades), so that associations with these biomarkers would be expected to demonstrate an increasing dose–response association between never, past and current smoking, which we observed in our analyses. Third, the fact that we found similar associations for two separate methylation sites supports the validity of the results.

We demonstrated substantially stronger associations between the criterion standard (cg05575921 and cg0363183) and summary metrics of longitudinal, repeated self-report measures of smoking than with a single report (i.e. most recent assessment of smoking status). As we found in prior analyses of repeated longitudinal self-reported measures of alcohol consumption (Justice et al. 2017; Justice et al. 2018), summary measures of repeated self-reported smoking also demonstrate stronger associations than single, cross-sectional reports that are often employed as phenotypes in large-scale genetic studies (Sanchez-Roige et al. 2017). Like drinking behaviors, smoking behaviors among middle-aged individuals are typically stable with some decrement with advancing age. In this context, multiple observations, summarized over time and adjusted for age, can reduce the degree of ‘noise’ in the measurement and provide improved phenotypic ‘signal’. In our prior analyses of self-reported alcohol consumption, an age-adjusted mean AUDIT-C score provided the best association with our criterion standard (the minor allele frequency for an ADH1B polymorphism that has been previously shown to be protective for alcohol use disorder) (Justice et al. 2018).

In the current analysis, statistical trajectories using age as the time scale were superior to the modal response. These findings suggest that it is both important to employ multiple measures of self-reported health behaviors over time and to adjust these for the age of the individual at the time of the report. Otherwise, an older individual's drinking or smoking behaviors might be misclassified based upon a lower level of use at an advanced age, obscuring higher levels of use at younger ages.

It is also important to note that our findings were highly consistent whether the analyses were restricted to patients for whom the statistical trajectory assignments were most certain (trajectory probability ≥0.80 versus <0.80). Patients for whom the trajectory assignments were most certain had more consistent self-reported smoking behavior (concordance = 52 percent vs 48 percent for all reports, p <.001) and/or had more reports (mean = 13.6 versus 7.0, p <.001). The findings were also similar when we limited the sample to participants who reported using marijuana or hashish less than monthly or never. These findings underscore that our measure is robust to missing data and to multisubstance use.

There are limitations to this study. First, there are relatively few women in the study, potentially reducing generalizability, although sex has not been identified as a moderator of DNA methylation in smokers. In addition, these data lack granular assessments of the dose of cigarette exposure, such as smoking pack years, which have been shown to predict methylation (Joehanes et al. 2016). Similarly, knowing whether a smoker was daily or non-daily would also be important in the assessment of total dose, as non-daily smokers tend to have lower overall exposure (although exposure can still be substantial) (Shiffman et al. 2012). HIV infection results in the depletion of a particular T-cell subset and a shift of overall cell frequencies. To assess whether the reduction of T-cells in HIV+ impacts DNA methylation of cg0755921 and cg03636183 in our sample, we compared DNA methylation of these two loci between HIV+ and uninfected samples and found that they did not differ significantly (cg05575921: t = −1.69, p = 0.09 and cg3636183: t = 0.14, p = 0.88), suggesting that DNA methylation of these loci are associated with smoking but not with HIV status. Finally, the VACS-BC does not contain biomarkers of nicotine metabolism, which is primarily driven by genetic variation in hepatic cytochrome P450 enzymes (Dempsey et al. 2004), and which has been found to influence DNA methylation (Loukola et al. 2015). Although biomarkers may be the most accurate way to identify smoking status, they are impractical on a large scale. We previously developed an algorithm to identify the smoking status of patients in VA care using text fields from EMR smoking data validated against self-reported confidentially collected research survey data (McGinnis et al. 2011). We are currently collecting cotinine concentrations in a separate study of patients and will be able to compare EMR smoking data to cotinine concentrations in future research.

As smoking status is often reported in text fields in non-VA systems as well, these methods can be used outside the VA, so that the findings are generalizable to other health systems. In addition, the VA is rapidly developing a cohort of a million veterans with genetic data in the MVP (Gaziano et al. 2016), and this study will be an important resource as a potential method to identify smoking status in the MVP.

In conclusion, we have demonstrated that longitudinal trajectories of smoking status based upon repeated assessments of smoking that are recorded in the EMR are strongly associated with two DNA methylation sites, which are sensitive biomarkers of tobacco use. We anticipate that these trajectories will prove to be effective, efficient phenotypes for large-scale genetic and other ‘omics’ discovery efforts.

Acknowledgements

The use of DNA methylation status to determine smoking status is protected by US Patents 8,637,652 and 9,273,358 as well as by pending intellectual property claims owned by Behavioral Diagnostics. This study was funded by a grant from the National Institute on Alcohol Abuse and Alcoholism (U24-AA020794, U01-AA020790, U10 AA013566 (completed) and VHA i01 BX003341. Views presented in the manuscript are those of the authors and do not reflect those of the Department of Veterans Affairs, or the United States Government.

Conflict of Interest

Dr Kranzler has been an advisory board member, consultant or CME speaker for Alkermes, Indivior and Lundbeck. He is also a member of the American Society of Clinical Psychopharmacology's Alcohol Clinical Trials Initiative, which was supported in the last 3 years by AbbVie, Alkermes, Ethypharm, Indivior, Lilly, Lundbeck, Otsuka, Pfizer, Arbor and Amygdala Neurosciences. Drs Kranzler and Gelernter are named as inventors on PCT patent application #15/878,640 entitled: ‘Genotype-guided dosing of opioid agonists,’ filed January 24, 2018. No other authors have conflicts of interest to declare.

Authors Contribution

ACJ, HK, KX and KAM were responsible for the study concept and design. ACJ contributed to the acquisition of the data. ACJ, KAM, JPT, HT and KX assisted with data analysis and interpretation of findings. XZ performed DNA methylation data processing and quality control. BL performed methylation analysis for smoking. KAM, ACJ, KX and HK drafted the manuscript. HT, XK, WB, JC, JG and KC provided critical revisions of the manuscript for important intellectual content. All authors critically reviewed content and approved final version for publication.

    Appendix A: Group-based smoking trajectories among HIV+ and uninfected patients in VACS (2000–2015)

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.