Targeted learning with daily EHR data
Corresponding Author
Oleg Sofrygin
Division of Research, Kaiser Permanente, Northern California, Oakland, California
Division of Biostatistics, University of California, Berkeley, California
Oleg Sofrygin, Division of Research, Kaiser Permanente, Northern California, Oakland, CA 94612.
Email: [email protected]
Search for more papers by this authorZheng Zhu
Division of Research, Kaiser Permanente, Northern California, Oakland, California
Search for more papers by this authorJulie A. Schmittdiel
Division of Research, Kaiser Permanente, Northern California, Oakland, California
Search for more papers by this authorAlyce S. Adams
Division of Research, Kaiser Permanente, Northern California, Oakland, California
Search for more papers by this authorRichard W. Grant
Division of Research, Kaiser Permanente, Northern California, Oakland, California
Search for more papers by this authorMark J. van der Laan
Division of Biostatistics, University of California, Berkeley, California
Search for more papers by this authorRomain Neugebauer
Division of Research, Kaiser Permanente, Northern California, Oakland, California
Search for more papers by this authorCorresponding Author
Oleg Sofrygin
Division of Research, Kaiser Permanente, Northern California, Oakland, California
Division of Biostatistics, University of California, Berkeley, California
Oleg Sofrygin, Division of Research, Kaiser Permanente, Northern California, Oakland, CA 94612.
Email: [email protected]
Search for more papers by this authorZheng Zhu
Division of Research, Kaiser Permanente, Northern California, Oakland, California
Search for more papers by this authorJulie A. Schmittdiel
Division of Research, Kaiser Permanente, Northern California, Oakland, California
Search for more papers by this authorAlyce S. Adams
Division of Research, Kaiser Permanente, Northern California, Oakland, California
Search for more papers by this authorRichard W. Grant
Division of Research, Kaiser Permanente, Northern California, Oakland, California
Search for more papers by this authorMark J. van der Laan
Division of Biostatistics, University of California, Berkeley, California
Search for more papers by this authorRomain Neugebauer
Division of Research, Kaiser Permanente, Northern California, Oakland, California
Search for more papers by this authorAbstract
Electronic health records (EHR) data provide a cost- and time-effective opportunity to conduct cohort studies of the effects of multiple time-point interventions in the diverse patient population found in real-world clinical settings. Because the computational cost of analyzing EHR data at daily (or more granular) scale can be quite high, a pragmatic approach has been to partition the follow-up into coarser intervals of pre-specified length (eg, quarterly or monthly intervals). The feasibility and practical impact of analyzing EHR data at a granular scale has not been previously evaluated. We start filling these gaps by leveraging large-scale EHR data from a diabetes study to develop a scalable targeted learning approach that allows analyses with small intervals. We then study the practical effects of selecting different coarsening intervals on inferences by reanalyzing data from the same large-scale pool of patients. Specifically, we map daily EHR data into four analytic datasets using 90-, 30-, 15-, and 5-day intervals. We apply a semiparametric and doubly robust estimation approach, the longitudinal Targeted Minimum Loss-Based Estimation (TMLE), to estimate the causal effects of four dynamic treatment rules with each dataset, and compare the resulting inferences. To overcome the computational challenges presented by the size of these data, we propose a novel TMLE implementation, the “long-format TMLE,” and rely on the latest advances in scalable data-adaptive machine-learning software, xgboost and h2o, for estimation of the TMLE nuisance parameters.
Supporting Information
Filename | Description |
---|---|
SIM_8164-Supp-0001-websuppl_stremr_paper_v2_0.pdfPDF document, 320.3 KB |
SIM_8164-Supp-0001-websuppl_stremr_paper_v2_0.pdf |
Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
REFERENCES
- 1Neugebauer R, Fireman B, Roy JA, O'Connor PJ, Selby JV. Dynamic marginal structural modeling to evaluate the comparative effectiveness of more or less aggressive treatment intensification strategies in adults with type 2 diabetes. Pharmacoepidemiol Drug Saf. 2012; 21(S2): 99-113.
- 2Neugebauer R, Fireman B, Roy JA, O'Connor PJ. Impact of specific glucose-control strategies on microvascular and macrovascular outcomes in 58,000 adults with type 2 diabetes. Diabetes Care. 2013; 36(11): 3510-3516.
- 3Ray WA. Evaluating medication effects outside of clinical trials: new-user designs. Am J Epidemiol. 2003; 158(9): 915-920.
- 4Hernán MA, Alonso A, Logan R, et al. Observational studies analyzed like randomized experiments: an application to postmenopausal hormone therapy and coronary heart disease. Epidemiology. 2008; 19(6): 766-779.
- 5Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. Am J Epidemiol. 2016; 183(8): 758-764.
- 6van der Laan MJ, Gruber S. Targeted minimum loss based estimation of causal effects of multiple time point interventions. Int J Biostat. 2012; 8(1). ISSN (Online) 1557-4679.
- 7Petersen M, Schwab J, Gruber S, Blaser N, Schomaker M, van der Laan M. Targeted maximum likelihood estimation for dynamic and static longitudinal marginal structural working models. J Causal Inference. 2014; 2(2): 147-185.
- 8Gruber S, Logan RW, Jarrín I, Monge S, Hernán MA. Ensemble learning of inverse probability weights for marginal structural modeling in large observational datasets. Statist Med. 2015; 34(1): 106-117.
- 9Neugebauer R, Chandra M, Paredes A, Graham DJ, McCloskey C, Go AS. A marginal structural modeling approach with super learning for a study on oral bisphosphonate therapy and atrial fibrillation. J Causal Inference. 2013; 1(1): 21-50.
10.1515/jci-2012-0003 Google Scholar
- 10Neugebauer R, Schmittdiel JA, van der Laan MJ. Targeted learning in real-world comparative effectiveness research with time-varying interventions. Statist Med. 2014; 33(14): 2480-2520.
- 11Sofrygin O, van der Laan MJ, Neugebauer R. stremr: Streamlined estimation of survival for static, dynamic and stochastic treatment and monitoring regimes. R package version 0.31. 2016. https://github.com/osofr/stremr
- 12 R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2016. https://www.R-project.org/
- 13Chen T, Guestrin C. XGboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016; San Francisco, CA.
- 14 The H2O.ai team. H2o. 2017. http://h2o-release.s3.amazonaws.com/h2o/rel-turin/3/R
- 15Nathan DM, Buse JB, Davidson MB, et al. Management of hyperglycemia in type 2 diabetes: a consensus algorithm for the initiation and adjustment of therapy. Diabetes Care. 2006; 29(8): 1963-1972.
- 16Skyler JS, Bergenstal R, Bonow RO, et al. Intensive glycemic control and the prevention of cardiovascular events: implications of the accord, advance, and VA diabetes trials. Diabetes Care. 2008; 32(1): 187-192.
- 17Ray KK, Seshasai SR, Wijesuriya S, et al. Effect of intensive control of glucose on cardiovascular outcomes and death in patients with diabetes mellitus: a meta-analysis of randomised controlled trials. The Lancet. 2009; 373(9677): 1765-1772.
- 18Duckworth W, Abraira C, Moritz T, et al. Glucose control and vascular complications in veterans with type 2 diabetes. N Engl J Med. 2009; 360(2): 129-139.
- 19 ADVANCE Collaborative Group, Patel A, MacMahon S, et al. Intensive blood glucose control and vascular outcomes in patients with type 2 diabetes. N Eng J Med. 2008; 358(24): 2560-2572.
- 20Holman RR, Paul SK, Bethel MA, Matthews DR, Neil HA. 10-year follow-up of intensive glucose control in type 2 diabetes. N Engl J Med. 2008; 359(15): 1577-1589.
- 21Vogt TM, Lafata JE, Tolsma DD, Greene SM. The role of research in integrated health care systems: the HMO research network. Perm J. 2004; 8(4): 10-17.
- 22van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. Int J Biostat. 2007; 3(1). Article No. 3.
- 23Pearl J. Causal inference in statistics: an overview. Statistics Surveys. 2009; 3: 96-146.
10.1214/09-SS057 Google Scholar
- 24VanderWeele TJ. Concerning the consistency assumption in causal inference. Epidemiology. 2009; 20(6): 880-883.
- 25Robins J. A new approach to causal inference in mortality studies with a sustained exposure period – application to control of the healthy worker survivor effect. Mathematical Modelling. 1986; 7(9-12): 1393-1512.
- 26van der Laan MJ, Benkeser D, Sofrygin O. Targeted Minimum Loss-Based Estimation. Hoboken, NJ: John Wiley & Sons; 2018. Wiley Statsref: Statistics Reference Online.
- 27Stitelman OM, De Gruttola V, van der Laan MJ. A general implementation of tmle for longitudinal data applied to causal inference in survival analysis. uc berkeley division of biostatistics working paper series. Working Paper 281. 2011. http://biostats.bepress.com/ucbbiostat/paper281
- 28Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005; 61(4): 962-973.
- 29Schnitzer ME, Moodie EEM, van der Laan MJ, Platt RW, Klein MB. Modeling the impact of hepatitis C viral clearance on end-stage liver disease in an HIV co-infected cohort with targeted maximum likelihood estimation. Biometrics. 2014; 70(1): 144-152.
- 30Hernán MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000: 561-570.
- 31McCaffrey DF, Ridgeway G, Morral AR. Propensity score estimation with boosted regression for evaluating causal effects in observational studies Psychological Methods. 2004; 9(4): 403-425.
- 32Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Statist Med. 2010; 29(3): 337-346.
- 33van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol Biol. 2007; 6. Article No. 25.
- 34van der Laan MJ, Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. Working Paper Series 130. Berkeley, CA: Division of Biostatistics, University of California, Berkeley. 2003.
- 35van der Laan MJ, Dudoit S, van der Vaart AW. The cross-validated adaptive epsilon-net estimator. Statistics and Decisions. 2006; 24(3): 373-395.
10.1524/stnd.2006.24.3.373 Google Scholar
- 36Polley E, LeDell E, Kennedy C, Lendle S, van der Laan M. Superlearner: super learner prediction. R package version 2.0-21. Vienna, Austria: R Foundation for Statistical Computing; 2016. http://CRAN.R-project.org/package=SuperLearner
- 37Ho TK. Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition; 1995; Washington, DC.
- 38Breiman L. Random forests. Machine Learning. 2001; 45(1): 5-32. https://doi.org/10.1023/A:1010933404324
- 39Click C, Lanford J, Malohlava M, Parmar V, Roark H. Gradient boosted models; 2015. http://h2o.ai/resources
- 40Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010; 33(1): 1-22.
- 41Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York, NY: Springer; 2009.
10.1007/978-0-387-84858-7 Google Scholar
- 42Leong TK, Tabada GH, Yang J, Zhu Z, Neugebauer R. Msmstructure. SAS Macro. 2016. https://www.dor.kaiser.org/external/DORExternal/research/causalinference.html
- 43Neugebauer R, Schmittdiel JA, van der Laan MJ. A case study of the impact of data-adaptive versus model-based estimation of the propensity scores on causal inferences from three inverse probability weighting estimators. Int J Biostat. 2016; 12(1): 131-155.
- 44Stürmer T, Funk MJ, Poole C, Brookhart MA. Nonexperimental comparative effectiveness research using linked healthcare databases. Epidemiology. 2011; 22(3): 298-301.
- 45Hernán MA, Hernández-Díaz S, Robins JM. A structural approach to selection bias. Epidemiology. 2004; 15(5): 615-625.
- 46Weuve J, Tchetgen EJT, Glymour MM, et al. Accounting for bias due to selective attrition: the example of smoking and cognitive decline. Epidemiology. 2012; 23(1): 119-128.
- 47Chaix B, Evans D, Merlo J, Suzuki E. Commentary: Weighing up the dead and missing: reflections on inverse-probability weighting and principal stratification to address truncation by death. Epidemiology. 2012; 23(1): 129-131.
- 48Tchetgen EJT, Glymour MM, Shpitser I, Weuve J. Rejoinder: to weight or not to weight?: on the relation between inverse-probability weighting and principal stratification for truncation by death. Epidemiology. 2012; 23(1): 132-137.
- 49Hernán MA, Schisterman EF, Hernández-Díaz S. Invited commentary: composite outcomes as an attempt to escape from selection bias and related paradoxes. Am J Epidemiol. 2013; 179(3): 368-370.
- 50Rubin DB. Causal inference through potential outcomes and principal stratification: application to studies with “censoring” due to death. Statistical Science. 2006; 21(3): 299-309.
- 51Frangakis CE, Rubin DB. Principal stratification in causal inference. Biometrics. 2002; 58(1): 21-29.
- 52Moodie EEM, Stephens DA, Klein MB. A marginal structural model for multiple-outcome survival data: assessing the impact of injection drug use on several causes of death in the canadian co-infection cohort. Statist Med. 2014; 33(8): 1409-1425.
- 53Stephens-Shields AJ, Spieker AJ, Anderson A, et al. Blood pressure and the risk of chronic kidney disease progression using multistate marginal structural models in the cric study. Statist Med. 2017; 36(26): 4167-4181.
- 54Hernán MA, McAdams M, McGrath N, Lanoy E, Costagliola D. Observation plans in longitudinal studies with time-varying treatments. Stat Methods Med Res. 2009; 18(1): 27-52.
- 55Kreif N, Sofrygin O, Schmittdiel J, et al. Evaluation of adaptive treatment strategies in an observational study where time-varying covariates are not monitored systematically. 2018. arXiv preprint arXiv:1806.11153.
- 56van der Laan M. A generally efficient targeted minimum loss based estimator based on the highly adaptive lasso. Int J Biostat. 2017; 13(2). ISSN (Online) 1557-4679.
- 57Schwab J, Lendle S, Petersen M, van der Laan M. ltmle: longitudinal targeted maximum likelihood estimation. R package version 0.9.3. 2014. http://CRAN.R-project.org/package=ltmle
- 58Neugebauer R, Schmittdiel JA, Adams AS, Grant RW, van der Laan MJ. Identification of the joint effect of a dynamic treatment intervention and a stochastic monitoring intervention under the no direct effect assumption. J Causal Inference. 2017; 5(1). ISSN (Online) 2193-3685.
- 59Luedtke AR, Sofrygin O, van der Laan MJ, Carone M. Sequential double robustness in right-censored longitudinal models. 2017. arXiv preprint arXiv:1705.02459.