Volume 38, Issue 16 pp. 3073-3090
RESEARCH ARTICLE

Targeted learning with daily EHR data

Oleg Sofrygin

Corresponding Author

Oleg Sofrygin

Division of Research, Kaiser Permanente, Northern California, Oakland, California

Division of Biostatistics, University of California, Berkeley, California

Oleg Sofrygin, Division of Research, Kaiser Permanente, Northern California, Oakland, CA 94612.

Email: [email protected]

Search for more papers by this author
Zheng Zhu

Zheng Zhu

Division of Research, Kaiser Permanente, Northern California, Oakland, California

Search for more papers by this author
Julie A. Schmittdiel

Julie A. Schmittdiel

Division of Research, Kaiser Permanente, Northern California, Oakland, California

Search for more papers by this author
Alyce S. Adams

Alyce S. Adams

Division of Research, Kaiser Permanente, Northern California, Oakland, California

Search for more papers by this author
Richard W. Grant

Richard W. Grant

Division of Research, Kaiser Permanente, Northern California, Oakland, California

Search for more papers by this author
Mark J. van der Laan

Mark J. van der Laan

Division of Biostatistics, University of California, Berkeley, California

Search for more papers by this author
Romain Neugebauer

Romain Neugebauer

Division of Research, Kaiser Permanente, Northern California, Oakland, California

Search for more papers by this author
First published: 25 April 2019
Citations: 24

Abstract

Electronic health records (EHR) data provide a cost- and time-effective opportunity to conduct cohort studies of the effects of multiple time-point interventions in the diverse patient population found in real-world clinical settings. Because the computational cost of analyzing EHR data at daily (or more granular) scale can be quite high, a pragmatic approach has been to partition the follow-up into coarser intervals of pre-specified length (eg, quarterly or monthly intervals). The feasibility and practical impact of analyzing EHR data at a granular scale has not been previously evaluated. We start filling these gaps by leveraging large-scale EHR data from a diabetes study to develop a scalable targeted learning approach that allows analyses with small intervals. We then study the practical effects of selecting different coarsening intervals on inferences by reanalyzing data from the same large-scale pool of patients. Specifically, we map daily EHR data into four analytic datasets using 90-, 30-, 15-, and 5-day intervals. We apply a semiparametric and doubly robust estimation approach, the longitudinal Targeted Minimum Loss-Based Estimation (TMLE), to estimate the causal effects of four dynamic treatment rules with each dataset, and compare the resulting inferences. To overcome the computational challenges presented by the size of these data, we propose a novel TMLE implementation, the “long-format TMLE,” and rely on the latest advances in scalable data-adaptive machine-learning software, xgboost and h2o, for estimation of the TMLE nuisance parameters.

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.