Volume 24, Issue 10 pp. 1282-1293
ORIGINAL ARTICLE

Handling missing data in a rheumatoid arthritis registry using random forest approach

Ahmad Alsaber

Corresponding Author

Ahmad Alsaber

Department of Mathematics and Statistics, University of Strathclyde, Glasgow, UK

Correspondence

Ahmad Alsaber, Department of Mathematics and Statistics, University of Strathclyde, Glasgow, G1 1XH, UK.

Email: [email protected]

Search for more papers by this author
Adeeba Al-Herz

Adeeba Al-Herz

Department of Rheumatology, Al-Amiri Hospital, Kuwait City, Kuwait

Search for more papers by this author
Jiazhu Pan

Jiazhu Pan

Department of Mathematics and Statistics, University of Strathclyde, Glasgow, UK

Search for more papers by this author
Ahmad T. AL-Sultan

Ahmad T. AL-Sultan

Department of Community Medicine and Behavioral Sciences, Kuwait University, Kuwait City, Kuwait

Search for more papers by this author
Divya Mishra

Divya Mishra

Department of Plant Pathology, Kansas State University, Kansas, MN, USA

Search for more papers by this author
KRRD Group

KRRD Group

Department of Rheumatology, Al-Amiri Hospital, Kuwait City, Kuwait

Search for more papers by this author
First published: 12 August 2021
Citations: 7

Ahmad Alsaber and Jiazhu Pan contributed equally to this study.

Funding information

This is an investigator study from the KRRD registry. The KRRD registry is supported by unrestricted grants from Pfizer

Abstract

Missing data in clinical epidemiological research violate the intention-to-treat principle, reduce the power of statistical analysis, and can introduce bias if the cause of missing data is related to a patient's response to treatment. Multiple imputation provides a solution to predict the values of missing data. The main objective of this study is to estimate and impute missing values in patient records. The data from the Kuwait Registry for Rheumatic Diseases was used to deal with missing values among patient records. A number of methods were implemented to deal with missing data; however, choosing the best imputation method was judged by the lowest root mean square error (RMSE). Among 1735 rheumatoid arthritis patients, we found missing values vary from 5% to 65.5% of the total observations. The results show that sequential random forest method can estimate these missing values with a high level of accuracy. The RMSE varied between 2.5 and 5.0. missForest had the lowest imputation error for both continuous and categorical variables under each missing data rate (10%, 20%, and 30%) and had the smallest prediction error difference when the models used the imputed laboratory values.

CONFLICT OF INTEREST

Authors have declared that no competing interests exist.

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.