Volume 6, Issue 1 e1049

ORIGINAL RESEARCH

Open Access

Prognosis of COVID-19 patients using lab tests: A data mining approach

Fariba Khounraz,

Fariba Khounraz

Administration and Resources Development Affairs, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Contribution: Conceptualization, Data curation, Writing - original draft

Search for more papers by this author

Mahmood Khodadoost,

Mahmood Khodadoost

School of Traditional Medicine, Traditional Medicine & Materia Medical Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Contribution: Conceptualization

Search for more papers by this author

Saeid Gholamzadeh,

Saeid Gholamzadeh

Administration and Resources Development Affairs, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Legal Medicine Research Center, Legal Medicine Organization, Tehran, Iran

Contribution: Conceptualization

Search for more papers by this author

Rashed Pourhamidi,

Rashed Pourhamidi

Non Communicable Diseases Research Center, Bam University of Medical Sciences, Bam, Iran

Contribution: Writing - original draft

Search for more papers by this author

Tayebeh Baniasadi,

Tayebeh Baniasadi

orcid.org/0000-0003-0212-291X

Department of Health Information Technology, Faculty of Para-Medicine, Hormozgan University of Medical Sciences, Bandar Abbas, Iran

Contribution: Supervision, Writing - review & editing

Search for more papers by this author

Aida Jafarbigloo,

Aida Jafarbigloo

Department of Health Information Technology, School of Allied Medical Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Contribution: Writing - original draft

Search for more papers by this author

Gohar Mohammadi,

Gohar Mohammadi

Administration and Resources Development Affairs, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Contribution: Conceptualization

Search for more papers by this author

Mahnaz Ahmadi,

Mahnaz Ahmadi

Department of Pharmaceutics and Pharmaceutical Nanotechnology, School of Pharmacy, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Contribution: Writing - review & editing

Search for more papers by this author

Seyed Mohammad Ayyoubzadeh,

Corresponding Author

Seyed Mohammad Ayyoubzadeh

[email protected]

orcid.org/0000-0001-8450-7818

Department of Health Information Management, School of Allied Medical Sciences, Tehran University of Medical Sciences, Tehran, Iran

Correspondence Seyed Mohammad Ayyoubzadeh, Health Information Management Department, 3rd Floor, School of Allied Medical Sciences, Tehran University of Medical Sciences, No #17, Farredanesh Alley, Ghods St, Enghelab Ave, Tehran 1417744361, Iran.

Email: [email protected]

Contribution: Conceptualization, Data curation, Formal analysis, Writing - original draft

Search for more papers by this author

Fariba Khounraz,

Fariba Khounraz

Administration and Resources Development Affairs, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Contribution: Conceptualization, Data curation, Writing - original draft

Search for more papers by this author

Mahmood Khodadoost,

Mahmood Khodadoost

School of Traditional Medicine, Traditional Medicine & Materia Medical Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Contribution: Conceptualization

Search for more papers by this author

Saeid Gholamzadeh,

Saeid Gholamzadeh

Administration and Resources Development Affairs, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Legal Medicine Research Center, Legal Medicine Organization, Tehran, Iran

Contribution: Conceptualization

Search for more papers by this author

Rashed Pourhamidi,

Rashed Pourhamidi

Non Communicable Diseases Research Center, Bam University of Medical Sciences, Bam, Iran

Contribution: Writing - original draft

Search for more papers by this author

Tayebeh Baniasadi,

Tayebeh Baniasadi

orcid.org/0000-0003-0212-291X

Department of Health Information Technology, Faculty of Para-Medicine, Hormozgan University of Medical Sciences, Bandar Abbas, Iran

Contribution: Supervision, Writing - review & editing

Search for more papers by this author

Aida Jafarbigloo,

Aida Jafarbigloo

Department of Health Information Technology, School of Allied Medical Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Contribution: Writing - original draft

Search for more papers by this author

Gohar Mohammadi,

Gohar Mohammadi

Administration and Resources Development Affairs, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Contribution: Conceptualization

Search for more papers by this author

Mahnaz Ahmadi,

Mahnaz Ahmadi

Department of Pharmaceutics and Pharmaceutical Nanotechnology, School of Pharmacy, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Contribution: Writing - review & editing

Search for more papers by this author

Seyed Mohammad Ayyoubzadeh,

Corresponding Author

Seyed Mohammad Ayyoubzadeh

[email protected]

orcid.org/0000-0001-8450-7818

Department of Health Information Management, School of Allied Medical Sciences, Tehran University of Medical Sciences, Tehran, Iran

Email: [email protected]

Contribution: Conceptualization, Data curation, Formal analysis, Writing - original draft

Search for more papers by this author

First published: 08 January 2023

https://doi.org/10.1002/hsr2.1049

Citations: 1

Share a link

Email
Wechat
Bluesky

Abstract

Background

The rapid prevalence of coronavirus disease 2019 (COVID-19) has caused a pandemic worldwide and affected the lives of millions. The potential fatality of the disease has led to global public health concerns. Apart from clinical practice, artificial intelligence (AI) has provided a new model for the early diagnosis and prediction of disease based on machine learning (ML) algorithms. In this study, we aimed to make a prediction model for the prognosis of COVID-19 patients using data mining techniques.

Methods

In this study, a data set was obtained from the intelligent management system repository of 19 hospitals at Shahid Beheshti University of Medical Sciences in Iran. All patients admitted had shown positive polymerase chain reaction (PCR) test results. They were hospitalized between February 19 and May 12 in 2020, which were investigated in this study. The extracted data set has 8621 data instances. The data include demographic information and results of 16 laboratory tests. In the first stage, preprocessing was performed on the data. Then, among 15 laboratory tests, four of them were selected. The models were created based on seven data mining algorithms, and finally, the performances of the models were compared with each other.

Results

Based on our results, the Random Forest (RF) and Gradient Boosted Trees models were known as the most efficient methods, with the highest accuracy percentage of 86.45% and 84.80%, respectively. In contrast, the Decision Tree exhibited the least accuracy (75.43%) among the seven models.

Conclusion

Data mining methods have the potential to be used for predicting outcomes of COVID-19 patients with the use of lab tests and demographic features. After validating these methods, they could be implemented in clinical decision support systems for better management and providing care to severe COVID-19 patients.

1 INTRODUCTION

The present coronavirus disease 2019 (COVID-19) epidemic is an important public health issue on a global scale.¹ Coronavirus disease (COVID-19) has been caused by severe acute respiratory syndrome coronavirus 2 (SARS-COV2), which points to the virus as the cause of the potentially fatal illness that is such a major global public health concern. In Wuhan, China, the first case of the disease was discovered in December 2019.^{2, 3} The World Health Organization (WHO) classified this illness as a pandemic on March 2020, due to its spread to other nations. COVID-19 can spread one-to-one by respiratory transfer.⁴ Coughing, fever, and shortness of breath are some examples of COVID-19 symptoms.⁵

Functional screening and diagnostic tools were imperative to control the outbreak of COVID-19, isolation, precautions, and clinical management of patients. Artificial Intelligence (AI) assures a new platform for healthcare, aside from clinical procedures and treatments. Many various AI tools which made based on machine learning (ML) algorithms are used to analyze data and decision-making processes.¹ Medical practice has been gradually experiencing a change by AI. AI applications are growing into various fields which were before assumed to be only related to human researchers. This is due to the recent advancement in digitized data acquisition, computing infrastructure, and ML.² It can even accurately distinguish between viral types of pneumonia, making it an effective screening tool using technologies such as AI and deep learning.³ Extracting knowledge from large data repositories made data mining an essential component in different fields of human life, and one of the most imperative parts is AI.⁴

Data mining as a subfield of AI can be defined as finding interesting patterns in data. The aim is to identify new, reliable, helpful, and comprehensible correlations and patterns in available data to use discovered patterns to explain the latest behavior or foresee the results.^{5, 6} Data mining techniques are extensively used in various healthcare applications such as patient outcome prediction, modeling of health outcomes, assessment of treatment efficacy, infection control, and ranking of hospitals.⁷

Research indicated that researchers should continue with the intuition they obtained, focusing on finding solutions to the problems of COVID-19, and coming up with new developments. With the increasing focus on data mining and ML in the healthcare area, the proper environment can be provided for further advancement.⁸

Using data mining techniques regarding COVID-19 issues could potentially be beneficial in overcoming these challenges. For example, accurately anticipating and diagnosing such viruses requires providing prediction systems, which can be challenging. Also, Identifying epidemiologic risks in advance will improve the prediction, prevention, and detection of future global health risks that could be predicted by AI-driven techniques.⁹ Historical data can be a valuable source of information for predictive models. Data mining could also forecast who gets acute respiratory distress syndrome (ARDS), an acute and serious result of COVID-19, with historical data.¹⁰ In this regard, Pan et al. at Vulcan Hill Hospital in Wuhan, China, extracted data from the COVID-19 intensive care unit (ICU) to develop, evaluate and validate various ML models for predicting the prognosis of COVID-19 patients.¹¹ The combination of image datasets and ML has also helped in the diagnosis of COVID-19. In the study of Muhammad et al. by using X-ray images of patients' chests and ML models, they were able to extract the image features of COVID-19.^{12, 13} In another study, Gumaei et al. used time-series data on the number of people with COVID-19 worldwide. They tried to predict patients with the disease using different ML models.¹⁴ Research has also been done to develop drugs in this area. Using ML techniques, Jamshidi et al. in their studies were able to reach, a framework based on DL methods that have been presented to illustrate how AI can accelerate the process of drug development. This framework includes eight layers, which are responsible for identifying, analyzing, and predicting the drug's performances in different stages.^{15, 16} The use of ML models has advanced to the point that it is even used to measure the impact of human interventions on the disease outbreak. For example, Delen et al. used the Gradient Boosted Trees model as a data mining technique to analyze the effectiveness of social distance during the COVID-19 outbreak.¹⁷ Several similar studies have been conducted all around the world. For example, Gong et al. constructed an effective model to identify and classify cases of high risk. In their study, 372 after admission patients with COVID-19 were monitored for more than 15 days, and models were made according to baseline data from two groups, including the severe and nonsevere groups. They used a nomogram for predictions of severe illness and assessed its performance. Findings in the training cohort and validation cohort were (area under the curve (AUC): 0.912 [95% confidence interval (CI): 0.846−0.978], sensitivity 85.71%, specificity 87.58%), and (AUC: 0.853 [0.790−0.916], sensitivity 77.5%, specificity 78.4%) respectively.¹⁸ Based on 14 clinical variables, 6 prediction models for COVID-19 diagnosis were developed utilizing 6 distinct data mining approaches (BayesNet, Logistic, IBk, CR, PART, and J48). The study investigated 114 past instances from the Taizhou hospital in China's Zhejiang Province. Their findings demonstrated that the C.R. meta-classifier, with an accuracy of 84.21%, is the most reliable classifier to predict positive and negative COVID-19 instances.¹⁹ During the Corona outbreak, one of the greatest challenges for humanity was the proper management and response to this disease.²⁰ According to the literature, ML can be useful in COVID-19 research, diagnosis, and prediction.²¹ Therefore, for quick and very effective prediction to diagnose COVID-19 patients, two stages of feature selection and COVID-19 diagnosis stage can be used.¹⁹ Success in combating such epidemics depends heavily on building an arsenal of platforms, methods, approaches, and tools that converge to achieve desired goals and make life more satisfying.²²

Although, as mentioned, many studies have been performed using different ML models worldwide, there is still a need to develop and evaluate these models with other datasets.¹⁷ The motivation and contribution of this study are to develop predictive models to determine the outcome of COVID-19 patients using ML methods with a novel feature selection algorithm. These models were aimed to be trained based on Iranian hospitals' data that could help clinicians to prognosis COVID-19 patients by Lab tests.

Thus, this study aimed to propose a prediction model for the prognosis of COVID-19 patients (will the patient survive or die) using data mining techniques based on an Iranian data set of COVID-19 patients. Accordingly, different steps were organized for this research. Overall, the methods section includes phases of data set collection, preprocessing, feature selection, and modeling and evaluation. In Section 3, the main findings of this study are presented which included the result of features selection and evaluation of data mining models as well as comparative indicators and diagrams related to the built models. In Section 4, the results of comparing the findings of our study with similar studies are represented. Finally, after stating the study's limitations, general conclusions are mentioned along with suggestions for future research.

2 METHODS

The methods used in this research are consistent with the related guidelines. The steps for conducting this research are represented in Figure 1. Overall, the method includes data set collection, data set preprocessing, feature selection, and modeling and evaluation which are described in the following sections.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Steps of conducting the research

2.1 Data set collection

The data set was obtained from the Hospital Intelligent Management (HIM) system repository, a comprehensive system containing patient data from 19 hospitals at Shahid Beheshti University of Medical Sciences in Iran.

All patients, as COVID-19 cases, with a positive result of PCR test hospitalized between February 19 and May 12, 2020, were studied. The extracted data set has 8621 data instances. The data obtained include demographic features and results of 16 laboratory tests of patients. Demographic information includes gender, age, underlying disease, nation, and inpatient department. The laboratory tests include aspartate aminotransferase (AST), lactate dehydrogenase (LDH), lymphocytes, eosinophil (EOS), erythrocyte sedimentation rate (ESR), platelet count (PLT), alanine aminotransferase (ALT), hemoglobin, albumin calcium, magnesium, thyroid-stimulating hormone (TSH), thyroglobulin (T.G.), fasting blood sugar (FBS), thyroxine (T4), triiodothyronine (T3), procalcitonin (PCT). The features data type is tabulated in Table 1.

Table 1. Feature data type

Feature name	Datatype
Gender	Binominal
Age	Numerical
Underlying disease	Nominal
Nation	Nominal
Inpatient department	Nominal
Aspartate aminotransferase (AST)	Numerical
Alanine aminotransferase (ALT)	Numerical
Lactate dehydrogenase (LDH)	Numerical
Lymphocytes	Numerical
Eosinophil (EOS)	Numerical
Erythrocyte sedimentation rate (ESR)	Numerical
Platelet count (PLT)	Numerical
Hemoglobin	Numerical
Albumin calcium	Numerical
Magnesium	Numerical
Thyroid-stimulating hormone (TSH)	Numerical
Thyroglobulin (T.G.)	Numerical
Fasting blood sugar (FBS)	Numerical
Thyroxine (T4)	Numerical
Triiodothyronine (T3)	Numerical
Procalcitonin (PCT)	Numerical
Outcome (Label)	Binominal

2.2 Data set preprocessing

Preprocessing is a necessary process to produce an efficient classification model that impacts the performance of ML methods.²¹ In the first step, duplicate records were recognized and removed based on the national identification codes of patients to preprocess the data. The label (survived or dead) is transformed to binomial values in the data set.

In the next step, the data set containing the lab results of patients (Table 2) has been converted to a columnar data set with one patient per row and all test results and discharge states as columns.

Table 2. Sample records of the original laboratory test data set

Admission code	Test group	Test name	Result	Unit
3547430	ESR	ESR	35	mm/hr
3547430	AST	AST	25	U/L
3547430	AST	AST	24	U/L
3547430	ALT	ALT	23	U/L
3547430	ALT	ALT	20	U/L
3547440	PLT	PLT	439	×10³/µl
3547440	PLT	PLT	467	×10³/µl
3547440	PLT	PLT	511	×10³/µl
3547440	PLT	PLT	444	×10³/µl
3547440	PLT	PLT	355	×10³/µl

Abbreviations: AST, aspartate aminotransferase; ESR, erythrocyte sedimentation rate; PLT, platelet count.

Due to the differences in the lab tests for each patient, the resulting data set was sparse.

2.3 Feature selection

The feature selection process is shown in Figure 2. The process includes an independent t test, features subset calculation, and feature subset selection that are described below.

2.3.1 Independent t test

To choose the most influential factors in survival, the p values of the independent t test were calculated for each lab test in two groups (the survived patients' group and the dead patients' group). In this step, records with a missing value for that lab test are removed. The significant features were chosen to be passed to the next step (p value less than 0.05).

2.3.2 Features subset calculation

After this step, to overcome the sparsity issue, we tried to generate a subset of features that have fewer missing values and more features. For this purpose, first, we listed all 32,767 possible subsets of the features list (15 features) and assigned a score to each feature subset. Each feature subset score is calculated with the count of the records having a value for all of the features in the subset. For example, in Figure 3, the record count having values of ALT, PLT, and AST features is equal to 1381.

2.4 Feature subset selection

The list of features subsets and scores related to each features subset is sorted, and the subset with the most features and a score of more than 1000 is chosen for the final data set (having more than 1000 nonmissing records for all features subsets with more features count; Figure 3).

2.4.1 Modeling and evaluation

Logistic Regression, Gradient Boosted Trees, Naive Bayes, Decision Trees, Support Vector Machine, and Generalized Linear models are generated and evaluated using 10-fold cross-validation with Rapid Miner Studio software. The creating and assessing process of these models consists of feature selection, optimization of model parameters using the train data, and evaluation of the model using the test data, in the training and testing phases, respectively. Also, synthetic minority oversampling technique²³ is applied for balancing the train data. This process divides the data set into 10 nonoverlapping folds. Each of the 10-folds is given to be used in the test stage, while all other folds are used collectively in the training stage. A total of 10 models are fit and evaluated on the 10 hold-out test sets, and the mean performance is reported.

2.4.2 Logistic regression (LR)

LR is a kind of regression analysis in statistics used to predict the outcome of a definite-dependent variable from a set of predictor or independent variables.²⁴ The relationship between a categorical variable and dependent factors of any kind of categorical, continuous, or binary can be analyzed using LR.²⁵ When the dependent variable has two values (0 and 1 or yes and no), it can be used, referring to binary logistic regression.²⁶

2.4.3 Naive Bayes (NB)

NB is a subdivision of Bayesian decision theory called naive as the formulation makes some naïve assumptions and can classify documents astoundingly well.²⁷ NB is one of the simplest probabilistic classifiers.^{28, 29} The classifier simplifies the learning process by anticipating that features are independent of given classes.²⁸ The resulting classifier is significantly prosperous in practice, even often competing with more sophisticated techniques.²⁶

NB is efficient in several practical applications. Text classification and medical diagnosis are examples of such applications.²⁷

2.4.4 Support vector machine (SVM)

An SVM is used for analyzing data, discovering patterns in classification, and regression analysis.²⁴ As a powerful tool for data classification, this model can classify two categories, classifies two categories which are pointed by assigning them to one of two disjoint half-spaces, in the case of linear classifiers in the original input space or nonlinear classifiers, in the higher dimensional feature space.²⁶ The larger space between the two classes, the better the model will be. Also, SVM works much better on datasets with many attributes.²⁴

2.4.5 Gradient boosted trees

Additional trees are combined strategically in the gradient boosting tree method by correcting mistakes that previous models made. Thus, it is more likely to increase the accuracy of prediction. Using Gradient Boosting of regression trees, it is possible to produce competitive, robust procedures that are also interpretable for regression and classification, especially suitable for mining less than clean data.³⁰ Boosting algorithms are relatively easy to implement and allow for experimentation with various model designs. The GBMs have demonstrated significant progress in the practical applications and challenges of data mining and ML.²⁷

2.4.6 Decision tree (DT)

As a promising tool, a DT can predict response to data by using classification or regression and is one of the primary data mining methods. If the features are grouped, classification is used, and if data are continuous, regression is used. DT is constructed of a root node, leaf nodes, and branches. The evaluation of the data is possible by following the path from the root node to reach a leaf node.²⁴ Two phases make a tree, the first is tree-growing (building), and the second is tree-pruning. In the first phase, the algorithm begins with the entire data set at the root node; the data set is splatted into subsets, which is repeated for the next steps (for each subset) until each member becomes sufficiently small. In the next phase (tree-pruning), to boost the accuracy of the tree, the whole tree is cut back to avoid over-fitting.³¹

2.4.7 Generalized linear model (GLM)

The GLM provides a comprehensive and favored method for statistical analysis. In particular, the ability to predict can be valuable for the assessment of the practical importance of the predictors and to compare competing GLMs.³² These models are easy to interpret, and the methods are theoretically well understood and explained.³³ GLM extends the concept of the linear regression model.³⁴ The term GLM came from Nelder and Wedderburn³⁵ and McCullagh and Nelder.³⁶ They reported that if the distribution of Y, as a dependent variable, is a member of the exponential class, the GLM could be specified by two components, including the distribution of Y and the link function.³⁷

2.4.8 Random forest (RF)

As an ensemble learning method, RF or random decision forest could be used for tasks such as classification and regression. In this model, many decision trees are constructed at training time and outputting the class, whether classification or regression.³⁸ The random forest algorithm was suggested for the first time by L. Breiman in 2001. This model could present a general-purpose classification and regression method. This technique could have a suitable performance in problems where the samples are few related to the factors. It utilizes some randomized decision trees and classifies the data using the combination of results of the decision trees. Furthermore, it can easily be served to large-scale problems, adapted to different ad hoc learning tasks, and returns variable importance measures.³⁹ Although RF performs better than decision trees, its accuracy is less than the gradient-boosted trees. Having said that, data characteristics can impact their performance.⁴⁰

2.4.9 Model evaluation

Also, evaluating the performance of similar steps without performing the feature selection phase is possible. Sensitivity and specificity, as two vital key factors, could determine the validity of a model.⁴¹ Thus, the accuracy, sensitivity, and specificity of these models have been evaluated using a using10-fold cross-validation method using Formulas 1-3:

\mathrm{Accuracy}=\frac{{TP}+{TN}}{{TP}+{FP}+{FN}+{TN}},

()

\mathrm{Sensitivity}=\frac{{TP}}{{TP}+{FN}},

()

\mathrm{Specificity}=\frac{{TN}}{{FP}+{TN}}.

()

3 RESULT

3.1 Feature selection

The Independent t test p values of lab tests are shown in Table 3.

Table 3. Independent t test p values of lab tests

Lab test	p value
ALT	0.0
AST	0.0
Albumin(Alb)	0.00044
ESR	0.0
Eos	0.00016
FBS	1e−05
Interleukin 6	0.01216
LDH	0.0
Lymphocytes	0.0
PLT	0.03728
Procalcitonin	1e−05
SGPT ALT	0.0
T4	0.00154
TSH	0.01758
Total Protein	0.00391

Abbreviations: ALT, alanine aminotransferase; AST, aspartate aminotransferase; ESR, erythrocyte sedimentation rate; FBS, fasting blood sugar; LDH, lactate dehydrogenase; PLT, platelet count; TSH, thyroid-stimulating hormone.

Of these lab tests shown in Table 3, Lymphocytes, ESR, PLT, and AST features are selected based on the scores (the score of this combination of features was equal to 1298). Finally, these four lab tests along with eight demographic features were selected with 1461 records for model creation.

3.2 Modeling and evaluation

Seven data mining techniques were applied to the data set created from the previous step. The evaluated models include Logistic Regression, Gradient Boosted Trees, Naive Bayes, Decision Trees, Support Vector Machines, and Generalized Linear, Random Forest algorithms. Indicators of Accuracy, Sensitivity, Specificity, and AUC related to the built models can be seen in Table 4.

Table 4. Indicators of accuracy, sensitivity, specificity, and AUC related to the built models

Model	Accuracy (%)	Sensitivity (%)	Specificity (%)	AUC
Gradient Boosted trees	84.80	73.87	88.95	0.894
Random Forest	86.45	60.70	96.22	0.893
Support vector Machine	84.74	51.24	97.45	0.891
Naïve Bayes	85.21	55.72	96.41	0.880
Logistic Regression	85.56	65.18	93.29	0.877
Generalized linear model	83.09	43.29	98.21	0.865
Decision Tree	75.43	28.20	93.39	0.608

Abbreviation: AUC, area under the curve.

The Random Forest and Gradient Boosted trees models were the most efficient ones, with the highest accuracy percentage of 86.45% and 84.80%, respectively. In contrast, the Decision Tree had the least accuracy (75.43%) among the seven models. The comparison diagram of the receiver operating characteristic curve of the models can be seen in Figure 4.

4 DISCUSSION

This study examined the performances of classification models to predict COVID-19 mortality based on seven models.

In the first step of this study, we concluded that among 15 laboratory tests, 4 of them have a relationship with patients who survived COVID-19, which are: Lymphocytes, ESR, PLT, and AST. In the next step, data mining models were developed to predict COVID-19 outcomes using the epidemiological data set of COVID-19 patients from the Hospitals of Shahid Beheshti University of Medical Sciences. Logistic Regression, Gradient Boosted Trees, Naive Bayes, Decision Trees, Support Vector Machine, Generalized linear Model, and Random Forest algorithms were applied directly to the data set using Rapid Miner Studio. Based on laboratory tests and demographic features, our findings showed that the Random Forest and Gradient Boosted trees models had higher predictive accuracy than other models.

The growing importance of data mining in various sciences has led to the introduction of new models for this type of issue. Many studies have been done in medical sciences to predict the course of diseases using data mining models. A study by Che and others in 2016⁴² on the pediatric ICU data set for acute lung injury (ALI), using a method called interpretable mimic learning, which used the GBT model, showed that GBT could recognize important markers of mortality and ventilator-free days prediction tasks. The 2020 study by Pan et al. was conducted on a database of patients with COVID-19 in the ICU to compare conventional logistic regression methods and four ML algorithms. Finally, the eXtreme Gradient Boosting (XGBoost) model correctly predicted the risk of death with eight markers.¹¹ The Gumaei et al. study was collected in 2020 using daily COVID-19-validated items. GBR had the best performance among all of the models.¹⁴

Delen et al., due to the excellent performance of GBT, this method was used to analyze the effectiveness of social distance during the outbreak of COVID-19. This showed that about 47% of the changes in the rate of disease transmission could be explained by changes in the pattern of mobility resulting from the implementation of social distance policies in the studied countries.¹⁷

The GBT model is even used for early prediction of COVID-19 using AI methods. Using a new Internet of Medical Things (IoMT) and GBT model, Yildirim et al. showed that the GBTs classifier has the best performance with 0.970 area under the curve (AUC) value for the diagnosis of COVID-19 disease. This indicated the high performance of the GBT model in this study.¹ In some data, such as genomic datasets, the GBT model is not necessarily the best model. In the study conducted by Akbulut et al. on Genomic Biomarkers of Metagenomic Next-Generation Sequencing Data. It showed that the multilayer perceptron (MLP) model performed better than the GBT model.^{2, 42, 43}

According to this study and previous studies, we believe that regarding COVID-19 disease, the Gradient Boosted Trees model seems to have the best performance for laboratory tests such as ALT to predict the mortality of COVID-19 patients. Although other models such as deep learning models showed an acceptable performance.⁴⁴ Finally, due to the selection of the most effective tests for COVID-19 mortality and the evaluation of the gradient-enhanced tree model, this model will have the best performance in predicting mortality.

5 LIMITATION

One of the most critical limitations of this study is that the data were used only from hospitals under the supervision of Shahid Beheshti University of Medical Sciences, so the sample size was limited to the statistical population, which may show bias. However, all 19 hospitals make a large set, focusing on treatment and representing all patients with COVID-19. The next limitation was the unequal distribution of patient laboratory tests, which affected data analysis.

In the future, according to what we have learned in this study, we can try to correct model defects and ML methods. We will also use a broader statistical sample that better represents the community.

6 CONCLUSION

Scientists and medical professionals have been working hard to find new ways to fight the infectious disease COVID-19 and its various strains. Recently, ML and data mining methods have been proven to work successfully in healthcare for various purposes. Data mining methods, especially Gradient Boosted Trees and Random Forest models, can predict outcomes of COVID-19 patients using lab tests and demographic features. In fact, we were able to predict patients at risk by using these models and selecting four lab tests from among 16 laboratory tests. These models should be evaluated in larger populations and different settings. After validating these methods, they can be used in the future for early diagnosis of critically ill patients and to prevent the increase in mortality in such pandemics.

AUTHOR CONTRIBUTIONS

Fariba Khounraz: Conceptualization; data curation; writing – original draft. Mahmood Khodadoost: Conceptualization. Saeid Gholamzadeh: Conceptualization. Rashed Pourhamidi: Writing – original draft. Tayebeh Baniasadi: Supervision; writing – review & editing. Aida Jafarbigloo: Writing – original draft. Gohar Mohammadi: Conceptualization. Mahnaz Ahmadi: Writing – review & editing. Seyed M. Ayyoubzadeh: Conceptualization; data curation; formal analysis; writing – original draft.

CONFLICT OF INTEREST

The authors declare no conflict of interest.

TRANSPARENCY STATEMENT

The lead author Seyed Mohammad Ayyoubzadeh affirms that this manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.

ETHICS STATEMENT

This study design was reviewed and approved by the Ethics Committee of Shahid Beheshti University of Medical Sciences (IR.SBMU.RETECH.REC.1399.487).

Open Research

DATA AVAILABILITY STATEMENT

Data available on request from the authors.

Supporting Information

REFERENCES

1Sammut C, Webb GI. Encyclopedia of Machine Learning and Data Mining. Springer; 2017.
10.1007/978-1-4899-7687-1
Google Scholar
2Yu K-H, Beam AL, Kohane IS. Artificial intelligence in healthcare. Nat Biomed Eng. 2018; 2(10): 719-731.
10.1038/s41551-018-0305-z
PubMed Web of Science® Google Scholar
3Xu X, Jiang X, Ma C, et al. A deep learning system to screen novel coronavirus disease 2019 pneumonia. Engineering (Beijing). 2020; 6(10): 1122-1129. doi:10.1016/j.eng.2020.04.010
CAS PubMed Google Scholar
4Reddy DLC. A review on data mining from past to the future. Int J Comput Appl. 2011; 975:8887.
Google Scholar
5Roiger RJ. Data Mining, in Data Mining. Chapman and Hall/CRC; 2017: 33-62.
Google Scholar
6Chung HM, Gray P. Special section: data mining. J Manag Inf Syst. 1999; 16(1): 11-16.
10.1080/07421222.1999.11518231
Google Scholar
7Al-Turaiki I, Alshahrani M, and Almutairi T. Building predictive models for MERS-CoV infections using data mining techniques. J Infect Public Health. 2016; 9(6): 744-748.
10.1016/j.jiph.2016.09.007
PubMed Web of Science® Google Scholar
8Albahri A, Hamid RA, Alwan JK. Role of biological data mining and machine learning techniques in detecting and diagnosing the novel coronavirus (COVID-19): a systematic review. J Med Syst. 2020; 44(7): 122.
10.1007/s10916-020-01582-x
CAS PubMed Web of Science® Google Scholar
9Long JB, Ehrenfeld JM. The Role of Augmented Intelligence (AI) in Detecting and Preventing the Spread of Novel Coronavirus. Springer; 2020.
10.1007/s10916-020-1536-6
Google Scholar
10Jiang X, et al. Towards an artificial intelligence framework for data-driven prediction of coronavirus clinical severity. Comput Mater Contin. 2020; 62(3): 537-551.
Google Scholar
11Pan P, Li Y, Xiao Y, et al. Prognostic assessment of COVID-19 in the intensive care unit by machine learning methods: model development and validation. J Med Internet Res. 2020; 22(11):23128.
10.2196/23128
Web of Science® Google Scholar
12Muhammad L, Algehyne E, Usman SS, et al. Deep Learning Models for Predicting COVID-19 Using Chest X-Ray Images, in Trends and Advancements of Image Processing and Its Applications. Springer; 2022: 127-144.
Google Scholar
13Johri P, Diván MJ, Khanam R, et al. Trends and Advancements of Image Processing and Its Applications. Springer; 2021.
Google Scholar
14Gumaei A, Alrakhami M, Al Rahhal M, et al. Prediction of COVID-19 confirmed cases using gradient boosting regression method. Comput Mater Contin. 2021; 66(1): 315-329.
Web of Science® Google Scholar
15Jamshidi MB, Talla J, Lalbakhsh A, et al. A Conceptual Deep Learning Framework for COVID-19 Drug Discovery. In IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON) December 1 2021. IEEE.
Google Scholar
16Jamshidi MB, Lalbakhsh A, Peroutka Z, et al. Deep Learning Techniques and COVID-19 Drug Discovery: Fundamentals, State-of-the-Art and Future Directions, in Emerging Technologies During the Era of COVID-19 Pandemic. Springer; 2021: 9-31.
Google Scholar
17Delen D, Eryarsoy E, Davazdahemami B. No place like home: cross-national data analysis of the efficacy of social distancing during the COVID-19 pandemic. JMIR Public Health Surveill. 2020; 6(2):19862.
10.2196/19862
PubMed Google Scholar
18Gong J, Ou J, Qiu X, et al. A tool for early prediction of severe coronavirus disease 2019 (COVID-19): a multicenter study using the risk nomogram in Wuhan and Guangdong, China. Clin Infect Dis. 2020; 71(15): 833-840.
10.1093/cid/ciaa443
CAS PubMed Web of Science® Google Scholar
19Mohamed AM, Saleh A, Altantawy DA, et al. Covid-19 patients diagnosis (CPD) strategy using data mining techniques. Mansoura Eng J. 2022; 47(2): 33-42.
10.21608/bfemu.2022.233811
Google Scholar
20Wang P, Ren H, Zhu X, Fu X, Liu H, Hu T. Spatiotemporal characteristics and factor analysis of SARS-CoV-2 infections among healthcare workers in Wuhan, China. J Hosp Infect. 2021; 110: 172-177.
10.1016/j.jhin.2021.02.002
CAS PubMed Web of Science® Google Scholar
21Alshdaifat E, Alshdaifat D, Alsarhan A, et al. The effect of preprocessing techniques, applied to numeric features, on classification algorithms' performance. Data. 2021; 6(2): 11.
10.3390/data6020011
Web of Science® Google Scholar
22Jamshidi MB, Lalbakhsh A, Talla J, et al. Artificial intelligence and COVID-19: deep learning approaches for diagnosis and treatment. IEEE Access. 2020; 8: 109581-109595.
10.1109/ACCESS.2020.3001973
PubMed Web of Science® Google Scholar
23Caiafa CF, Sun Z, Tanaka T, Marti-Puig P, Solé-Casals J. Machine learning methods with noisy, incomplete or small datasets. Appl. Sci. 2021; 11(9): 4132.
10.3390/app11094132
CAS Web of Science® Google Scholar
24T. M, Mukherji D, Padalia N, Naidu A. A heart disease prediction model using SVM-decision trees-logistic regression (SDL). Int J Comput Appl. 2013; 68(16): 11-15.
Google Scholar
25Hong H, Naghibi SA, Pourghasemi HR, Pradhan B. GIS-based landslide spatial modeling in Ganzhou City, China. Arab J Geosci. 2016; 9(2): 112.
10.1007/s12517-015-2094-y
Web of Science® Google Scholar
26Ayon SI, Islam MM, and Hossain MR. Coronary artery heart disease prediction: a comparative study of computational intelligence techniques. IETE J Res. 2020; 68(4): 2488-2507.
10.1080/03772063.2020.1713916
Web of Science® Google Scholar
27Kaviani P, Dhotre SJIJoAE. Short survey on naive bayes algorithm. Int J Adv Eng Res Sci. 2017; 4(11): 607-611.
Google Scholar
28Rish I. An Empirical Study of the Naive Bayes Classifier. In IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence. 2001: 41-46.
Google Scholar
29Taheri S, Mammadov M. Learning the naive Bayes classifier with optimization models. Int J Appl Math Comput Sci. 2013; 23(4): 787-795.
10.2478/amcs-2013-0059
Web of Science® Google Scholar
30Friedman JHJAos. Greedy function approximation: a gradient boosting machine. Ann Statist. 2001; 29(5): 1189-1232.
10.1214/aos/1013203451
Web of Science® Google Scholar
31Sattler K-U, Dunemann O, SQL Database Primitives for Decision Tree Classifiers. in Proceedings of the tenth international conference on Information and knowledge management. October 5 2001. ACM Press.
Google Scholar
32Zheng B, Agresti A. Summarizing the predictive power of a generalized linear model. Stat Med. 2000; 19(13): 1771-1781.
10.1002/1097-0258(20000715)19:13<1771::AID-SIM485>3.0.CO;2-P
CAS PubMed Web of Science® Google Scholar
33Fahrmeir L, Tutz G. Multivariate Statistical Modelling Based on Generalized Linear Models. Springer Science & Business Media; 2013.
Google Scholar
34Härdle W, Werwatz A, Müller M, et al. Generalized Partial Linear Models, in Springer Series in Statistics. Springer; 2004: 189-209.
Google Scholar
35Nelder JA, Wedderburn RW. Generalized linear models. J R Stat Soc Ser A Stat Soc. 1972; 135(3): 370-384.
10.2307/2344614
Web of Science® Google Scholar
36McCullagh P, Nelder JA. Binary data. In Generalized Linear Models (pp. 98-148). Springer;1989.
Google Scholar
37Tin Kam Ho Ho. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell. 1998; 20(8): 832-844.
10.1109/34.709601
Web of Science® Google Scholar
38Biau G, Scornet E. A random forest guided tour. Test. 2016; 25(2): 197-227.
10.1007/s11749-016-0481-7
Web of Science® Google Scholar
39Piryonesi SM, El-Diraby TE. Role of data analytics in infrastructure asset management: overcoming data size and quality problems. J Transp Eng Part B: Pavements. 2020; 146(2):04020022.
10.1061/JPEODX.0000175
Web of Science® Google Scholar
40Ho W-H, Lee KT, Chen HY, Ho TW, Chiu HC. Disease-free survival after hepatic resection in hepatocellular carcinoma patients: a prediction approach using artificial neural network. PLoS One. 2012; 7(1):e29179.
10.1371/journal.pone.0029179
CAS PubMed Web of Science® Google Scholar
41Che Z, et al. Interpretable Deep Models for ICU Outcome Prediction. in AMIA Annual Symposium Proceedings. American Medical Informatics Association; 2016.
Google Scholar
42Yildirim E, Cicioğlu M, Çalhan A. Real-time Internet of medical things framework for early detection of Covid-19. Neural Comput Appl. 2022; 34(22): 20365-20378.
10.1007/s00521-022-07582-x
PubMed Web of Science® Google Scholar
43Akbulut A, Yağın F, Çolak CJEMJ. Prediction of COVID-19 based on genomic biomarkers of metagenomic next-generation sequencing data using artificial intelligence technology. Erciyes Med J. 2022; 44(6): 544-548.
CAS Google Scholar
44Taheriyan M, et al. Prediction of COVID-19 patients' survival by deep learning approaches. Med J Islam Repub Iran. 1936; 2022(29): 144.
Google Scholar

Citing Literature

Volume6, Issue1

January 2023

e1049

Prognosis of COVID-19 patients using lab tests: A data mining approach

Abstract

Background

Methods

Results

Conclusion

1 INTRODUCTION

2 METHODS

2.1 Data set collection

2.2 Data set preprocessing

2.3 Feature selection

2.3.1 Independent t test

2.3.2 Features subset calculation

2.4 Feature subset selection

2.4.1 Modeling and evaluation

2.4.2 Logistic regression (LR)

2.4.3 Naive Bayes (NB)

2.4.4 Support vector machine (SVM)

2.4.5 Gradient boosted trees

2.4.6 Decision tree (DT)

2.4.7 Generalized linear model (GLM)

2.4.8 Random forest (RF)

2.4.9 Model evaluation

3 RESULT

3.1 Feature selection

3.2 Modeling and evaluation

4 DISCUSSION

5 LIMITATION

6 CONCLUSION

AUTHOR CONTRIBUTIONS

CONFLICT OF INTEREST

TRANSPARENCY STATEMENT

ETHICS STATEMENT

Open Research

DATA AVAILABILITY STATEMENT

Supporting Information

REFERENCES

Citing Literature

Figures

References

Related

Information