Volume 2025, Issue 1 6666688

Research Article

Open Access

A Novel Machine and Deep Learning–Based Ensemble Techniques for Automatic Lung Cancer Detection

Mohd Munazzer Ansari

orcid.org/0000-0002-1135-8177

Department of Electronic and Communication Engineering , Integral University , Lucknow , India , integraluniversity.ac.in

Search for more papers by this author

Shailendra Kumar,

Corresponding Author

Shailendra Kumar

[email protected]

orcid.org/0000-0001-8413-4927

Department of Electronic and Communication Engineering , Integral University , Lucknow , India , integraluniversity.ac.in

Search for more papers by this author

Channabasava Chola,

Channabasava Chola

School of International Education , Huanghuai University , Zhumadian , China , huanghuai.edu.cn

Search for more papers by this author

Md Belal Bin Heyat,

Corresponding Author

Md Belal Bin Heyat

[email protected]

orcid.org/0000-0001-5307-9582

CenBRAIN Neurotech Center of Excellence , School of Engineering , Westlake University , Hangzhou , Zhejiang , China , westlake.edu.cn

Search for more papers by this author

Faijan Akhtar,

Corresponding Author

Faijan Akhtar

[email protected]

orcid.org/0000-0001-9926-4150

Institute of Sciences in Emergency Medicine , Department of Emergency Medicine , Guangdong Provincial People’s Hospital , Guangdong Academy of Medical Sciences , Southern Medical University , Guangzhou , China , fimmu.com

Department of Emergency Medicine , Guangdong Provincial People’s Hospital , Guangdong Academy of Medical Sciences , Southern Medical University , Guangzhou , China , fimmu.com

Search for more papers by this author

Mohd Ammar Bin Hayat,

Mohd Ammar Bin Hayat

orcid.org/0009-0000-3746-5758

College of Intelligent Systems Science and Engineering , Harbin Engineering University , Harbin , China , hrbeu.edu.cn

Search for more papers by this author

Eram Sayeed,

Eram Sayeed

Triveni Rai Kisan Mahila Mahavidyalaya , D. D. U. Gorakhpur University , Kushinagar , India

Search for more papers by this author

Saba Parveen,

Saba Parveen

orcid.org/0009-0006-4520-3014

College of Electronics and Information Engineering , Shenzhen University , Shenzhen , China , szu.edu.cn

Search for more papers by this author

Rashid Abbasi,

Rashid Abbasi

College of Computer Science and Artificial Intelligence , Wenzhou University , Wenzhou , China , wzu.edu.cn

Search for more papers by this author

Dustin Pomary,

Corresponding Author

Dustin Pomary

[email protected]

orcid.org/0000-0002-0739-4677

Electrical/Electronics Engineering Department , Ho Technical University , Ho , Volta Region , Ghana , htu.edu.gh

Search for more papers by this author

Mohd Munazzer Ansari,

Mohd Munazzer Ansari

orcid.org/0000-0002-1135-8177

Department of Electronic and Communication Engineering , Integral University , Lucknow , India , integraluniversity.ac.in

Search for more papers by this author

Shailendra Kumar,

Corresponding Author

Shailendra Kumar

[email protected]

orcid.org/0000-0001-8413-4927

Department of Electronic and Communication Engineering , Integral University , Lucknow , India , integraluniversity.ac.in

Search for more papers by this author

Channabasava Chola,

Channabasava Chola

School of International Education , Huanghuai University , Zhumadian , China , huanghuai.edu.cn

Search for more papers by this author

Md Belal Bin Heyat,

Corresponding Author

Md Belal Bin Heyat

[email protected]

orcid.org/0000-0001-5307-9582

CenBRAIN Neurotech Center of Excellence , School of Engineering , Westlake University , Hangzhou , Zhejiang , China , westlake.edu.cn

Search for more papers by this author

Faijan Akhtar,

Corresponding Author

Faijan Akhtar

[email protected]

orcid.org/0000-0001-9926-4150

Department of Emergency Medicine , Guangdong Provincial People’s Hospital , Guangdong Academy of Medical Sciences , Southern Medical University , Guangzhou , China , fimmu.com

Search for more papers by this author

Mohd Ammar Bin Hayat,

Mohd Ammar Bin Hayat

orcid.org/0009-0000-3746-5758

College of Intelligent Systems Science and Engineering , Harbin Engineering University , Harbin , China , hrbeu.edu.cn

Search for more papers by this author

Eram Sayeed,

Eram Sayeed

Triveni Rai Kisan Mahila Mahavidyalaya , D. D. U. Gorakhpur University , Kushinagar , India

Search for more papers by this author

Saba Parveen,

Saba Parveen

orcid.org/0009-0006-4520-3014

College of Electronics and Information Engineering , Shenzhen University , Shenzhen , China , szu.edu.cn

Search for more papers by this author

Rashid Abbasi,

Rashid Abbasi

College of Computer Science and Artificial Intelligence , Wenzhou University , Wenzhou , China , wzu.edu.cn

Search for more papers by this author

Dustin Pomary,

Corresponding Author

Dustin Pomary

[email protected]

orcid.org/0000-0002-0739-4677

Electrical/Electronics Engineering Department , Ho Technical University , Ho , Volta Region , Ghana , htu.edu.gh

Search for more papers by this author

First published: 15 July 2025

https://doi.org/10.1155/bmri/6666688

Academic Editor: Abhishek Tyagi

Share a link

Email
Wechat
Bluesky

Abstract

Lung cancer (LC) remains one of the most challenging malignancies to diagnose accurately due to the complexity of its pathology and the limitations of traditional diagnostic protocols. This study presents a novel deep learning–based ensemble technique aimed at enhancing the automatic detection of LC, integrating machine learning (ML) methodologies to improve diagnostic accuracy and reliability significantly. The publicly available Kaggle LC survey dataset, comprising 309 instances with 16 clinical attributes, including age, smoking history, and symptoms were used. The synthetic minority oversampling technique (SMOTE) for data balancing, complemented by fivefold cross-validation strategy to assess the performance of various ML models, was also employed. This approach utilizes convolutional neural networks (CNNs) integrated with ensemble learning to develop two novel models, ConvXGB and ConvCatBoost, which synergistically combine CNN-based feature extraction with the powerful classification capabilities of XGBoost and CatBoost, respectively, for binary classification in LC detection. ConvXGB achieved improved results, including an accuracy of 98.26%, precision of 98.72%, recall of 98.72%, and F1-score of 98.72%. Such performance metrics underscore ConvXGB’s potential to minimize false positives and negatives, enhancing outcomes in LC diagnostics. Additionally, the complexities surrounding LC diagnosis through a comprehensive comparative analysis were explored, establishing benchmarks for evaluating diagnostic precision, accuracy, recall, and F1-score. Our findings highlight the transformative impact of ML in LC research, paving the way for diagnostic frameworks that promise significant advancements in oncological practices.

1. Introduction

Lung cancer (LC) is a pervasive global malignancy that remains a leading cause of cancer-induced fatalities, posing significant challenges due to its heightened intratumor heterogeneity and the intricate characteristics of cancer cells [1]. With an annual diagnosis rate of 2.20 million individuals, a staggering 75% succumb within 5 years [2]. Overcoming these challenges requires a comprehensive understanding of LC patterns, diagnoses, and treatment responses. When breath is taken, air follows a journey through the body before it reaches the lungs. It first enters through the nose, then travels down the pharynx, through the larynx, and into the trachea [3]. From there, it moves into the bronchi, which split into smaller branches, eventually reaching tiny air sacs called alveoli. These alveoli are surrounded by tiny blood vessels called capillaries, where the magic happens: oxygen from the air breathed enters the blood, and carbon dioxide from our blood is transferred to the alveoli to be exhaled [4]. This exchange is crucial for our survival, as our body relies on a constant supply of oxygen, which it gets through breathing [5]. Understanding this process is essential when studying lung function, especially in the context of early LC detection [6].

Lung diseases sit at the top of the causes of death across numerous developed nations. Lung tissue suffers severe damage as time passes because of exposure to environmental toxins together with smoking and persistent inflammation [7, 8]. The lungs have natural defenses, with mucus as a barrier to trap and remove harmful substances. However, these protective mechanisms can become overwhelmed, especially when someone smokes [9]. The constant exposure to smoke and toxins can damage the lungs’ ability to clear out these harmful particles, making it harder for the body to protect itself. This is one of the reasons why smoking can lead to serious lung conditions over time [10]. Multiple factors including environmental elements together with genetic characteristics and hereditary medical conditions lead to lung health decline which results in different respiratory ailments [11]. Patients who experience chronic obstructive pulmonary disease (COPD) have either chronic bronchitis or emphysema since these two conditions create the overall illness known as COPD [12]. COPD develops when chronic bronchitis joins with emphysema and creates this advanced and progressively worsening illness. Smoking stands as the primary cause leading to the development of this illness [13].

Chronic bronchitis causes persistent inflammation together with bronchi lining damage in the structure that links the trachea to the lungs [14]. Objects with chronic bronchitis present both a prolonged cough along with an abundance of mucus while also causing breathing problems. The main damage target for emphysema occurs within the air sacs of the lungs. The condition produces symptoms that involve continuous coughing, breathing difficulties, reduced physical tolerance, and diminished capacity to execute regular tasks without experiencing fatigue [15].

LC proves to be the leading cancer-related killer for men and women in the United States [16]. The number of deaths from LC exceeds the total deaths from colon cancer, along with cervical cancer and breast cancer combined [17]. A persistent cough stands as the main sign that people with LC experience, and healthcare providers must evaluate this symptom closely. People who currently have COPD from smoking already experience recurrent coughing because of their smoking history [18]. The most critical warning signal related to LC includes a noteworthy change in cough behavior that becomes both more common and more intense and could include mucus production or the presence of blood. The symptoms of LC can manifest as phlegm in coughed-up material alongside chest pain and difficult breathing, as well as appetite reduction and unintended weight loss, as well as fever, along with rare events of hemoptysis [19].

LC stands as a leading cause of cancer-related fatalities worldwide, accounting for 11.4% of new diagnoses and contributing to 18% of all cancer deaths, with approximately 85% attributed to non–small cell lung cancer (NSCLC) [20]. The diagnosis of NSCLC typically entails methodologies such as radiofrequency and stereotactic body radiotherapy (SBRT), unlike small cell lung carcinoma (SCLC), which necessitates specific therapeutic strategies and displays quicker dissemination [21]. NSCLC advances slowly and presents diverse subtypes like squamous cell carcinoma, large cell carcinoma, and adenocarcinoma [22]. Despite radon exposure, smoking, and air pollution being prevalent triggers, NSCLC can impact nonsmoking individuals as well. Indications may encompass a persistent cough, breathing difficulties, chest discomfort, and coughing up blood [23]. SCLC, constituting 10%–15% of LC instances, manifests vigorous proliferation and uniform cellular morphology under microscopic examination [6]. Mainly associated with smoking, it can influence nonsmokers as well, frequently undergoing early metastasis in the disease progression, resulting in advanced stages upon detection [24, 25]. The diagnostic process incorporates imaging evaluations such as CT scans and tissue samplings [26]. As a treatment, therapy typically integrates chemotherapy and radiation, occasionally complemented by surgical procedures if the malignancy is confined [27]. The prognosis is generally unfavorable, with a median survival rate of approximately 1 year postdiagnosis. Timely identification and intervention are paramount for enhancing outcomes in the case of LC. The distinct association of SCLC with smoking mandates unique therapeutic and diagnostic methodologies compared to NSCLC. The early detection of NSCLC substantially elevates survival rates, varying from 35% to 85% based on the phase and tumor category, although the overall 5-year survival rate remains low at 16% due to delayed identification [25]. Chemotherapy, efficacious in 60% of NSCLC instances, is a fundamental modality for SCLC treatment. Smoking, culpable for about 90% of LC cases, alongside elements like air pollution, radon, asbestos, and infections, contributes to the onset of LC. Both genetic predisposition and acquired mechanisms influence susceptibility. Therapeutic alternatives encompass radiation, surgical interventions, targeted therapies, and chemotherapy [27].

In the dynamic landscape of oncological research, collaborative initiatives and technological advancements have led to the creation of extensive clinical, imaging, and sequencing databases spanning several decades. These repositories empower researchers to delve into diverse datasets and unravel intricate patterns across LC diagnosis and treatment [28]. Based on this, there has been a notable enhancement of 15%–20% in the precision of cancer prognosis in recent years. This improvement can be attributed to the utilization of ML methodologies [29]. Advanced molecular analyses, spanning genomics, proteomics, transcriptomics, and metabolomics, have significantly enhanced our research capabilities [30]. In the context of cancer studies, the fusion of various data types and extensive datasets is progressively prevalent, highlighting the crucial role of ML in this area of research [31]. Specializing in prognostication and discerning patterns within data through mathematical algorithms, ML plays a pivotal role in early detection [32], classification, feature extraction, tumor microenvironment deconvolution, drug response, and prognosis prediction, evaluation in cancer research [33]. ML models assist physicians in navigating the complexity of cancer-associated databases, facilitating informed decision-making [34].

ML [35, 36], a subset of AI, establishes a connection between the challenge of extracting knowledge from data samples and the overarching concept of inference [37–39]. The learning process comes about in two separate stages: first, the estimation of hidden dependencies within a system using a given dataset, and second, the utilization of these estimated dependencies to make predictions about new outcomes of the system [38]. ML has gained substantial momentum in biomedical research, illustrating its flexibility in multiple applications [39]. Consider a scenario where medical data [40] regarding LC is gathered to predict the nature of a tumor, whether it is malignant or benign, based solely on its size. In this context of ML, the goal would be to ascertain the probability of the tumor being malignant, denoted as 1 for yes and 0 for no [41]. To ensure good generalization, it is important to explore a wide range of biological samples and use different methods and tools [42]. Supervised and unsupervised learning stand as the twin pillars of ML methodologies. Supervised learning harnesses labelled training data to deduce or map input data to desired outputs, whereas unsupervised learning operates in a label-free environment devoid of predefined outcomes at the outset. Bridging the strengths of both, semisupervised learning crafts precise models by leveraging both labelled and unlabeled data, particularly beneficial when unlabeled datasets outnumber labelled ones. At the heart of ML lie data samples, each characterized by numerous features, each featuring diverse value types. A myriad of methods and approaches exist for data preparation, tailored to optimize data alignment with specific ML paradigms [43]. Paramount among these strategies is feature extraction, feature selection, and dimensionality reduction. The ultimate aim of ML methodologies is to furnish models proficient in executing tasks like classification, prediction, or estimation. Among these tasks, classification emerges as the foremost and most pervasive pursuit within the learning process [44]. Regression and clustering are two other significant ML tasks. In regression problems, a learning function maps data onto a real-value variable, enabling the estimation of the predictive variable’s value for each new sample. Clustering, as a typical unsupervised assignment, involves the identification of categories or clusters that describe the data items.

In our field of study, deep learning (DL) is a subset of ML [45], utilizing convolutional neural network (CNN) to render accurate decisions [46, 47]. Its effectiveness is most pronounced when addressing complex challenges, notably classification. Within our study, the performance of DL models across publicly accessible datasets is scrutinized. For example, given the continuous nature of LC data, CNN models emerge as powerful instruments for analyzing population-level patterns in LC [48]. Our manuscript systematically guides readers through a review of related studies, detailed materials and methods, comprehensive results and discussions, and concluding insights into future directions.

1.1. Motivation of the Study

The widespread mortality from LC remains among the top cancer-related causes because medical professionals find it challenging to detect early symptoms without sophisticated diagnostic systems. Traditional CT scan diagnosis methods, which are time-consuming, labor-intensive, and prone to human error, often produce numerous false positives (FPs) and false negatives (FNs). The application of ML and DL for automated diagnosis continues to encounter three key limitations which involve class unbalance within datasets and data noise problems together with the difficulty of extracting crucial clinical features from complex image data. The hierarchical feature extraction capabilities of CNNs for our dataset are advanced, while these tools demonstrate inadequate performance for working with clinical data formats. Ensemble models including eXtreme gradient boosting (XGBoost) and CatBoost excel at processing structured data information while operating poorly on imaging data in its raw format. The complementary strengths of these techniques inspired the development of a hybrid framework that integrates their advantages. ConvXGB and ConvCatBoost, which integrate CNN-based deep feature extraction with gradient boosting classifiers to create an accurate, reliable, and interpretable early LC detection system, are proposed. The combination of these methods produces an improved diagnostic procedure which overcomes individual framework limitations. The combination method lowers misdiagnosis errors due to unbalanced datasets and enhances performance across various healthcare scenarios through its scalable deployment capabilities. The aim of this study is to establish a data system which connects imaging processes to structured information analysis to deliver prompt doctor-driven medical knowledge. This research work serves to advance intelligent diagnostic systems which detect LC at an earlier stage with higher precision along with improved operational efficiency for potential life-saving applications that transform clinical oncology decisions.

1.2. Contribution of the Study

Our research presents a pioneering hybrid ensemble framework for LC prediction, designated as ConvXGB and ConvCatBoost, which integrates CNN-based deep feature extraction with the powerful classification capabilities of XGBoost and CatBoost. Unlike existing methodologies that primarily depend on isolated ML or DL models, our approach uniquely applies a lightweight 1D CNN to extract deep feature representations from structured clinical tabular data rather than exclusively relying on imaging data. SMOTE was employed to address the critical issue of class imbalance, frequently overlooked in prior investigations. A streamlined 1D CNN architecture was meticulously designed to optimize performance while minimizing model complexity, making it particularly suited for structured medical datasets. The key contributions of our study are as follows:

•
Two novel ensemble models, ConvXGB and ConvCatBoost, which combine CNNs with XGBoost and CatBoost, are introduced. These models significantly outperform traditional ML classifiers in LC diagnosis.
•
An evaluation process with fivefold cross-validation on unbalanced and balanced datasets was implemented, ensuring a fair and comprehensive assessment of model performance.
•
The models of this study achieved improved diagnostic results, with ConvXGB showcasing excellent accuracy, high recall, and strong F1-score setting a new standard for LC detection.
•
A fully reproducible experimental framework is created, detailing everything from clinical datasets and preprocessing to model architecture and algorithm workflows. This framework makes it easy for other researchers to replicate and build upon our work.
•
Finally, this study highlighted how integrating deep feature learning with ensemble boosting techniques can significantly improve diagnostic accuracy and reduce misclassification in clinical datasets [49].

2. Related Works

This literature review explores the complex relationship between statistical analysis, ML techniques, and clinical applications [50, 51] concerning LC research. LC continues to be a major global health concern, contributing significantly to most cancer-related deaths around the globe. The heterogeneous nature and diverse manifestations of LC make early diagnosis challenging, despite advances in diagnostic techniques and treatments. In this review, this study presents a wide range of research that improves LC diagnosis, prognosis, and treatment by addressing statistical and ML methods. Our goal is to address the current state of LC research, spot current trends, and highlight interesting directions for further investigation and creativity by combining the knowledge from these studies.

Faisal et al. [52] assess the discriminative power of predictors to improve the efficiency of LC detection based on symptoms. Various classifiers, including support vector machine (SVM), decision tree, multilayer perceptron, and NNs, were evaluated using a benchmark dataset from the UCI repository. Their performance was compared with renowned ensemble methods like random forest (RF) and majority voting. The gradient-boosted tree emerged as the top performer, achieving 90% accuracy, surpassing individual classifiers and ensembles. Nam and Shin [53] underscore the significance of active research on clinical decision support systems (CDSS) employing ML to address challenges. This study delves into ML’s capacity for LC detection using Azure ML (using three algorithms: two-class decision jungle, multiclass decision jungle, and two-class SVM). Despite constrained big data availability, collaboration is supported for refined screening models, achieving 94% accuracy and high true positive (TP). Patra [54] highlighted LC as a formidable respiratory obstacle stemming from unregulated lung cell proliferation. Smoking, active or passive, notably escalates mortality rates, especially among young individuals. Despite advancements in medical care, mortality persists, underscoring the urgency for early detection. ML presents promising prospects in healthcare, facilitating timely disease prognosis. Their analysis of diverse classifiers on UCI-LC data identifies the radial basis function (RBF) classifier as a robust predictor, attaining 81.25% accuracy. Vieira et al. [19] constructed a data mining model using CRISP-DM and rapid miner by the ANN algorithm to predict LC based on symptoms. Various models and scenarios were evaluated, with the worst achieving 93% accuracy, 96% sensitivity, 90% specificity, and 91% precision using an artificial neural network. Xie et al. [55] stress the vital role of early detection in improving LC patient survival. Their study focuses on identifying plasma metabolites as diagnostic biomarkers for LC in Chinese patients. Utilizing an interdisciplinary approach merging metabolomics and ML, they analyzed data from 110 LC patients and 43 healthy individuals. Six metabolic biomarkers effectively differentiate Stage I LC patients from healthy subjects, exhibiting strong results. Binson et al. [18] aim to extract biomarkers from easily obtainable breath samples to assist in diagnostics. Using metabolomics tools, they establish unique breath profiles for timely LC detection. The study examined breath samples from 218 individuals, including patients with LC, COPD, and asthma, as well as healthy individuals. Eight ML models were developed to differentiate patients from healthy controls. The KPCA-XGBoost model yielded promising results, with high accuracy, sensitivity, and specificity for predicting LC. Dritsas and Trigka [4] researchers employ ML techniques for the identification of LC susceptibility, demonstrating the efficacy of the rotation forest (RotF) algorithm. The performance of the model was assessed through various evaluation measures. Different algorithms, namely, SVM, NB, RF, and RotF, were evaluated, with RotF emerging as the top performer. By employing SMOTE in conjunction with 10-fold cross-validation, the RotF algorithm demonstrated exceptional accuracy, precision, recall, and F-measure outcomes alongside a notable area under the curve (AUC) of 99.3%, underscoring its potential for precise anticipation of LC risk and timely intervention.

Through their empirical study, Omar and Nassif [56] attempted to implement an array of ML algorithms such as K-nearest neighbor (KNN), SVM, MLP, and RBF classifier for predicting LC by assessing symptomatology. Utilizing a textual dataset encompassing 15 discrete features, the research meticulously assessed the efficacy of each model. In order to facilitate a comparison of predictive accuracy, a variety of feature selection and extraction methodologies were employed, elucidating the impact of these preprocessing techniques on the performance of the models. Results showed that MLP reliably exceeded the performance of alternative models, reaching the highest accuracy, no matter the feature selection technique applied. This level of performance underscores the robustness of MLP in the domain of symptom-driven LC prediction, positioning it as a promising candidate for dependable, symptom-oriented diagnostic applications. Dirik [15] aims to improve early illness detection by efficiently assessing numerous instances with various algorithms on existing datasets, aiming for expert-level results. They focus on creating an automated system for early-stage LC detection using ML techniques. They used different ML algorithms, the system’s effectiveness evaluated through accuracy, sensitivity, and precision metrics from the confusion matrix, showing a 91% accuracy rate. Rao [57] emphasized the urgent necessity for early cancer detection to enhance treatment effectiveness and decrease mortality rates. Their study delved into diverse detection techniques and algorithms for lung and breast cancer, using amalgam approaches utilizing varied image types and datasets. Despite the challenge posed by the small size of cancerous cells, promising avenues include RF, Bayesian methods, SVM, logic-based classifiers, and CNN for classifying and segmenting cancer cells. Previous studies showed favorable outcomes with traditional ML and DL approaches in LC detection, yet ongoing limitations remain. The current methods follow a separate approach that focuses on images or tabular information, which creates issues for feature extraction and population-based generalization. The effective management of unbalanced classes and reliable performance on diverse small datasets represent current significant barriers [58, 59]. Current methods fail to maximize the combined learning capabilities of multiple structures because they use individual architectures as their sole solution approach [60].

Our research presents a new ensemble framework that unites CNN deep feature extraction potential with the classification strength of boosting algorithms (XGBoost and CatBoost). The implementation of ConvXGB and ConvCatBoost models establishes a single framework that enhances structured clinical data processing, better diagnostic outcomes, resistance against unbalance, and enhanced model interpretability [61]. The combined ensemble model provides greater capabilities than traditional ML approaches and standalone DL by offering scalable early LC detection solutions [1].

3. Materials and Methods

This section delineates the details of the dataset accessible to the public, including the procedure of gathering data and a thorough overview of the dataset. Moreover, it highlights the techniques utilized in developing and refining the proposed LC prediction model. It offers an in-depth explanation of each model and ends with examining their performance metrics. Furthermore, it presents the comprehensive structure of the proposed framework for the detection of LC, as illustrated in Figure 1.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Proposed ensemble-based framework for LC prediction.

3.1. Experimental Setup

The experimental setup was conducted on a high-performance computing workstation equipped with an Intel Core i7-11800h processor running at 2.30 GHz, 32 GB DDR4 RAM, and an NVIDIA GeForce RTX 3060 GPU with 6 GB of dedicated memory, operating on Windows 11 Professional 64-bit. All experiments were implemented using Python 3.10 as the primary programming environment. DL models, including the lightweight 1D CNN used for feature extraction, were developed using Tensor Flow 2.11 and Keras libraries. Traditional ML models such as logistic regression (LR), KNN, SVM, RF, XGBoost, and CatBoost were implemented using the Scikit-learn, XGBoost, and CatBoost libraries. Data preprocessing, augmentation, and visualization were performed using Pandas, NumPy, Matplotlib, and Seaborn libraries [62]. The setup leveraged the computing power of the workstation to handle the complex tasks of training DL models and evaluating performance metrics.

3.1.1. Handling of Missing Values

Before building the predictive models, we thoroughly examined the dataset for any missing entries. While key features such as age, smoking status, and fatigue were complete, a few other attributes contained missing values in the datasets. To preserve the dataset and avoid discarding potentially useful samples, we applied straightforward imputation techniques for any missing values in the datasets. For numerical features, missing values were replaced with the mean of the respective column. For categorical features, the most frequent value (mode) was used. This approach maintained the consistency and size of the dataset without introducing bias from row deletion. All data preprocessing steps, including missing value handling, were performed using Python libraries such as Pandas and Scikit-learn.

3.2. Dataset

A publicly available dataset (https://www.kaggle.com/datasets/mysarahmadbhat/lung-cancer) was utilized for this study [63]. The dataset involved a survey conducted among individuals with and without LC, aiming to investigate various characteristics that could be correlated with the illness. The dataset comprises 309 instances with 15 clinical attributes (gender, age, smoking, yellow fingers, anxiety, peer pressure, chronic disease, fatigue, allergy, wheezing, alcohol consumption, coughing, shortness of breath, swallowing difficulty, and chest pain) and one target variable (LC) [4]. To address the imbalanced class distribution between LC and non-LC instances in the dataset, SMOTE is applied [64]. Non-null counts indicate the number of available values per column, with most features being integers. However, gender and LC are categorical or textual type objects. This detailed description encapsulates preprocessed patient information or data for further processing within an ML pipeline.

3.3. ML Classifiers

In the diagnosis of LC, ML models play a pivotal role. This study employs a range of classifiers, including LR, KNN, SVM, RF, XGBoost, and CatBoost. Additionally, this work involves developing ensemble models called ConvXGB and ConvCatBoost which integrates CNN with XGBoost and CatBoost algorithms. All algorithms trained meticulously to leverage its unique strengths, aiming to deliver precise LC prognosis. To enhance model robustness, k-fold cross-validation with k = 5 was utilized, ensuring reliable performance assessment and minimizing overfitting [65]. The next section addressed the overall methods addressed in the following section.

3.3.1. LR

LR enables the assessment of diverse risk factors and their association with LC prediction. It hinders the identification of insignificant predictors and risk factors associated with the disease, thereby hindering early detection and personalized treatment strategies [66]. Mathematically, LR models the probability of an individual developing LC based on a set of predictor variables. Let us denote the probability of LC(p), x₁, x₂, ⋯, x_n: Independent variables include environmental influences, genetic susceptibility, smoking history, and age. The coefficients (parameters) β₁, β₂, ⋯., β_n correspond to each independent variable. The logistic function is used to model the relationship between independent variables and the probability of LC:

(1)

where P(LC = 1|x) is the LC probability based on independent variable values, e is Euler’s number (about equal to 2.71828), and the coefficients β₁, β₂, ⋯., β_n are estimated from the data through methods such as maximum likelihood estimation. These coefficients represent the change in the log-odds of having LC for a one-unit change in the corresponding independent variable. In the domain of LC detection and prediction, LR model can utilize a variety of predictors, such as demographic information, medical history, CT scan imaging features, biomarker levels, and genetic information [67]. By analyzing these predictors, LR models provide valuable insights into the risk factors associated with LC and contribute to the development of predictive models for early detection.

3.3.2. KNN

ML algorithms, including KNN [68], have been implemented to identify and classify LC based on the medical history and physical activities of participants [69]. Mathematically, the class of a new instance is determined by the majority vote of its KNN:

(2)

where y represents the predicted class and Nk(x) denotes the set of KNN of instance y⁽ⁱ⁾ and x⁽ⁱ⁾ denotes the class of the i-th neighbors. Applying KNN to LC prediction involves identifying similar patients based on features like age, smoking history, and genetic predisposition. This enables personalized risk assessment and aids in early detection strategies [70].

3.3.3. SVM

In the context of LC detection using SVM, the mathematical equation revolves around finding the optimal boundary that separates patients with LC from those without based on their features. Here is how the equation translates into this context:

The goal of SVM is to find the optimal boundary represented by the equation

(3)

where w represents the weight vector, which defines the orientation of the boundary in the feature space. Each element of w corresponds to the importance of the respective feature in determining whether a patient has LC. x is the feature vector representing a patient’s characteristics, such as age, smoking history, and genetic markers. b is the bias term, which accounts for an offset from the origin and influences the position of the boundary. The decision boundary separates patients with L (positive class) from those without (negative class) [71]. Patient’s data lying on different sides of the boundary were mapped to classification accordingly:

If w^T x + b > 0, the patient is classified as having LC.

If w^T x + b < 0, the patient is classified as not having LC.

During training, SVM aims to find the optimal w and b values by maximizing the margin between the classes while minimizing misclassifications. These optimization processes were conducted by solving a convex optimization problem. Additionally, SVM can handle nonlinear decision boundaries by using kernel functions, which map the input features into a higher dimensional space where linear separation is possible [72].

3.3.4. XGBoost

XGBoost has proven instrumental in LC detection, excelling in various competitions. Its versatility is evidenced in predicting diseases, identifying diesel fuel brands, and estimating tunnel boring machine advance rates [73]. Researchers prefer XGBoost due to its ability to handle missing data internally, obviating the need for data scaling, ensuring high computational speed through parallel processing, and mitigating overfitting with regularization [74]. In context of LC detection, mathematical equation for XGBoost’s objective function aligns with the goal of accurately classifying patients based on relevant features. Let us break down the equation in this specific context:

(4)

where Obj represents the objective function, which XGBoost aims to minimize during training. In LC detection, the objective is to minimize the overall discrepancy between the true labels (indicating whether a patient has LC or not) and the predicted outputs from the XGBoost model. n denotes the number of patients in the training dataset. Each patient is represented by a feature vector containing characteristics relevant to LC detection, such as age, smoking history, and genetic markers. y_i is the true label for the i-th patient, indicating whether the patient has LC (positive class) or not (negative class).

represents the predicted output (score) of the i-th patient given by the ensemble of trees in the XGBoost model. The model’s prediction is based on the patient’s features and learned patterns from the training data.

is the loss function, measuring the difference between the true label and the predicted output for each patient. In LC detection, a suitable loss function could be binary cross-entropy or another appropriate classification loss function. Ω(f_i) is the regularization term, which penalizes the complexity of the model. In the context of LC detection, regularization helps prevent overfitting and ensures that the model generalizes well to new, unseen patient data.

3.3.5. RF

RF classifier serves as a potent tool for detecting and predicting the presence of LC. RF classifier evaluates the predictions of each decision tree in the forest. It selects the class label with the highest number of votes as the final prediction for the presence or absence of LC for a given patient. RF classifier in LC analysis demonstrates notable advantages, including its ability to manage noisy and unbalanced data, represent data without reduction, analyze large datasets efficiently, achieve high accuracy, and effectively address overfitting issues. These attributes make the RF classifier a reliable and resilient LC detection and prediction algorithm. These characteristics establish the RF classifier as a resilient and dependable algorithm [54]. For the mathematically inclined, the RF classifier aggregates multiple decision trees, each of which independently predicts the likelihood of a patient having LC based on a set of features or risk factors.

Let us denote T as the total number of decision trees in the RF, f_i(x) as the prediction of the i-th decision tree for input x, and C as the set of possible class labels, where C = {0, 1} (0 represents the absence of LC and 1 represents the presence of LC), and y(x) as the final predicted class label for input x. The RF classifier predicts the class label y(x) for input x as

(5)

3.3.6. CatBoost

CatBoost classifier is a highly effective ML algorithm designed to predict categorical features. It utilizes gradient boosting with binary decision trees as the base predictors [75]. In the context of classification tasks like LC detection, CatBoost minimizes a loss function that measures the discrepancy between the true labels (indicating whether a patient has LC or not) and the model’s predicted probabilities. Here is a simplified representation of the loss function for binary classification tasks (yes or no) or encoded numerically (0 or 1) like LC detection [76].

(6)

In this equation, L (y, p(x)) represents the loss function, where y is the true label (one for presence of LC, zero for absence) and p(x) is the model’s predicted probability that a patient has LC based on their features (x). N is the total number of patients in the dataset, and log denotes the natural logarithm. The loss function penalizes the model more heavily when its predicted probability diverges from the actual label, encouraging the model to make more accurate predictions [77]. This equation captures the essence of CatBoost optimizes the model parameters to achieve accurate predictions for binary classification tasks like LC detection.

3.4. CNN Architecture

CNN [78] is a computational model that integrates convolution, pooling, and fully connected layers. The convolution and max-pooling layers’ extract features crucial for analysis, while the fully connected layers map these features to the final output. Our investigation involved implementing a 1D CNN architecture design for experimental analysis consisting of a sequence of three Conv1D layers followed by two max-pooling layers for downsampling. This CNN architecture is customized to effectively manage one-dimensional data, thus rendering it suitable for a variety of classification tasks. The architecture of this CNN [78] used in our work shows a summary of the provided CNN model (Table 1).

Table 1. Detailed description of CNN.

Input	Layer	Feature map	Size	Kernel size	Activation
(15, 1)	Convolutional	30	(13, 30)	(3, 1)	ReLU
(13, 30)	Max pooling	30	(6, 30)	(2, 1)	—
(6, 30)	Convolutional	60	(6, 60)	(3, 1)	ReLU
(6, 60)	Max pooling	60	(3, 60)	(2, 1)	—
(3, 60)	Convolutional	90	(3, 90)	(3, 1)	ReLU
(3, 90)	Flatten	1	270	—	—
270	Dropout (0.5)	—	—	—	—
270	Dense	128	—	—	ReLU
128	Dense	50	—	—	ReLU
50	Dense	1	—	—	Sigmoid

Our work introduced a CNN model with an input shape of (15, 1), representing 15 time steps and a single feature (channel). It employs three convolutional layers, with 30, 60, and 90 filters, respectively. Each filter operates with a kernel size of (3, 1), extracting features from the input data. In order to introduce nonlinearity in the network, ReLU activation functions throughout all layers were employed. Following the convolutional layers, the study applied two max-pooling layers of size (2,1) and formed new feature maps with reduced spatial dimensions when only the highest values from each region were retained. Then, a flatten layer is used to map the multidimensional feature maps to a one-dimensional vector to fit to subsequent fully connected layers. To avoid overfitting while training, a dropout layer with a 0.5 dropout rate was added. This technique randomly deactivates half of the neurons in each training step, which encourages better generalization. The network then moves to two dense layers with 128 and 50 neurons, both layers being followed by ReLU activation for enhancement of learning capacity. Finally, the output layer is a neuron with a sigmoid activation function whose output represents a probability score for the binary classification predicting LC or not in the patient. The CNN model, trained with binary cross-entropy loss, well suited for two-class problems, used Adam for optimization. The model performance with accuracy as the vital parameter was assessed. Creative design of our CNN architecture was meant to take care of 1D sequential data like the time series and structured clinical record data, which made it suitable for healthcare prediction tasks.

3.4.1. Parameters Used in the Proposed Work

In the development of the proposed ConvXGB and ConvCatBoost models, careful selection of parameters was crucial to optimize performance while maintaining computational efficiency. A lightweight 1D CNN architecture was designed with three convolutional layers, having 30, 60, and 90 filters, respectively. A small kernel size of (3,1) was chosen to capture fine-grained feature patterns across the clinical attributes, while minimizing overfitting. ReLU activation functions applied to introduce nonlinearity, enhancing the model’s ability to learn complex relationships. Pooling layers with a size of (2,1) were incorporated after convolution layers to progressively reduce the feature map dimensions and computational load, without significant loss of important information. A dropout layer with a dropout rate of 0.5 was utilized to prevent overfitting by randomly deactivating neurons during training. Fully connected dense layers of size 128 and 50 were included before the output layer to perform feature consolidation and high-level reasoning. Training of the CNN performed over 150 epochs with a batch size of 160, based on empirical testing to balance learning efficiency and convergence speed. The Adam optimizer was selected for its adaptive learning rate properties, allowing stable and faster convergence [79].

Additionally, a binary cross-entropy loss function was employed, appropriate for the binary classification nature of LC diagnosis (LC-Yes or LC-No). For the XGBoost and CatBoost classifiers, default learning rates and depth parameters were initially adopted, followed by fine-tuning based on validation performance to maximize predictive accuracy without overfitting. The choice of SMOTE for class balancing was made after observing severe imbalance in the original dataset, which could otherwise bias the model toward the majority class [80]. Finally, fivefold cross-validation was implemented to ensure a reliable and unbiased estimate of model generalization capability. The parameter choices were guided by previous best practices in similar research and by extensive experimental validation, ensuring the final models achieved optimal performance while maintaining generalizability and computational efficiency [81].

3.5. ConvCatBoost

ConvCatBoost is a novel approach that combines CNN with the CatBoost algorithm to enhance classification performance, particularly in LC detection. This fusion leverages CNN’s ability to extract intricate features from input data, while CatBoost refines predictions through gradient-boosted decision trees [82]. By integrating these two methodologies, ConvCatBoost is well-equipped to manage numerical and categorical data effectively. The resulting model exhibits improved accuracy and robustness, making it a valuable tool for early LC detection.

3.6. ConvXGB

An ensemble learning based ConvXGB algorithm is proposed to address classification challenges for LC detection effectively. ConvXGB combines the functionalities of CNN [83] and XGBoost [73], thereby leveraging the strengths inherent in both methodologies. ConvXGB encompasses two fundamental components, aiming to synergistically merge the capabilities of CNN [83] and XGBoost [84]. Employing CNN for feature extraction effectively captures important spatial patterns essential for identifying LC. Pretrained CNN models are employed to extract features, which are fed into an ensemble part for classification. After extracting features, ConvXGB leverages the XGBoost component to analyze the data, iteratively refining predictions. By combining these refined predictions, ConvXGB provides accurate diagnoses of LC, reflecting the probability of its presence based on both input data and extracted features. The detailed steps of this process are outlined (Algorithm 1), which defines the pseudocode for the ConvXGB model algorithm.

Algorithm 1: Pseudocode of the ConvXGB.

Input: Experimental dataset D
Parameters: Number of epochs N, feature vector dimension k, learning rate, optimizer settings
1. Initialize CNN model C
2. Initialize XGBoost classifier XGB
3. Apply SMOTE to dataset D to balance class distribution
4. Split D into training set D_train (90%) and validation set D_val (10%)
5. For each epoch =1= 1 to N do
6. Train CNN model C:
Input: X_train
Output: Deep feature vectors z∈ℝ^k from the penultimate Dense layer
Loss: Binary Cross-Entropy
Optimizer: Adam
7. Validate CNN model on D_val
8. End For
9. Extract deep features:
10. Z_train=C(X_train), Z_val=C(X_val)
11. Train XGBoost model XGBoost:
12. Input: Z_train, Labels: Y_train
13. Objective:
Obj=
where L is binary logistic loss, Ω is regularization
14. Predict on validation features Z_val: Output: ŷ_val
15. Evaluate model on D_val:
Accuracy = TP + TN/TP + TN + FP + FN
Precision = TP/TP + FP
Recall = TP/TP + FN
F1 − Score = 2 × Precision × Recall/Precision + Recall
AUC = Area under ROC curve
16. Return: Predicted LC classes (LC-Yes/LC-No)
Output: Predicted lung cancer classes (LC-Yes/LC-No)

3.6.1. Rationalization for Employing Ensemble Models

Three ensemble models—CNN [85, 86], ConvCatBoost, and ConvXGB—apply together because they leverage deep feature learning capabilities while bringing robustness to tabular data classification. DL models called CNNs provide optimal capability to detect nonlinear patterns within medical data derived from structured clinical records [87]. CNNs demonstrate disadvantages regarding interpretability along with overfitting problems that occur when working with small or unbalanced medical datasets. XGBoost and CatBoost demonstrate high effectiveness in structured data processing while delivering good generalization through iterative weak learner aggregation and enhanced management of unbalanced datasets [88]. The ensemble method connects XGBoost and CatBoost classifiers to CNN-based deep feature extraction to gain the hierarchical learning advantages of CNN and ensemble boosting methods’ strong predictive performance with regularization properties [89]. The proposed ensemble combination addresses separate model weaknesses and produces better diagnostic accuracy while solving overfitting problems and achieving superior performance with unbalanced datasets, which are fundamental for LC detection needs [90]. The empirical analysis shows that ConvXGB and ConvCatBoost perform better than traditional ML algorithms and standalone CNN models when measuring all important performance indicators, thus supporting the successful implementation of proposed ensemble approach [61].

3.6.2. Integration of CNN Features With CatBoost and XGBoost

The decision to develop ensemble models ConvXGB and ConvCatBoost was driven by the need to integrate the strengths of both DL and gradient boosting approaches for robust LC prediction. CNNs are well-suited for capturing hierarchical and nonlinear patterns within structured data, but they are prone to overfitting when trained on small or imbalanced clinical datasets and often lack interpretability. In contrast, XGBoost and CatBoost are gradient boosting algorithms that excel in processing tabular data, offer superior regularization capabilities, and are highly effective in handling class imbalance—a common issue in medical datasets. By first extracting deep features using a lightweight 1D CNN and then passing those high-level representations into XGBoost and CatBoost classifiers, the ensemble models harness both local feature learning and strong predictive generalization. This architecture was specifically selected after comparative evaluations demonstrated that these combinations consistently outperformed standalone CNN, XGBoost, CatBoost, and traditional ML models across all key performance metrics. Thus, the ensemble strategy not only enhances classification performance but also ensures robustness, scalability, and clinical interpretability—all critical factors for real-world LC diagnostic applications.

The method uses CNN features through a structured two-stage ensemble model which incorporates CatBoost along with XGBoost. A preprocessing step normalizes clinical tabular features through min–max normalization and one-hot encoding of categorical data. This process applies to the dataset which contains 309 samples and 15 features. The class unbalance is solved through the SMOTE that produces an equalized dataset for training purposes. A 1D CNN operates as a lightweight extractor for high-level features during the second processing stage. The designed CNN uses three Conv1D layers with ReLU activation functions followed by two MaxPooling1D layers alongside two dense layers with 128 neurons and 50 neurons, respectively. Training of the CNN model occurs through 150 epochs with 160 batch size using the binary cross-entropy loss function and the Adam optimizer [91]. The trained deep feature vectors consist of 50 dimensions extracted from the second-last dense layer, which produces a feature matrix with a shape of (309, 50) [92]. The extracted features provide input to independently train two boosting algorithms known as ConvCatBoost operated by CatBoost and the ConvXGB system built using XGBoost. CatBoost adopts a logistic loss for its optimization process. The CNN trains separately from both boosting models, which execute the last stage of binary cancer classification [93]. The analysis uses a fivefold cross-validation technique throughout CNN feature extraction and training of two boosting classifiers (CatBoost and XGBoost) to achieve a consistent, robust, and unbiased assessment. These boosting models from ConvCatBoost and ConvXGB generate two-class predictions for LC-Yes or LC-No, which receive their performance evaluation metrics. The two-stage ensemble methodology lets CNN extract complicated patterns from data before gradient boosting algorithms use them to improve clinical decisions, thus forming a practical framework for detecting LC.

To clarify the integration process, we describe the complete model pipeline for ConvXGB and ConvCatBoost as follows:

•
Data preprocessing: The raw clinical dataset (309 samples × 15 features) is normalized using min–max scaling, and categorical variables (e.g., gender and LC status) are encoded using one-hot encoding.
•
Balancing classes: SMOTE is applied to handle class imbalance, resulting in a balanced dataset (550 samples).
•
CNN-based feature extraction:
- •
  A lightweight 1D CNN is trained on the structured tabular data. The architecture includes three convolutional layers followed by max-pooling layers, dropout, and dense layers.
- •
  Deep features are extracted from the penultimate dense layer (128 or 50 dimensions depending on configuration), resulting in a feature matrix of shape (N, D), where N is the number of samples and D is the feature vector size (e.g., 50).
•
Ensemble learning stage:
- ➢
  The extracted features are fed into two separate classifiers:
  - ▪
    ConvXGB: uses XGBoost for final classification.
  - ▪
    ConvCatBoost: uses CatBoost for final classification.
- ➢
  Both classifiers are trained on the CNN-generated feature vectors using binary logistic loss.
•
Model evaluation:
- •
  Fivefold cross-validation is applied.
- •
  Metrics such as accuracy, precision, recall, F1-score, and AUC are computed.

This modular pipeline ensures that CNN handles deep pattern extraction, while boosting classifiers focus on robust, interpretable decision boundaries. A visual overview is provided in Figures 2 and 3. The process includes data preprocessing, class balancing using SMOTE, CNN-based feature extraction, and classification using gradient boosting models (XGBoost and CatBoost) in Figure 3.

The ensemble models ConvXGB and ConvCatBoost were chosen to exploit the complementary strengths of CNNs and gradient boosting techniques (XGBoost and CatBoost). CNNs excel at hierarchical feature extraction, especially in handling nonlinear patterns from structured or spatial data. However, they may struggle with small or imbalanced datasets and lack interpretability. On the other hand, XGBoost and CatBoost are highly efficient with structured clinical/tabular data, offer strong regularization, and are particularly effective in managing class imbalance, which is common in medical datasets. By combining CNN for deep feature extraction with CatBoost and XGBoost for robust classification, we aim to create an ensemble architecture that mitigates individual weaknesses (like CNN overfitting or limited tabular performance) and boosts classification accuracy, precision, and recall, especially in early-stage LC detection. These specific ensemble models were selected after a comparative evaluation of multiple classifiers, showing consistent improvement across performance metrics over both standalone CNN and traditional ML models.

3.7. Performance Evaluations Method

To evaluate the performance of ML models, several evaluation methods including accuracy, precision, recall, F1-score, and AUC were considered [94, 95]. The mathematical expression of the performance methods is mentioned in Equations (7)–(10):

(7)

(8)

(9)

(10)

These performances were assessed using the confusion matrix, which comprises TP, true negative (TN), FP, and FN elements.

In our study, model performance is assessed using standard classification metrics including accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC). The confusion matrices presented in figures are based on results from individual validation folds or single 90:10 train-validation splits and reflect model performance at a fixed decision threshold of 0.5 [96]. In contrast, the AUC values reported in summary tables are calculated from ROC curves that consider all possible classification thresholds, providing a threshold-independent evaluation of model performance. Furthermore, AUC results are averaged over fivefold cross-validation to ensure robustness and generalizability [97]. This distinction explains the differences observed between confusion matrix–based estimations and AUC scores in the results.

4. Results and Discussion

Our results outline robust preprocessing, diverse feature analysis, and rigorous evaluation of ML models for LC detection. Overall, XGBoost excelled in accuracy and AUC, with SMOTE enhancing model performance. Confusion matrices and ROC curves illustrated model efficacy before and after dataset balancing. CNN and ensemble models used in the detection of LC demonstrated promise, underscoring the necessity of mitigating class unbalance in medical datasets to ensure accurate predictions and diagnosis, thus advancing patient care and disease management in healthcare analytics.

4.1. Data Preprocessing and Dataset Preparation

The initial steps in detecting LC involve preprocessing the dataset to address outliers and remove unnecessary information, ensuring data quality and reliability. Dealing with missing values is crucial, and various ML techniques are employed to address this issue by enhancing dataset quality. Training aims to optimize model parameters for accurate LC detection, while testing addresses the validation of the model in generalizing performance over new data instances.

Prior to the analysis, the dataset was preprocessed to ensure proper scaling and encoding of the variables. Continuous features, such as age, are scaled using min–max normalization to transform the values to a common range between 0 and 1. Then categorical variables, such as gender and LC status, were encoded using one-hot encoding. This technique transforms each categorical variable into a set of binary values, where each column represents a unique category.

An extensive assortment of AI approaches, including LR, KNN, SVM, RF, CatBoost, and XGBoost, harnessed for the purpose of evaluation. These algorithms are supported by preprocessing methods and assessment criteria to thoroughly assess efficacy, precision, and generalization capabilities. The methodology incorporates preprocessing tools [98] for data enhancement, various ML models for categorization assignments, boosting techniques to improve CNN performance, and an assortment of assessment criteria. This systematic approach establishes a robust framework for data-centric exploration and thorough analysis, enabling a more profound comprehension of intricate phenomena. The preprocessing steps performed included the following:

•
Normalization: Continuous variables were rescaled using min–max normalization to a [0,1] range.
•
Encoding: Categorical variables were converted into numerical format through one-hot encoding.
•
Class balancing: SMOTE is applied to synthetically augment the minority class samples, addressing the unbalance between LC-positive and LC-negative cases.

A detailed overview of the dataset partitioning, both before and after applying SMOTE, is summarized in Table 2. The dataset was split into training and validation sets using a 90:10 ratio.

Table 2. Dataset partitioning for training (90%) and validation (10%).

Dataset version	Training samples	Validation samples	Total samples
Original dataset	278	31	309
SMOTE-balanced dataset	500	50	550

4.2. Feature Extraction

Feature analysis in the study utilized an extensive array of variables to encompass the complex aspects of LC. These variables encompassed factors like smoking history, cough presence, alcohol consumption, and yellow finger appearance. The correlation matrix and heat map visualizations were utilized to identify trends and associations among the distinct characteristics in the dataset. This thorough feature engineering process aimed to capture the most relevant and informative predictors of LC.

The representation of features in each class is shown in Table 3. Regarding gender, both males and females demonstrate comparable probabilities of receiving an LC diagnosis. Our examination indicates that each scrutinized feature is detectable in LC patients to varying degrees, spanning from 25% to 36%. Moreover, a noteworthy portion of individuals displays these symptoms even in the absence of an LC diagnosis. Regular clinical assessments and monitoring of risk factors may assist in averting or alleviating the detrimental consequences of LC, notwithstanding the absence of the disease.

Table 3. Distribution of participants based on their feature values and class labels in the balanced dataset.

Features	LC		Features	LC
Gender	No	Yew	Allergy	No	Yes
Female	7.485	42.515	No	12.410	37.590
Male	5.245	44.755	Yes	1.455	48.545
Smoking	No	Yes	Wheezing	No	Yes
No	7.405	42.595	No	10.950	39.050
Yes	5.460	44.540	Yes	2.615	47.385
Yellow fingers	No	Yes	Alcohol consuming	No	Yes
No	9.775	40.225	No	11.680	38.320
Yes	3.695	46.305	Yes	2.035	47.965
Anxiety	No	Yes	Coughing	No	Yes
No	8.710	41.290	No	11.155	38.845
Yes	3.895	46.105	Yes	2.795	47.205
Peer pressure	No	Yes	Shortness of breath	No	Yes
No	9.415	40.585	No	7.660	42.340
Yes	3.225	46.775	Yes	5.555	44.445
Chronic disease	No	Yes	Swallowing difficulty	No	Yes
No	8.170	41.830	No	10.365	39.635
Yes	4.485	45.515	Yes	1.725	48.275
Fatigue	No	Yes	Chest pain	No	Yes
No	9.900	40.100	No	9.855	40.145
Yes	4.565	45.435	Yes	3.490	46.510

Figure 4 illustrates the distribution of participants across different age groups, revealing a higher prevalence of LC among individuals aged 50–79. Among these age groups, the 60–64 age bracket exhibits the highest frequency of LC cases. The relationships between features based on the heat map mentioned in Figure 5 were designed.

4.3. Performance Evaluation of LC Prediction Based on Unbalanced Data Using ML

To predict LC, ML algorithms such as LR, KNN, SVM, XGBoost, RF, and CatBoost with cross-validation folds such as 1, 2, 3, 4, and 5 techniques (Table 4) were used. Additionally, we employed SMOTE to balance the data. It was found that in the case of an unbalanced dataset, the RF model has the highest performance in terms of accuracy, recall, AUC, F1-score, and precision, which were found to be 0.7201, 0.9748, 0.9356, 0.9484, and 0.9198, respectively. Additionally, the mean KNN model of this study has the lowest performance in terms of accuracy, recall, AUC, F1-score, and precision, which were found to be 0.6648, 0.9680, 0.8532, 0.9347, and 0.9074, respectively.

Table 4. Performance of lung cancer prediction based on various ML models using the unbalanced dataset.

Model	Fold	Accuracy	Recall	AUC	F1-score	Precision
LR	1	0.7416	0.9833	0.9416	0.9833	0.9833
	2	0.7683	0.9811	0.9601	0.9541	0.9285
	3	0.6870	0.9454	0.9194	0.9369	0.9285
	4	0.6000	1.0000	0.9815	0.8867	0.7966
	5	0.7227	0.9454	0.9181	0.9454	0.9454
	Mean	0.7039	0.9636	0.9442	0.9413	0.9165

KNN	1	0.725	0.950	0.7000	0.9661	0.9827
	2	0.7683	0.9811	0.9098	0.9541	0.9285
	3	0.6246	0.9636	0.8467	0.9380	0.9137
	4	0.5666	1.0000	0.9730	0.8785	0.7833
	5	0.6393	0.9454	0.8363	0.9369	0.9285
	Mean	0.6648	0.9680	0.8532	0.9347	0.9074

SVM	1	0.7333	0.9666	0.9750	0.9747	0.9830
	2	0.7127	0.9811	0.9329	0.9454	0.9122
	3	0.7675	0.9636	0.8987	0.9549	0.9464
	4	0.5666	1.000	0.9673	0.8785	0.7833
	5	0.8060	0.9454	0.8969	0.9541	0.9629
	Mean	0.7172	0.9713	0.9341	0.9415	0.9176

XGBoost	1	0.7500	1.0000	0.9500	0.9915	0.9836
	2	0.5922	0.9622	0.8322	0.9189	0.8793
	3	0.7493	0.9272	0.8207	0.9357	0.9444
	4	0.6333	1.0000	0.9801	0.8952	0.8103
	5	0.7878	0.9090	0.9424	0.9345	0.9615
	Mean	0.7025	0.9597	0.9051	0.9352	0.9158

RF	1	0.7416	0.9833	0.9333	0.9833	0.9833
	2	0.7777	1.0000	0.9643	0.9724	0.9285
	3	0.7675	0.9636	0.8792	0.9549	0.9464
	4	0.6000	1.0000	0.9574	0.8867	0.7966
	5	0.7136	0.9272	0.9606	0.9444	0.9444
	Mean	0.7201	0.9748	0.9389	0.9484	0.9198

CatBoost	1	0.7416	0.9833	0.9000	0.9833	0.9833
	2	0.6667	1.0000	0.9433	0.9464	0.8983
	3	0.7675	0.9636	0.8831	0.9549	0.9464
	4	0.6000	1.0000	0.9730	0.8867	0.7966
	5	0.7136	0.9272	0.9424	0.9357	0.9444
	Mean	0.6979	0.9748	0.9283	0.9414	0.9138

Note: Bold data means maximum value.

4.4. Performance Evaluation of LC Prediction Based on Balanced Data Using ML

The LR, KNN, SVM, XGBoost, RF, and CatBoost models with fivefold cross-validation techniques are used to predict LC based on balanced data (Table 5). The low precision (0.100) for the CNN model may be due to a high number of FPs in the balanced dataset, potentially caused by insufficient feature discrimination in the standalone CNN architecture. To mitigate overfitting observed in the CNN model, a dropout rate of 0.5 was applied, and early stopping techniques were explored, which showed potential for improving generalization on the balanced dataset. It was found that the CatBoost model of this study has the highest performance in terms of accuracy, recall, AUC, F1-score, and precision, which were found to be 0.9550, 0.9398, 0.9895, 0.9541, and 0.9708, respectively, under the balance dataset. Additionally, the mean KNN model of this study has the lowest performance in terms of accuracy, recall, AUC, and F1-score, which are found to be 0.9143, 0.8286, 0.9635, and 0.9056, respectively. While the lowest precision is 0.9346 for the LR model.

Table 5. Performance of lung cancer prediction based on various ML models using the SMOTE-balanced dataset.

Model	Fold	Accuracy	Recall	AUC	F1-score	Precision
LR	1	0.8993	0.8776	0.9810	0.9009	0.9259
	2	0.9168	0.9107	0.9890	0.9189	0.9272
	3	0.8880	0.8653	0.9361	0.8832	0.9000
	4	0.9440	0.9245	0.9735	0.9423	0.9607
	5	0.9340	0.9038	0.9478	0.9306	0.9591
	Mean	0.9164	0.8963	0.9863	0.9150	0.9346

KNN	1	0.9473	0.8947	0.9836	0.9444	0.9807
	2	0.9017	0.8035	0.9656	0.8910	1.0000
	3	0.8846	0.7692	0.9416	0.8695	1.0000
	4	0.9339	0.8679	0.9636	0.9292	1.0000
	5	0.9038	0.8076	0.9635	0.8936	0.9767
	Mean	0.9143	0.8286	0.9635	0.9056	0.9915

SVM	1	0.9649	0.9298	0.9858	0.9636	1.0000
	2	0.9361	0.9107	0.9934	0.9357	0.9622
	3	0.9038	0.8076	0.9845	0.8936	1.0000
	4	0.9535	0.9433	0.9910	0.9523	0.9615
	5	0.9615	0.9230	0.9663	0.9600	1.0000
	Mean	0.9439	0.9029	0.9843	0.9410	0.9847

XGBoost	1	0.9705	1.0000	0.9962	0.9743	0.9500
	2	0.9532	0.9642	0.9842	0.9557	0.9473
	3	0.9237	0.8653	0.9649	0.9183	0.9782
	4	0.9538	0.9694	0.9828	0.9532	0.9444
	5	0.9711	0.9828	0.9752	0.9702	1.0000
	Mean	0.9545	0.9468	0.9817	0.9544	0.9640

RF	1	0.9520	0.9824	0.9951	0.9739	0.9655
	2	0.9622	0.9821	0.9953	0.9565	0.9482
	3	0.9423	0.8846	0.9908	0.9292	1.0000
	4	0.9447	0.9622	0.9902	0.9444	0.9272
	5	0.9711	0.9423	0.9934	0.9607	1.0000
	Mean	0.9544	0.9507	0.9930	0.9529	0.9682

CatBoost	1	0.9628	0.9649	0.9979	0.9649	0.9649
	2	0.9725	0.9642	0.9965	0.9729	0.9818
	3	0.9326	0.8653	0.9910	0.9278	1.0000
	4	0.9447	0.9622	0.9790	0.9444	0.9272
	5	0.9622	0.9423	0.9831	0.9602	0.9800
	Mean	0.9550	0.9398	0.9895	0.9541	0.9708

Note: Bold data means maximum value.

4.5. Performance Evaluation of LC Prediction Based on Unbalanced Data Using CNN and Ensemble DL Models

In this section, the performance of the proposed CNN and ensemble approach is evaluated for the unbalanced dataset. To train CNN with the unbalanced dataset, it is divided into training and validation sets as 90:10, respectively. In this phase, the proposed CNN architecture facilitates simultaneous training and testing of models. With 150 epochs configured and a batch size of 160, the model iterates through the dataset, honing its understanding of the features within. The validation loss and accuracy curve for CNN across different epochs for the unbalanced dataset is shown in Figure 6.

Overall proposed CNN model learned with low bias, and the model seems to be overfitting, followed by less learning till 25 epochs; then, the model showed improved learning. In this work, an ensemble-based approach in composition of the CNN model with XGBoost and CatBoost classifiers based on their performance demonstrated (Tables 4 and 6) is proposed. Deep features are fed to ensemble learning models, namely, ConvCatBoost and ConvXGB, for the prediction of LC.

Table 6. Performance of the LC prediction based on CNN and ensemble DL models using unbalanced data.

Metric	CNN	ConvCatBoost	ConvXGB
Accuracy	0.9565	0.9736	0.9739
Precision	0.9740	0.9870	0.9747
Recall	0.9615	0.9744	0.9872
F1-score	0.9677	0.9806	0.9803

The performance of CNN models of this study is evaluated under unbalanced dataset using confusion matrices and ROC curves as shown in Figures 7 and 8, respectively. The confusion matrix provides overall performance metrics for the CNN and deep ensemble models ConvCatBoost and ConvXGB by comparing predicted and true labels for LC prediction. The CNN model’s metrics such as accuracy (0.9565), precision (0.9740), recall (0.9615), and F1-score (0.9677) highlight its effectiveness in binary classification tasks.

Overall, Figure 7 indicates that ConvXGB demonstrated superior performance compared to CNN and ConvCatBoost, particularly regarding FP and TN for predicting LC. The ROC curve in Figure 8 for CNN and ensemble models ConvCatBoost and ConvXGB using unbalanced dataset illustrates binary classification performance across various threshold settings. It plots the TPR against the FPR, offering insight into the trade-offs between precision and recall. The ensemble models outperformed the baseline CNN model in terms of the ROC curve.

The performance of the CNN and ensemble-based models, namely, ConvXGB and ConvCatBoost, was evaluated for LC prediction on the unbalanced dataset. Performance metrics such as accuracy, precision, recall, and F1-score were derived from the confusion matrix for all models. The results for CNN, ConvCatBoost, and ConvXGB (Table 6) showed that the overall LC prediction performances in terms of accuracy with the unbalanced datasets were 0.9565 for CNN, 0.9736 for ConvCatBoost, and 0.9739 for ConvXGB, respectively. Similarly, LC prediction performances in terms of precision with the unbalanced dataset were 0.9740, 0.9870, and 0.9747 for CNN, ConvCatBoost, and ConvXGB, respectively. LC prediction performances in terms of recall with the unbalanced dataset were 0.9615, 0.9744, and 0.9872 for CNN, ConvCatBoost, and ConvXGB, respectively. In terms of F1-score with the unbalanced dataset, the scores were 0.9677, 0.9806, and 0.9803 for CNN, ConvCatBoost, and ConvXGB, respectively. In the proposed work of this study, the CNN and ensemble DL models using the unbalanced data ConvCatBoost and ConvXGB exhibit superior accuracy, implying better classification despite data imbalance. While CNN and ConvCatBoost demonstrated higher precision, ConvXGB showed a slight lag in precision. Recall scores show consistency among CNN and ConvCatBoost, with ConvXGB slightly lower. ConvCatBoost tops the F1-score, followed closely by CNN and ConvXGB, indicating robust performance in LC detection despite unbalanced dataset challenges.

4.6. Performance Evaluation of LC Prediction Based on Balanced Data Using CNN and Ensemble DL Models

In this section, a performance evaluation of LC prediction using CNN and an ensemble approach for the balanced dataset was conducted. The CNN training phase comprised 150 epochs, with a batch size set to 160 for each epoch, averaging 39 ms/step. Following training, the CNNs exhibited enhanced performance in comparison to conventional ML techniques by achieving a remarkable accuracy of 95.13%. Additionally, the findings of this study are supplemented with visual representations in Figure 9, which illustrate the learning process of the CNN model on the balanced dataset.

ConvCatBoost and ConvXGB, both deep ensemble models integrating CNN and boosting algorithms, were demonstrated remarkable performance in binary classification for LC prediction. ConvCatBoost shows precision (0.9870) and recall (0.9744), with accuracy (0.9736) and notable F1-scores (0.9806), indicating accurate categorization. Meanwhile, ConvXGB achieved impeccable recall (0.9872) and high precision (0.9872), resulting in an accuracy (0.9826) and impressive F1-scores (0.9872). Overall, ConvXGB outperformed with exceptional precision, recall, and F1-scores, highlighting its efficacy in precise classification tasks. Confusion matrix for balance data with deep ensemble models CNN, ConvCatBoost, and ConvXGB shown in Figure 10 described the classification task for prediction LC with predicted and true levels. Figure 11 explains the ROC curve results for the CNN with Sigmoid model, ConvCatBoost, and ConvXGB, respectively, indicating AUC is 1.00, indicating perfect discrimination between positive and negative instances.

In the CNN-based ensemble model of this study, the performance for balanced data (Table 7) presents the performance measures such as accuracy, precision, recall, and F1-score for DL architectures: CNN utilizing sigmoid activation, ConvCatBoost, and ConvXGB. ConvXGB attains the highest values in accuracy (0.9826) and precision (0.9872), surpassing recall (0.9872) and F1-score (0.9872), demonstrating a robust overall performance. In comparison, ConvCatBoost demonstrates inferior performance when contrasted with ConvXGB, with values of accuracy, precision, recall, and F1-score of 0.9736, 0.9870, 0.9744, and 0.9806, respectively. The CNN model employing sigmoid activation achieves recall (0.9024), implying accurate positive forecasts, albeit with precision (0.100) and F1-score (0.487) in relation to ConvXGB. In summary, ConvXGB emerges as the most effective model across all criteria, displaying superior levels of accuracy, precision, recall, and F1-score.

Table 7. Performance of the LC prediction based on CNN and ensemble DL models using balanced data.

Metric	CNN	ConvCatBoost	ConvXGB
Accuracy	0.9652	0.9736	0.9826
Precision	0.100	0.9870	0.9872
Recall	0.9024	0.9744	0.9872
F1-score	0.9487	0.9806	0.9872

4.7. Comparison Results Among Various LC Prediction Models

This research shows a breakdown of the models’ performance metrics, including accuracy, precision, recall, and F1-score (Table 8). An extensive assessment conducted for range of classifiers, encompassing conventional approaches like LR, KNN, SVM, XGBoost, RF, and CatBoost and DL-based CNN model, as well as ensemble methods like ConvXGB and ConvCatBoost. Overall, ConvXGB emerged as the most effective performer across evaluation metrics, displaying outstanding levels of accuracy (98.26%), recall (98.72%), and F1-score (98.72%) in contrast of precision KNN outperformed all models by exhibiting high precision (99.15%). This highlights the potency of amalgamating the representational capacity of CNN with the boosting capabilities of XGBoost for robust binary classification. CNN and ConvCatBoost also exhibited noteworthy outcomes, particularly excelling in recall and F1-score, demonstrating their adeptness in precisely recognizing positive instances. Ensemble methods such as ConvXGB and ConvCatBoost demonstrated competitive performance, attaining elevated accuracy levels and maintaining balanced precision–recall trade-offs. Traditional methods like LR, KNN, SVM, XGBoost, RF, and CatBoost also delivered respectable results, albeit with slight discrepancies in certain metrics. These discoveries give important advice to researchers and practitioners in the identification of fitting classifiers customized to demands and circumstances, hence supporting creativity and development in ML and AI research.

Table 8. Comparing different ML models after fivefold cross-validation and their best performance.

Metric	LR	KNN	SVM	XGBoost	RF	CatBoost	CNN	ConvCatBoost	ConvXGB
Accuracy	0.9146	0.9143	0.9439	0.9545	0.9544	0.9550	0.9652	0.9736	0.9826
Precision	0.9346	0.9915	0.9847	0.9640	0.9682	0.9708	0.100	0.9870	0.9872
Recall	0.8963	0.8286	0.9029	0.9468	0.9507	0.9398	0.9024	0.9744	0.9872
F1-score	0.9150	0.9065	0.9410	0.9544	0.9529	0.9541	0.9487	0.9806	0.9872

Note: Bold data means maximum value.

4.8. Comparison of the Proposed and Existing Methods

In the comparison of the proposed ensemble DL models of this study with the fundamental methodology analyzed in this study to augment accuracy and efficiency (Table 9), it was demonstrated that there is a contrast with previously documented research. This study’s proposed ensemble DL model shows a significant improvement in accuracy and efficiency compared to earlier methodologies explored in this study. Past research utilizing models such as ANN, WDELM, CNN, and RotF has reported commendable accuracy levels. However, the proposed approach of this study stands out due to its ensemble design, which integrates the strengths of CNN and XGBoost. This fusion enables the model of this study to achieve impressive accuracy, outperforming previously documented models.

Table 9. Comparison between previous and proposed LC detection models.

Ref.	Model	Dataset	Accuracy (%)
[99]	ANN	Multimodal clinical	90.91
[100]	WDELM	LC detection	92.00
[19]	ANN	Public medical	93.00
[101]	CNN	LC classification	95.00
[4]	RotF	Demographic	97.10
This study	ConvXGB	Clinical	98.26

This high level of performance emphasizes the effectiveness of the frame worked this study in accurately classifying cases into LC-Yes and LC-No categories. By combining the feature-learning capability of CNNs with the powerful classification strength of XGBoost, the ensemble model of this study captures both spatial and structured data patterns with high precision. Moreover, the inclusion of the SMOTE addresses class imbalance, enhancing the model’s ability to generalize across diverse clinical scenarios.

LC detection remains a critical area in medical AI due to its high mortality rate and the need for early, accurate diagnosis. Over the years, researchers have explored various ML and DL techniques to improve prediction outcomes. ANN models, for example, have been widely used due to their ability to handle complex nonlinear data. One such model achieved 90.91% accuracy on a multimodal clinical dataset [99], while another study reported 93% using public medical data [19]. Despite these results, ANN performance is often highly dependent on data quality and can struggle with large, high-dimensional datasets.

Ensemble models such as WDELM have shown moderate success by combining multiple learners, achieving 92% accuracy on LC datasets [100]. While ensemble methods offer robustness, they sometimes rely on hand-crafted features or shallow models, which may not fully capture intricate patterns in medical imaging. DL architectures like CNNs have addressed this limitation by automatically learning hierarchical representations from image data, with one CNN-based model reaching 95% accuracy [101]. Additionally, RotF, an ensemble technique, achieved 97.1% accuracy on demographic data [4], highlighting the strength of model diversification.

Building on these advances, the proposed ConvXGB model of this study introduces an ensemble architecture: first, deep features are extracted using a pretrained CNN, which captures spatial dependencies in the input; then, XGBoost is applied for high-precision classification. This combination leverages the advantages of both DL and gradient-boosted decision trees, allowing the model to refine predictions and improve classification performance iteratively.

The experimental results of this study confirm the superiority of ConvXGB, which not only achieves the highest accuracy 98.26% among the compared models but also excels across other key evaluation metrics such as precision, recall, F1-score, and AUC. These findings underscore the model’s robustness, reliability, and potential for real world clinical implementation. By harnessing the complementary strengths of CNN and XGBoost, ConvXGB offers a scalable and effective solution for LC prediction one that holds promise for enhancing diagnostic accuracy and supporting early intervention in clinical settings.

4.9. Applications of the Study

The manuscript endeavors to significantly improve the precision and efficacy of LC diagnosis by applying advanced ML methods. It assesses the effectiveness of ensemble learning compared to ConvXGB for detection, specifically focusing on the complexities related to tumor compositions. By connecting the power of ensemble learning and ConvXGB, the manuscript aims to exceed the constraints of traditional diagnostic approaches. The precision of diagnostic procedures is enhanced through the classification accuracy of ensemble learning and the feature extraction capabilities of ConvXGB. The research is aimed at identifying the most effective model for classifying LC, providing valuable insights into the efficiency of ensemble learning and ConvXGB in interpreting medical images. The expected outcomes suggest the enhancement of ML algorithms for precise cancer identification. This development has the potential to revolutionize the field of diagnosis and treatment, leading to improved patient outcomes and more informed clinical decision-making. It sets the stage for improved management of LC and a decrease in mortality rates.

4.10. Limitations of the Study

The investigation presents encouraging results, yet it recognizes various constraints. Initially, the magnitude and variety of the dataset might limit the model’s applicability, necessitating more extensive and diverse datasets for thorough validation. Moreover, the uneven distribution of classes within the data may skew model performance, underscoring the importance of addressing class imbalance. Furthermore, depending solely on predefined features derived from CT scans could potentially overlook significant biomarkers. Integrating sophisticated feature selection techniques or utilizing DL for automated feature extraction could boost the models’ capacity to detect subtle patterns and enhance diagnostic precision [102]. Additionally, the selection of validation methods can significantly influence the credibility of the study. Accordingly, employing a range of validation approaches and performing external validation on distinct datasets are essential measures to ensure the reliability of the results. The translation of the models into clinical practice presents challenges, including the necessity for validation in real-world scenarios and regulatory authorization. Ensuring the interpretability of models, especially for DL models recognized for their opaque nature, is crucial for clinical acceptance. Furthermore, ethical considerations such as patient confidentiality, data protection, and algorithmic biases should be meticulously managed to safeguard patient rights and foster trust in AI-driven healthcare solutions [103]. Transparent disclosure of model performance and potential biases is critical for ethical and accountable deployment. Tackling these constraints in future research can propel the domain of LC diagnosis forward, leading to the establishment of more dependable, clinically relevant, and ethically sound ML-based methodologies [104].

4.11. Future Scope of the Study

This research successfully proves the value of combined DL with ensemble methods for detecting LC, but multiple improvement areas exist for future development. The framework requires increased availability of diverse clinical reports from various institutions to ensure effective model generalization across different demographics and medical conditions. The characteristic features of early-stage LC become more detectable by the ensemble models when using multimodal data combination approaches that merge CT scan images with genomic profiles and laboratory test results [105]. XAI techniques, specifically SHAP and LIME, will serve as top priorities for the project because they help make model predictions more understandable to healthcare professionals, thus building trust between healthcare providers and the system [106]. AutoML solutions will undergo evaluation because they help automatically design model specifications and discover optimal hyperparameters, creating efficient development processes and exceptional model results [107]. The future work endeavors to convert proposed models into functional CDSSs, which will create user-friendly real-time assistance tools for physicians in early LC detection to enhance patient health results [108].

5. Conclusions

This study contributes significantly to the enhancement of LC diagnostic protocols through the integration of sophisticated ML methodologies. By employing strategies such as SMOTE for data balancing, fivefold cross-validation for performance analysis, and CNN for binary classification in LC detection, the accuracy and dependability of diagnostic processes was elevated. Moreover, the rigorous comparative analysis of this study establishes benchmarks for evaluating the efficacy of ML models in LC diagnosis, providing a robust framework for future evaluations in both unbalanced and balanced data scenarios. Furthermore, the exploration of ensemble learning techniques in this study, including ensemble CNN, ConvCatBoost, and ConvXGB, addresses the intricacies of LC diagnosis, leading to more comprehensive and resilient diagnostic frameworks. The exceptional performance of ConvXGB underscores its potential to transform LC diagnosis, offering healthcare professionals a reliable and effective tool for timely interventions and improved patient outcomes. This research highlights the groundbreaking influence of ML in LC research, opening avenues for continued progress in oncological diagnostics and treatment strategies.

Nomenclature

LC: lung cancer
ML: machine learning
DL: deep learning
NSCLC: non–small cell lung cancer
SCLC: small cell lung carcinoma
SBRT: stereotactic body radiotherapy
CNN: convolutional neural network
RF: random forest
SVM: support vector machine
KNN: K-nearest neighbor
LR: logistic regression
XGBoost: eXtreme gradient boosting
CatBoost: categorical boosting
SMOTE: synthetic minority oversampling technique
CDSS: clinical decision support system
ROC: receiver operating characteristic
AUC: area under the curve
ANN: artificial neural network
MLP: multilayer perceptron
WDELM: weighted differential extreme learning machine
RotF: rotation forest
ConvXGB: convolutional neural network with XGBoost
ConvCatBoost: convolutional neural network with CatBoost
ReLU: rectified linear unit

Conflicts of Interest

The authors declare no conflicts of interest.

Author Contributions

M.M.A. and S.K.: conceptualization, methodology, formal analysis, software, validation, and writing—original draft. M.B.B.H.: data curation, methodology, investigation, visualization, and writing—original draft. F.A.: conceptualization, formal analysis, funding acquisition, resources, supervision, project administration, and writing—review and editing. C.C.: data curation, methodology, investigation, visualization, and writing—review and editing. M.A.B.H.: data curation, methodology, validation, investigation, and writing—review and editing. E.S. and S.P.: data curation, investigation, formal analysis, and writing—original draft. R.A. and D.P.: data curation, investigation, formal analysis, and writing—review and editing. All the authors have read and agreed to the publication.

Funding

No funding was received for this manuscript.

Acknowledgments

We would like to thank Prof. Naseem, Prof. Chandel, Prof. Ansari, and Prof. Huang for their valuable support. The authors also acknowledge Integral University, Lucknow, India, for providing the manuscript communication number (IU/R&D/2024-MCN0002892) for this study.

Open Research

Data Availability Statement

All data utilized in this study is included in this article.

References

1 Ansari M. M., Kumar S., Bin Heyat M. B., Ullah H., Bin Hayat M. A., Sumbul S., Parveen A., and Ali T. Z., SVMVGGNet-16: A Novel Machine and Deep Learning Based Approaches for Lung Cancer Detection Using Combined SVM and VGGNet-16, Current Medical Imaging. (2025) 21, https://doi.org/10.2174/0115734056348824241224100809, 39773067.
10.2174/0115734056348824241224100809
PubMed Web of Science® Google Scholar
2 Hanahan D. and Weinberg R. A., Hallmarks of Cancer: The Next Generation, Cell. (2011) 144, no. 5, 646–674, https://doi.org/10.1016/j.cell.2011.02.013, 2-s2.0-79952284127.
10.1016/j.cell.2011.02.013
CAS PubMed Web of Science® Google Scholar
3 Bhushan R., Ravichandiran V., and Kumar N., An Overview of the Anatomy and Physiology of the Brain, Nanocarriers for Drug-Targeting Brain Tumors. (2022) 3, 3–29, https://doi.org/10.1016/B978-0-323-90773-6.00023-3.
10.1016/B978-0-323-90773-6.00023-3
Google Scholar
4 Dritsas E. and Trigka M., Lung Cancer Risk Prediction With Machine Learning Models, Big Data and Cognitive Computing. (2022) 6, no. 4, https://doi.org/10.3390/bdcc6040139.
10.3390/bdcc6040139
Web of Science® Google Scholar
5 Bin Heyat M. B., Akhtar F., Munir F., Sultana A., Muaad A. Y., Gul I., Sawan M., Asghar W., Iqbal S. M. A., Baig A. A., de la Torre Díez I., and Wu K., Unravelling the Complexities of Depression With Medical Intelligence: Exploring the Interplay of Genetics, Hormones, and Brain Function, Complex & Intelligent Systems. (2024) 10, no. 4, 5883–5915, https://doi.org/10.1007/s40747-024-01346-x.
10.1007/s40747-024-01346-x
Google Scholar
6 Ansari M. M., Kumar S., Tariq U., Bin Heyat M. B., Akhtar F., Bin Hayat M. A., Sayeed E., Parveen S., and Pomary D., Evaluating CNN Architectures and Hyperparameter Tuning for Enhanced Lung Cancer Detection Using Transfer Learning, Journal of Electrical and Computer Engineering. (2024) 2024, no. 1, https://doi.org/10.1155/2024/3790617.
10.1155/2024/3790617
Web of Science® Google Scholar
7 Almalki W. H., Introduction to Lung Disease, Microbiome in Inflammatory Lung Diseases, 2022, Springer Nature Singapore, 1–12.
10.1007/978-981-16-8957-4_1
Google Scholar
8 Berezowska S., Maillard M., Keyter M., and Bisig B., Pulmonary Squamous Cell Carcinoma and Lymphoepithelial Carcinoma – Morphology, Molecular Characteristics and Differential Diagnosis, Histopathology. (2024) 84, no. 1, 32–49, https://doi.org/10.1111/his.15076, 37936498.
10.1111/his.15076
PubMed Web of Science® Google Scholar
9 Molina J. R., Yang P., Cassivi S. D., Schild S. E., and Adjei A. A., Non-Small Cell Lung Cancer: Epidemiology, Risk Factors, Treatment, and Survivorship, Mayo Clinic Proceedings. (2008) 83, no. 5, 584–594, https://doi.org/10.4065/83.5.584, 2-s2.0-44249086425, 18452692.
10.1016/S0025-6196(11)60735-0
PubMed Web of Science® Google Scholar
10 Gayap H. T. and Akhloufi M. A., Deep Machine Learning for Medical Diagnosis, Application to Lung Cancer Detection: A Review, Application to Lung Cancer Detection: A Review, BioMedInformatics. (2024) 4, no. 1, 236–284, https://doi.org/10.3390/biomedinformatics4010015.
10.3390/biomedinformatics4010015
Google Scholar
11 Xing K., Ku J., and Zhao J., A Novel Approach to Optimizing Convolutional Neural Networks for Improved Digital Image Segmentation, International Journal of Intelligent Systems. (2024) 2024, 1–10, https://doi.org/10.1155/2024/4337255, 4337255.
10.1155/2024/4337255
Web of Science® Google Scholar
12 Chandran U., Reps J., Yang R., Vachani A., Maldonado F., and Kalsekar I., Machine Learning and Real-World Data to Predict Lung Cancer Risk in Routine Care, Cancer Epidemiology, Biomarkers & Prevention. (2023) 32, no. 3, 337–343, https://doi.org/10.1158/1055-9965.EPI-22-0873, 36576991.
10.1158/1055-9965.EPI-22-0873
PubMed Web of Science® Google Scholar
13 Shehata S. A., Toraih E. A., Ismail E. A., Hagras A. M., Elmorsy E., and Fawzy M. S., Vaping, Environmental Toxicants Exposure, and Lung Cancer Risk, Cancers. (2023) 15, no. 18, https://doi.org/10.3390/cancers15184525, 37760496.
10.3390/cancers15184525
PubMed Web of Science® Google Scholar
14 Nicolas E., Kosmider B., Cukierman E., Borghaei H., Golemis E. A., and Borriello L., Cancer Treatments as Paradoxical Catalysts of Tumor Awakening in the Lung, Cancer and Metastasis Reviews. (2024) 43, no. 4, 1165–1183, https://doi.org/10.1007/s10555-024-10196-5, 38963567.
10.1007/s10555-024-10196-5
CAS PubMed Web of Science® Google Scholar
15 Dirik M., Machine Learning-Based Lung Cancer Diagnosis, Turkish Journal of Engineering. (2023) 7, no. 4, 322–330, https://doi.org/10.31127/tuje.1180931.
10.31127/tuje.1180931
Google Scholar
16 Li C., Lei S., Ding L., Xu Y., Wu X., Wang H., Zhang Z., Gao T., Zhang Y., and Li L., Global Burden and Trends of Lung Cancer Incidence and Mortality, Chinese Medical Journal. (2023) 136, no. 13, 1583–1590, https://doi.org/10.1097/CM9.0000000000002529, 37027426.
10.1097/CM9.0000000000002529
PubMed Web of Science® Google Scholar
17 Sharma K. P., Grosse S. D., Maciosek M. V., Joseph D., Roy K., Richardson L. C., Jaffe H., and Sharma K. P., Preventing Breast, Cervical, and Colorectal Cancer Deaths: Assessing the Impact of Increased Screening, Preventing Chronic Disease. (2020) 17, https://doi.org/10.5888/pcd17.200039, 33034556.
10.5888/pcd17.200039
PubMed Web of Science® Google Scholar
18 Binson V. A., Subramoniam M., Sunny Y., and Mathew L., Prediction of Pulmonary Diseases With Electronic Nose Using SVM and XGBoost, IEEE Sensors Journal. (2021) 21, no. 18, 20886–20895, https://doi.org/10.1109/JSEN.2021.3100390.
10.1109/JSEN.2021.3100390
CAS Web of Science® Google Scholar
19 Vieira E., Ferreira D., Neto C., Abelha A., and Machado J., Data Mining Approach to Classify Cases of Lung Cancer, World Conference on Information Systems and Technologies, 2021, Springer International Publishing, 511–521, https://doi.org/10.1007/978-3-030-72657-7_49.
10.1007/978-3-030-72657-7_49
Google Scholar
20 Siegel R. L., Miller K. D., Fuchs H. E., and Jemal A., Cancer Statistics, 2022, CA: A Cancer Journal for Clinicians. (2022) 72, no. 1, 7–33.
10.3322/caac.21708
PubMed Web of Science® Google Scholar
21 Moldovanu D., De Koning H. J., and Van Der Aalst C. M., Lung Cancer Screening and Smoking Cessation Efforts, Translational Lung Cancer Research. (2021) 10, no. 2, 1099–1109, https://doi.org/10.21037/tlcr-20-899, 33718048.
10.21037/tlcr-20-899
PubMed Web of Science® Google Scholar
22 He P., Kuhara H., Tachibana I., Jin Y., Takeda Y., Tetsumoto S., Minami T., Kohmo S., Hirata H., Takahashi R., Inoue K., Nagatomo I., Kida H., Kijima T., Naka T., Morii E., Kawase I., and Kumanogoh A., Calretinin Mediates Apoptosis in Small Cell Lung Cancer Cells Expressing Tetraspanin CD9, FEBS Open Bio. (2013) 3, 225–230, https://doi.org/10.1016/j.fob.2013.04.005, 2-s2.0-84878514520, 23772398.
10.1016/j.fob.2013.04.005
CAS PubMed Web of Science® Google Scholar
23 de Koning H. J., van der Aalst C. M., de Jong P. A., Scholten E. T., Nackaerts K., Heuvelmans M. A., Lammers J.-W. J., Weenink C., Yousaf-Khan U., Horeweg N., Westeinde S. v.’t., Prokop M., Mali W. P., Hoesein F. A. A. M., van Ooijen P. M. A., Aerts J. G. J. V., den Bakker M. A., Thunnissen E., Verschakelen J., Vliegenthart R., Walter J. E., ten Haaf K., Groen H. J. M., and Oudkerk M., Reduced Lung-Cancer Mortality With Volume CT Screening in a Randomized Trial, New England Journal of Medicine. (2020) 382, no. 6, 503–513, https://doi.org/10.1056/nejmoa1911793, 31995683.
10.1056/nejmoa1911793
PubMed Google Scholar
24 Correale P., Giannicola R., Saladino R. E., Nardone V., Pirtoli L., Tassone P., Luce A., Cappabianca S., Scrima M., Tagliaferri P., and Caraglia M., On the Way of the New Strategies Aimed to Improve the Efficacy of PD-1/PD-L1 Immune Checkpoint Blocking mAbs in Small Cell Lung Cancer, Translational Lung Cancer Research. (2020) 9, no. 5, 1712–1719, https://doi.org/10.21037/tlcr-20-536, 33209592.
10.21037/tlcr-20-536
CAS PubMed Web of Science® Google Scholar
25 Rosenzweig K. E., Sim S. E., Mychalczak B., Braban L. E., Schindelheim R., and Leibel S. A., Elective Nodal Irradiation in the Treatment of Non-Small-Cell Lung Cancer With Three-Dimensional Conformal Radiation Therapy, International Journal of Radiation Oncology • Biology • Physics. (2001) 50, no. 3, 681–685, https://doi.org/10.1016/S0360-3016(01)01482-1, 2-s2.0-0035400325.
10.1016/S0360-3016(01)01482-1
CAS PubMed Web of Science® Google Scholar
26 Linet M. S., Slovis T. L., Miller D. L., Kleinerman R., Lee C., Rajaraman P., and Berrington de Gonzalez A., Cancer Risks Associated With External Radiation From Diagnostic Imaging Procedures, CA: A Cancer Journal for Clinicians. (2012) 62, no. 2, 75–100, https://doi.org/10.3322/caac.21132, 2-s2.0-84863257158, 22307864.
10.3322/caac.21132
PubMed Web of Science® Google Scholar
27 Zang J., Horinouchi H., Hanaoka J., Funai K., Sakakura N., and Liao H., The Role of Salvage Surgery in the Treatment of a Gefitinib-Resistant Non-Small Cell Lung Cancer Patient: A Case Report, Journal of Thoracic Disease. (2021) 13, no. 7, 4554–4559, https://doi.org/10.21037/jtd-21-171, 34422381.
10.21037/jtd-21-171
PubMed Web of Science® Google Scholar
28 Polley M. Y. C., Freidlin B., Korn E. L., Conley B. A., Abrams J. S., and McShane L. M., Statistical and Practical Considerations for Clinical Evaluation of Predictive Biomarkers, Journal of the National Cancer Institute. (2013) 105, no. 22, 1677–1683, https://doi.org/10.1093/jnci/djt282, 2-s2.0-84890541361, 24136891.
10.1093/jnci/djt282
PubMed Google Scholar
29 Cruz J. A. and Wishart D. S., Applications of Machine Learning in Cancer Prediction and Prognosis, Cancer Informatics. (2006) 2, 117693510600200, https://doi.org/10.1177/117693510600200030.
10.1177/117693510600200030
Google Scholar
30 Kumari N. and Choi S. H., Tumor-Associated Macrophages in Cancer: Recent Advancements in Cancer Nanoimmunotherapies, Journal of Experimental & Clinical Cancer Research. (2022) 41, no. 1, https://doi.org/10.1186/s13046-022-02272-x, 35183252.
10.1186/s13046-022-02272-x
PubMed Web of Science® Google Scholar
31 Benning L., Peintner A., and Peintner L., Advances in and the Applicability of Machine Learning-Based Screening and Early Detection Approaches for Cancer: A Primer, Cancers. (2022) 14, no. 3, 1–15, https://doi.org/10.3390/cancers14030623, 35158890.
10.3390/cancers14030623
PubMed Web of Science® Google Scholar
32 Akhtar F., Bin Heyat M. B., Parveen S., Singh P., Hassan M. F. U., Parveen S., Bin Hayat M. A., Sayeed E., Ali A., Li J. P., and Sawan M., Early Coronary Heart Disease Deciphered via Support Vector Machines: Insights From Experiments, 2023 20th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), 2023, IEEE, 1–7, https://doi.org/10.1109/ICCWAMTIP60502.2023.10387051.
10.1109/ICCWAMTIP60502.2023.10387051
Google Scholar
33 Fortunato O., Boeri M., Verri C., Conte D., Mensah M., Suatoni P., Pastorino U., and Sozzi G., Assessment of Circulating Micrornas in Plasma of Lung Cancer Patients, Molecules. (2014) 19, no. 3, 3038–3054, https://doi.org/10.3390/molecules19033038, 2-s2.0-84896920819, 24619302.
10.3390/molecules19033038
PubMed Web of Science® Google Scholar
34 Heneghan H. M., Miller N., and Kerin M. J., MiRNAs as Biomarkers and Therapeutic Targets in Cancer, Current Opinion in Pharmacology. (2010) 10, no. 5, 543–550, https://doi.org/10.1016/j.coph.2010.05.010, 2-s2.0-77956652041.
10.1016/j.coph.2010.05.010
CAS PubMed Web of Science® Google Scholar
35 Bin Heyat M. B., Adhikari D., Akhtar F., Parveen S., Zeeshan H. M., Ullah H., Chen Y. H., Wang L., and Sawan M., Intelligent Internet of Medical Things for Depression: Current Advancements, Challenges, and Trends, International Journal of Intelligent Systems. (2025) 2025, https://doi.org/10.1155/int/6801530, 6801530.
10.1155/int/6801530
Web of Science® Google Scholar
36 Parveen S., Bin Heyat M. B., Akhtar F., Parveen S., Asrafali B., Singh B., Ali K., Huang L., Li X. P., Jabbar A., Asad M., and Li J. P., Interweaving Artificial Intelligence and Bio-Signals in Mental Fatigue: Unveiling Dynamics and Future Pathways, 2023 20th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), 2023, IEEE, 1–9, https://doi.org/10.1109/ICCWAMTIP60502.2023.10387132.
10.1109/ICCWAMTIP60502.2023.10387132
Google Scholar
37 Bharadwaj K. B. P. and Kanagachidambaresan G. R., K. B. Prakash and G. R. Kanagachidambaresan, Pattern Recognition and Machine Learning, Programming with TensorFlow. EAI/Springer Innovations in Communication and Computing, 2021, Springer, https://doi.org/10.1007/978-3-030-57077-4_11.
10.1007/978-3-030-57077-4_11
Google Scholar
38 Kay D., Taaffe D. R., and Marino F. E., Whole-Body Pre-Cooling and Heat Storage During Self-Paced Cycling Performance in Warm Humid Conditions, Journal of Sports Sciences. (1999) 17, no. 12, 937–944, https://doi.org/10.1080/026404199365326, 2-s2.0-0032713456, 10622353.
10.1080/026404199365326
CAS PubMed Web of Science® Google Scholar
39 Witten I. H., Frank E., and Hall M. A., Credibility: Evaluating What’s Been Learned, 2005, http://books.google.com/books?hl=en%26lr=%26id=QTnOcZJzlUoC%26oi=fnd%26pg=PR17%26dq=Data+Mining+Practical+Machine+Learning+Tools+and+Techniques%26ots=3gpDdrWiOc%26sig=TZS7G8l1eXSa2SpAvfD6aBoJ2lw.
Google Scholar
40 Castellano G. and Casalino G., Computational Intelligence in Healthcare, 2021, MDPI-Multidisciplinary Digital Publishing Institute, https://doi.org/10.3390/books978-3-0365-2378-1.
Google Scholar
41 Kourou K., Exarchos T. P., Exarchos K. P., Karamouzis M. V., and Fotiadis D. I., Machine Learning Applications in Cancer Prognosis and Prediction, Computational and Structural Biotechnology Journal. (2015) 13, 8–17, https://doi.org/10.1016/j.csbj.2014.11.005, 2-s2.0-84942612935, 25750696.
10.1016/j.csbj.2014.11.005
CAS PubMed Web of Science® Google Scholar
42 Agah A., Medical Applications of Artificial Intelligence, 2013, 1st edition, CRC Press, https://doi.org/10.1201/b15618.
10.1201/b15618
Google Scholar
43 Skaf Y. and Laubenbacher R., Topological Data Analysis in Biomedicine: A Review, Journal of Biomedical Informatics. (2022) 130, 104082, https://doi.org/10.1016/j.jbi.2022.104082, 35508272.
10.1016/j.jbi.2022.104082
PubMed Web of Science® Google Scholar
44 Drier Y. and Domany E., Do Two Machine-Learning Based Prognostic Signatures for Breast Cancer Capture the Same Biological Processes?, PLoS One. (2011) 6, no. 3, https://doi.org/10.1371/journal.pone.0017795, 2-s2.0-79952687473, 21423753.
10.1371/journal.pone.0017795
PubMed Google Scholar
45 Muhammad Zeeshan H., Sultana A., Bin Heyat B., Akhtar F., Parveen S., Bin Hayat M. A., Sayeed E., Abdelgeliel A. S., and Muaad A. Y., A Machine Learning-Based Analysis for the Effectiveness of Online Teaching and Learning in Pakistan During COVID-19 Lockdown, WORK: A Journal of Prevention, Assessment & Rehabilitation. (2025) 81, no. 1, 2340–2358, https://doi.org/10.1177/10519815241308161, 39973628.
10.1177/10519815241308161
PubMed Google Scholar
46 Iqbal M. S., Heyat M. B. B., Parveen S., Hayat M. A. B., Roshanzamir M., Alizadehsani R., Akhtar F., Sayeed E., Hussain S., Hussein H. S., and Sawan M., Progress and Trends in Neurological Disorders Research Based on Deep Learning, Computerized Medical Imaging and Graphics. (2024) 116, 102400, https://doi.org/10.1016/j.compmedimag.2024.102400, 38851079.
10.1016/j.compmedimag.2024.102400
PubMed Web of Science® Google Scholar
47 Iqbal M. S., Abbasi R., Bin Heyat M. B., Akhtar F., Abdelgeliel A. S., Albogami S., Fayad E., and Iqbal M. A., Recognition of mRNA N4 Acetylcytidine (ac4C) by Using Non-Deep vs. Deep Learning, Applied Sciences. (2022) 12, no. 3, https://doi.org/10.3390/app12031344.
10.3390/app12031344
Google Scholar
48 Flagel L., Brandvain Y., and Schrider D. R., The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference, Molecular Biology and Evolution. (2019) 36, no. 2, 220–238, https://doi.org/10.1093/molbev/msy224, 2-s2.0-85061503500, 30517664.
10.1093/molbev/msy224
CAS PubMed Web of Science® Google Scholar
49 Yarman B. S. and Rathore S. P. S., The Future of AI in Disease Detection - A Look at Emerging Trends and Future Directions in the Use of AI for Disease Detection and Diagnosis, AI in Disease Detection: Advancements and Applications, 2025, Wiley, 265–288, https://doi.org/10.1002/9781394278695.ch12.
10.1002/9781394278695.ch12
Google Scholar
50 Sumbul A., Sultana M. B. B., Heyat K., Rahman F., Akhtar S., Parveen M. B., Urbano V., LipariI.De, la Torre Díez A. A., and Khan A. M., Efficacy and Classification of Sesamum indicum Linn Seeds With Rosa damascena Mill Oil in Uncomplicated Pelvic Inflammatory Disease Using Machine Learning, Frontiers in Chemistry. (2024) 12, 1–21, https://doi.org/10.3389/fchem.2024.1361980, 38629105.
10.3389/fchem.2024.1361980
PubMed Web of Science® Google Scholar
51 Tahseen Z., Sofi G., Bin Heyat M. B., Jahan N., and Rashid A., Exploring Obstruction and Deobstruent Drugs in Unani Medicine: A Comprehensive Survey and Future Directions, Alternative Terapies in Health & Medicine. (2024) 30, no. 7, 20–27, 39110046.
PubMed Google Scholar
52 Faisal M. I., Bashir S., Khan Z. S., and Khan F. H., An Evaluation of Machine Learning Classifiers and Ensembles for Early Stage Prediction of Lung Cancer, Proceedings of the 2018 3rd International Conference on Emerging Trends in Engineering, Sciences and Technology (ICEEST), 2018, IEEE, 1–4, https://doi.org/10.1109/ICEEST.2018.8643311, 2-s2.0-85063489296.
10.1109/ICEEST.2018.8643311
Google Scholar
53 Nam Y. J. and Shin W. J., A Study on Comparison of Lung Cancer Prediction Using Ensemble Machine Learning, Korean Journal of Artificial Intelligence. (2019) 7, no. 2, 19–24, https://doi.org/10.24225/kjai.2019.7.2.19.
10.24225/kjai.2019.7.2.19
Google Scholar
54 Patra R., N. Chaubey, S. Parikh, and K. Amin, Prediction of Lung Cancer Using Machine Learning Classifier, Computing Science, Communication and Security. COMS2 2020. Communications in Computer and Information Science, 2020, 1235, Springer, https://doi.org/10.1007/978-981-15-6648-6_11.
10.1007/978-981-15-6648-6_11
Google Scholar
55 Xie Y., Meng W. Y., Li R. Z., Wang Y. W., Qian X., Chan C., Yu Z. F., Fan X. X., Pan H. D., Xie C., Wu Q. B., Yan P. Y., Liu L., Tang Y. J., Yao X. J., Wang M. F., and Leung E. L. H., Early Lung Cancer Diagnostic Biomarker Discovery by Machine Learning Methods, Translational Oncology. (2021) 14, no. 1, https://doi.org/10.1016/j.tranon.2020.100907, 33217646, 100907.
10.1016/j.tranon.2020.100907
CAS PubMed Web of Science® Google Scholar
56 Omar A. A.-C. and Nassif A. B., Lung Cancer Prediction using Machine Learning based Feature Selection: A comparative Study, Proceedings of the 2023 Advances in Science and Engineering Technology International Conferences (ASET), 2023, IEEE, 1–6, https://doi.org/10.1109/ASET56582.2023.10180436.
10.1109/ASET56582.2023.10180436
Google Scholar
57 Rao D. N., Machine Learning Research on Breast and Lung Cancer Detection, International Journal of Innovation in Engineering. (2023) 3, no. 1, 48–54, https://doi.org/10.59615/ijie.3.1.48.
10.59615/ijie.3.1.48
Google Scholar
58 Fernández A., del Río S., Chawla N. V., and Herrera F., An Insight Into Imbalanced Big Data Classification: Outcomes and Challenges, Complex & Intelligent Systems. (2017) 3, no. 2, 105–120, https://doi.org/10.1007/s40747-017-0037-9.
10.1007/s40747-017-0037-9
Web of Science® Google Scholar
59 Wang L., Han M., Li X., Zhang N., and Cheng H., Review of Classification Methods on Unbalanced Data Sets, IEEE Access. (2021) 9, 64606–64628, https://doi.org/10.1109/ACCESS.2021.3074243.
10.1109/ACCESS.2021.3074243
Web of Science® Google Scholar
60 Mohammed A. and Kora R., A Comprehensive Review on Ensemble Deep Learning: Opportunities and Challenges, Journal of King Saud University-Computer and Information Sciences. (2023) 35, no. 2, 757–774, https://doi.org/10.1016/j.jksuci.2023.01.014.
10.1016/j.jksuci.2023.01.014
Web of Science® Google Scholar
61 Raju A. S. N., Venkatesh K., Padmaja B., Kumar C. N. S., Patnala P. R. M., Lasisi A., Islam S., Razak A., and Khan W. A., Exploring Vision Transformers and XGBoost as Deep Learning Ensembles for Transforming Carcinoma Recognition, Scientific Reports. (2024) 14, no. 1, 1–35, https://doi.org/10.1038/s41598-024-81456-1, 39627293.
10.1038/s41598-024-81456-1
PubMed Web of Science® Google Scholar
62 Li M., Jiang Y., Zhang Y., and Zhu H., Medical Image Analysis Using Deep Learning Algorithms, Frontiers in Public Health. (2023) 11, 1–28, https://doi.org/10.3389/fpubh.2023.1273253, 38026291.
10.3389/fpubh.2023.1273253
PubMed Web of Science® Google Scholar
63 Lung Cancer Prediction Dataset, Accessed on July 2022, available online: https://www.kaggle.com/datasets/mysarahmadbhat/lung-cancer.
Google Scholar
64 Maldonado S., López J., and Vairetti C., An Alternative SMOTE Oversampling Strategy for High-Dimensional Datasets, Applied Soft Computing. (2019) 76, 380–389, https://doi.org/10.1016/j.asoc.2018.12.024, 2-s2.0-85059345203.
10.1016/j.asoc.2018.12.024
Web of Science® Google Scholar
65 Lynch C. M., Abdollahi B., Fuqua J. D., de Carlo A. R., Bartholomai J. A., Balgemann R. N., van Berkel V. H., and Frieboes H. B., Prediction of Lung Cancer Patient Survival via Supervised Machine Learning Classification Techniques, American Journal of Physiology-Renal Physiology. (2017) 108, 1–8, https://doi.org/10.1016/j.ijmedinf.2017.09.013, 2-s2.0-85030106984, 29132615.
10.1016/j.ijmedinf.2017.09.013
PubMed Google Scholar
66 Ojha T. R., Machine Learning Based Classification and Detection of Lung Cancer, Journal of Artificial Intelligence and Capsule Networks. (2023) 5, no. 2, 110–128, https://doi.org/10.36548/jaicn.2023.2.003.
10.36548/jaicn.2023.2.003
Google Scholar
67 Sullivan T. M., Milestone Z. P., Tempel P. E., Gao S., and Burd R. S., Development and Validation of a Bayesian Belief Network Predicting the Probability of Blood Transfusion After Pediatric Injury, Journal of Trauma and Acute Care Surgery. (2023) 94, no. 2, 304–311, https://doi.org/10.1097/TA.0000000000003709, 35696359.
10.1097/TA.0000000000003709
PubMed Web of Science® Google Scholar
68 Sathishkumar R., Kalaiarasan K., Prabhakaran A., and Aravind M., Detection of Lung Cancer using SVM Classifier and KNN Algorithm, Proceedings of the 2019 IEEE International Conference on System, Computation, Automation and Networking (ICSCAN), 2019, IEEE, 1–7, https://doi.org/10.1109/ICSCAN.2019.8878774.
10.1109/ICSCAN.2019.8878774
Google Scholar
69 Gao Q., Yang L., Lu M., Jin R., Ye H., and Ma T., The Artificial Intelligence and Machine Learning in Lung Cancer Immunotherapy, Journal of Hematology & Oncology. (2023) 16, no. 1, https://doi.org/10.1186/s13045-023-01456-y, 37226190.
10.1186/s13045-023-01456-y
PubMed Web of Science® Google Scholar
70 Li Y., Zhao D., Xu Z., Heidari A. A., Chen H., Jiang X., Liu Z., Wang M., Zhou Q., and Xu S., Bsrwpso-FKNN: A Boosted PSO With Fuzzy K-Nearest Neighbor Classifier for Predicting Atopic Dermatitis Disease, Frontiers in Neuroinformatics. (2022) 16, https://doi.org/10.3389/fninf.2022.1063048, 36726405.
10.3389/fninf.2022.1063048
PubMed Google Scholar
71 Huang S., Nianguang C. A. I., Penzuti Pacheco P., Narandes S., Wang Y., and Wayne X. U., Applications of Support Vector Machine (SVM) Learning in Cancer Genomics, Cancer Genomics & Proteomics. (2018) 15, no. 1, 41–51, https://doi.org/10.21873/cgp.20063, 2-s2.0-85040177449, 29275361.
10.21873/cgp.20063
CAS PubMed Web of Science® Google Scholar
72 Manju B. R., Athira V., and Rajendran A., Efficient Multi-Level Lung Cancer Prediction Model Using Support Vector Machine Classifier, IOP Conference Series: Materials Science and Engineering. (2021) 1012, no. 1, 012034, https://doi.org/10.1088/1757-899x/1012/1/012034.
10.1088/1757-899x/1012/1/012034
CAS Google Scholar
73 Chen T. and Guestrin C., XGBoost: A Scalable Tree Boosting System, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, Association for Computing Machinery, 785–794, https://doi.org/10.1145/2939672.2939785, 2-s2.0-84984950690.
10.1145/2939672.2939785
Google Scholar
74 Webb G. I., Encyclopedia of Machine Learning and Data Science, 2020, Springer, https://doi.org/10.1007/978-1-4899-7502-7.
Google Scholar
75 Prokhorenkova L., Gusev G., Vorobev A., Dorogush A. V., and Gulin A., Catboost: Unbiased Boosting With Categorical Features, Advances in Neural Information Processing Systems. (2018) 31, 6638–6648.
Google Scholar
76 Ibrahim A. A., Ridwan R. L., Muhammed M. M., Abdulaziz R. O., and Saheed G. A., Comparison of the CatBoost Classifier With Other Machine Learning Methods, International Journal of Advanced Computer Science and Applications. (2020) 11, 738–748, https://doi.org/10.14569/IJACSA.2020.0111190.
10.14569/IJACSA.2020.0111190
Web of Science® Google Scholar
77 Uddin K. M. M., Nahid M. N. H., Ullah M. M. H., Mazumder B., Khan M. S. I., and Dey S. K., Machine Learning-Based Chronic Kidney Cancer Prediction Application: A Predictive Analytics Approach, Biomedical Materials & Devices. (2024) 2, no. 2, 1028–1048, https://doi.org/10.1007/s44174-023-00133-5.
10.1007/s44174-023-00133-5
Google Scholar
78 Bhatt D., Patel C., Talsania H., Patel J., Vaghela R., Pandya S., Modi K., and Ghayvat H., CNN Variants for Computer Vision: History, Architecture, Application, Challenges and Future Scope, Electronics. (2021) 10, no. 20.
10.3390/electronics10202470
Web of Science® Google Scholar
79 Reyad M., Sarhan A. M., and Arafa M., A Modified Adam Algorithm for Deep Neural Network Optimization, Neural Computing and Applications. (2023) 35, no. 23, 17095–17112, https://doi.org/10.1007/s00521-023-08568-z.
10.1007/s00521-023-08568-z
Web of Science® Google Scholar
80 Joloudari J. H., Marefat A., Nematollahi M. A., Oyelere S. S., and Hussain S., Effective Class-Imbalance Learning Based on SMOTE and Convolutional Neural Networks, Applied Sciences. (2023) 13, no. 6, https://doi.org/10.3390/app13064006.
10.3390/app13064006
Google Scholar
81 Li Z., Yoon J., Zhang R., Rajabipour F., Srubar W. V., Dabo I., and Radlińska A., Machine Learning in Concrete Science: Applications, Challenges, and Best Practices, Npj Computational Materials. (2022) 8, no. 1, https://doi.org/10.1038/s41524-022-00810-x.
10.1038/s41524-022-00810-x
Web of Science® Google Scholar
82 Noaman N. F., Kanber B. M., Al Smadi A., Jiao L., and Alsmadi M. K., Advancing Oncology Diagnostics: AI-Enabled Early Detection of Lung Cancer Through Hybrid Histological Image Analysis, IEEE Access. (2024) 12, 64396–64415, https://doi.org/10.1109/ACCESS.2024.3397040.
10.1109/ACCESS.2024.3397040
Web of Science® Google Scholar
83 Choi K. S., Shin J. S., Lee J. J., Kim Y. S., Kim S. B., and Kim C. W., In Vitro Trans-Differentiation of Rat Mesenchymal Cells Into Insulin-Producing Cells by Rat Pancreatic Extract, Biochemical and Biophysical Research Communications. (2005) 330, no. 4, 1299–1305, https://doi.org/10.1016/j.bbrc.2005.03.111, 2-s2.0-17044401306, 15823584.
10.1016/j.bbrc.2005.03.111
CAS PubMed Web of Science® Google Scholar
84 Thongsuwan S., Jaiyen S., Padcharoen A., and Agarwal P., ConvXGB: A New Deep Learning Model for Classification Problems Based on CNN and XGBoost, Nuclear Engineering and Technology. (2021) 53, no. 2, 522–531, https://doi.org/10.1016/j.net.2020.04.008.
10.1016/j.net.2020.04.008
CAS Web of Science® Google Scholar
85 Ullah H., Bin Heyat M. B., Akhtar F., Sumbul, Muaad A. Y., Islam M. S., Abbas Z., Pan T., Gao M., Lin Y., and Lai D., An End-to-End Cardiac Arrhythmia Recognition Method With an Effective DenseNet Model on Imbalanced Datasets Using ECG Signal, Computational Intelligence and Neuroscience. (2022) 2022, 23, https://doi.org/10.1155/2022/9475162, 36210977, 9475162.
10.1155/2022/9475162
PubMed Web of Science® Google Scholar
86 Vinod D. S., Prakash S. P. S., Alsalman H., Muaad A. Y., and Bin Heyat M. B., Ensemble Technique for Brain Tumor Patient Survival Prediction, IEEE Access. (2024) 12, 19285–19298, https://doi.org/10.1109/ACCESS.2024.3360086.
10.1109/ACCESS.2024.3360086
Web of Science® Google Scholar
87 Ismail W. N., Hassan M. M., Alsalamah H. A., and Fortino G., CNN-Based Health Model for Regular Health Factors Analysis in Internet-of-Medical Things Environment, IEEE Access. (2020) 8, 52541–52549, https://doi.org/10.1109/ACCESS.2020.2980938.
10.1109/ACCESS.2020.2980938
Web of Science® Google Scholar
88 Zhang L. and Jánošík D., Enhanced Short-Term Load Forecasting With Hybrid Machine Learning Models: CatBoost and XGBoost Approaches, Expert Systems with Applications. (2024) 241, https://doi.org/10.1016/j.eswa.2023.122686, 122686.
10.1016/j.eswa.2023.122686
Web of Science® Google Scholar
89 Ahn J. M., Kim J., and Kim K., Ensemble Machine Learning of Gradient Boosting (XGBoost, LightGBM, CatBoost) and Attention-Based CNN-LSTM for Harmful Algal Blooms Forecasting, Toxins. (2023) 15, no. 10, https://doi.org/10.3390/toxins15100608, 37888638.
10.3390/toxins15100608
PubMed Web of Science® Google Scholar
90 Lv S. X., Peng L., Hu H., and Wang L., Effective Machine Learning Model Combination Based on Selective Ensemble Strategy for Time Series Forecasting, Information Sciences. (2022) 612, 994–1023, https://doi.org/10.1016/j.ins.2022.09.002.
10.1016/j.ins.2022.09.002
Web of Science® Google Scholar
91 Ogundokun R. O., Maskeliunas R., Misra S., and Damaševičius R., Improved CNN Based on Batch Normalization and Adam Optimizer, International Conference on Computational Science and Its Applications, 2022, Springer International Publishing, 593–604, https://doi.org/10.1007/978-3-031-10548-7_43.
10.1007/978-3-031-10548-7_43
Google Scholar
92 Suresh K. R., Jarapala A., and Sudeep P. V., Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study, Circuits, Systems, and Signal Processing. (2022) 41, no. 10, 5719–5742, https://doi.org/10.1007/s00034-022-02050-2.
10.1007/s00034-022-02050-2
Web of Science® Google Scholar
93 Aurna N. F., Yousuf M. A., Taher K. A., Azad A. K. M., and Moni M. A., A Classification of MRI Brain Tumor Based on Two Stage Feature Level Ensemble of Deep CNN Models, Computers in Biology and Medicine. (2022) 146, https://doi.org/10.1016/j.compbiomed.2022.105539, 35483227.
10.1016/j.compbiomed.2022.105539
PubMed Web of Science® Google Scholar
94 Müller D., Soto-Rey I., and Kramer F., Towards a Guideline for Evaluation Metrics in Medical Image Segmentation, BMC Research Notes. (2022) 15, no. 1, https://doi.org/10.1186/s13104-022-06096-y, 35725483.
10.1186/s13104-022-06096-y
PubMed Web of Science® Google Scholar
95 Akhtar F., Heyat M. B. B., Sultana A., Parveen S., Muhammad Zeeshan H., Merlin S. F., Shen B., Pomary D., Ping Li J., and Sawan M., Medical Intelligence for Anxiety Research: Insights From Genetics, Hormones, Implant Science, and Smart Devices With Future Strategies, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. (2024) 14, no. 6, e1552, https://doi.org/10.1002/widm.1552.
10.1002/widm.1552
Web of Science® Google Scholar
96 Varoquaux G. and Colliot O., O. Colliot, Evaluating Machine Learning Models and Their Diagnostic Value, Machine Learning for Brain Disorders. Neuromethods, 2023, 197, Humana, https://doi.org/10.1007/978-1-0716-3195-9_20.
10.1007/978-1-0716-3195-9_20
Google Scholar
97 Wilimitis D. and Walsh C. G., Practical Considerations and Applied Examples of Cross-Validation for Model Development and Evaluation in Health Care: Tutorial, Jmir Ai. (2023) 2, e49023, https://doi.org/10.2196/49023, 38875530.
10.2196/49023
PubMed Web of Science® Google Scholar
98 Ansari M. M. and Kumar S., Analysis of Various Image Processing Approaches in Detecting and Predicting Lung Cancer, Emerging Trends in IoT and Computing Technologies, 2023, Routledge, 519–527, https://doi.org/10.1201/9781003350057-76.
10.1201/9781003350057-76
Google Scholar
99 Duan S., Cao H., Liu H., Miao L., Wang J., Zhou X., Wang W., Hu P., Qu L., and Wu Y., Development of a Machine Learning-Based Multimode Diagnosis System for Lung Cancer, Aging. (2020) 12, no. 10, 9840–9854, https://doi.org/10.18632/aging.103249, 32445550.
10.18632/aging.103249
CAS PubMed Google Scholar
100 Zhao L., Qian J., Tian F., Liu R., Liu B., Zhang S., and Lu M., A Weighted Discriminative Extreme Learning Machine Design for Lung Cancer Detection by an Electronic Nose System, IEEE Transactions on Instrumentation and Measurement. (2021) 70, 1–9, https://doi.org/10.1109/TIM.2021.3084312.
10.1109/TIM.2021.3084312
PubMed Web of Science® Google Scholar
101 Bushara A. R. and Kumar R. S. V., Deep Learning-Based Lung Cancer Classification of CT Images Using Augmented Convolutional Neural Networks, ELCVIA. Electronic Letters on Computer Vision and Image Analysis. (2022) 21, no. 1, 130–142, https://doi.org/10.5565/REV/ELCVIA.1490.
10.5565/REV/ELCVIA.1490
Google Scholar
102 Mostafa G., Mahmoud H., Abd El-Hafeez T., and ElAraby M. E., The Power of Deep Learning in Simplifying Feature Selection for Hepatocellular Carcinoma: A Review, BMC Medical Informatics and Decision Making. (2024) 24, no. 1, https://doi.org/10.1186/s12911-024-02682-1, 39367397.
10.1186/s12911-024-02682-1
PubMed Web of Science® Google Scholar
103 Williamson S. M. and Prybutok V., Balancing Privacy and Progress: A Review of Privacy Challenges, Systemic Oversight, and Patient Perceptions in AI-Driven Healthcare, Applied Sciences. (2024) 14, no. 2, https://doi.org/10.3390/app14020675.
10.3390/app14020675
Google Scholar
104 Mirakhori F. and Niazi S. K., Harnessing the AI/ML in Drug and Biological Products Discovery and Development: The Regulatory Perspective, Pharmaceuticals. (2025) 18, no. 1, https://doi.org/10.3390/ph18010047, 39861110.
10.3390/ph18010047
PubMed Web of Science® Google Scholar
105 Xu X., Li J., Zhu Z., Zhao L., Wang H., Song C., Chen Y., Zhao Q., Yang J., and Pei Y., A Comprehensive Review on Synergy of Multi-Modal Data and AI Technologies in Medical Diagnosis, Bioengineering. (2024) 11, no. 3, https://doi.org/10.3390/bioengineering11030219, 38534493.
10.3390/bioengineering11030219
PubMed Web of Science® Google Scholar
106 Kalasampath K., Kn S., Sajeev S., Kuppa S. S., Ajay K., and Angulakshmi M., A Literature Review on Applications of Explainable Artificial Intelligence (XAI), IEEE Access. (2025) 13, 41111–41140, https://doi.org/10.1109/ACCESS.2025.3546681.
10.1109/ACCESS.2025.3546681
Web of Science® Google Scholar
107 Salehin I., Islam M. S., Saha P., Noman S. M., Tuni A., Hasan M. M., and Baten M. A., AutoML: A Systematic Review on Automated Machine Learning With Neural Architecture Search, Journal of Information and Intelligence. (2024) 2, no. 1, 52–81, https://doi.org/10.1016/j.jiixd.2023.10.002.
10.1016/j.jiixd.2023.10.002
Google Scholar
108 Khalifa M., Albadawy M., and Iqbal U., Advancing Clinical Decision Support: The Role of Artificial Intelligence Across Six Domains, Computer Methods and Programs in Biomedicine Update. (2024) 5, https://doi.org/10.1016/j.cmpbup.2024.100142, 100142.
10.1016/j.cmpbup.2024.100142
Google Scholar

All articles