International Journal of Intelligent Systems

Volume 2025, Issue 1 2751767

Research Article

Open Access

Interpretable Deep Learning for Classifying Skin Lesions

Mojeed Opeyemi Oyedeji,

Corresponding Author

Mojeed Opeyemi Oyedeji

[email protected]

orcid.org/0000-0001-8653-5572

SDAIA-KFUPM Joint Research Center for Artificial Intelligence , King Fahd University of Petroleum & Minerals , Dhahran , 31261 , Saudi Arabia , kfupm.edu.sa

Search for more papers by this author

Emmanuel Okafor,

Emmanuel Okafor

orcid.org/0000-0001-6929-6880

SDAIA-KFUPM Joint Research Center for Artificial Intelligence , King Fahd University of Petroleum & Minerals , Dhahran , 31261 , Saudi Arabia , kfupm.edu.sa

Search for more papers by this author

Hussein Samma,

Hussein Samma

orcid.org/0000-0002-3562-2788

SDAIA-KFUPM Joint Research Center for Artificial Intelligence , King Fahd University of Petroleum & Minerals , Dhahran , 31261 , Saudi Arabia , kfupm.edu.sa

Search for more papers by this author

Motaz Alfarraj,

Motaz Alfarraj

orcid.org/0000-0002-6052-7221

SDAIA-KFUPM Joint Research Center for Artificial Intelligence , King Fahd University of Petroleum & Minerals , Dhahran , 31261 , Saudi Arabia , kfupm.edu.sa

Electrical Engineering Department , King Fahd University of Petroleum & Minerals , Dhahran , 31261 , Saudi Arabia , kfupm.edu.sa

Information and Computer Science Department , King Fahd University of Petroleum & Minerals , Dhahran , 31261 , Saudi Arabia , kfupm.edu.sa

Search for more papers by this author

Mojeed Opeyemi Oyedeji,

Corresponding Author

Mojeed Opeyemi Oyedeji

[email protected]

orcid.org/0000-0001-8653-5572

SDAIA-KFUPM Joint Research Center for Artificial Intelligence , King Fahd University of Petroleum & Minerals , Dhahran , 31261 , Saudi Arabia , kfupm.edu.sa

Search for more papers by this author

Emmanuel Okafor,

Emmanuel Okafor

orcid.org/0000-0001-6929-6880

SDAIA-KFUPM Joint Research Center for Artificial Intelligence , King Fahd University of Petroleum & Minerals , Dhahran , 31261 , Saudi Arabia , kfupm.edu.sa

Search for more papers by this author

Hussein Samma,

Hussein Samma

orcid.org/0000-0002-3562-2788

SDAIA-KFUPM Joint Research Center for Artificial Intelligence , King Fahd University of Petroleum & Minerals , Dhahran , 31261 , Saudi Arabia , kfupm.edu.sa

Search for more papers by this author

Motaz Alfarraj,

Motaz Alfarraj

orcid.org/0000-0002-6052-7221

SDAIA-KFUPM Joint Research Center for Artificial Intelligence , King Fahd University of Petroleum & Minerals , Dhahran , 31261 , Saudi Arabia , kfupm.edu.sa

Electrical Engineering Department , King Fahd University of Petroleum & Minerals , Dhahran , 31261 , Saudi Arabia , kfupm.edu.sa

Information and Computer Science Department , King Fahd University of Petroleum & Minerals , Dhahran , 31261 , Saudi Arabia , kfupm.edu.sa

Search for more papers by this author

First published: 29 April 2025

https://doi.org/10.1155/int/2751767

Academic Editor: Alexander Hošovský

Share a link

Email
Wechat
Bluesky

Abstract

The global prevalence of skin cancer necessitates the development of AI-assisted technologies for accurate and interpretable diagnosis of skin lesions. This study presents a novel deep learning framework for enhancing the interpretability and reliability of skin lesion predictions from clinical images, which are more inclusive, accessible, and representative of real-world conditions than dermoscopic images. We comprehensively analyzed 13 deep learning models from four main convolutional neural network architecture classes: DenseNet, ResNet, MobileNet, and EfficientNet. Different data augmentation strategies and model optimization algorithms were explored to access the performances of the deep learning models in binary and multiclass classification scenarios. In binary classification, the DenseNet-161 model, initialized with random weights, obtained a top accuracy of 79.40%, while the EfficientNet-B7 model, initialized with pretrained weights from ImageNet, reached an accuracy of 85.80%. Furthermore, in the multiclass classification experiments, DenseNet121, initialized with random weights and trained with AdamW, obtained the best accuracy of 65.1%. Likewise, when initialized with pretrained weights, the DenseNet121 model attained a top accuracy of 75.07% in multiclass classification. Detailed interpretability analyses were carried out leveraging the SHAP and CAM algorithms to provide insights into the decision rationale of the investigated models. The SHAP algorithm was beneficial in understanding the feature attributions by visualizing how specific regions of the input image influenced the model predictions. Our study emphasizes using clinical images for developing AI algorithms for skin lesion diagnosis, highlighting the practicality and relevance in real-world applications, especially where dermoscopic tools are not readily accessible. Beyond accessibility, these developments also ensure that AI-assisted diagnostic tools are deployed in diverse clinical settings, thus promoting inclusiveness and ultimately improving early detection and treatment of skin cancers.

1. Introduction

Skin cancer is among the most fatal forms of cancer, with an annual report of approximately 5.4M cases in USA. Melanoma is a type of skin cancer that represents only 5% of the cases in the USA; it contributes to around 75% of fatalities, resulting in roughly 10,000 cases each year [1]. Early skin cancer detection and treatment protocols are the most recommended approach in the medical healthcare sector for curbing the fatality rate. Many cancer detection systems use deep learning (DL) algorithms for skin lesion diagnosis by proffering predictive decisions on skin cancer categories. However, some factors limit the universal acceptability of DL algorithms in some medical applications. These factors include health data poverty, lack of fairness due to model bias, and lack of DL model interpretability (black-box models). Significant progress has been recorded in advancing the utility of DL techniques for proffering skin lesion diagnosis on dermoscopic images, thereby creating effective predictive models. Although the DL models analyzed on dermoscopic images reported by the authors [2–4] yielded good model performance, a potential barrier exists in deploying these model instances, factoring dermoscopic context into mobile or smart electronic gadgets due to spatial variations in the images. An alternative approach to curb the model limitations is to embrace data collection using smartphone cameras [5]. Hence, a broader context of this article investigates interpretable DL models for predicting and localizing skin lesions, which were curated from smartphones and have diverse variations in image quality.

Previous research works have concentrated on creating a skin lesion diagnostic model while considering the dermoscopic lesion image dataset [6–9]. Some of the famous well-known dermoscopic skin lesion datasets include PH2 [10], BCN20000 [6], HAM10000 [7], and the ISIC collections [8, 9]. These datasets have played an essential role in the development of DL models for solving diverse computer vision challenges such as image classification [11–16] and segmentation [17, 18].

The research study by Cassidy et al. [19] reported a comparative analysis of DL techniques with varying convolutional neural network (CNN) configurations while assessing and analyzing articles published within the past four years. It is vital to highlight that the authors investigated undersampling scenario by removing image duplication while retaining one unique image instance and creating a balanced dataset. Their study compared their proposed models with 19 existing DL methods and demonstrated promising classification prowess in an attempt to classify melanoma. A growing interest has recently been in deploying machine learning models trained on clinical images curated from smartphones and then deployed on mobile devices to foster skin lesion diagnosis. This interest has led to the curation of the PADUFES-20 dataset [5], which investigated the benefits of transfer learning strategy when the DL architectures (GoogLeNet, VGG, residual network (ResNet), and MobileNetV3) analyze the fusion of image and clinical metadata features to create an improved skin lesion diagnostic model with an efficient multiclassification capability.

Furthermore, data augmentation (DA) schemes have been pertinent in promoting enhanced computer vision models capable of classifying or localizing clinical images. A typical DA scheme employs an algorithm for transforming an image’s spatial content; examples include rotation, illumination, color casting, cropping, and rotation matrix [20] for enhancing DL model performances. It is a known fact that dermoscopic and clinical image skin lesion datasets may experience class imbalance challenges. Hence, previous studies have experimented on the benefits of diverse DA schemes [5, 21–25] to mitigate class imbalance problems. A comprehensive study was conducted in the research article by Perez et al. [26]; their study examined the impact of 13 different augmentation schemes when attempting to assess the performance of three DL model architectures trained on dermoscopic lesion images for binary classification.

Limited or no study has explored DL model architectures, factoring in the impact of different DA schemes compared with original images in the clinical image dataset. A research study has attempted to gain intrinsic insight into model interpretability rationality. For instance, the researchers explore the use of explainable algorithms for comprehending the DL model decision-making rationality when analyzed on dermoscopic skin lesions [19].

1.1. Related Studies

Developing DL models from clinical images for skin lesion diagnosis is rapidly gaining traction. This is primarily motivated by the fact that real-time deployment of these artificial intelligence (AI)–powered systems for skin lesion diagnosis will be done on smartphone devices. Recent datasets [5, 27] collected from these smartphone devices are beginning to stimulate new research interests in the applicability of models developed from smartphone imagery for accurate and effective diagnosis of skin lesions. Researchers in [5] provided the first publicly available skin lesion dataset sourced from smartphone cameras. This effort was motivated to promote research investigations in developing computer-aided skin lesion diagnostic tools from smartphone images.

DA strategies have been reported in the literature to mitigate overfitting problems and improve the robustness of DL models by applying spatial and color distortions to images during the training phase. The effect of different DA schemes in skin lesion classification has been studied extensively by Perez et al. [26] using dermoscopic images from the ISIC 2017 archives. The work of Perez et al. [26] investigated 13 different augmentation schemes using three model architectures, namely, Inception-V4, ResNet-152, and DenseNet-161. The results from the study suggested that the optimal augmentation scenario (determined by the best AUC of 0.854, 0.882, and 0.879 for Inception-v4, ResNet, and DenseNet, respectively) for the studied dataset is a combination of random crop, affine, flip, and color jitter. Different color and geometric DA schemes were explored in [28] for the ISBI-2016 melanoma classification challenge dataset. The reported results suggested improved model performance with a combination of geometric, color, and specialist-guided dataset augmentation. The authors in [21] explored the fusion of clinical, CNN, and handcrafted features for skin lesion binary classification tasks involving identifying benign versus malignant lesions and melanoma versus nonmelanoma lesions. The study employed an online DA scheme involving small perturbations on color transformations like brightness and contrast. Geometric transformations such as vertical and horizontal flipping and random rotations between −7.2 and 7.2° were also applied to the dataset. A combined dataset was proposed in [22] to develop a mobile-based skin lesion diagnosis system by including monkeypox images in the PADUFES-20 dataset. The study evaluated the performance of small-sized pretrained networks, namely, EfficientNet-B0, EfficientNet-B7, ResNet18, GoogLeNet, MobileNet-V2, ShuffleNet, and NasnetMobile. DA schemes, including reflection, rotation, scale, and translations, were applied to the dataset simultaneously. An iterative magnitude pruning technique was proposed by Medhat et al. [24] to arrive at lightweight models based on the AlexNet architecture. This study investigated three datasets: the PADUFES-20, MED-NODE, and PH2. The DA schemes applied simultaneously on the datasets included random rotation, reflection, shear, scale, and translation.

The original curators of the PADUFES-20 dataset [5] investigated the impact of the fusion of image and clinical features to obtain improved skin lesion diagnosis in a multiclass classification task using pretrained CNN architectures, namely, GoogLeNet, VGG, ResNet, and MobileNet. The DA scheme employed included horizontal and vertical rotations, translation, rescale, shear, Gaussian noise, and blur. Color augmentations were also applied to the dataset, including brightness, contrast, saturation, and hue. However, this study did not independently investigate the effect of each of the mentioned schemes on the performance of the models. Instead, all the augmentations were applied simultaneously on the dataset. A multimodal fusion scheme was proposed in [23] for image and metadata fusion of the PADUFES-20 dataset involving intra-modality self-attention and inter-modality cross-attention. The image and clinical features were encoded independently before being passed into the intra and inter-modality attention blocks. DA schemes were applied simultaneously to the datasets, including horizontal and vertical flipping, color jittering, random contrast, and Gaussian noise.

The opaque nature of conventional DL algorithms poses a significant challenge to integrating AI solutions in medical diagnostics. These black-box models, while powerful, often lack transparency, making it difficult for clinicians to understand and trust their outputs. This lack of interpretability is a significant barrier to widespread adoption in clinical settings, where the rationale behind diagnostic decisions is as important as the decisions themselves. Enhancing the confidence and reliability of DL models in clinical workflows requires providing clear, understandable explanations for their diagnostic results. To address this challenge, explainable AI (XAI) is rapidly evolving. Researchers are actively developing methods that offer explainable and interpretable mechanisms to support the decision rationale of DL algorithms. Techniques such as SHapley Additive exPlanations (SHAP) [29–32], Local Interpretable Model-Agnostic Explanations (LIME) [33], Deep Learning Important FeaTures (DeepLIFT) [34], and Class Activation Mapping (CAM) [35–37] are among the leading approaches being explored. These methods aim to demystify how DL models arrive at their conclusions by highlighting which features or parts of the data most influence the model’s predictions. This transparency is crucial for validating and verifying AI-driven diagnostic tools.

Recent studies in medical image analysis [38–42] incorporate some interpretable schemes in their analysis to provide some intuitive information about the model’s performance. The pointing game scores [43] were used in [39] to provide quantitative interpretable measures for saliency XAI methods, including Gradient-Based Class Activation Map (GradCAM), Gradient-Based Class Activation Map PlusPlus (GradCAM++), and EigenCAM in the evaluation of DL algorithms for mammography analysis. These quantitative measures were supported by ground-truth bounding boxes, which measured the localization capabilities of these saliency methods. Explainability analysis was explored in [38] with the LIME algorithm for DL models developed from dermoscopic imagery. Due to the absence of ground-truth bounding boxes, the explainability assessment was qualitative only through heatmap visualizations. Visual explanations were discussed in [40] for DL models trained for multiclass classification on the HAM10000 [7] dataset, leveraging saliency XAI methods and layerwise feature map visualizations. The visual explanations obtained by GradCAM and LIME algorithms in [41] show alignments in interpretable assessments of a DL-based monkeypox classifier.

A comprehensive study by Cassidy et al. [19] provided benchmark results for binary and multiclass classification on dermoscopic lesion images. The performance of 19 DL models was discussed and qualitative visual explanations were provided using the GradCAM algorithm. The research investigation in this study focused particularly on the inherent issues within the datasets, especially the problem of image duplicates, and proposed a duplicate removal strategy. This study serves as a critical reference for benchmarking the performance of DL architectures on the largest archive for dermoscopic lesion images.

Optimization algorithms are essential to the learning process of DL algorithms. Popular optimization algorithms for DL models include stochastic gradient descent (SGD), Adaptive Moment Estimation (Adam), and root mean square propagation (RMSprop). The study in [44] investigated the effect of different optimization algorithms on the performance of DL models using the ISIC2018 [7] and COVIDx [45] datasets. Likewise, a comprehensive evaluation of the performance of different optimization algorithms for DL models was carried out in [46] for Indian sign language recognition. The impact of DL optimizers and their hyperparameters was also investigated in [47] for machine bearing fault diagnosis.

1.2. Motivation

Currently, to the best of our knowledge, we find that no study provides benchmark results for the performance of DL models on clinical images. Furthermore, no prior study has investigated the effect of different DA schemes on the performance of skin lesion models developed from clinical data. Likewise, we find that no study has investigated the performance of different DL model optimizers with DA schemes on the performance of skin lesion diagnosis models from clinical datasets. Furthermore, we observed limited research has been done in exploring interpretable DL models for classifying skin lesion from clinical datasets.

1.3. Contribution

This article attempts to bridge the research gap in the reviewed literature while highlighting the following main contributions.

1.
We present a new interpretable DL platform that has the prowess to predict skin lesions and localize the portion in the clinical image containing lesions (as shown in Figure 1) through an easily accessible web-based application.
2.
We investigated 13 DL models derived from four main neural network architectures (DenseNet, EfficientNet, MobileNetV3, and ResNet), each considering a fixed optimization scheme analyzed on binary and multiclass image datasets (without DA). We extended our analysis to improve the model performance by considering the four best model architectures; each was analyzed using five DA schemes to increase the training data size while inspecting the impact of three variants of the Adam optimizer (AdamW, Adam, and Rectified Adam (RAdam)).
3.
Furthermore, we investigate the benefit of using three CAM algorithms (GradCAM, GradCAM++, and Ablation Class Activation Map (AblationCAM)) and SHAP algorithm to analyze the trained DL model and identify the attention region, which influences the decisions made by the DL methods.
4.
Our results reveal that DenseNet architecture with an appropriate optimization scheme and selection of suitable DA protocol yielded a performance that surpassed other techniques in binary and multiclass challenges. We show that the explainable algorithms demystify the rationality behind the decisions made by the DL methods by showing the attention regions containing lesions.
5.
Our proposed interpretable DL system prediction and localization of skin lesions will proffer decision support to medical experts in determining the severity of lesions existing as benign or malignant or gaining insights into the subcategorization of the lesions.

Paper outline: Section 2 presents the used datasets and the data preprocessing by exploring DA techniques for increasing the number of images in training data. Section 3 describes the different DL methods investigated, explains the interpretable algorithms, briefly narrates the performance metrics for evaluating the DL models and model deployment, and highlights the experimental setups. Section 4 presents and analyzes the results obtained after training the DL models. Section 6 summarizes the lessons learned from the study and proffers research recommendations.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Our proposed explainable deep learning system pipeline for predicting and localizing skin lesion in a clinician image.

2. Datasets and DA

2.1. Dataset Description

The smartphone-acquired skin lesion dataset, PADUFES-20, considered in this study [5] represents the first publicly available skin lesion data repository of dermal clinical images collected using smartphones, including patient demographics. The dataset consists of 2298 samples of six skin lesions from 1373 patients, including three malignant and three benign lesions. The malignant lesions in the dataset are basal cell carcinoma (BCC), malignant melanoma (MEL), and squamous cell carcinoma (SCC), while the benign lesions featured in the dataset are actinic keratosis (ACK), melanocytic nevus (NEV), and seborrheic keratosis (SEK). Tables 1 and 2 summarize the distributions of the skin lesions when treated as a multiclass or binary classification problem. Figure 2 [5] depicts each dataset’s skin lesion type. We investigated two kinds of clinician image datasets: binary class and multiclass. In the binary classification problem, we considered using the DL models to classify lesions as benign or malignant. Conversely, the multiclass classification problem involved using DL models to classify a lesion instance in a clinician image into one of the six classes featured in the dataset.

Table 1. PADUFES-20 dataset is treated in a multiclass fashion.

Lesion type	Samples	%
Actinic keratosis	730	31.77
Basal cell carcinoma	845	36.77
Malignant melanoma	52	2.26
Melanocytic nevus	244	10.62
Squamous cell carcinoma	192	8.36
Seborrheic keratosis	235	10.23
Total	2298	100

Table 2. PADUFES-20 dataset when the multiclass is regrouped into a binary class.

Lesion type	Samples	Benign	Malignant	%
Actinic keratosis	730	✔	✗	31.77
Basal cell carcinoma	845	✗	✔	36.77
Malignant melanoma	52	✗	✔	2.26
Melanocytic nevus	244	✔	✗	10.62
Squamous cell carcinoma	192	✗	✔	8.36
Seborrheic keratosis	235	✔	✗	10.23
Total	2298	1209	1089	100

2.2. DA

In our preliminary analysis, we observed that the used dataset was small and suffered from a class imbalance within the data distribution. Hence, we investigated five different offline DA schemes to operate on the images of the used datasets to increase the number of training examples and curb the limited training data. Note that each DA scheme factors an algorithm for generating transformed images combined with the original image to create a minimum of twice (DA-1, DA-2, DA-3, and DA-4) or quadruple (DA-5) of the original training data. Table 3 summarizes the different augmentation scenarios independently investigated for each model configuration in this study, and Figure 3 [5] depicts the examples of the augmentation scenarios.

Table 3. Summary of the different data augmentation scenarios considered in this study.

DA scheme	Name	Description
DA-1	Random resized crop	Randomly crop the images and resize the cropped image to 224 × 224 pixels
DA-2	Color jitter	Adjusting the contrast, saturation, and brightness which lies within the uniform distribution range [0, 1] and the hue was adjusted in the range [0, 0.5]
DA-3	Random horizontal flip	Randomly flip the images horizontally
DA-4	Random rotation	Randomly rotate the images between (−25, +25) degrees
DA-5	Random horizontal flip, rotation, and resized crop	Apply schemes DA-3, DA-4, and DA-1 in that order

3. Methodology

This section briefly explains DL methods, explores different optimization schemes and experimental setups, discusses foundational concepts about interpretable algorithms, and outlines design steps for model deployment.

3.1. DL Methods

We discuss the DL approaches, which are categorized into four primary techniques.

3.1.1. ResNet

The ResNet [48] represents a DL approach derived from a CNN architecture comprising varying depths of layers. A typical ResNet can curb vanishing gradients encountered in deep neural networks (DNNs). To combat this, the pioneers of this technique introduced the concept of residual blocks that use the skip connection paradigm. The described characteristics empower the ResNet model to efficiently learn identity mapping and discern disparities between input and output layers. Our research investigated the following depths of neural networks: 18, 34, 50, and 101. This means we considered four ResNet architecture variants (ResNet18, ResNet34, ResNet50, and ResNet101). A ResNet model paradigm can be expressed as

()

In this context, y signifies the output of the residual block, whereas H(.) function encompasses the stacking of layers within the residual function. The input to the block is denoted by x, while θ_k represents the network weights for given layers within the residual block, and θ_s corresponds to the skip connection network weights. To determine the successive input value x are obtained by applying the ReLU activation function to the sum of the weighted outputs from the previous layers. This application of ReLU usually introduces nonlinearity into the computation process.

3.1.2. DenseNet

The inherent characteristic of the DenseNet [49] architectures relies on feature reuse and the feedforward propagation prowess that allows the learning of robust features by leveraging information from previous layers, thus resulting in a compact network with fewer network parameters. The dense connections improve the backpropagation of gradients, resulting in improved speed of convergence and training stability. The creators of the architecture proposed four distinct versions, each sharing the same fundamental principles and consisting of 4 dense blocks having variations in the connectivity levels. The intrinsic portions of the initial two blocks were kept constant across all the model variants, with the first block comprising six dense layers and the second incorporating 12 dense layers. The dense layers in all versions employ two convolutions with kernel filter of sizes 3 × 3 and 1 × 1. Conversely, the arrangement of the final two dense blocks varies across different variants of this deep learning architecture. Our research investigation concentrated mainly on the four variations of the DenseNet architectures: DenseNet-201, DenseNet-169, DenseNet-161, and DenseNet-121.

3.1.3. EfficientNet

The research study by Tan and Le [50] introduced an innovative technique for concurrently optimizing the adjustment of the network width, image resolution, and network depth. The pioneer of this technique utilized the compound scaling paradigm while introducing a distinctive set of DL architectures, denoted as EfficientNet-B0 through B7. EfficientNet is constructed from foundational models that utilize Mobile Inverted Bottleneck (MBConv) blocks. Our present investigation examined three iterations of EfficientNet (EfficientNet-B7, EfficientNet-B4, and EfficientNet-B0).

3.1.4. MobileNetV3

This DL method relies on the utility of an automated network search program that is capable of producing an accurate and efficient neural network model. This technique was initially invented by the research team from Google [51]; this DL method uses a neural architecture search platform that factors the mobile search space while factoring a squeeze and excitation optimization scheme. This DL method relies on depthwise convolution and squeeze and excitation optimization protocol within its layers. We considered two variants of this architecture:

1.
MobileNetV3-L: This variant underscores the principle of maximizing accuracy while considering the limitations of resources on mobile devices. This method means MobileNetV3-Large.
2.
MobileNetV3-S: This variant was initially designed for devices with inadequate computational resources and employs minimal processing power and memory utility. This method refers to MobileNetV3-Small.

3.2. Model Optimizers

3.2.1. Adam

The Adam optimizer is a method for optimizing weight parameters, which involves the utilization of the cross-entropy loss function 〈·(W_t) [52]. The formulation of the Adam optimizer is employed to compute the optimal weights as per the following formula:

()

At time step t, approximates the gradients relative to the given loss function. 〈·(W_t−1) denotes the cross-entropy loss function, contingent upon the weight parameters W_t. The numerator segment of the fraction in this equation estimates the bias-corrected first moment β₁, while the denominator segment approximates the bias-corrected second raw moments β₂. Notably, the exponential decay rates for these moment estimates are raised to the power variable t, resulting in and , respectively. Within an Adam optimization algorithm, updates are made to the squared gradient (V_t) and the moving mean of the gradient (m_t). In our preliminary experiments, we optimized the learnable weights on all the architectures (variants of ResNet, DenseNet, MobileNet, and EfficientNet) previously described in Section 3.1.

3.2.2. RAdam

An adaptation of the Adam stochastic optimizer (RAdam) incorporates a mechanism to rectify the variance of the adaptive learning rate [53]. The weight update rule for RAdam can be formulated as follows:

()

In equation (3), the term inside the square root represents the variance rectification component. It is important to note that when the variance becomes tractable, ρ_t exceeds 4. Additionally, η_t represents an adaptive learning rate, which can be determined using the expression . Here, V_t+1 denotes an update for the estimation of the first bias moment and α accounts for the weight decay.

3.2.3. AdamW

AdamW is a stochastic optimization technique that adjusts the conventional approach to weight decay in Adam [54] by separating weight decay from the gradient update process. In Adam l₂ regularization is commonly integrated using the following modification, where W_t represents the weight decay rate at time t “with the following” The weight update in AdamW is formulated as: where W_t denotes the weights at step t, η is the learning rate, m_t and v_t are the bias-corrected first and second moment estimates, ϵ is a small constant for numerical stability, and λ is the weight decay coefficient. This decoupling strategy improves regularization and enhances generalization performance across various deep learning tasks.

3.3. Experimental Setups

In baseline experiments for binary and multiclass classification tasks, 13 DL architectures were trained without DA. The best-performing architecture in each family was selected and retrained independently with five different augmentation schemes and three model optimizers, resulting in 60 training experiments per classification task (binary and multiclass). We trained models from scratch (using random weights) and pretrained weights (from ImageNet) for 100 epochs and 50 epochs, respectively. We utilized a train-validation-test split ratio of 70:15:15 in all the experiments conducted and we selected the model at the best epoch on the validation set in each experiment. All the images were resized to a standard size of 224 by 224 pixels. A fixed learning rate of 0.001 was adopted in all the experiments. Model training utilized a workstation equipped with an AMD Ryzen Threadripper PRO 3975WX 32-core processor (128 MB cache), three NVIDIA RTX 3090 graphics cards, 4.7 TB of storage, and 256 GB of memory.

3.4. XAI Algorithms

We explored XAI algorithms based on SHAP and three visual interpretable algorithms that rely on the principle of CAM [55]. We investigated these algorithms by assessing the trained DL models. Our main objective is to examine the prowess of the interpretable algorithms to analyze the DL techniques and pinpoint the key regions indicative of skin lesions (localization map) within the clinical medical images when attempting to predict a target class.

3.4.1. CAM

CAMs are useful for visualizing the decision rationale of DL models by producing heatmaps showing the model’s attention region for a given input image. This provides an intuition into the model’s decisions and enables researchers and end users to understand which regions of the input contribute to the classification result. CAMs are used for tasks such as object detection, image localization, and model debugging, helping to ensure the transparency and reliability of CNN-based models. Three CAM methods are considered in our study: GradCAM, GradCAM++, and AblationCAM.

3.4.1.1. GradCAM

GradCAM is an interpretable algorithm used in computer vision to visualize an image’s most important regions that influence the decision made by the DL method to predict a desired class. GradCAM is a method for discriminative localization of specific classes and exhibits the prowess to generate visual comprehension from the DL model. GradCAM uses the principle of gradient information flow from the apex CNN layer to interpret the relevant network weights when attempting to make a specific decision (performing a classification task) [35]. We described a GradCAM class activation localization map as a rectification of the sum-weighted product between the weight of partial linearization

and downstream neural network feature map (a^j) as described in the following equation:

()

The weight parameter computes the gradient using the formula , y^c for class c represents the activation score, and B accounts for the constant (quantity of the pixels) in the feature activation map, and the activation function RELU can be computed using the formula f(x) = max(0, x), given an input feature representation (x).

3.4.1.2. GradCAM++

This approach represents an enhanced form of GradCAM, offering visual interpretations of the DL model predictions and visual insight when localizing objects of interest. It additionally provides intuition into the presence of the significant object of interest within an image [36]. In a standard GradCAM++, the methodology involves leveraging a weighted integration of positive partial derivatives derived from the final convolutional layer’s feature maps to a specific class score. These derivatives are employed as weights to generate a visual explanation for a designated class label. Thus, the GradCAM++ method can compute a class discriminative localization map with the aid of the following equation:

()

The weight variable can be calculated using the formula: . It is important to highlight that both GradCAM and GradCAM++ acquire their weights by linearly combining various feature maps through the backpropagation of class relevance scores.

3.4.1.3. AblationCAM

It is an interpretable algorithm that employs a gradient-free saliency map to generate a visual representation of the DL techniques. The AblationCAM interpretability approach circumvents the usual gradient computations and facilitates the generation of class-specific saliency maps [37]. We calculated the ablation saliency map using the following expression:

()

The weight variable can be determined through the formula .

3.4.2. SHAP

SHAP is an interpretability algorithm based on cooperative game theory utilizing the concept of Shapley values assigning a value for each player’s contributions in a game. The SHAP algorithm evaluates the feature contributions to the model’s output by considering all possible feature combinations and averaging the contributions to the output. In the field of XAI, the SHAP algorithm is popular due to the ability to provide global insights to model behavior. The SHAP algorithm is model agnostic and belongs to a class of additive feature attribution methods having these properties [29]: local accuracy, missingness, and consistency. The model-agnostic attribute makes it suitable for any machine learning or DL model. The SHAP algorithm is employed in our study for understanding the feature attributions of the developed models. We utilize the partition explainer function in the official SHAP API [29]. In all the explanations generated by the partition explainer, we utilized a total of 500 evaluations.

3.5. Evaluation Performance Metrics

The DL classification performance evaluations were measured using the following metrics.

1.
Accuracy: This metric assesses the effectiveness of a supervised learning algorithm to correctly predicting a specific output labels relative to all possible predictions.
()
where α_TP denotes true positives, α_TN accounts for true negative, α_FN represent false negative, and α_FP denotes false positive.
2.
Balanced accuracy (BACC): BACC is a metric that evaluates classifier performance in imbalanced class scenarios by considering the accuracy of each class individually and averaging them, thus preventing bias toward the majority class, calculated as the average recall of each class, particularly valuable for datasets with significant class imbalances.
()
3.
Precision: It is a measure that evaluates how accurately a supervised learning algorithm predicts a positive output label (class).
()
4.
Recall (sensitivity): It is a metric that assesses how accurately a supervised learning algorithm attempts to predict true positive output labels among all possible positive samples in given data.
()
5.
F1-Score: This metric calculates harmonic mean between precision and recall.

()

Statistical tests were conducted using the paired t-test and Welch test to compare model performances between instances with DA and without DA across the optimization algorithms. The paired t-test (12) compares the mean difference in performance between original and DA schemes for the same optimizer. In contrast, Welch’s t-test (15) compares the mean performances assuming unequal variances between original and DA schemes. In both cases, the resulting p values were used to determine the significance of the observed differences. In equations (13) and (16),

and

represent the measured performance metric without and with DA, respectively, for the i-th optimization algorithm. Here i = 1, 2, and, 3 represent Adam, RAdam, and AdamW.

()

3.6. Model Deployment Procedure

This section outlines the stages for deploying a functional AI-driven interpretable DL system for predicting skin lesions, motivated by the research works of Hroub et al. [56].

3.6.1. Design Stage

This stage entails conceptualizing and designing the frontend and backend of the AI-based logic for our proposed interpretable DL platform capable of predicting skin lesions.

The design specifications of our proposed system are enumerated below.

1.
We anticipate that a user will utilize the uploading feature to upload an image.
2.
We considered a situation in which, following the successful upload of a clinical image depicting a skin lesion, the user of the application chooses the desired predictive model from the available range of DL models and subsequently clicks the Predict button.
3.
Afterward, upon clicking the Predict button, the backend component of our application interface, which comprises the DL models and program logic, is activated to predict the target class. The results are expected to be displayed on the screen showing the level of predictive certainty along with relevant medical advice.
4.
Moreover, the user has the option to choose an interpretability method. Depending upon the selection of the DL model, the predicted target class, and the interpretable algorithm selected, the interactive application can generate the saliency map. This map highlights the areas of the skin lesion image that significantly impact the DL model’s decision justification.

3.6.2. Implementation Stage

In this stage, we implemented the user interface for our proposed interpretable DL system web application. The demo web application was built on a Python web-based library, Gradio, which supports modularized coding philosophy and facilitates the speedy deployment of AI models. The DL models and explainable algorithms were developed in Python to create a software platform facilitating man–machine interaction within the AI-enabled dermatology decision support platform. Additionally, Cascading Style Sheets (CSS) were employed to design the user interface of our application. Figure 4 shows the implementation steps for the proposed system.

3.6.3. Deployment Stage

We integrated three chosen DL models (DenseNet, ResNet, and MobileNetV3) into our application program backend in this stage. The visual interface of the AI system we created is depicted in Figure 5. We deployed the developed application on a local host.

4. Results

This section extensively analyzes the DL model performances and provides interpretability insight from our newly developed explainable DL platform.

4.1. Binary Classification

4.1.1. Model Training With Random Weights

Table 4 summarizes the binary classification performance of the 13 different state-of-the-art architectures initially investigated in this study. All versions of DenseNet trained on original images surpass all other DL methodologies when assessed on the testing set (when attempting to solve the binary classification). They exhibit a performance ranging from 74.4% to 78.5% across various classification metrics such as accuracy, precision, recall, and F1-Score. The second best method concerning DL architecture is ResNet101, demonstrating notably superior performance to all instances of MobileNetV3 and EfficientNet variants. Conversely, MobileNetV3 variants exhibit the lowest performance levels. Additionally, we conducted a detailed analysis of the performance relative to each DL variant to identify the top-performing model within each architecture. Consequently, DenseNet161, ResNet101, EfficientNetB7, and MobileNetV3-L emerged as the top performers across their respective DL architectures. Therefore, we utilized the chosen DL techniques incorporating variations of the Adam optimizer (such as RAdam, Adam, and AdamW) to carry out supplementary experiments on the training set containing either original or data augmented images. We conducted additional experiments to assess whether there is potential for enhancing DL model performance factoring DA compared to DL methods, which are trained solely on original images.

Table 4. The binary classification performance of the deep learning models whose learnable weights are randomly initialized (training from scratch) and were trained on the training set without DA was assessed on the testing set from the PADUFES-20 dataset.

Model	Accuracy	Precision	Recall	F1
DenseNet121	76.70	76.70	76.60	76.60
DenseNet161	78.40	78.50	78.40	78.40
DenseNet169	74.50	74.70	74.40	74.40
DenseNet201	77.10	77.20	77.20	77.10
ResNet-18	72.60	72.90	72.50	72.50
ResNet-34	71.30	71.30	71.30	71.30
ResNet-50	71.30	71.30	71.40	71.20
ResNet-101	73.30	73.40	73.40	73.30
MobileNetV3-S	65.80	65.80	65.70	65.70
MobileNetV3-L	68.10	68.10	68.10	68.10
EfficientNet-B0	68.40	68.40	68.30	68.20
EfficientNet-B4	71.20	71.30	71.10	71.10
EfficientNet-B7	72.00	72.50	71.90	71.80

Note: Values shown in bold represent the highest-performing results within each model class.

The research outcomes were consolidated to gain insights into the binary classification problem, considering the optimization schemes previously described, and we subsequently trained the DL methods on the original and DA images. The results of the performance metrics (accuracy, F1-Score, and AUC) obtained after training the DL techniques and then evaluating the trained models on the testing set are summarized in Tables 5, 6, and 7. From these tables, we observed that DenseNet161 factoring at least one of the optimization schemes, considering the appropriate choice of the DA scheme, obtained the highest performance, outperforming other DL methods across the evaluated performance metrics. The DenseNet-161 architecture utilizing DA-1 and the Adam optimizer yielded the highest classification accuracy and F1-Score (79.4%), which outperformed the DenseNet161 model derived after training on original images (baseline) by a margin of 1%. To further understand the impact of the examined DA schemes, we inspect the 12 experimental scenarios: each scenario accounts (per row) for all possible experiments of a particular DL instance, factoring in one of the optimization schemes, and was trained on either original images or each of the DA schemes. Consequently, the models obtained after training were used to evaluate the testing set.

Table 5. The binary classification accuracy of the deep learning models whose learnable weights are randomly initialized (training from scratch) and were trained on the training set without DA/with DA was assessed on the testing and validation sets from the PADUFES-20 dataset.

Model	Optimizer	Original	DA-1	DA-2	DA-3	DA-4	DA-5
Testing set
DenseNet161	Adam	78.40	79.40	67.20	78.80	76.10	78.40
	RAdam	77.60	78.70	69.30	78.30	77.10	77.80
	AdamW	78.00	76.10	68.00	79.10	75.80	77.40
ResNet101	Adam	73.30	75.20	65.10	74.10	75.50	75.20
	RAdam	71.70	74.80	66.20	75.90	75.50	75.40
	AdamW	74.30	74.10	64.30	76.20	74.30	75.10
MobileNetV3-L	Adam	68.10	73.30	62.90	73.90	78.70	74.60
	RAdam	63.90	73.50	57.20	71.00	76.10	73.80
	AdamW	66.80	74.80	63.00	73.90	75.90	74.90
EfficientNet-B7	Adam	72.00	75.50	62.30	73.50	76.10	75.20
	RAdam	71.70	76.80	57.20	76.40	77.50	77.20
	AdamW	72.30	75.90	63.00	72.30	76.50	76.20

Validation set
DenseNet161	Adam	80.87	80.58	68.70	80.00	80.00	79.71
	RAdam	80.00	81.45	71.01	81.44	80.57	82.03
	AdamW	81.16	80.00	70.14	80.87	79.71	80.58
ResNet101	Adam	77.10	78.20	68.12	78.84	80.00	77.39
	RAdam	76.52	77.68	69.56	77.97	78.26	78.55
	AdamW	79.13	77.39	67.83	79.42	77.68	76.81
MobileNetV3-L	Adam	71.59	77.97	66.95	78.26	79.71	76.52
	RAdam	69.86	77.68	64.93	74.20	78.55	76.23
	AdamW	79.13	77.68	66.09	77.68	78.55	77.97
EfficientNet-B7	Adam	74.49	77.97	63.19	78.26	78.84	78.55
	RAdam	76.23	79.71	62.32	78.55	80.57	79.71
	AdamW	74.35	80.57	66.66	77.97	80.87	78.84

Note: Bolded values indicate the highest-performing results across the applied DA schemes.

Table 6. The binary classification F1-Score of the deep learning models whose learnable weights are randomly initialized (training from scratch) and were trained on the training set without DA/with DA was assessed on the testing and validation sets from the PADUFES-20 dataset.

Model	Optimizer	Original	DA-1	DA-2	DA-3	DA-4	DA-5
Testing set
DenseNet161	Adam	78.40	79.40	67.20	78.80	76.10	78.40
	RAdam	77.70	78.70	69.20	78.30	77.60	77.80
	AdamW	77.90	76.10	68.00	79.10	75.80	77.40
ResNet101	Adam	73.30	75.20	64.90	74.10	75.50	75.20
	RAdam	71.70	74.80	66.20	75.90	75.50	75.40
	AdamW	74.30	74.00	64.00	76.20	74.30	75.10
MobilenetV3-L	Adam	68.10	73.30	62.70	73.80	78.60	74.20
	RAdam	63.00	73.40	57.20	71.00	76.00	73.80
	AdamW	66.80	74.60	63.00	73.90	75.90	74.90
EfficientNet-B7	Adam	71.80	75.40	62.30	73.50	75.70	75.20
	RAdam	71.70	76.80	57.20	76.40	77.40	77.20
	AdamW	71.10	75.90	63.00	72.00	76.20	76.20

Validation set
DenseNet161	Adam	80.87	80.56	68.58	79.98	79.97	79.70
	RAdam	80.00	81.38	70.75	81.44	80.49	82.00
	AdamW	81.02	79.93	70.08	80.86	79.70	80.58
ResNet101	Adam	77.09	78.26	67.74	78.81	79.90	77.34
	RAdam	76.39	77.65	69.56	77.81	78.26	78.53
	AdamW	79.00	77.36	67.64	79.36	77.61	76.78
MobileNetV3-L	Adam	71.57	77.84	66.72	78.12	79.57	76.26
	RAdam	69.84	77.63	64.62	74.15	78.52	76.22
	AdamW	71.57	77.44	65.89	77.67	78.53	77.85
EfficientNet-B7	Adam	74.26	77.81	63.17	78.16	78.42	78.51
	RAdam	76.23	79.70	62.32	78.50	80.41	79.62
	AdamW	74.31	80.51	66.63	77.74	80.51	78.83

Note: Bolded values indicate the highest-performing results across the applied DA schemes.

Table 7. The binary classification AUC of the deep learning models whose learnable weights are randomly initialized (training from scratch) and were trained on the training set without DA/with DA was assessed on the testing and validation sets from the PADUFES-20 dataset.

Model	Optimizer	Original	DA-1	DA-2	DA-3	DA-4	DA-5
Testing set
DenseNet161	Adam	85.40	84.00	71.40	84.90	85.00	83.60
	RAdam	83.80	84.80	74.40	84.90	83.30	84.60
	AdamW	83.70	82.60	73.60	85.50	82.90	83.90
ResNet101	Adam	79.20	79.90	69.60	80.50	80.90	82.10
	RAdam	77.40	81.80	69.20	81.50	81.60	81.30
	AdamW	79.80	80.00	69.70	80.70	81.50	79.30
MobileNetV3-L	Adam	73.00	79.40	66.60	81.70	83.30	82.60
	RAdam	66.60	79.90	64.30	77.00	81.90	80.00
	AdamW	70.90	80.60	63.70	79.30	82.40	80.50
EfficientNet-B7	Adam	79.60	82.20	65.30	78.90	82.10	82.00
	RAdam	77.30	81.20	60.70	82.50	83.70	82.10
	AdamW	76.70	82.70	63.80	78.50	82.90	81.70

Validation set
DenseNet161	Adam	88.30	85.80	71.88	85.73	87.47	86.02
	RAdam	85.05	86.42	76.12	86.21	85.61	87.27
	AdamW	86.42	85.32	75.74	86.52	85.23	86.57
ResNet101	Adam	82.13	82.03	73.87	83.99	84.27	84.84
	RAdam	82.58	83.58	74.06	83.60	84.58	84.58
	AdamW	82.85	82.94	75.19	82.09	85.55	81.47
MobileNetV3-L	Adam	77.17	82.17	70.49	84.82	85.11	86.31
	RAdam	72.35	83.81	66.16	78.56	85.45	82.69
	AdamW	74.23	82.72	67.03	82.36	85.17	84.25
EfficientNet-B7	Adam	81.18	85.27	66.60	81.13	84.13	84.17
	RAdam	83.10	83.03	63.33	84.75	86.03	84.52
	AdamW	78.71	84.23	66.30	82.26	84.91	85.35

Note: Bolded values indicate the highest-performing results across the applied DA schemes.

We report that the DA-1 scheme aided the DenseNet161 in obtaining the highest binary classification accuracy and F1-Score. Additionally, we inspected the impact of the DA across the 12 scenarios; the results show that DA-4 and DA-3 are the most influential in enhancing the DL model performance across the four main neural network architectures under study. The second competitive method is MobileNetV3-L architecture; the most favorable performance was observed with the DA-4 approach across all three optimizer types compared to all other DA schemes, while the Adam optimizer surpassed the AdamW and RAdam. Additionally, MobileNetV3-L, when trained on DA-4, significantly outperforms MobileNetV3-L when trained on the original image by a margin ≥10.3% across accuracy, F1-Score, and AUC metrics. Likewise, the best model accuracy was recorded for the EfficientNet-B7 architecture with DA-4 compared with all other model optimizers: RAdam performed slightly better than the other optimization schemes. The best models from the EfficientNet-B7 architecture outperform most variants of the ResNet101 in most of the performance evaluation metrics. After conducting experiments for the different instances of the ResNet-101 models, while considering different DA schemes and optimizers, the highest model accuracy of 76.2% was achieved with DA-3 when the ResNet101 model weights are optimized using the AdamW optimizer. In general, the ResNet-101 factoring DA-3 and DA-4 schemes yielded the best performance across the three main experimental scenarios for this architecture. However, the best-performing ResNet-101 model trained on data-augmented images still yielded poorer performance compared to most of the top-performing architectures, including DenseNet161, MobileNetV3-L, and EfficientNet-B7.

The results presented show that spatial or geometrical transformation–based DA schemes based on random rotation (DA-4), horizontal flipping (DA-3), and cropping (DA-1) can improve the binary classification performance of DL architectures that are trained on clinical skin lesion images. In most instances, the performance of these DL architectures deteriorates when color jitter DA (DA2) schemes are considered. It is vital to highlight that the random rotation (DA-4) yielded the best performances in most experiments, especially with the MobileNetV3-L and EfficientNet-B7. We underscore that some instances of DA-5, which factors DA-1, DA-3, and DA-4, often yielded decent high classification scores across the evaluated metrics. At the same time, the top-performing architecture across all the schemes and model optimizers was the DenseNet-161 architecture. These results indicate that the DA schemes considered in this study have demonstrated the potential to improve DL model performances. To better understand how the evaluated DL models predict clinical skin lesions, we illustrate the correlation between the actual output labels (ground truth) and the predicted outputs from the top-performing models using a confusion matrix, depicted in Figure 6. The figure demonstrates that DenseNet-161 exhibited the highest accuracy, correctly predicting around 548 images while making only 141 false predictions during the testing phase evaluation. In contrast, the other methods showed lower accuracy in correct predictions.

4.1.2. Training With Pretrained Weights

Additional experiments were carried out with the top-performing DL models (DenseNet161, ResNet101, MobileNetV3-L, and EfficientNetB7) which were trained using pretrained weights. Tables 8, 9, and 10 provide a summary of the testing and validation performance in terms of accuracy, F1-Score, and AUC, respectively. The stated DL models were trained using pretrained weights (from ImageNet) while considering different DA strategies and model optimizers for binary classification of skin lesions. The results demonstrate that models adopting DA-4 and DA-5 yielded superior performance, indicating that these augmentation strategies provided better variability and robustness during training, thereby allowing the models to generalize adequately to unseen data. The EfficientNet-B7 architecture (trained with DA-5 augmentation) attained a top accuracy and F1-Score of 85.80% in testing set evaluations. Likewise, the top AUC score was obtained with the EfficientNet-B7 architecture under similar training conditions in validation set evaluations. Experiments with DA-4 strategies also reveal superior classification performance for DenseNet161, ResNet101, and MobileNet-V3, with accuracy scores reaching 84.64%, 84.93%, and 84.93%, respectively, in testing set evaluations. Similar performance was recorded in the validation set with DenseNet161, ResNet101, MobileNet-V3, and EfficientNetB7 attaining accuracy scores of 86.38%, 86.09%, 86.38%, and 87.54%, respectively.

Table 8. Binary classification accuracy of deep learning models with pretrained weights, trained on the training set with and without data augmentation (DA), evaluated on the testing and validation sets from the PADUFES-20 dataset.

Model	Optimizer	Original	DA-1	DA-2	DA-3	DA-4	DA-5
Testing set
DenseNet161	Adam	81.74	83.19	81.16	83.48	83.48	82.90
	RAdam	82.61	83.77	80.29	80.58	81.74	80.58
	AdamW	82.61	84.06	76.81	83.19	84.64	84.35
ResNet101	Adam	80.58	82.32	80.58	80.00	84.93	81.74
	RAdam	81.16	81.16	80.29	82.90	82.03	81.16
	AdamW	83.48	79.71	80.87	82.90	84.35	82.32
MobileNetV3-L	Adam	80.58	82.61	84.35	82.61	84.93	80.29
	RAdam	79.13	82.90	80.29	83.48	80.58	81.45
	AdamW	82.03	83.48	82.32	82.61	83.19	84.06
EfficientNet-B7	Adam	81.16	82.61	84.35	82.03	82.90	85.80
	RAdam	81.45	82.32	82.03	82.61	81.45	83.77
	AdamW	82.90	83.48	81.16	82.32	83.19	85.80

Validation set
DenseNet161	Adam	84.35	86.09	83.48	83.48	86.38	86.38
	RAdam	85.22	84.35	83.48	80.58	86.38	85.22
	AdamW	84.06	83.77	85.22	83.19	86.09	85.22
ResNet101	Adam	85.80	83.77	84.06	80.00	85.22	84.93
	RAdam	83.77	82.90	83.19	82.90	86.09	85.51
	AdamW	85.22	84.06	81.45	82.90	85.22	84.64
MobileNetV3-L	Adam	83.19	83.77	83.77	82.61	86.38	85.22
	RAdam	84.06	84.35	82.90	83.48	85.51	84.64
	AdamW	84.64	84.35	84.93	82.61	84.64	84.35
EfficientNet-B7	Adam	85.22	84.64	84.64	82.03	87.54	86.96
	RAdam	84.35	85.51	83.77	82.61	86.38	86.96
	AdamW	84.35	84.64	86.09	82.32	87.25	86.96

Note: Bolded values indicate the highest-performing results across the applied DA schemes.

Table 9. Binary classification F1-score of deep learning models with pretrained weights, trained on the training set with and without data augmentation (DA), evaluated on the testing and validation sets from the PADUFES-20 dataset.

Model	Optimizer	Original	DA-1	DA-2	DA-3	DA-4	DA-5
Testing set
DenseNet161	Adam	81.73	83.19	81.16	83.47	83.47	82.86
	RAdam	82.60	83.73	80.28	80.56	81.74	80.57
	AdamW	82.61	84.04	76.81	83.19	84.63	84.34
ResNet101	Adam	80.58	82.31	80.58	80.00	84.92	81.73
	RAdam	81.08	81.15	80.25	82.89	82.03	81.16
	AdamW	83.48	79.71	80.87	82.89	84.35	82.31
MobileNetV3-L	Adam	80.58	82.58	84.34	82.61	84.91	80.28
	RAdam	79.13	82.90	80.29	83.47	80.58	81.43
	AdamW	81.97	83.47	82.32	82.60	83.19	84.05
EfficientNet-B7	Adam	81.15	82.57	84.34	82.03	82.90	85.80
	RAdam	81.45	82.29	82.02	82.61	81.45	83.75
	AdamW	82.89	83.48	81.14	82.27	83.18	85.80

Validation set
DenseNet161	Adam	84.28	86.04	83.41	83.47	86.32	86.37
	RAdam	85.05	84.33	83.44	80.56	86.33	85.19
	AdamW	84.02	83.71	85.04	83.19	86.04	85.20
ResNet101	Adam	85.77	83.74	84.00	80.00	85.18	84.92
	RAdam	83.56	82.80	83.08	82.89	86.05	85.44
	AdamW	85.13	84.00	81.43	82.89	85.18	84.60
MobileNetV3-L	Adam	83.17	83.74	83.65	82.61	86.30	85.20
	RAdam	84.00	84.27	82.83	83.47	85.45	84.61
	AdamW	84.52	84.32	84.82	82.60	84.60	84.31
EfficientNet-B7	Adam	85.13	84.62	84.59	82.03	87.48	86.84
	RAdam	84.32	85.49	83.70	82.61	86.32	86.88
	AdamW	84.31	84.52	85.97	82.27	87.13	86.81

Note: Bolded values indicate the highest-performing results across the applied DA schemes.

Table 10. Binary classification AUC of deep learning models with pretrained weights, trained on the training set with and without data augmentation (DA), evaluated on the testing and validation sets from the PADUFES-20 dataset.

Model	Optimizer	Original	DA-1	DA-2	DA-3	DA-4	DA-5
Testing set
DenseNet161	Adam	89.87	89.48	89.62	90.12	90.17	89.81
	RAdam	89.70	90.21	86.46	90.31	89.73	88.09
	AdamW	90.39	88.71	87.96	91.02	90.08	89.03
ResNet101	Adam	88.24	89.02	88.71	89.22	89.80	88.94
	RAdam	88.73	86.61	88.37	89.65	87.18	89.14
	AdamW	88.02	87.52	89.34	90.12	91.31	88.92
MobileNetV3-L	Adam	86.58	87.67	87.75	88.49	90.07	87.73
	RAdam	86.93	87.46	87.61	88.36	89.23	88.39
	AdamW	88.92	88.69	86.90	88.17	88.06	90.05
EfficientNet-B7	Adam	89.93	90.37	89.20	89.54	90.69	90.93
	RAdam	86.94	88.90	89.28	88.70	89.55	90.30
	AdamW	90.40	88.61	88.51	89.51	89.24	90.67

Validation set
DenseNet161	Adam	90.05	90.29	89.45	90.12	90.99	92.84
	RAdam	89.63	89.30	88.84	90.31	91.59	91.41
	AdamW	91.70	89.74	90.37	91.02	91.93	91.50
ResNet101	Adam	90.40	89.76	91.46	89.22	92.70	91.02
	RAdam	90.94	88.32	88.27	89.65	91.58	91.01
	AdamW	89.72	90.21	90.04	90.12	90.73	91.06
MobileNetV3-L	Adam	89.74	88.38	90.36	88.49	92.06	91.70
	RAdam	90.40	89.35	89.32	88.36	88.36	91.00
	AdamW	89.42	90.24	90.55	88.17	91.13	90.67
EfficientNet-B7	Adam	89.35	90.29	90.33	89.54	92.34	92.02
	RAdam	90.29	89.64	90.75	88.70	92.08	92.23
	AdamW	88.71	91.21	92.14	89.51	92.10	91.84

Note: Bolded values indicate the highest-performing results across the applied DA schemes.

The choice of optimizer also significantly impacted the performances of the DL models in binary classification experiments conducted with pretrained weights. In the validation set evaluations, the EfficientNet-B7 model, trained with the Adam optimizer, attained the best AUC score of 92.34% when DA-4 strategy was considered. Conversely, in the testing set evaluations, the best AUC score of 91.02% was recorded with the DenseNet161 architecture (trained with the DA-3 strategy). While RAdam and AdamW optimizers yielded competitive performance for the models investigated, we find that when using pretrained weights, the Adam optimizer provided the most consistent results when combined with effective DA strategies.

The EfficientNet-B7 model architecture outperformed other architectures in these sets of experiments considering pretrained weights attaining binary accuracy, F1-Score, and AUC of 85.80%, 85.80%, and 90.93%, respectively, in the testing set evaluations. This architecture also demonstrates compatibility with the DA-5 augmentation strategy and the Adam optimizer. Superior performance was also recorded with the DenseNet161 and MobileNet-V3L model architectures, obtaining top accuracy of 84.64% and 84.93%, respectively, in testing set evaluations. Overall, the results obtained with pretrained models show that the combination of the Adam optimizer and DA-4/DA-5 strategies is particularly effective for the binary classification of skin lesions using clinical lesion images.

4.1.3. Statistical Analyses

A statistical comparison between the DA-1 scheme and the original dataset for DenseNet161 yielded p values of 0.95 for both the paired t-test and Welch’s t-test. We observe the statistical difference between the original (without any DA) and other DA schemes in each model architecture over different model optimizer instances. The best accuracy (79.40) was obtained with the DenseNet161 model with Adam optimizer for models trained from scratch without pretrained weights. Statistical tests were conducted using the paired t-test (12) and the Welch t-test (15) to compare the performance of the investigated models on the testing set. Conversely, with the DenseNet161 operating with the DA-3 scheme, the resulting models outperformed the case without augmentation with p values of 0.07 (paired t-test) and 0.09 (Welch t-test). The best overall accuracy with the ResNet101 model architecture was recorded with the DA-3 scheme and AdamW optimizer. Statistical comparisons of models operating DA-3 with the instance without augmentation reveal p values of 0.14 and 0.08, respectively, with the paired t-test and Welch t-test. The overall best accuracy for MobileNet-based architectures was recorded with the DA-4 scheme. Statistical comparisons here show p values of 0.007 and 0.003, respectively, in paired t-tests and Welch tests. Likewise, DA-3 (paired t-test: 0.004, Welch t-test: 0.015) and DA-5 (paired t-test: 0.014, Welch t-test: 0.017) outperformed the instance without DA for the MobileNet architecture. The DA-4 scheme produced the overall best accuracy for the Efficient-Net-B7 architecture across all the optimization algorithms. Statistical comparisons in this scenario with the instance without DA gave p values of 0.013 and 0.003 for paired and Welch t-tests, respectively. Similarly, DA-1 and DA-5 resulted in improved accuracy for EfficientNet compared with instances without data augmentation, with DA-1 yielding p values of 0.015 (paired t-test) and 0.003 (Welch’s t-test), and DA-5 yielding p values of 0.025 (paired t-test) and 0.012 (Welch’s t-test).

The DenseNet161 with pretrained weights gave an overall best accuracy of 84.64% with the DA-4 scheme and AdamW optimizer. Statistical tests measuring the performance of the pretrained DenseNet-161 across all optimizers in the DA-4 scheme against the instance without any DA gave p values of 0.404 and 0.372 for paired and Welch t-tests. Conversely, in the DA-1 scheme, which also outperformed the instance without any DA for DenseNet161, the p values obtained were 0.005 and 0.02, respectively, for paired and Welch t-tests showing significant statistical differences. The overall best accuracy (=84.93%) for the pretrained ResNet101 was obtained with the DA-4 scheme and Adam optimizer. Statistical tests comparing the ResNet101 with the DA-4 scheme and instances without DA revealed p values of 0.22 and 0.18, respectively. The MobileNetV3 architecture obtained an overall best accuracy of 84.93 with the DA-4 scheme and Adam optimizer. The statistical tests comparing the DA-4 scheme for MobileNet-V3 with the case without any DA gave p values of 0.150 and 0.211, respectively, for paired and Welch t-tests. Conversely, the MobileNet-v3 architecture also obtained accuracy values in DA1, surpassing the instance without DA. The measured p values in this instance were 0.07 and 0.09, respectively, demonstrating some statistical difference. The best overall accuracy for the pretrained EfficientNet-B7 was obtained with the DA-5 scheme. Statistical comparisons of this instance with the case without DA gave p values of 0.04 and 0.02, respectively, for paired and Welch t-tests.

4.2. Multiclass Classification

4.2.1. Training With Random Weights

Table 11 summarizes the performance of the 13 different DL architectures trained on clinical lesion images to perform multiclassification tasks. All versions of DenseNet trained on original images surpass all other DL methodologies when assessed on the testing set (when attempting to solve the binary classification). They exhibit a performance ranging from 55.4% to 57.8% across the performance classification measures (accuracy, recall, precision, and F1-Score). DenseNet models achieved BACC scores ranging from 43.1% to 46.1%. Overall, DenseNet121 was the top-ranked model in three out of five performance measures, while DenseNet201 ranked second in two out of five measures. The second best performing DL architecture is ResNet50, demonstrating notably superior performance to all instances of MobileNetV3 and EfficientNet variants. Conversely, MobileNetV3 variants exhibit the worst performance levels across all the evaluation metrics. Additionally, we conducted a detailed analysis of the performance relative to each DL variant to identify the top-performing model within each architecture. Consequently, DenseNet121, ResNet50, EfficientNetB0, and MobileNetV3-S emerged as the top performers across their respective DL architectures. Therefore, we utilized the chosen DL techniques incorporating variations of the Adam optimizer (such as RAdam, Adam, and AdamW) to carry out supplementary experiments on the training set containing either original or data-augmented images. We conducted additional experiments to assess whether there is potential for enhancing DL model performance factoring DA compared to DL methods, which are trained solely on original images.

Table 11. The multiclassification performance of the deep learning models whose learnable weights are randomly initialized (training from scratch) and were trained on the training set without DA was assessed on the testing set from the PADUFES-20 dataset.

Model	Accuracy	Precision	Recall	F1	BACC
DenseNet121	57.70	57.40	57.70	57.40	46.10
DenseNet161	57.10	55.30	57.10	55.40	43.10
DenseNet169	56.70	56.70	56.70	56.10	45.20
DenseNet201	57.80	57.10	57.80	57.10	45.10
ResNet-18	53.90	50.80	53.90	51.20	36.20
ResNet-34	51.30	48.30	51.30	49.30	33.80
ResNet-50	53.20	52.00	53.20	52.10	42.00
ResNet-101	52.20	53.50	52.20	52.30	40.60
EfficientNet-B0	45.80	46.70	45.80	46.00	35.70
EfficientNet-B4	47.50	44.90	47.50	46.10	32.90
EfficientNet-B7	45.70	44.10	45.80	44.30	30.70
MobileNetV3-S	43.90	41.30	43.90	41.60	29.10
MobileNetV3-L	43.30	40.40	43.30	41.10	21.90

Note: Values shown in bold represent the highest-performing results within each model class.

The research outcomes were consolidated to gain insights into the multiclass classification problem, considering the optimization schemes previously described, and we subsequently trained the DL methods the original and the DA images. The results of the performance metrics (accuracy, F1-Score, and BACC) obtained after training the DL techniques and then evaluating the trained models on the testing set are summarized in Tables 12, 13, and 14. From these tables, we observed that DenseNet121 factoring at least one of the optimization schemes, considering the appropriate choice of the DA scheme, obtained the highest performance, outperforming other DL methods across the observed performance measures. The DenseNet-121 architecture factoring in AdamW optimizer and trained on DA-5 yielded the highest classification accuracy and F1-Score of 65.1% and 64.3%, respectively, which outperformed the DenseNet121 model derived after training on original images (baseline) by a significant margin of 7.4%−8.4% depending on the optimization schemes (Adam or AdamW) used in the training phase. Again, we observe that The DenseNet-121 architecture factoring in AdamW optimizer and trained on DA-5 yielded the highest BACC of (55.1%) in the testing phase. To further understand the impact of the examined DA schemes, we inspect the 12 experimental scenarios: each scenario accounts (per row) for all possible experiments of a particular DL instance, factoring in one of the optimization schemes, and was trained on either original images or each of the DA schemes. Consequently, the models obtained after training were used to evaluate the testing set.

Table 12. The multiclassification accuracy of the deep learning models whose learnable weights are randomly initialized (training from scratch) and were trained on the training set without DA/with DA was assessed on the testing and validation sets from the PADUFES-20 dataset.

Model	Optimizer	Original	DA-1	DA-2	DA-3	DA-4	DA-5
Testing set
DenseNet-121	Adam	57.70	63.00	59.90	62.20	60.90	60.30
	RAdam	56.50	62.00	61.50	63.50	63.30	63.60
	AdamW	56.70	63.60	59.80	63.20	61.40	65.10
ResNet-50	Adam	53.20	59.00	53.30	56.40	59.90	58.00
	RAdam	52.50	60.00	56.40	58.00	59.90	59.10
	AdamW	55.20	59.90	55.20	57.10	61.70	62.00
EfficientNet-B0	Adam	45.80	57.40	54.20	55.50	60.30	54.30
	RAdam	46.70	56.70	49.10	54.20	58.30	53.50
	AdamW	45.60	57.80	51.40	55.90	55.20	56.50
MobileNetV3-S	Adam	43.90	51.10	47.20	50.30	56.40	47.00
	RAdam	45.50	49.00	47.70	48.80	53.30	46.50
	AdamW	43.70	50.90	48.30	49.10	56.10	48.30

Validation set
DenseNet121	Adam	62.32	67.83	63.19	66.09	65.80	64.93
	RAdam	60.29	66.96	63.77	68.70	67.25	67.83
	AdamW	59.13	66.96	63.19	66.67	67.54	66.67
ResNet50	Adam	59.42	63.48	57.10	62.03	64.35	62.61
	RAdam	56.23	62.90	58.84	64.64	64.35	60.29
	AdamW	60.00	63.48	57.10	64.06	66.38	62.61
MobileNetV3-S	Adam	48.41	55.65	51.59	55.36	59.71	51.59
	RAdam	47.54	54.20	51.59	52.75	55.36	51.88
	AdamW	47.54	55.65	53.04	53.04	60.29	52.17
EfficientNet-B0	Adam	51.30	60.29	57.97	62.32	65.51	58.26
	RAdam	51.01	60.87	55.36	59.13	62.90	59.42
	AdamW	51.30	62.61	57.39	59.42	62.03	59.42

Note: Bolded values indicate the highest-performing results across the applied DA schemes.

Table 13. The multiclassification F1-Score of the deep learning models whose learnable weights are randomly initialized (training from scratch) and were trained on the training set without DA/with DA was assessed on the testing and validation sets from the PADUFES-20 dataset.

Model	Optimizer	Original	DA-1	DA-2	DA-3	DA-4	DA-5
Testing set
DenseNet121	Adam	57.40	61.30	57.20	60.70	59.40	60.20
	RAdam	54.50	60.50	58.70	61.70	60.90	61.90
	AdamW	55.10	62.70	57.20	61.60	59.90	64.30
ResNet-50	Adam	52.10	58.50	52.90	54.90	56.00	57.70
	RAdam	51.50	59.20	53.60	56.20	57.80	59.30
	AdamW	54.60	57.90	52.70	56.00	59.30	61.70
EfficientNet-B0	Adam	46.00	57.10	51.80	54.60	58.80	53.70
	RAdam	45.40	56.40	47.30	52.50	57.60	54.40
	AdamW	45.30	57.30	50.50	54.20	53.90	56.80
MobileNetV3-S	Adam	41.60	50.90	44.50	46.60	55.40	48.60
	RAdam	44.00	49.40	45.40	45.70	53.30	47.40
	AdamW	41.70	51.30	46.30	46.80	55.10	48.90

Validation set
DenseNet121	Adam	45.91	50.05	43.59	47.76	52.46	50.48
	RAdam	43.19	48.54	44.71	51.00	48.21	50.51
	AdamW	43.23	49.67	43.59	53.14	50.62	54.37
ResNet50	Adam	46.37	47.22	45.48	47.40	47.20	51.71
	RAdam	39.42	47.27	42.24	45.02	46.60	46.10
	AdamW	45.47	46.47	37.68	47.83	46.41	50.64
MobileNetV3-S	Adam	31.02	43.79	36.34	41.80	41.56	42.51
	RAdam	38.38	41.26	37.50	34.28	42.41	38.71
	AdamW	34.29	42.89	39.80	34.60	43.40	40.85
EfficientNet-B0	Adam	40.60	48.61	42.06	47.29	48.33	41.78
	RAdam	33.34	48.54	36.33	40.89	46.30	47.40
	AdamW	36.30	47.70	42.42	40.24	45.96	49.27

Note: Bolded values indicate the highest-performing results across the applied DA schemes.

Table 14. The multiclassification balanced accuracy (BACC) of the deep learning models whose learnable weights are randomly initialized (training from scratch) and were trained on the training set without DA/with DA was assessed on the testing and validation sets from the PADUFES-20 dataset.

Model	Optimizer	Original	DA-1	DA-2	DA-3	DA-4	DA-5
Testing set
DenseNet-121	Adam	46.10	49.30	42.30	45.30	48.00	48.10
	RAdam	42.30	47.60	43.70	47.90	46.50	48.90
	AdamW	41.80	49.50	42.30	49.80	44.80	55.10
ResNet-50	Adam	42.00	48.20	40.50	40.70	43.30	49.60
	RAdam	39.60	48.50	40.40	41.60	43.50	49.10
	AdamW	43.10	46.10	38.50	41.70	44.50	52.00
EfficientNet-B0	Adam	35.70	46.90	38.40	41.70	46.50	43.20
	RAdam	31.40	46.50	32.10	37.30	44.60	46.80
	AdamW	31.10	45.80	38.50	40.30	41.60	47.90
MobilenetV3-S	Adam	29.10	40.30	32.90	33.70	42.30	40.80
	RAdam	33.90	39.70	33.90	31.30	42.20	38.60
	AdamW	33.10	42.60	33.30	32.60	43.10	40.60

Validation set
DenseNet121	Adam	46.93	48.71	43.93	47.33	51.73	49.95
	RAdam	43.46	49.82	45.02	50.67	46.60	50.48
	AdamW	42.45	49.54	43.93	51.31	48.76	53.45
ResNet50	Adam	46.69	49.21	46.40	46.93	46.11	52.94
	RAdam	39.73	48.28	42.40	45.83	46.26	46.43
	AdamW	45.45	46.51	38.55	47.59	46.81	50.55
MobileNetV3-S	Adam	31.97	44.95	35.98	39.19	42.08	44.93
	RAdam	36.25	42.31	36.89	34.45	43.82	40.68
	AdamW	34.93	44.64	39.08	34.33	43.15	42.74
EfficientNet-B0	Adam	41.69	48.94	39.77	46.41	49.66	43.39
	RAdam	34.89	49.82	36.33	40.00	46.65	49.04
	AdamW	37.49	48.52	43.87	41.14	45.60	50.32

Note: Bolded values indicate the highest-performing results across the applied DA schemes.

We report that the DA-5 scheme aided the DenseNet161 when trained using the AdamW optimizer and yielded the highest multiclassification metrics (accuracy, F1-Score, and BACC). Additionally, we inspected the impact of the DA across the 12 scenarios; the results show that DA-4 and DA-5 are the most influential in enhancing the DL model performance across the four main neural network architectures under study. The second competitive method is ResNet-50 architecture; when trained on DA-5 using the AdamW optimizer, it was the most favorable performance when compared with an instance of ResNet-50 trained with the other two optimization schemes (Adam and RAdam). Additionally, ResNet-50, when trained on DA-5, significantly outperforms ResNet-50 when trained on the original image by a margin ≥6.8% across accuracy, F1-Score, and AUC metrics when considering the AdamW optimizer. Likewise, the best model accuracy was recorded for the EfficientNet-B0 architecture with DA-4 compared with either Adam or RAdam optimizer: Adam performed better than the other optimization schemes. The best models from the EfficientNet-B0 architecture outperform all variants of the MobileNetV3-S in most of the performance evaluation metrics. After conducting experiments for the different instances of the MobileNetV3-S models while considering different DA schemes and optimizers, the highest model accuracy of 56.4% was achieved with DA-4 when the MobileNetV3-S model weights were optimized using the Adam optimizer. In general, the MobileNetV3-S factoring DA-4 and DA-1 schemes yielded the best performance across the three main experimental scenarios for this architecture; however, the best MobileNetV3-S trained on data-augmented images yielded the worst performance when compared with most of the best DL architectures (DenseNet121, ResNet-50, and EfficientNet-B0).

The findings demonstrate that data augmentation schemes based on spatial or geometrical transformations including random rotation (DA-4), a combination of cropping, horizontal flipping, and rotation (DA-5), horizontal flipping (DA-3), and cropping (DA-1) contribute to improved multiclass classification performance in deep learning models trained on clinical skin lesion datasets. In most instances, the performance of these DL architectures deteriorates when color jitter DA (DA2) schemes are considered. It is vital to highlight that the random rotation (DA-4) yielded the best performances in most experiments, especially with the MobileNetV3-S and EfficientNet-B0. We underscore that DenseNet121 and ResNet-50 trained with AdamW optimizer on DA-5, which factors DA-1, DA-3, and DA-4, often yielded the highest classification scores across the evaluated metrics. At the same time, the top-performing architecture across all the schemes and model optimizers was the DenseNet-121 architecture. These results indicate that the DA schemes considered in this study have demonstrated the potential to improve DL model performances. To better understand how the evaluated DL models predict clinical skin lesions, we illustrate the correlation between the actual output labels (ground truth) and the predicted outputs from the top-performing models using a confusion matrix, depicted in Figure 7. The figure demonstrates that DenseNet-121 exhibited the highest accuracy, correctly predicting around 443 images while making only 246 false predictions during the testing phase evaluation. In contrast, the other methods showed lower accuracy in correct predictions.

4.2.2. Training With Pretrained Weights

Additional experiments were carried out with the top-performing DL models (DenseNet121, ResNet50, MobileNetV3-S, and EfficientNetB0), which were trained using pretrained weights (from ImageNet) while considering different DA strategies and optimization techniques in an attempt to perform multiclassification of skin lesions. Tables 15, 16, and 17 provide a summary of the testing and validation performance in terms of accuracy, F1-Score, and AUC, respectively. The results obtained demonstrate the efficacy of models trained with DA-3, DA-4, and DA-5 strategies for improving the test and validation performances. The best multiclass classification accuracy of 75.07% was recorded with the DenseNet121 architecture utilizing the RAdam optimizer and DA-3 strategy. The EfficientNet-B0 architecture attained the highest F1-Score and BACC of 69.47% and 68.54% with the Adam optimizer and DA-4 strategy. Similar performance was observed for the EfficientNet-B0 models in the validation set with F1-Score and BACC of 72.83% and 69.83%, respectively, considering Adam optimizer and DA-4 strategy, thereby demonstrating the consistency of these models in testing and validation sets.

Table 15. Multiclass classification accuracy of deep learning models with pretrained weights, trained on the training set with and without data augmentation (DA), evaluated on the testing and validation sets from the PADUFES-20 dataset.

Model	Optimizer	Original	DA-1	DA-2	DA-3	DA-4	DA-5
Testing set
DenseNet121	Adam	71.59	70.43	68.12	74.20	73.33	70.14
	RAdam	66.67	72.17	71.01	75.07	72.17	71.30
	AdamW	69.86	68.70	68.41	71.30	73.04	70.14
ResNet50	Adam	66.09	68.41	66.67	68.70	71.59	70.43
	RAdam	65.80	67.83	69.57	71.30	69.86	72.46
	AdamW	65.80	66.09	68.41	68.12	70.72	67.20
MobileNetV3-S	Adam	61.74	64.93	64.06	67.54	72.17	66.09
	RAdam	61.45	66.96	60.29	65.51	68.99	68.12
	AdamW	62.61	66.38	65.22	67.25	71.01	66.96
EfficientNet-B0	Adam	71.01	68.70	69.57	70.14	73.04	73.04
	RAdam	69.28	71.01	68.99	69.28	71.30	69.86
	AdamW	70.72	70.43	67.83	71.59	72.46	73.04

Validation set
DenseNet121	Adam	72.46	75.36	74.78	77.39	77.10	74.78
	RAdam	72.17	75.65	75.07	76.23	77.10	75.07
	AdamW	74.20	76.23	75.94	76.23	77.68	76.23
ResNet50	Adam	71.88	74.49	71.01	76.23	76.52	71.88
	RAdam	71.59	72.46	73.91	73.04	76.52	72.75
	AdamW	72.46	73.04	74.49	74.78	74.78	71.88
MobileNetV3-S	Adam	67.83	70.72	72.46	71.59	71.88	66.67
	RAdam	66.09	71.01	70.14	70.43	71.01	67.54
	AdamW	67.54	69.86	72.17	71.30	72.17	68.41
EfficientNet-B0	Adam	73.62	74.49	76.23	76.52	76.81	74.78
	RAdam	73.04	74.20	76.81	75.36	77.68	74.78
	AdamW	73.04	74.20	76.52	77.39	75.94	75.65

Note: Bolded values indicate the highest-performing results across the applied DA schemes.

Table 16. Multiclass classification F1-Score of deep learning models with pretrained weights, trained on the training set with and without data augmentation (DA), evaluated on the testing and validation sets from the PADUFES-20 dataset.

Model	Optimizer	Original	DA-1	DA-2	DA-3	DA-4	DA-5
Testing set
DenseNet121	Adam	62.74	60.33	63.88	66.49	66.89	65.29
	RAdam	58.60	64.18	60.86	65.85	67.03	64.48
	AdamW	59.48	60.15	57.99	60.61	65.22	65.93
ResNet50	Adam	58.01	61.92	59.43	58.74	65.28	65.61
	RAdam	58.75	61.45	60.81	66.91	61.20	66.49
	AdamW	62.87	55.25	57.73	60.43	62.54	58.23
MobileNetV3-S	Adam	56.10	57.00	53.54	59.75	63.31	62.73
	RAdam	56.42	63.04	51.91	58.53	63.12	64.99
	AdamW	56.81	61.42	54.62	56.76	65.40	62.93
EfficientNet-B0	Adam	64.80	62.77	61.20	61.43	69.47	66.45
	RAdam	57.12	58.95	57.91	62.10	63.31	60.01
	AdamW	64.38	61.49	63.76	64.52	64.48	65.78

Validation set
DenseNet121	Adam	64.51	70.91	68.25	70.46	66.52	66.58
	RAdam	65.86	61.98	69.79	63.66	69.00	64.40
	AdamW	68.82	65.39	62.28	65.78	67.80	68.72
ResNet50	Adam	63.64	62.62	65.12	65.59	70.78	63.16
	RAdam	63.85	63.13	63.85	61.52	66.36	62.86
	AdamW	66.05	60.19	63.86	61.19	66.84	61.77
MobileNetV3-S	Adam	58.51	63.63	59.55	60.14	62.78	61.88
	RAdam	57.83	61.74	55.67	59.83	61.81	60.36
	AdamW	50.23	60.63	57.66	62.06	64.06	60.99
EfficientNet-B0	Adam	69.67	67.30	64.68	67.53	72.38	69.52
	RAdam	64.48	65.58	69.13	68.83	72.30	71.61
	AdamW	68.05	68.83	68.55	68.29	66.02	69.39

Note: Bolded values indicate the highest-performing results across the applied DA schemes.

Table 17. Multiclass classification AUC of deep learning models with pretrained weights, trained on the training set with and without data augmentation (DA), evaluated on the testing and validation sets from the PADUFES-20 dataset.

Model	Optimizer	Original	DA-1	DA-2	DA-3	DA-4	DA-5
Testing set
DenseNet121	Adam	62.19	59.76	63.86	64.52	65.74	65.22
	RAdam	61.30	63.29	59.27	63.59	65.85	63.02
	AdamW	59.26	60.38	57.74	60.47	62.54	65.09
ResNet50	Adam	57.72	61.84	56.14	57.89	64.23	64.50
	RAdam	58.21	61.08	59.44	64.08	59.36	68.40
	AdamW	63.72	53.80	57.11	59.76	60.94	60.24
MobileNetV3-S	Adam	55.34	57.30	52.87	60.59	62.61	62.10
	RAdam	58.18	62.62	52.65	57.84	62.23	65.15
	AdamW	57.15	59.81	54.94	57.84	65.09	61.96
EfficientNet-B0	Adam	66.15	61.78	61.05	59.91	68.54	66.11
	RAdam	55.39	59.31	55.79	62.51	63.81	58.93
	AdamW	65.70	61.44	62.60	62.98	62.46	66.81

Validation set
DenseNet121	Adam	62.17	67.87	66.71	68.64	63.88	66.38
	RAdam	66.71	60.50	66.95	61.07	67.64	62.70
	AdamW	66.68	63.97	60.55	64.17	63.42	68.86
ResNet50	Adam	61.33	61.36	62.55	63.50	68.73	60.11
	RAdam	62.35	61.63	61.44	59.23	63.09	62.15
	AdamW	64.42	59.10	60.49	59.26	64.02	60.51
MobileNetV3-S	Adam	57.09	62.64	56.23	58.32	63.19	64.00
	RAdam	56.23	62.12	55.03	56.40	59.95	59.22
	AdamW	51.14	59.86	54.77	60.23	64.48	59.56
EfficientNet-B0	Adam	67.34	64.62	61.77	66.43	69.83	68.70
	RAdam	60.69	63.50	65.51	65.24	70.81	69.11
	AdamW	66.59	69.25	66.10	64.78	64.49	67.95

Note: Bolded values indicate the highest-performing results across the applied DA schemes.

The choice of model optimization algorithms also affected the performances of the investigated architectures when considering pretrained weights. The Adam optimizers yielded consistent performance across different metrics with DA-4 and DA-5 strategies providing superior performance for the EfficientNet architecture in F1-Score and BACC metrics. Highly competitive performance was also observed for RAdam and AdamW in these pretrained settings. The MobileNetV3-S showed performance improvements for the BACC metric reaching about 65.15% in testing set evaluations. The DenseNet121 and EfficientNetB0 architectures outperformed others in the multiclass classification experiments involving pretrained weights. DenseNet121 attained the highest accuracy in the testing set evaluations under DA-3 and DA-5 strategies with the RAdam and Adam optimizers indicating the suitability of this architecture for multiclass classification on clinical lesion images. The deep and densely connected layers of the DenseNet model provided better feature extraction and representation in the multiclass settings.

4.2.3. Statistical Analyses

Statistical tests were conducted using the paired t-test (Eqn) and the Welch t-test (Eqn) to compare the performance of the investigated models on the testing dataset. We observe the statistical difference between the original (without any DA) and other DA schemes in each model architecture over different model optimizer instances. The best overall accuracy for the DenseNet-121 (=63.50%) trained from scratch was obtained in the DA-3 scheme with the RAdam optimizer. Statistical tests comparing the performance of the DenseNet121 in DA-3 with the instance without DA revealed p values of 0.015 and 0.0003, respectively, for paired and Welch t-tests. The ResNet-50 model trained without pretrained weights gave an overall best accuracy of 62.00% with the DA-5 scheme and AdamW optimizer. Statistical comparison of the results obtained for the ResNet-50 in DA-5 schemes with the instance without DA gave p values of 0.01 and 0.02, respectively, for paired and Welch t-tests, demonstrating high levels of statistical difference. The EfficientNet-b0 trained without pretrained weights gave an overall best accuracy of 60.3% with the DA-4 scheme and Adam optimizer. The measured p values on paired t-test and Welch tests were 0.013 and 0.012, respectively, indicating high levels of statistical difference between DA-4 and instances without DA for EfficientNet-b0. MobileNet-V3S, trained without pretrained weights, gave an overall best accuracy of 56.40 with the DA-4 scheme and Adam optimizer. The statistical tests measuring the difference between the DA-4 scheme and instances without DA for MobileNet-V3S gave p values of 0.02 and 0.0018, respectively, for paired and Welch t-tests.

The pretrained DenseNet121 model obtained an overall best accuracy of 75.07% with the DA-3 scheme and RAdam optimizer. Statistical comparison of the DA-3 scheme scenario for DenseNet121 with the instance without DA gave p values of 0.19 and 0.09, respectively, for paired and Welch t-tests. The overall best accuracy (=72.46%) obtained for the pretrained ResNet-50 architecture was obtained with the DA-5 scheme and the RAdam optimizer. Statistical comparison of the DA-5 scenario with the instance without DA for ResNet-50 gave p values of 0.11 for both paired and Welch t-tests. DA-4 also produced better performance for the pretrained ResNet50 architecture and gave p values of 0.007 and 0.008 compared to the instances without DA, indicating a high statistical difference. MobileNet-V3S trained using pretrained weights recorded the best accuracy values across all optimizers with the DA-4 scheme. Statistical comparisons of MobileNet-V3S on DA-4 with the instance without DA gave p values of 0.009 and 0.005, respectively, showing high statistical differences. The pretrained EfficientNet-b0 obtained an overall best accuracy of 73.04 in DA-4 and DA-5 schemes. Statistical comparison for EfficientNet-b0 models in DA-4 scheme with the instance without DA gave p values of 0.002 and 0.05, respectively, for paired and Welch t-tests, indicating significant levels of statistical difference.

5. Discussion

5.1. Interpretability

One of our research goals is to understand the rationality behind how the selected DL methods arrive at a particular decision. We conducted a comparative analysis of three interpretable algorithms. These algorithms analyze DL models and produce saliency maps on a clinician’s skin images by identifying regions indicating skin lesion conditions. In the subsequent subsections, we offer a concise overview of the operational framework of our newly developed interpretable DL platform, which predicts and localizes the presence of skin lesions in clinical images for both binary and multiclassification scenarios. In the web platform, only the CAM-based methods were made available because the SHAP algorithm requires lots of computational resources and hence takes more time to produce the explainable maps.

5.1.1. SHAP Assessment on Binary and Multiclass Tasks

In this section, we provide detailed explainability discussions using feature attributions from the SHAP algorithm. The discussion in this section is premised on the DenseNet161 model trained with the DA-4 strategy for binary classification and DenseNet121 model trained with the DA-5 strategy for multiclass classification. The SHAP plots are produced for these models and trained from scratch using pretrained weights with Adam, AdamW, and RAdam optimizers. The models are analyzed using instances where correct and incorrect predictions were obtained. Figures 8, 9, and 10 provide the binary classification SHAP visualizations of the feature attributions for the DenseNet161 in pretrained and random weight settings using Adam, AdamW, and RAdam optimizers, respectively. In Figure 8, for the correct predictions, Adam-optimized models focus on the irregular borders and pigmentation, as highlighted by the red regions, for malignant predictions. For benign predictions, the model highlights the smoother areas characterized by uniform texture and coloration. Wrong predictions were obtained when the model incorrectly highlighted noncritical regions of images and areas, particularly displaying malignant features. The AdamW-optimized model highlights lesion border areas and texture irregularities as critical features for malignancy as shown in Figure 9. The lack of color variations and uniform textures was highlighted for correct benign predictions in the AdamW-optimized model. For incorrect predictions, the model attributes less significant regions and minor artifacts, failing to give significant weights to malignant features. Furthermore, it appears the model identifies elevation as a predominant feature for obtaining malignant predictions, leading to misclassifications in Figures 9(b) and 9(d). In Figure 10, the RAdam-optimized model highlights the lesion edges and heterogeneity for distinguishing malignant and benign lesions. Although the model highlights the lesion areas in the incorrect predictions, it fails to interpret the features correctly to obtain either malignant or benign predictions.

Figures 11, 12, and 13 provide the multiclass classification SHAP visualizations of the feature attributions for the DenseNet161 in pretrained and random weight settings using Adam, AdamW, and RAdam optimizers, respectively. In Figure 11, the Adam-optimized DenseNet-121 with pretrained weights focuses on the darker and irregular regions for correct BCC predictions. Uniform pigmentation and well-defined borders were highlighted as attributions for NEV prediction, while melanoma attributes were due to uneven colors and irregular borders. Incorrect predictions of SCC were due to the attribution of surface irregularities and color variations shared with BCC and ACK conditions. The AdamW-optimized model in Figure 12 highlights malignant features showing irregular pigmentation and border irregularities to obtain accurate predictions for MEL conditions. Characteristic features showing raised edges and pearly colors were attributed to BCC predictions. Incorrect predictions in this case were primarily due to the model’s feature attributions that were not definitive for the correct class. Accurate predictions for MEL conditions in Figure 13 for the RAdam-optimized model were due to features distinctive of asymmetry and color variations. Likewise, the model focuses on distinctive features with irregular textures and pigmentations for ACK and BCC. Incorrect predictions, especially in the case of NEV misclassified as MEL, were due to emphasis on features that were not indicative of malignant lesions.

5.1.2. CAM Interpretability Assessment on a Multiclassification Task

The operational functionality of our developed system is showcased in Figure 14. This explainable DL system demonstrates the capability to predict skin lesions, suggest the specific type of skin lesion (malignant or benign), and pinpoint the precise region of interest containing the lesion (tumor). Suppose a user of the developed interpretable DL platform chooses the multiclass skin lesion predictive system menu tab; upon uploading the skin image and selecting DenseNet121 and then clicking on the Predict button, the user activates the DL engine to classify the skin image to determine the target class from the six possible class categories (displaying the extent of certainty). The result demonstrates the system effectively predicted the provided skin image containing BCC, a type of malignant tumor, revealing a certainty score of 100%. Conversely, the confidence scores for other classes (SEK, NEV, ACK, SCC, and MEL) were 0%. Consequently, we chose the interpretable scheme (GradCAM) from the available options and then clicked on the interpret button. The interpretability analysis showed that the BCC can be found in the middle portion of the clinical skin image. We conducted another assessment of DenseNet121 using the other interpretable algorithms (AblationCAM and GradCAM++). Figure 15 demonstrates that all interpretable algorithms concurred and identified nearly identical attention regions (localized portions) in the skin containing BCC. This comprehensive analysis is crucial for medical experts to establish confidence before proffering treatment protocols.

5.1.3. CAM Interpretability Assessment on a Binary Classification Task

We conducted a detailed examination of the operational features within our developed system’s binary skin lesion predictive system menu, as depicted in Figure 16. Suppose a user of the developed interpretable DL platform chooses the binary skin lesion predictive system menu tab; upon uploading the skin image and selecting MobileNet-Large (MobileNetV3-L) and then clicking on the Predict button, the user activates the DL engine to classify the skin image to determine the target class from the two possible class categories (displaying the extent of certainty). The result demonstrates the system accurately predicted the provided skin image containing a benign tumor, achieving a confidence score of 100%. Conversely, the confidence score for a malignant tumor is 0%. Furthermore, the explainability analysis suggests that the benign tumor is located at the upper left and lower middle regions of the clinical skin image. We replicated the assessment using the other interpretable algorithms and present the resulting observations in Figure 17. The figure shows that all the interpretable algorithms agreed and identified nearly identical attention regions (localized portions) in the skin containing benign tumors. Consequently, this attention region influenced MobileNetV3-L’s decision to classify the uploaded skin image as depicting a benign tumor. Such thorough analysis is essential for medical experts to devise treatment protocols confidently.

5.2. Limitations

In our study, we carried out a detailed and comprehensive analysis of diverse DL models for binary and multiclass classification. We investigated different DA schemes and model optimizers for the selected DL models. Furthermore, we provided a detailed explainability analysis with CAM and SHAP algorithms. While our study provides some novel insights into the performance of these DL models with different DA and optimization algorithms, some limitations need to be addressed.

•
Dataset: While the investigated dataset, PADUFES-20 [5], produced some promising results in binary classification, the limited performance in the multiclass classification is due to the limited samples present in the dataset. Although DA strategies have been proven in our study to improve performance on this dataset, the MEL class in the PADUFES-20 dataset has only 52 samples in total. This prevents the models from learning adequate features for classifying MEL in multiclass classification. Likewise, the SCC class has only 192 samples in the dataset, and this limitation also affected the model’s performance in correctly identifying lesions from this class in the testing set. Our study has shown the potential of improving the performance of DL models in lesion classification from clinical images when the right augmentation strategies are applied. Research advocacy for building larger databases of clinical lesion images, similar to the large dermoscopy lesion archives, is needed to further drive research in this area and improve model performance.
•
DA: In our study, we investigate five DA strategies, including only one color DA scheme. Further studies can be conducted using more sophisticated color augmentation strategies for simulating performance under different lighting conditions. Additionally, color augmentations can be carefully designed to build models immune to skin tone variations and bias.
•
Model architectures: Our study focuses on four families of CNN architectures, namely, DenseNet, ResNet, MobileNet, and EfficientNet. It may be worth investigating other CNN architectures on this dataset. Furthermore, this study can be replicated with different vision transformer variants.
•
Optimization schemes: Due to limited computational resources, we investigated and reported on only three model optimization algorithms: Adam, RAdam, and AdamW. Further investigations with other optimization algorithms might be useful to the medical dermatology research community.
•
Explainability/interpretability: Our interpretability analyses were limited to qualitative discussions due to the absence of ground truth bounding boxes in the lesion dataset. We intend to conduct further quantitative interpretability analysis after engaging qualified experts in the medical community for ground truth labeling on the PADUFES-20 dataset.

6. Conclusion

This article has demonstrated the development and deployment of an affordable, interpretable DL system for binary or multiclassification of lesions from clinical skin images. The interpretable DL system employed XAI algorithms, which analyzed the DL model inferences by understanding the different pixel-level feature representations and subsequently mapping these learned representations to predict target classes for clinician skin images from a dataset—treated as either binary or multiclass of the skin lesion dataset. Additionally, the study evaluates DL models trained with three variants of the Adam optimizer (RAdam, AdamW, and Adam) with randomly initialized weights. We explore the impact of various DA techniques on model performance evaluation and trained 13 DL models on original data.

Afterward, we selected the 4 best-performing model instances from the 13 DL architectures and considered five DA schemes to enhance model performance. To gain insights into the decision-making processes of these DL methods, we utilize interpretable algorithms that leverage the CAM and SHAP principles to identify critical weights influencing the decision made by the DL models, thereby producing localization maps of skin lesion regions (heatmaps of lesion or areas containing tumor in the medical images). We employed DL models and interpretable algorithms to create an interactive web application interface deployed via the Gradio framework. This pilot-scale application can be a decision support tool for dermatologists and clinicians.

The results from our experimental analysis demonstrate that the instance of DenseNet when trained with appropriate DA schemes and optimized with suitable optimizer outperforms most other DL approaches across various evaluation metrics in binary and multiclass challenges. We illustrate how explainable algorithms elucidate the rationale behind DL model decisions by highlighting attention regions containing lesions. This study indicates that offline DA can significantly enhance DL techniques, improving classification performance compared to models trained solely on original clinician skin images. This indicates that employing DA techniques, factoring in spatial transformations or combinatorial effects of different DA schemes, aided the DL model in effectively generalizing to new testing examples. Our proposed interactive interpretable DL system supports three primary functions: predictive capability, the ability to perform recommendations, and saliency map generation.

Future research could explore the potential of using the vision transformer model to predict skin lesions on a larger dataset of clinician skin lesions, covering a broader range of class categories.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

This research was funded by the Saudi Data & AI Authority (SDAIA) and King Fahd University of Petroleum & Minerals (KFUPM) under SDAIA-KFUPM Joint Research Center for Artificial Intelligence (JRCAI) Grant JRCAI-UCG-04.

Acknowledgments

The authors would like to acknowledge the support provided by SDAIA-KFUPM Joint Research Center for Artificial Intelligence at King Fahd University of Petroleum & Minerals.

Open Research

Data Availability Statement

Data are available on request.

References

1 Esteva A., Kuprel B., Novoa R. A. et al., Dermatologist-Level Classification of Skin Cancer With Deep Neural Networks, Nature. (2017) 542, no. 7639, 115–118, https://doi.org/10.1038/nature21056, 2-s2.0-85016143105.
10.1038/nature21056
CAS PubMed Web of Science® Google Scholar
2 Le D. N. T., Le H. X., Ngo L. T., and Ngo H. T., Transfer Learning With Class-Weighted and Focal Loss Function for Automatic Skin Cancer Classification, 2020.
Google Scholar
3 Ha Q., Liu B., and Liu F., Identifying Melanoma Images Using Efficientnet Ensemble: Winning Solution to the Siim-Isic Melanoma Classification Challenge, http://arxiv.org/abs/2010.05351.
Google Scholar
4 Brinker T. J., Hekler A., Enk A. H. et al., Deep Neural Networks are Superior to Dermatologists in Melanoma Image Classification, European Journal of Cancer. (2019) 119, 11–17, https://doi.org/10.1016/j.ejca.2019.05.023, 2-s2.0-85069046206.
10.1016/j.ejca.2019.05.023
PubMed Web of Science® Google Scholar
5 Pacheco A. G. and Krohling R. A., The Impact of Patient Clinical Information on Automated Skin Cancer Detection, Computers in Biology and Medicine. (2020) 116, no. 1, https://doi.org/10.1016/j.compbiomed.2019.103545.
10.1016/j.compbiomed.2019.103545
PubMed Google Scholar
6 Hernández-Pérez C., Combalia M., Podlipnik S. et al., BCN20000: Dermoscopic Lesions in the Wild, Scientific Data. (2024) 11, no. 1, https://doi.org/10.1038/s41597-024-03387-w.
10.1038/s41597-024-03387-w
PubMed Google Scholar
7 Tschandl P., Rosendahl C., and Kittler H., Data Descriptor: The Ham10000 Dataset, a Large Collection of Multi-Source Dermatoscopic Images of Common Pigmented Skin Lesions, Scientific Data. (2018) 5, no. 1, https://doi.org/10.1038/sdata.2018.161, 2-s2.0-85051787357.
10.1038/sdata.2018.161
PubMed Google Scholar
8 Codella N. C. F., Gutman D., Celebi M. E. et al., Skin Lesion Analysis toward Melanoma Detection: A Challenge at the 2017 International Symposium on Biomedical Imaging (ISBI), Hosted by the International Skin Imaging Collaboration (Isic), 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), 2018, 168–172, https://doi.org/10.1109/ISBI.2018.8363547, 2-s2.0-85048103617.
10.1109/ISBI.2018.8363547
Google Scholar
9 International Skin Imaging Collaboration, SIIM-ISIC 2020 Challenge Dataset, 2020, International Skin Imaging Collaboration.
Google Scholar
10 Mendonca T., Ferreira P. M., Marques J. S., Marcal A. R. S., and Rozeira J., PH²-A Dermoscopic Image Database for Research and Benchmarking, Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2013, 5437–5440, https://doi.org/10.1109/embc.2013.6610779, 2-s2.0-84886463000.
10.1109/embc.2013.6610779
Google Scholar
11 Jojoa Acosta M. F., Caballero Tovar L. Y., Garcia-Zapirain M. B., and Percybrooks W. S., Melanoma Diagnosis Using Deep Learning Techniques on Dermatoscopic Images, BMC Medical Imaging. (2021) 21, no. 1, https://doi.org/10.1186/s12880-020-00534-8.
10.1186/s12880-020-00534-8
PubMed Web of Science® Google Scholar
12 Chatterjee S., Dey D., and Munshi S., Integration of Morphological Preprocessing and Fractal Based Feature Extraction With Recursive Feature Elimination for Skin Lesion Types Classification, Computer Methods and Programs in Biomedicine. (2019) 178, 201–218, https://doi.org/10.1016/j.cmpb.2019.06.018, 2-s2.0-85068269371.
10.1016/j.cmpb.2019.06.018
PubMed Google Scholar
13 Mahbod A., Tschandl P., Langs G., Ecker R., and Ellinger I., The Effects of Skin Lesion Segmentation on the Performance of Dermatoscopic Image Classification, Computer Methods and Programs in Biomedicine. (2020) 197, https://doi.org/10.1016/j.cmpb.2020.105725.
10.1016/j.cmpb.2020.105725
PubMed Google Scholar
14 Carcagni P., Leo M., Cuna A. et al., Classification of Skin Lesions by Combining Multilevel Learnings in a Densenet Architecture, Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2019, 11751, Springer Verlag, 335–344, https://doi.org/10.1007/978-3-030-30642-7_30, 2-s2.0-85072977411.
10.1007/978-3-030-30642-7_30
Google Scholar
15 Batista L. G., Bugatti P. H., and Saito P. T., Classification of Skin Lesion Through Active Learning Strategies, Computer Methods and Programs in Biomedicine. (2022) 226, https://doi.org/10.1016/j.cmpb.2022.107122.
10.1016/j.cmpb.2022.107122
PubMed Google Scholar
16 Hekler A., Kather J. N., Krieghoff-Henning E. et al., Effects of Label Noise on Deep Learning-Based Skin Cancer Classification, Frontiers of Medicine. (2020) 7, no. 5, https://doi.org/10.3389/fmed.2020.00177.
10.3389/fmed.2020.00177
Google Scholar
17 Hasan M. K., Dahal L., Samarakoon P. N., Tushar F. I., and Martí R., DSNeT: Automatic Dermoscopic Skin Lesion Segmentation, Computers in Biology and Medicine. (2020) 120, no. 5, https://doi.org/10.1016/j.compbiomed.2020.103738.
10.1016/j.compbiomed.2020.103738
PubMed Google Scholar
18 Lee Y. C., Jung S.-H., and Won H.-H., WonDerM: Skin Lesion Classification With Fine-Tuned Neural Networks, 2018.
Google Scholar
19 Cassidy B., Kendrick C., Brodzicki A., Jaworek-Korjakowska J., and Yap M. H., Analysis of the ISIC Image Datasets: Usage, Benchmarks and Recommendations, Medical Image Analysis. (2022) 75, no. 1, https://doi.org/10.1016/j.media.2021.102305.
10.1016/j.media.2021.102305
PubMed Google Scholar
20 Okafor E., Schomaker L., and Wiering M. A., An Analysis of Rotation Matrix and Colour Constancy Data Augmentation in Classifying Images of Animals, Journal of Information and Telecommunication. (2018) 2, no. 4, 465–491, https://doi.org/10.1080/24751839.2018.1479932.
10.1080/24751839.2018.1479932
Google Scholar
21 Mendes C. F. S. d. F. and Krohling R. A., Deep and Handcrafted Features From Clinical Images Combined With Patient Information for Skin Cancer Diagnosis, Chaos, Solitons & Fractals. (2022) 162, https://doi.org/10.1016/j.chaos.2022.112445.
10.1016/j.chaos.2022.112445
Google Scholar
22 Oztel I., Yolcu Oztel G., and Sahin V. H., Deep Learning-Based Skin Diseases Classification Using Smartphones, Advanced Intelligent Systems. (2023) 5, no. 12, https://doi.org/10.1002/aisy.202300211.
10.1002/aisy.202300211
Google Scholar
23 Ou C., Zhou S., Yang R. et al., A Deep Learning Based Multimodal Fusion Model for Skin Lesion Diagnosis Using Smartphone Collected Clinical Images and Metadata, Frontiers in Surgery. (2022) 9, https://doi.org/10.3389/fsurg.2022.1029991.
10.3389/fsurg.2022.1029991
Google Scholar
24 Medhat S., Abdel-Galil H., Aboutabl A. E., and Saleh H., Iterative Magnitude Pruning-Based Light-Version of Alexnet for Skin Cancer Classification, Neural Computing & Applications. (2023) 36, no. 3, 1413–1428, https://doi.org/10.1007/s00521-023-09111-w.
10.1007/s00521-023-09111-w
Google Scholar
25 Qin Z., Liu Z., Zhu P., and Xue Y., A GAN-based Image Synthesis Method for Skin Lesion Classification, Computer Methods and Programs in Biomedicine. (2020) 195, https://doi.org/10.1016/j.cmpb.2020.105568.
10.1016/j.cmpb.2020.105568
PubMed Google Scholar
26 Perez F., Vasconcelos C., Avila S., and Valle E., Data Augmentation for Skin Lesion Analysis, 2018, Springer International Publishing.
10.1007/978-3-030-01201-4_33
Google Scholar
27 Groh M., Harris C., Soenksen L. et al., Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology With the Fitzpatrick 17k Dataset, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021, IEEE, https://doi.org/10.1109/cvprw53098.2021.00201.
10.1109/cvprw53098.2021.00201
Google Scholar
28 Vasconcelos C. N. and Vasconcelos B. N., Experiments Using Deep Learning for Dermoscopy Image Analysis, Pattern Recognition Letters. (2020) 139, 95–103, https://doi.org/10.1016/j.patrec.2017.11.005, 2-s2.0-85035799257.
10.1016/j.patrec.2017.11.005
Google Scholar
29 Lundberg S. M. and Lee S.-I., A Unified Approach to Interpreting Model Predictions, Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 2017, Red Hook, NY, Curran Associates Inc, 4768–4777.
Google Scholar
30 Chen H., Lundberg S. M., and Lee S.-I., Explaining a Series of Models by Propagating Shapley Values, Nature Communications. (2022) 13, no. 1, https://doi.org/10.1038/s41467-022-31384-3.
10.1038/s41467-022-31384-3
Google Scholar
31 Lundberg S. M., Erion G., Chen H. et al., From Local Explanations to Global Understanding With Explainable AI for Trees, Nature Machine Intelligence. (2020) 2, no. 1, 56–67, https://doi.org/10.1038/s42256-019-0138-9.
10.1038/s42256-019-0138-9
PubMed Web of Science® Google Scholar
32 Covert I. and Lee S.-I., A. Banerjee and K. Fukumizu, Improving Kernelshap: Practical Shapley Value Estimation Using Linear Regression, AISTATS, Vol. 130 of Proceedings of Machine Learning Research, PMLR, 2021, 3457–3465.
Google Scholar
33 Ribeiro M. T., Singh S., and Guestrin C., Why Should I Trust You?: Explaining the Predictions of Any Classifier, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’16, 2016, ACM, https://doi.org/10.1145/2939672.2939778, 2-s2.0-84984985889.
10.1145/2939672.2939778
Google Scholar
34 Shrikumar A., Greenside P., and Kundaje A., Learning Important Features Through Propagating Activation Differences, Proceedings of the 34th International Conference on Machine Learning-Volume 70, ICML’17, 2017, 3145–3153.
Google Scholar
35 Selvaraju R. R., Cogswell M., Das A., Vedantam R., Parikh D., and Batra D., Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization, 2017 IEEE International Conference on Computer Vision (ICCV), 2017, 618–626, https://doi.org/10.1109/iccv.2017.74, 2-s2.0-85041910265.
10.1109/iccv.2017.74
Google Scholar
36 Chattopadhay A., Sarkar A., Howlader P., and Balasubramanian V. N., Grad-Cam++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, IEEE, 839–847.
Google Scholar
37 Ramaswamy H. G., Ablation-CAM: Visual Explanations for Deep Convolutional Network via Gradient-Free Localization, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, 983–991.
Google Scholar
38 Attallah O., Skin-CAD: Explainable Deep Learning Classification of Skin Cancer From Dermoscopic Images by Feature Selection of Dual High-Level CNNs Features and Transfer Learning, Computers in Biology and Medicine. (2024) 178, https://doi.org/10.1016/j.compbiomed.2024.108798.
10.1016/j.compbiomed.2024.108798
PubMed Google Scholar
39 Cerekci E., Alis D., Denizoglu N. et al., Quantitative Evaluation of Saliency-Based Explainable Artificial Intelligence (Xai) Methods in Deep Learning-Based Mammogram Analysis, European Journal of Radiology. (2024) 173, https://doi.org/10.1016/j.ejrad.2024.111356.
10.1016/j.ejrad.2024.111356
PubMed Google Scholar
40 Hosny K. M., Said W., Elmezain M., and Kassem M. A., Explainable Deep Inherent Learning for Multi-Classes Skin Lesion Classification, Applied Soft Computing. (2024) 159, https://doi.org/10.1016/j.asoc.2024.111624.
10.1016/j.asoc.2024.111624
Google Scholar
41 Nayak T., Chadaga K., Sampathila N. et al., Deep Learning Based Detection of Monkeypox Virus Using Skin Lesion Images, Medicine in Novel Technology and Devices. (2023) 18, https://doi.org/10.1016/j.medntd.2023.100243.
10.1016/j.medntd.2023.100243
PubMed Google Scholar
42 Sobahi N., Atila O., Deniz E., Sengur A., and Acharya U. R., Explainable COVID-19 Detection Using Fractal Dimension and Vision Transformer With Grad-CAM on Cough Sounds, Biocybernetics and Biomedical Engineering. (2022) 42, no. 3, 1066–1080, https://doi.org/10.1016/j.bbe.2022.08.005.
10.1016/j.bbe.2022.08.005
PubMed Web of Science® Google Scholar
43 Ozer C., Guler A., Cansever A. T., and Oksuz I., Explainable Image Quality Assessment for Medical Imaging, 2023, https://doi.org/10.48550/ARXIV.2303.14479.
10.48550/ARXIV.2303.14479
Google Scholar
44 Hassan E., Shams M. Y., Hikal N. A., and Elmougy S., The Effect of Choosing Optimizer Algorithms to Improve Computer Vision Tasks: A Comparative Study, Multimedia Tools and Applications. (2022) 82, no. 11, 16591–16633, https://doi.org/10.1007/s11042-022-13820-0.
10.1007/s11042-022-13820-0
PubMed Google Scholar
45 Wang L., Lin Z. Q., and Wong A., COVID-net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases From Chest X-Ray Images, Scientific Reports. (2020) 10, no. 1, https://doi.org/10.1038/s41598-020-76550-z.
10.1038/s41598-020-76550-z
Web of Science® Google Scholar
46 Sharma P. and Anand R. S., A Comprehensive Evaluation of Deep Models and Optimizers for Indian Sign Language Recognition, Graphics and Visual Computing. (2021) 5, https://doi.org/10.1016/j.gvc.2021.200032.
10.1016/j.gvc.2021.200032
Google Scholar
47 Lee S. and Kim T., Impact of Deep Learning Optimizers and Hyperparameter Tuning on the Performance of Bearing Fault Diagnosis, IEEE Access. (2023) 11, 55046–55070, https://doi.org/10.1109/access.2023.3281910.
10.1109/ACCESS.2023.3281910
Google Scholar
48 He K., Zhang X., Ren S., and Sun J., Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, IEEE, https://doi.org/10.1109/cvpr.2016.90, 2-s2.0-84986274465.
10.1109/cvpr.2016.90
Google Scholar
49 Huang G., Liu Z., Van Der Maaten L., and Weinberger K. Q., Densely Connected Convolutional Networks, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, IEEE, https://doi.org/10.1109/cvpr.2017.243, 2-s2.0-85035343801.
10.1109/cvpr.2017.243
Google Scholar
50 Tan M. and Le Q., EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, International Conference on Machine Learning, 2019, PMLR, 6105–6114.
Google Scholar
51 Howard A. G., Zhu M., Chen B. et al., MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, 2017, https://doi.org/10.48550/ARXIV.1704.04861.
10.48550/ARXIV.1704.04861
Google Scholar
52 Kinga D. and Adam J. B., A Method for Stochastic Optimization, 5, International Conference on Learning Representations (ICLR), 2015, San Diego, California.
Google Scholar
53 Liu L., Jiang H., He P. et al., On the Variance of the Adaptive Learning Rate and Beyond, 2019.
Google Scholar
54 Loshchilov I. and Hutter F., Decoupled Weight Decay Regularization, International Conference on Learning Representations, 2017.
Google Scholar
55 Zhou B., Khosla A., Lapedriza A., Oliva A., and Torralba A., Learning Deep Features for Discriminative Localization, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, 2921–2929.
Google Scholar
56 Hroub N. A., Alsannaa A. N., Alowaifeer M., Alfarraj M., and Okafor E., Explainable Deep Learning Diagnostic System for Prediction of Lung Disease From Medical Images, Computers in Biology and Medicine. (2024) 170, https://doi.org/10.1016/j.compbiomed.2024.108012.
10.1016/j.compbiomed.2024.108012
PubMed Google Scholar

All articles