International Journal of Intelligent Systems

Volume 2025, Issue 1 6463038

Research Article

Open Access

Software Defect Prediction Based on Fuzzy Cost Broad Learning System

Heling Cao

orcid.org/0000-0002-0610-9061

Key Laboratory of Grain Information Processing and Control , Henan University of Technology , Ministry of Education , Zhengzhou , China , meb.gov.tr

Henan Key Laboratory of Grain Photoelectric Detection and Control , Henan University of Technology , Zhengzhou , China , haut.edu.cn

College of Information Science and Engineering , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Center for Complexity Science , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Search for more papers by this author

Zhiying Cui,

Zhiying Cui

orcid.org/0009-0007-1551-9736

Key Laboratory of Grain Information Processing and Control , Henan University of Technology , Ministry of Education , Zhengzhou , China , meb.gov.tr

Henan Key Laboratory of Grain Photoelectric Detection and Control , Henan University of Technology , Zhengzhou , China , haut.edu.cn

College of Information Science and Engineering , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Center for Complexity Science , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Search for more papers by this author

Yonghe Chu,

Yonghe Chu

orcid.org/0000-0002-5756-8356

Key Laboratory of Grain Information Processing and Control , Henan University of Technology , Ministry of Education , Zhengzhou , China , meb.gov.tr

Henan Key Laboratory of Grain Photoelectric Detection and Control , Henan University of Technology , Zhengzhou , China , haut.edu.cn

College of Information Science and Engineering , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Center for Complexity Science , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Search for more papers by this author

Lina Gong,

Lina Gong

orcid.org/0000-0002-5272-6706

College of Computer Science and Technology , Nanjing University of Aeronautics and Astronautics , Nanjing , China , nuaa.edu.cn

Search for more papers by this author

Guangen Liu,

Guangen Liu

orcid.org/0009-0005-0428-0569

Key Laboratory of Grain Information Processing and Control , Henan University of Technology , Ministry of Education , Zhengzhou , China , meb.gov.tr

Henan Key Laboratory of Grain Photoelectric Detection and Control , Henan University of Technology , Zhengzhou , China , haut.edu.cn

College of Information Science and Engineering , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Center for Complexity Science , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Search for more papers by this author

Yun Wang,

Yun Wang

orcid.org/0000-0001-6681-4187

Key Laboratory of Grain Information Processing and Control , Henan University of Technology , Ministry of Education , Zhengzhou , China , meb.gov.tr

Henan Key Laboratory of Grain Photoelectric Detection and Control , Henan University of Technology , Zhengzhou , China , haut.edu.cn

College of Information Science and Engineering , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Center for Complexity Science , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Search for more papers by this author

Fangchao Tian,

Fangchao Tian

orcid.org/0009-0003-1125-0099

Key Laboratory of Grain Information Processing and Control , Henan University of Technology , Ministry of Education , Zhengzhou , China , meb.gov.tr

Henan Key Laboratory of Grain Photoelectric Detection and Control , Henan University of Technology , Zhengzhou , China , haut.edu.cn

College of Information Science and Engineering , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Center for Complexity Science , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Search for more papers by this author

Peng Li,

Corresponding Author

Peng Li

[email protected]

orcid.org/0009-0001-8437-7408

Center for Complexity Science , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Search for more papers by this author

Haoyang Ge,

Haoyang Ge

orcid.org/0009-0000-7888-860X

Key Laboratory of Grain Information Processing and Control , Henan University of Technology , Ministry of Education , Zhengzhou , China , meb.gov.tr

Henan Key Laboratory of Grain Photoelectric Detection and Control , Henan University of Technology , Zhengzhou , China , haut.edu.cn

College of Information Science and Engineering , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Center for Complexity Science , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Search for more papers by this author

Heling Cao,

Heling Cao

orcid.org/0000-0002-0610-9061

Key Laboratory of Grain Information Processing and Control , Henan University of Technology , Ministry of Education , Zhengzhou , China , meb.gov.tr

Henan Key Laboratory of Grain Photoelectric Detection and Control , Henan University of Technology , Zhengzhou , China , haut.edu.cn

College of Information Science and Engineering , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Center for Complexity Science , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Search for more papers by this author

Zhiying Cui,

Zhiying Cui

orcid.org/0009-0007-1551-9736

Key Laboratory of Grain Information Processing and Control , Henan University of Technology , Ministry of Education , Zhengzhou , China , meb.gov.tr

Henan Key Laboratory of Grain Photoelectric Detection and Control , Henan University of Technology , Zhengzhou , China , haut.edu.cn

College of Information Science and Engineering , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Center for Complexity Science , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Search for more papers by this author

Yonghe Chu,

Yonghe Chu

orcid.org/0000-0002-5756-8356

Key Laboratory of Grain Information Processing and Control , Henan University of Technology , Ministry of Education , Zhengzhou , China , meb.gov.tr

Henan Key Laboratory of Grain Photoelectric Detection and Control , Henan University of Technology , Zhengzhou , China , haut.edu.cn

College of Information Science and Engineering , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Center for Complexity Science , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Search for more papers by this author

Lina Gong,

Lina Gong

orcid.org/0000-0002-5272-6706

College of Computer Science and Technology , Nanjing University of Aeronautics and Astronautics , Nanjing , China , nuaa.edu.cn

Search for more papers by this author

Guangen Liu,

Guangen Liu

orcid.org/0009-0005-0428-0569

Key Laboratory of Grain Information Processing and Control , Henan University of Technology , Ministry of Education , Zhengzhou , China , meb.gov.tr

Henan Key Laboratory of Grain Photoelectric Detection and Control , Henan University of Technology , Zhengzhou , China , haut.edu.cn

College of Information Science and Engineering , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Center for Complexity Science , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Search for more papers by this author

Yun Wang,

Yun Wang

orcid.org/0000-0001-6681-4187

Key Laboratory of Grain Information Processing and Control , Henan University of Technology , Ministry of Education , Zhengzhou , China , meb.gov.tr

Henan Key Laboratory of Grain Photoelectric Detection and Control , Henan University of Technology , Zhengzhou , China , haut.edu.cn

College of Information Science and Engineering , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Center for Complexity Science , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Search for more papers by this author

Fangchao Tian,

Fangchao Tian

orcid.org/0009-0003-1125-0099

Key Laboratory of Grain Information Processing and Control , Henan University of Technology , Ministry of Education , Zhengzhou , China , meb.gov.tr

Henan Key Laboratory of Grain Photoelectric Detection and Control , Henan University of Technology , Zhengzhou , China , haut.edu.cn

College of Information Science and Engineering , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Center for Complexity Science , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Search for more papers by this author

Peng Li,

Corresponding Author

Peng Li

[email protected]

orcid.org/0009-0001-8437-7408

Center for Complexity Science , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Search for more papers by this author

Haoyang Ge,

Haoyang Ge

orcid.org/0009-0000-7888-860X

Key Laboratory of Grain Information Processing and Control , Henan University of Technology , Ministry of Education , Zhengzhou , China , meb.gov.tr

Henan Key Laboratory of Grain Photoelectric Detection and Control , Henan University of Technology , Zhengzhou , China , haut.edu.cn

College of Information Science and Engineering , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Center for Complexity Science , Henan University of Technology , Zhengzhou , China , haut.edu.cn

Search for more papers by this author

First published: 11 March 2025

https://doi.org/10.1155/int/6463038

Academic Editor: Vasudevan Rajamohan

Share a link

Email
Wechat
Bluesky

Abstract

Software defect prediction (SDP) is an effective approach to ensure software reliability. Machine learning models have been widely employed in SDP, but they ignore the impact of class imbalance, noise and outliers on the prediction performance. This study proposes a fuzzy cost broad learning system (FC-BLS). FC-BLS not only handles class imbalance problems but also considers the specific sample distribution to address noise and outliers in software defect datasets. Our approach draws fully on the idea of the cost matrix and fuzzy membership functions. It introduces them to BLS, where the cost matrix prioritises the training errors on the minority samples. Hence, the classification hyperplane position is more reasonable, and fuzzy membership functions calculate the membership degree of the sample in a feature mapping space to remove the prediction error caused by noise and outlier samples. Then, the optimisation problem is constructed based on the idea that the minority class and normal instances have relatively high costs. By contrast, the majority class and noise and outlier instances have relatively small costs. This study conducted experiments on nine NASA SDP datasets, and the experimental findings demonstrated the effectiveness of the proposed methodology on most datasets.

1. Introduction

As software engineering continues to develop, research on software defects has gradually become a prominent topic in the field of software reliability [1], making software defect prediction (SDP) one of the fastest-growing and most significant technologies. The SDP technology is designed to predict modules in software systems that are susceptible to defects, thereby allocating testing resources more effectively [2]. Consequently, an effective SDP can lead to cost savings in testing, enhance software quality [3] and positively affect human beings and society [4].

In our previous studies [5, 6], we investigated the phenomenon of class overlap, where defective and nondefective samples exhibited similarities in metric values, and assessed its influence on SDP performance. Additionally, we addressed class imbalance in both within-project and cross-project scenarios by proposing a method called STr-NN, with experimental results demonstrating its effectiveness. Moreover, studying the influence of classification techniques on SDP performance is crucial. Therefore, a novel classification technique suitable for SDP is required.

Currently, machine learning, such as nearest neighbour [7], support vector machine (SVM) [8], neural network [9], logistic regression [10], ensemble methods [11, 12] and extreme learning machine (ELM) [13], is the primary classification technology in traditional SDP [14, 15]. The primary processes included data collection, data processing, model training and model evaluation. Moreover, many variants of machine learning models have been used to improve the accuracy of SDP. For example, Zada et al. [16] combined the war strategy optimisation (WSO) algorithm and kernel extreme learning machine (KELM) for SDP to optimise classifier hyperparameters and consequently elevate defect detection efficiency within software components. However, owing to the nonlinear separability of SDP data, traditional classification algorithms often struggle to adequately represent high-dimensional and sparse software defect data, leading to underfitting. Therefore, many deep learning methods have been utilised in SDP. Unlike conventional machine learning, deep learning possesses the ability to automatically learn and extract discriminative features from data, leading to the creation of a more accurate SDP model. For instance, Šikić et al. [17] utilised a convolutional graph neural network (GCNN) as the underlying framework to obtain characteristics from an Abstract Syntax Tree to improve the classification performance. Chakraborty and Chakraborty [18] presented a deep feedforward neural network with an internal hierarchy, namely the Hellinger net, and it utilised a stochastic gradient descent backpropagation algorithm for training the model.

Despite the excellent capability of deep learning methods to represent data features, most methods exhibit a significant number of parameters and complex structures. The training process of these models usually entails layer-wise unsupervised pre-training and overall supervised fine-tuning, making the model optimisation process cumbersome. Moreover, deep learning models involve solving highly nonconvex optimisation problems, making it challenging to conduct a theoretical analysis of deep architectures. Currently, most efforts are focused on tuning parameters or adding more layers to enhance accuracy. To alleviate the challenges above, Chen and Liu [19] proposed a broad learning system (BLS) that extends neurones in a broad manner, rather than being confined to deep stacked layers. The output weights are then computed using a pseudoinverse. Compared to popular deep learning models, BLS can handle large-scale data and can be applied to incremental learning models without retraining the entire model when adding new nodes.

In conclusion, BLS was utilised as the fundamental classifier for SDP owing to its structural simplicity, few training parameters and rapid training. Nevertheless, conventional BLS cannot address the effects of class imbalances, outliers and noise that may be present in datasets. Class imbalance can cause a BLS to overlook potentially more valuable minority classes, owing to its widespread application. Noise and outliers were also present in the software defect data. Certain training points were affected by noise, resulting in data corruption. In addition, BLS may overfit the outliers, thereby affecting the classification results for normal data. Conventional BLS relies primarily on the distinguishability of different sample classes, sometimes overlooking variations in sample quantities across different classes and the special sample distribution in the feature space.

We propose a fuzzy cost broad learning system for SDP (FC-BLS) to overcome these shortcomings. Our research contributions can be summarised as follows.

1.
We propose a new variant of the BLS, namely, FC-BLS. This model improves the loss function of BLS by employing cost matrix strategies and fuzzy membership functions to address class imbalances and simultaneously mitigate outliers and noise.
2.
The cost matrix is leveraged to solve the class imbalance. The cost matrix prioritises the training errors on the minority samples, making the classification hyperplane position more reasonable. Higher costs are assigned to defective modules to emphasise the importance of minority classes in solving class imbalance problems.
3.
A fuzzy membership function addresses noise and outliers. The fuzzy membership function considers the specific sample distribution in feature space. It calculates the class centres of all samples in the same class, the radius of all samples and the distance from the samples to the class centres using clustering and K neighbours. Thus a fuzzy membership function is designed, in which a higher degree of membership indicates a stronger influence of that data point on the model.

The remainder of this paper is organised as follows. Section 2 provides prior knowledge relevant to SDP work. Section 3 presents a comprehensive description of the proposed method. The experimental setup is outlined in Section 4. Section 5 presents the experimental results and analyses. Finally, Section 6 concludes the study.

2. Related Work

2.1. Class Imbalanced Learning for Defect Prediction

Various approaches have been proposed at the data and algorithm levels to tackle class imbalance issue in the SDP [20]. Data-level methods employ re-sampling techniques, which are the preprocessing techniques, and adjust the data distribution to address imbalanced datasets, including oversampling [21, 22], and undersampling techniques [23] to tackle the problem. Stradowski [15] aimed to analyse the state-of-the-art SDP approach by machine learning technology and to identify the new trends in this field. In the case of oversampling, instances from minority classes are duplicated. However, distinguishing whether the replicated instances are beneficial or redundant is challenging, which can render oversampling results unreliable. Oversampling may lead to multiple duplicate samples in the dataset, which may appear overfitting [24]. Chawla et al. [22] proposed the SMOTE, which yields synthetic instances for the minority class to preprocess imbalanced data. Many researchers have proposed different variants of the SMOTE to address class imbalance problems. Torgo et al. [21] proposed an oversampling technique called SmoteR for solving regression problems. Douzas, Bacao and Last [25] presented K-means SMOTE to balance samples. Improved oversampling techniques were also proposed. Yedida and Menzies [26] introduced a fuzzy oversampling method to enhance the efficacy of deep learning on imbalanced datasets. Arun and Lakshmi [27] introduced a multicluster-based oversampling method to address the class imbalance and smaller disjuncts effectively. Wang, Liu and Bai [28] introduced an enhanced defect prediction approach that incorporates oversampling techniques to effectively mitigate class imbalance, class overlap and noise issues. These sophisticated oversampling techniques, while enhancing the diversity and representativeness of synthetic samples, continue to face challenges in determining the optimal number of synthetic instances and ensuring their alignment with the underlying data distribution, potentially limiting their applicability in SDP. Unlike oversampling approaches, undersampling techniques attain balance by reducing the number of instances in the majority class. However, it can be challenging to determine whether the deleted instances are redundant or beneficial, which results in the loss of instance information [29]. Khoshgoftar et al. [30, 31] employed a random undersampling (RUS) technique for imbalanced data. Furthermore, some researchers have suggested methods that combine undersampling and oversampling to mitigate imbalanced sample issues [32, 33].

At the algorithmic level, ensemble learning and cost-sensitive (CS) learning are frequently employed to tackle class imbalance. Bagging and boosting in ensemble learning have proven effective [34]. Wang and Yao [35] analysed the proposed ensemble method, DNC (Dynamic Adabort.NC), and found it to outperform ROS, RUS and SMOTE. Tang et al. [36] employed bagging ensemble learning in conjunction with the adaptive variable sparrow search algorithm ELM to mitigate dataset imbalance; however, the parameter selection process of this algorithm resulted in increased time overhead. CS learning allocates a significant penalty for misclassifying samples from the minority class while assigning a comparatively lower cost to misclassifying samples from the majority class. Khoshgotaar et al. [37] introduced a cost-boosting approach incorporating CS learning into SDP. Zheng [38] studied the efficacy of CS-boosting algorithms for augmenting neural networks. Arar and Ayan [39] conducted research on CS neural networks. Jin [40] considered intersample relationships and subsequently utilised CS learning to enhance distance metric learning to achieve sample balance. In summary, the CS algorithm can set the cost weights of different categories to adjust the model’s attention to different categories, making it more flexible and improving its generalisation performance.

2.2. BLS

The BLS can overcome the challenge of prolonged training processes in deep learning by expanding the width of the neural network. Moreover, BLS can reduce the time required to construct a network model. The input data are initially mapped into mapped nodes via BLS and subsequently enhanced to the enhancement nodes. Then, all the nodes are jointly input into the hidden layer. Finally, we use the pseudoinverse to calculate the weight that connects the hidden and output layers. In cases where the nodes are insufficient, the BLS eliminates the need to initiate the learning process from scratch. Instead, it simply adjusts the weights related to the added nodes to facilitate rapid retraining. Since its proposal, BLS has found extensive applications in diverse fields, including human activity recognition [41], feature extraction using the K-means clustering algorithm [42], multimodal information fusion in BLS [43], image classification [44], artificial intelligence [45] and imbalanced data classification [46]. This demonstrates that BLS exhibits robustness, generalisation ability and scalability.

3. Proposed Model

In this section, the FC-BLS introduces the cost matrix and fuzzy membership function into the BLS respectively to reduce the influence of class imbalance, noise and outliers on the prediction accuracy. The procedure for prediction using FC-BLS is depicted in Figure 1.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

The prediction process of the FC-BLS. (a) SDP dataset. Red represents the defective modules, and blue represents the nondefective modules. (b) Cost matrix. Calculate the cost according to the proportion of majority and minority classes, the minority class is assigned larger costs, the majority class is assigned smaller costs and the cost matrix C is formed. (c) Fuzzy membership function. K is the number of neighbours of the sample point x_i. Calculate the class centre of all samples in the mth class φ^m, the radius of all samples in the mth class R, the distance from samples to the class centre d_i, and then the mean distance separating the sample point from its neighbouring points is obtained, and then calculate the membership degree t_i. The membership degree matrix T is formed. (d) FC-BLS model. The cost matrix C and the membership function T are embedded in the BLS optimisation objective function. (e) Prediction labels.

FC-BLS mainly consists of three parts: (1) BLS model: the fundamental classifier of FC-BLS, enhancing the prediction process under its simplistic structure and low training parameters; (2) cost matrix: making the minority class obtain a more significant cost, and adding it to the BLS, hence increasing the sensitivity of the classifier towards the minority class and solving the issue of effective classification of imbalanced samples; (3) fuzzy membership function: considering the sample’s distribution in feature space, as well as the influence of noise and outliers, calculating the membership degree of the sample in the feature mapping space, and adding it to the BLS to eliminate the prediction error caused by noise and outlier samples, which significantly improve BLS, the classification effectiveness and generalisation ability.

3.1. BLS Model

BLS uses both n groups of mapped nodes and m groups of enhancement nodes as inputs to the hidden layer, and each group of mapped and enhancement nodes contains N and M neural nodes, respectively, and b is the dimensionality of the output data. The BLS model structure is depicted in Figure 2.

Given the input dataset

, the mapped nodes can be established by the following:

()

where L represents the sample size, whereas D denotes the sample dimension; W_zi and β_zi are the weight and bias matrix, respectively, which are produced in a random manner; and all mapped groups Z_i are stitched together to build a mapped layer Zⁿ.

()

The feature mapping layer Zⁿ is then nonlinearly transformed to form the enhancement nodes, which can be established by the following:

()

where W_hj and β_hj are produced in a random manner, and all enhancement nodes H_j are stitched together to build an enhancement layer H^m, as follows:

()

Ultimately, the mapped layer Zⁿ and the enhancement layer H^m are directly connected to the output of the BLS, and let A = [Zⁿ|H^m], the BLS final output is

()

where Y is the predicted value matching with X and W is the output weight matrix.

We can calculate the objective optimisation function for W:

()

Here, λ is the regularisation parameter.

3.2. Design of Cost Matrix

In the SDP datasets, the defective class is representative of the minority class. Assuming that the sample x_i is attributed to a defective class, the cost matrix C = diag{c_ii}, i = 1, …, N assigns a higher cost c_ii. The cost c_ii for each module x_i is defined as follows:

()

where ndm represents the overall count of nondefective samples, dm represents the overall count of defective samples, and N is the overall count of training samples.

3.3. Design of Fuzzy Membership Functions

For a given dataset X = (X₁, X₂, ⋯, X_m, ⋯, X_P), X_m is the sample matrix consisting of all samples of class m, P is the overall count of classes in the sample and

is the ith sample of the mth class. The count of samples in class m is n_m. The overall count of samples is L. The dataset X = (X₁, X₂, ⋯, X_m, ⋯, X_P) is mapped from the input space to the hidden layer and can be written as φ(X) = (φ(X₁), φ(X₂), ⋯, φ(X_m), ⋯, φ(X_P)). The membership degree is defined by clustering and k neighbours. The definitions for φ^m, which represents the class centre of all samples in the mth class, R, which denotes the radius of all samples in the mth class, and d_i, which denotes the distance from samples to the class centre, are as follows:

()

In equation (11),

denotes the mean distance separating the sample point x_i from its neighbouring points and k is the number of neighbours of the sample point x_i. The membership degree t_i is defined by equations (9)–(11):

()

In equations (12) and (13), we made σ > 0, δ > 0 and to avoid s_i = 0 and η_i = 0.

3.4. FC-BLS Model

In the BLS optimisation objective function of equation (6), the cost matrix and fuzzy membership function are introduced as follows:

()

Here, Q = TC, the membership degree matrix T = diag(t_i), i = 1, 2, …, L and the cost matrix C = diag(c_ii), i = 1, 2, …, L

We can calculate the objective optimisation function of equation (15) and obtain the final output weight matrix of the FC-BLS as follows:

()

where I is the identity matrix and λ is the regularisation parameter.

3.5. Computational Complexity Analysis

In this subsection, we can analyse the computational complexity of FC-BLS. From equation (16), we concluded that the computational complexity of FC-BLS mainly consists of A, A^TQA and A^TQY. The computational complexity of A is O(D(Nn + M)L), D represents the dimension of the input data, L is the number of samples and the complexity of the identity matrix I is O(L²); N is the number of feature nodes, n represents the number of feature node groups and M is the number of enhancement nodes; the complexity of A^TQA is O((Nn + M)²L³), the complexity of A^TQY is O(b(Nn + M)L³) and b is the dimension of the weight vector of the output node. The complexity of solving the inverse operation of the matrix A^TQA is O((Nn + M)³). Therefore, the computational complexity of FC-BLS is

()

4. Experimental Setup

This section introduces the software defect datasets and performance evaluation metrics employed in this experiment. Finally, the experimental setup and procedures are described.

4.1. Software Defect Dataset

All experimental datasets were based on NASA datasets. Table 1 provides the characteristics of the datasets. Notably, owing to the imbalance rates ranging from 2.1% to 35.2% across all datasets, each dataset exhibited class imbalance problem.

Table 1. Characteristics of the NASA datasets.

Datasets	Number of instances	Number of metrics	Defective (%)
CM1	327	37	12.8
KC3	194	39	18.6
MC2	125	39	35.2
MW1	253	37	10.7
PC1	705	37	8.7
PC2	745	36	2.1
PC3	1077	37	12.4
PC4	1287	37	13.8
PC5	1711	38	27.5

4.2. Performance Indicators

In the scenario of imbalanced SDP datasets, we utilised the F-measure, AUC and Recall as performance indicators. These indicators also assess whether the proposed method effectively balances the majority and minority classes. Table 2 shows the confusion matrix and the Recall, Precision and F-measure defined using the confusion matrix. The Recall is the proportion of samples that correctly predicted to be defective among all defective samples. The F-measure is the weighted harmonic average of the Precision and Recall. The AUC is often utilised to gauge the model’s ability to balance performance between two classes [47]. The higher the value of the F-measure, AUC and recall, the better the classification effect of the FC-BLS.

Table 2. Basic indicators for defect prediction.

	Predicted as defective	Predicted as nondefective
Actual defective	TP	FN
Actual nondefective	FP	TN
Recall	TP/TP + FN
Precision	TP/TP + FP
F-measure	2∗recall∗precision/recall + precision

4.3. Experimental Design

The experimental design is described in this section. We conducted experiments using a system equipped with 64 GB of RAM, an NVIDIA GPU (model 8server), CUDA Version 11.4, driver Version 470.182 and NVIDIA-SMI Version 470.182. We utilised Anaconda Version 3 and TensorFlow as the backend for the Keras library. Additionally, we used Numpy as a linear algebra library; Pandas and Sklearn for data interpretation; Imblearn, Scipy and Seaborn for sampling methods; and Matplotlib for data visualisation. We conducted the SDP experiments on nine NASA datasets. First, each NASA dataset was randomly divided into a 75:25 ratio, with the training dataset accounting for 75% of the test dataset accounting for 25%. Then, we performed 10-fold cross-validation to choose the optimal parameter configuration for FC-BLS model performance, the range of values for the number of mapped and enhancement nodes are {15, 20, …, 35}, the regularisation parameter λ took values in the range {2⁻⁵, 2⁻⁴, …, 2⁴}. Finally, we compared the FC-BLS with other cutting-edge methods using performance indicators.

5. Experimental Results and Analysis

In the comparative experiments, we compared FC-BLS with KNN [7], SVM [8], CNN [9], ELM [13], BLS [19] and Hellinger net [18], and the proposed method was the best among all three evaluation metrics. We used the average of 10 experiments as the final experimental result. Section 5.1 analyses the experimental results using evaluation metrics, and Section 5.2 is a parameter analysis of the FC-BLS, investigating the relationship between different parameter and model performance. In addition, we present the analytical and experimental data in the form of charts for a more intuitive representation.

5.1. Experiment Analysis

The Recall, AUC and F-measure for the different classifiers over the nine NASA datasets are listed in Table 3. The most favourable outcomes are emphasised in bold.

Table 3. Recall, AUC and F-measure for different classifiers over nine SDP datasets.

Dataset	Measure	CNN	KNN	ELM	SVM	Hellinger net	BLS	FC-BLS (proposed)
CM1	F-measure	0.818	0.867	0.777	0.826	0.865	0.857	0.894
	AUC	0.672	0.389	0.381	0.473	0.706	0.591	0.713
	Recall	0.800	0.848	0.731	0.872	0.660	0.858	0.876

KC3	F-measure	0.772	0.794	0.760	0.833	0.832	0.763	0.858
	AUC	0.574	0.464	0.514	0.500	0.744	0.610	0.798
	Recall	0.796	0.816	0.786	0.714	0.600	0.761	0.853

MC2	F-measure	0.672	0.658	0.672	0.877	0.730	0.760	0.897
	AUC	0.730	0.350	0.365	0.500	0.673	0.745	0.884
	Recall	0.656	0.656	0.672	0.781	0.540	0.759	0.897

MW1	F-measure	0.772	0.813	0.780	0.906	0.902	0.792	0.876
	AUC	0.500	0.592	0.533	0.500	0.790	0.609	0.811
	Recall	0.844	0.766	0.805	0.828	0.650	0.777	0.867

PC1	F-measure	0.860	0.876	0.857	0.902	0.945	0.899	0.925
	AUC	0.561	0.479	0.336	0.402	0.854	0.707	0.787
	Recall	0.823	0.881	0.827	0.924	0.870	0.889	0.921

PC2	F-measure	0.988	0.971	0.976	0.989	0.980	0.983	0.991
	AUC	0.500	0.434	0.508	0.500	0.660	0.537	0.665
	Recall	0.992	0.973	0.974	0.979	0.200	0.986	0.992

PC3	F-measure	0.824	0.823	0.806	0.786	0.907	0.843	0.889
	AUC	0.500	0.424	0.510	0.488	0.835	0.650	0.769
	Recall	0.880	0.807	0.855	0.851	0.700	0.842	0.881

PC4	F-measure	0.866	0.813	0.852	0.787	0.900	0.874	0.903
	AUC	0.632	0.451	0.300	0.491	0.842	0.739	0.852
	Recall	0.859	0.830	0.848	0.852	0.820	0.870	0.895

PC5	F-measure	0.621	0.692	0.636	0.624	0.825	0.670	0.829
	AUC	0.500	0.433	0.475	0.483	0.700	0.620	0.707
	Recall	0.734	0.734	0.667	0.729	0.500	0.658	0.809

Average	F-measure	0.799	0.812	0.791	0.837	0.876	0.827	0.896
	AUC	0.517	0.402	0.392	0.434	0.680	0.581	0.699
	Recall	0.738	0.731	0.716	0.753	0.554	0.740	0.799

Note: Bold values indicate the best-performing model values.

The FC-BLS employs linear feature mapping, thus making it comparable to the SVM and ELM models that utilise linear kernel functions. From Table 3 and Figures 3, 4, 5, it is evident that the experimental results of FC-BLS outperformed those of SVM and ELM on the nine datasets. Compared with the SVM, FC-BLS exhibited a higher average F-measure, AUC and Recall, with increases of 7.08%, 61.06% and 6.12%, respectively. Compared with the ELM, FC-BLS demonstrated a significantly higher average F-measure, AUC and Recall, with increases of 13.31%, 78.11% and 11.55%, respectively. This is because these two classification methods use only the original data features and cannot address the class imbalance, noise and outliers of the dataset. However, the SVM aims to maximise the classification interval and determine the decision boundary by selecting support vectors. If the data tended to be completely separated by a hyperplane in the feature space, the performance indicators of the SVM were better for datasets MW1 and PC1.

FC-BLS outperformed KNN on all datasets, with a significantly higher average F-measure, AUC and Recall, achieving increases of 13.31%, 78.11% and 11.55%, respectively. KNN classifies the datasets by measuring the distances between different features. However, the average AUC value of KNN is only 0.402, indicating that KNN cannot effectively address the issue of class imbalance. By contrast, FC-BLS yielded significantly superior results.

Compared to CNN, FC-BLS exhibited superior performance across all nine datasets, with notably higher average F-measure, AUC and Recall, achieving increases of 12.11%, 35.18% and 8.23%, respectively. CNN adopts a complex deep learning network structure to determine the optimal parameters and obtain the optimal classification model after a long period of training in multiple iterations. Therefore, CNN obtained the same Recall value on the PC2 dataset as FC-BLS, but CNN was insensitive to a few classes, which did not improve the influence of class imbalance.

The prediction performance of the FC-BLS model surpassed that of the BLS model. FC-BLS exhibited higher average F-measure, AUC and Recall than BLS, with increases of 7.08%, 61.06% and 6.12%, respectively. Therefore, combining the cost matrix and fuzzy membership function is beneficial for improving the classification effect of the SDP.

Compared with Hellinger Net, FC-BLS demonstrated superior performance across all six datasets, with significantly higher average F-measure, AUC and Recall, achieving increases of 2.25%, 2.68% and 44.25%, respectively. The Hellinger net [18] is a deep feedforward neural network with a built-in hierarchy that leverages the robustness of the Hellinger distance, a skew-insensitive distance measure, to address class imbalance effectively. Although Hellinger Net mitigates the issue of class imbalance to some degree, it does not address the challenges posed by noise and outliers in software defect data. However, FC-BLS not only uses the cost matrix to classify imbalanced samples effectively but also considers the specific sample distribution in the feature space and employs the fuzzy membership function to tackle issues related to noise and outliers, which improves the classification effect of SDP.

In summary, FC-BLS outperformed the other comparative methods regarding the three evaluation metrics of F-measure, AUC and Recall on the nine datasets.

5.2. Parametric Analysis

A portion of the weights in FC-BLS was stochastically generated throughout the training procedure. These include the random weights employed in mapping the entry data to the mapping features and utilised in mapping the mapping features to the enhancement nodes. In constructing the optimal model, the experimental outcomes are significantly influenced by the alterations in the quantity of mapped nodes, augmentation nodes and the regularisation parameter λ. The following experiments take three datasets, CM1, PC1 and PC2, as examples, and use Recall as an evaluation metric to analyse the influence of the aforementioned parameters on the prediction efficacy of FC-BLS.

5.2.1. Impact of Mapped Nodes and Enhancement Nodes on Recall

The quantity of mapped and enhancement nodes affected the performance of the FC-BLS model. More mapped nodes result in a higher extraction of a more significant number of features from the mapped features, which is beneficial for improving the Recall. In addition, more enhancement nodes introduce more feature representations and complexity to the FC-BLS, which can enhance the performance and generalisation ability of the model. However, this also requires a more significant amount of computation. Consequently, it is critical to discuss the impact of the quantity of mapped and enhancement nodes on the FC-BLS.

The number of mapped and enhancement nodes take values within a specified range {15, 20, …, 35}, and the regularisation parameter λ = 0.001. The average Recall was obtained after 10 experiments, and the effect of the mapped and enhancement nodes on the Recall is visually depicted in Figure 6. The Recall fluctuates with the quantity of nodes; however, the range between the highest and lowest Recall is minimal, indicating the robust stability of the FC-BLS. On the CM1 dataset, the Recall value is maximum at MapTimes = 25 and EnhanceTimes = 25; on the PC1 dataset, the Recall rises relatively smoothly at MapTimes > 20 and EnhanceTimes > 30, and as the quantity of mapped nodes increases, the computation of FC-BLS increases, the optimal value for MapTimes is 20 and EnhanceTimes is 30; on the PC2 dataset, when MapTimes < 30 and EnhanceTimes < 15, Recall has some fluctuation, but Recall has an overall upward trend. When MapTimes > 30 and EnhanceTimes > 15, the Recall gradually stabilises, and the optimal values of MapTimes and EnhanceTimes are 30 and 15, respectively.

5.2.2. Impact of the Regularisation Parameter λ on Recall

The regularisation parameter is an important BLS hyperparameter that can control the model complexity and avoid overfitting and underfitting phenomena, thus improving the model generalisation ability. If the λ value is excessively small, the model is susceptible to overfitting, whereas an overly large λ value can lead to the underfitting phenomenon. Hence, it is imperative to determine an appropriate value for λ. On the three datasets, λ takes values in the range {2⁻⁵, 2⁻⁴, …, 2⁴}, and MapTimes = EnhanceTimes = 20. The average Recall is obtained after 10 experiments, and the effect of the regularisation parameter on Recall is depicted in Figure 7. As observed, Recall varies with λ, allowing us to select an optimal value for λ based on these variations.

6. Discussion

Current SDP methods face challenges like class imbalance, noise and outliers, which reduce predictive performance. Traditional models often fail to prioritise minority class samples and are sensitive to noise, leading to distorted decision boundaries. To address these issues, this paper proposes an FC-BLS that uses a cost matrix to emphasise minority class samples and fuzzy membership functions to reduce the impact of noise and outliers. This approach retains the simplicity and fast training of BLSs while improving robustness in complex scenarios. While FC-BLS outperforms traditional methods on multiple NASA datasets, it has limitations. (1) It has only been validated on nine specific datasets, which may limit its generalisability. Its performance is highly sensitive to parameter tuning, increasing training complexity and hindering rapid application to diverse datasets. (2) The performance of FC-BLS is highly sensitive to the number of mapped and enhancement nodes as well as the regularisation parameter λ, with current parameter tuning relying heavily on empirical methods. (3) Although FC-BLS is efficient with large-scale data, its computational complexity may become a bottleneck for extremely large datasets, and it does not effectively handle dynamic or streaming data scenarios, limiting its applicability in real-world software development processes. Future work will focus on expanding datasets, automating parameter tuning, enabling real-time processing and exploring applications in domains like medical diagnosis and financial fraud detection to improve its practical value and contribution to software reliability engineering.

7. Conclusion

We proposed an FC-BLS for SDP, namely the FC-BLS model. We improved the loss function of BLS through constructing a cost matrix and fuzzy membership functions. It improves the class imbalance and considers the specific sample distribution in the feature space to address the noise and outliers in the software defect data. FC-BLS retains the characteristics of the BLS, such as its simple structure and short training time. Incorporation of fuzzy membership functions further enhances the robustness of the model in complex software defect scenarios. We carefully evaluated FC-BLS on nine NASA SDP datasets using three key performance indicators, showing the superior performance of FC-BLS compared to six other methods on most datasets. This study focuses primarily on developing an imbalanced classifier to enhance SDPs on NASA datasets. We may extend this work in the future studies to other SDP scenarios, such as cross-project and semi-supervised scenarios. In the future, we will attempt to utilise the FC-BLS model for data imbalance problems in other fields to prove the generalisation ability of the proposed method.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

This work was partially supported by grant from the Natural Science Foundation of China (Nos. 62206087, 62276091, 62202223, 61602154), Center for Complexity Science, Henan University of Technology (No. CSKFJJ-2024-7), the Innovative Funds Plan of Henan University of Technology (No. 2022ZKCJ14), Key Laboratory of Grain Information Processing and Control (Henan University of Technology), Ministry of Education (No. KFJJ2024016), Cultivation Programme for Young Backbone Teachers in Henan University of Technology (No. 21420158), General Program of Henan Provincial Natural Science Foundation (No. 242300420279) and National Natural Science Foundation of China Cultivation Project in Henan University of Technology (No. 31490139).

Open Research

Data Availability Statement

The data used to support the findings of this study are included within the article.

References

1 Lee D. G. and Seo Y. S., Improving Bug Report Triage Performance Using Artificial Intelligence Based Document Generation Model, Human-Centric Computing and Information Sciences. (2020) 10, no. 1, https://doi.org/10.1186/s13673-020-00229-7.
10.1186/s13673-020-00229-7
Web of Science® Google Scholar
2 Feng S., Keung J., Yu X., Xiao Y., and Zhang M., Investigation on the Stability of SMOTE-Based Oversampling Techniques in Software Defect Prediction, Information and Software Technology. (2021) 139, https://doi.org/10.1016/j.infsof.2021.106662.
10.1016/j.infsof.2021.106662
Web of Science® Google Scholar
3 Yang X., Tang K., and Yao X., A Learning-To-Rank Approach to Software Defect Prediction, IEEE Transactions on Reliability. (2015) 64, no. 1, 234–246, https://doi.org/10.1109/tr.2014.2370891, 2-s2.0-85027924052.
10.1109/TR.2014.2370891
Web of Science® Google Scholar
4 Zada I., Shahzad S., Alatawi M. N., Ali S., and Khan J. A., LAGSSE: An Integrated Framework for the Realization of Sustainable Software Engineering, The Journal of Supercomputing. (2023) 4.
Google Scholar
5 Gong L., Zhang H., Zhang J., Wei M., and Huang Z., A Comprehensive Investigation of the Impact of Class Overlap on Software Defect Prediction, IEEE Transactions on Software Engineering. (2023) 49, no. 4, 2440–2458, https://doi.org/10.1109/tse.2022.3220740.
10.1109/TSE.2022.3220740
Web of Science® Google Scholar
6 Gong L., Jiang S., Bo L., Jiang L., and Qian J., A Novel Class-Imbalance Learning Approach for Both Within-Project and Cross-Project Defect Prediction, IEEE Transactions on Reliability. (2020) 69, no. 1, 40–54, https://doi.org/10.1109/tr.2019.2895462, 2-s2.0-85062157657.
10.1109/TR.2019.2895462
Web of Science® Google Scholar
7 Jing X., Wu F., Dong X., Qi F., and Xu B., Heterogeneous Cross-Company Defect Prediction by Unified Metric Representation and CCA-Based Transfer Learning, Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, August 2015, Bergamo, Italy, 496–507, https://doi.org/10.1145/2786805.2786813, 2-s2.0-84960419541.
10.1145/2786805.2786813
Google Scholar
8 Elish K. O. and Elish M. O., Predicting Defect-Prone Software Modules Using Support Vector Machines, Journal of Systems and Software. (2008) 81, no. 5, 649–660, https://doi.org/10.1016/j.jss.2007.07.040, 2-s2.0-40749135790.
10.1016/j.jss.2007.07.040
Web of Science® Google Scholar
9 Thwin M. M. T. and Quah T. S., Application of Neural Networks for Software Quality Prediction Using Object-Oriented Metrics, Journal of Systems and Software. (2005) 76, no. 2, 147–156, https://doi.org/10.1016/j.jss.2004.05.001, 2-s2.0-11144262148.
10.1016/j.jss.2004.05.001
Web of Science® Google Scholar
10 Panichella A., Oliveto R., and De Lucia A., Cross-Project Defect Prediction Models: L’union Fait la Force, 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering, February 2014, Antwerp, Belgium, CSMR-WCRE, 164–173.
Google Scholar
11 Shanthin A. and Chandrasekaran R. M., Analyzing the Effect of Bagged Ensemble Approach for Software Fault Prediction in Class Level and Package Level Metrics, International Conference on Information Communication and Embedded Systems (ICICES2014), February 2014, Chennai, India, 1–5.
Google Scholar
12 Xia X., Lo D., McIntosh S., Shihab E., and Hassan A. E., Cross-project Build Co-Change Prediction, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), March 2015, Montreal, Canada, 311–320, https://doi.org/10.1109/saner.2015.7081841, 2-s2.0-84928670632.
10.1109/saner.2015.7081841
Google Scholar
13 Huang G. B., Zhu Q. Y., and Siew C. K., Extreme Learning Machine: Theory and Applications, Neurocomputing. (2006) 70, no. 1-3, 489–501, https://doi.org/10.1016/j.neucom.2005.12.126, 2-s2.0-33745903481.
10.1016/j.neucom.2005.12.126
Web of Science® Google Scholar
14 Okutan A. and Yıldız O. T., Software Defect Prediction Using Bayesian Networks, Empirical Software Engineering. (2014) 19, no. 1, 154–181, https://doi.org/10.1007/s10664-012-9218-8, 2-s2.0-84893780047.
10.1007/s10664-012-9218-8
Web of Science® Google Scholar
15 Stradowski S. and Madeyski L., Machine Learning in Software Defect Prediction: A Business-Driven Systematic Mapping Study, Information and Software Technology. (2023) 155, https://doi.org/10.1016/j.infsof.2022.107128.
10.1016/j.infsof.2022.107128
Web of Science® Google Scholar
16 Zada I., Alshammari A., Mazhar A. A. et al., Enhancing IOT Based Software Defect Prediction in Analytical Data Management Using War Strategy Optimization and Kernel ELM, Wireless Networks. (2023) 30, no. 9, 7207–7225, https://doi.org/10.1007/s11276-023-03591-3.
10.1007/s11276-023-03591-3
Web of Science® Google Scholar
17 Šikić L., Kurdija A. S., Vladimir K., and Šilić M., Graph Neural Network for Source Code Defect Prediction, IEEE Access. (2022) 10, 10402–10415, https://doi.org/10.1109/access.2022.3144598.
10.1109/ACCESS.2022.3144598
Web of Science® Google Scholar
18 Chakraborty T. and Chakraborty A. K., Hellinger Net: A Hybrid Imbalance Learning Model to Improve Software Defect Prediction, IEEE Transactions on Reliability. (2021) 70, no. 2, 481–494, https://doi.org/10.1109/tr.2020.3020238.
10.1109/TR.2020.3020238
Web of Science® Google Scholar
19 Chen C. L. P. and Liu Z., Broad Learning System: An Effective and Efficient Incremental Learning System Without the Need for Deep Architecture, IEEE Transactions on Neural Networks and Learning Systems. (2018) 29, no. 1, 10–24, https://doi.org/10.1109/tnnls.2017.2716952, 2-s2.0-85028939310.
10.1109/TNNLS.2017.2716952
CAS PubMed Web of Science® Google Scholar
20 Laradji I. H., Alshayeb M., and Ghouti L., Software Defect Prediction Using Ensemble Learning on Selected Features, Information and Software Technology. (2015) 58, 388–402, https://doi.org/10.1016/j.infsof.2014.07.005, 2-s2.0-84914106409.
10.1016/j.infsof.2014.07.005
Web of Science® Google Scholar
21 Torgo L., Ribeiro R. P., Pfahringer B., and Branco P., Portuguese Conference on Artificial Intelligence, Portuguese Conference on Artificial Intelligence, May 2013, Berlin, Germany, 378–389.
Google Scholar
22 Chawla N. V., Bowyer K. W., Hall L. O., and Kegelmeyer W. P., SMOTE: Synthetic Minority Over-Sampling Technique, Journal of Artificial Intelligence Research. (2002) 16, 321–357, https://doi.org/10.1613/jair.953.
10.1613/jair.953
Web of Science® Google Scholar
23 Goyal S., Handling Class-Imbalance with KNN (Neighbourhood) Under-Sampling for Software Defect Prediction, Artificial Intelligence Review. (2022) 55, no. 3, 2023–2064, https://doi.org/10.1007/s10462-021-10044-w.
10.1007/s10462-021-10044-w
Web of Science® Google Scholar
24 López V., Fernández A., García S., Palade V., and Herrera F., An Insight Into Classification With Imbalanced Data: Empirical Results and Current Trends on Using Data Intrinsic Characteristics, Information Sciences. (2013) 250, 113–141, https://doi.org/10.1016/j.ins.2013.07.007, 2-s2.0-84883447718.
10.1016/j.ins.2013.07.007
Web of Science® Google Scholar
25 Douzas G., Bacao F., and Last F., Improving Imbalanced Learning Through a Heuristic Oversampling Method Based on K-Means and SMOTE, Information Sciences. (2018) 465, 1–20, https://doi.org/10.1016/j.ins.2018.06.056, 2-s2.0-85049450664.
10.1016/j.ins.2018.06.056
Web of Science® Google Scholar
26 Yedida R. and Menzies T., On the Value of Oversampling for Deep Learning in Software Defect Prediction, IEEE Transactions on Software Engineering. (2022) 48, no. 8, 3103–3116, https://doi.org/10.1109/tse.2021.3079841.
10.1109/TSE.2021.3079841
Web of Science® Google Scholar
27 Arun C. and Lakshmi C., Diversity Based Multi-Cluster Over Sampling Approach to Alleviate the Class Imbalance Problem in Software Defect Prediction, International Journal of System Assurance Engineering and Management. (2023) 18, 1–13, https://doi.org/10.1007/s13198-023-02031-x.
10.1007/s13198-023-02031-x
Web of Science® Google Scholar
28 Wang R., Liu F., and Bai Y., A Software Defect Prediction Method that Simultaneously Addresses Class Overlap and Noise Issues After Oversampling, Electronics. (2024) 13, no. 20, https://doi.org/10.3390/electronics13203976.
10.3390/electronics13203976
Web of Science® Google Scholar
29 Rodriguez D., Herraiz I., Harrison R., Dolado J., and Riquelme J. C., Preliminary Comparison of Techniques for Dealing With Imbalance in Software Defect Prediction, Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, June 2014, London, UK, 1–10.
Google Scholar
30 Khoshgoftaar T. M. and Gao K., Feature Selection With Imbalanced Data for Software Defect Prediction, 2009 International Conference on Machine Learning and Applications, May 2009, Miami Beach, FL, 235–240, https://doi.org/10.1109/icmla.2009.18, 2-s2.0-77950789610.
10.1109/icmla.2009.18
Google Scholar
31 Khoshgoftaar T. M., Gao K., and Seliya N., Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence, October 2010, Arras, France, 137–144, https://doi.org/10.1109/ictai.2010.27, 2-s2.0-78751556705.
10.1109/ictai.2010.27
Google Scholar
32 Shi H., Ai J., Liu J., and Xu J., Improving Software Defect Prediction in Noisy Imbalanced Datasets, Applied Sciences. (2023) 13, no. 18, https://doi.org/10.3390/app131810466.
10.3390/app131810466
Google Scholar
33 Abdullah M. and Setiawan F. A., The Effectiveness of Resampling Method for Handling Class Imbalances in Software Defect Prediction, 2023 International Conference on Information Technology Research and Innovation (ICITRI), August 2023, Jakarta, Indonesia, 22–27.
Google Scholar
34 Galar M., Fernandez A., Barrenechea E., Bustince H., and Herrera F., A Review on Ensembles for the Class Imbalance Problem: Bagging-Boosting-And Hybrid-Based Approaches, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews). (2012) 42, no. 4, 463–484, https://doi.org/10.1109/tsmcc.2011.2161285, 2-s2.0-84862515469.
10.1109/TSMCC.2011.2161285
Web of Science® Google Scholar
35 Wang S. and Yao X., Using Class Imbalance Learning for Software Defect Prediction, IEEE Transactions on Reliability. (2013) 62, no. 2, 434–443, https://doi.org/10.1109/tr.2013.2259203, 2-s2.0-84878691303.
10.1109/TR.2013.2259203
PubMed Web of Science® Google Scholar
36 Tang Y., Dai Q., Yang M., Du T., and Chen L., Software Defect Prediction Ensemble Learning Algorithm Based on Adaptive Variable Sparrow Search Algorithm, International Journal of Machine Learning and Cybernetics. (2023) 14, no. 6, 1967–1987, https://doi.org/10.1007/s13042-022-01740-2.
10.1007/s13042-022-01740-2
Web of Science® Google Scholar
37 Khoshgoftaar T. M., Geleyn E., Nguyen L., and Bullard L., Cost-Sensitive Boosting in Software Quality Modeling, 7th IEEE International Symposium on High Assurance Systems Engineering, October 2002, Tokyo, Japan, 51–60, https://doi.org/10.1109/hase.2002.1173102, 2-s2.0-34548267415.
10.1109/hase.2002.1173102
Google Scholar
38 Zheng J., Cost-sensitive Boosting Neural Networks for Software Defect Prediction, Expert Systems with Applications. (2010) 37, no. 6, 4537–4543, https://doi.org/10.1016/j.eswa.2009.12.056, 2-s2.0-77249108251.
10.1016/j.eswa.2009.12.056
Web of Science® Google Scholar
39 Arar Ö. F. and Ayan K., Software Defect Prediction Using Cost-Sensitive Neural Network, Applied Soft Computing. (2015) 33, 263–277, https://doi.org/10.1016/j.asoc.2015.04.045, 2-s2.0-84929145814.
10.1016/j.asoc.2015.04.045
Web of Science® Google Scholar
40 Jin C., Software Defect Prediction Model Based on Distance Metric Learning, Soft Computing. (2021) 25, no. 1, 447–461, https://doi.org/10.1007/s00500-020-05159-1.
10.1007/s00500-020-05159-1
Web of Science® Google Scholar
41 Yang A. Q., Yu X. H., Su T. L., Jin X. B., and Kong J. L., Broad Learning System for Human Activity Recognition Using Sensor Data, International Journal of Computer Applications in Technology. (2019) 61, no. 4, 259–264, https://doi.org/10.1504/ijcat.2019.103297, 2-s2.0-85074198978.
10.1504/IJCAT.2019.103297
Google Scholar
42 Liu Z., Zhou J., and Chen C. L. P., Broad Learning System: Feature Extraction Based on K-Means Clustering Algorithm, 2017 4th International Conference on Information, Cybernetics and Computational Social Systems (ICCSS), July 2017, Dalian, China, 683–687, https://doi.org/10.1109/iccss.2017.8091501, 2-s2.0-85040587609.
10.1109/iccss.2017.8091501
Google Scholar
43 Han J., Xie L., Liu J., and Li X., Personalized Broad Learning System for Facial Expression, Multimedia Tools and Applications. (2020) 79, no. 23-24, 16627–16644, https://doi.org/10.1007/s11042-019-07979-2, 2-s2.0-85069671367.
10.1007/s11042-019-07979-2
Web of Science® Google Scholar
44 Jin J., Qin Z., Yu D., Li Y., Liang J., and Chen C. P., Regularized Discriminative Broad Learning System for Image Classification, Knowledge-Based Systems. (2022) 251, https://doi.org/10.1016/j.knosys.2022.109306.
10.1016/j.knosys.2022.109306
Web of Science® Google Scholar
45 Chu F., Wang J., Cao Y., and Li S., Efficient and Effective Ensemble Broad Learning System Based on Structural Diversity, Applied Soft Computing. (2024) 167, https://doi.org/10.1016/j.asoc.2024.112412.
10.1016/j.asoc.2024.112412
Web of Science® Google Scholar
46 Li Y., Gao Y., Jin J. et al., Adaptive Weights-Based Relaxed Broad Learning System for Imbalanced Classification, Digital Signal Processing. (2025) 156, https://doi.org/10.1016/j.dsp.2024.104869.
10.1016/j.dsp.2024.104869
Web of Science® Google Scholar
47 Ling C. X., Huang J., and Zhang H., AUC: A Better Measure Than Accuracy in Comparing Learning Algorithms, Advances in Artificial Intelligence: 16th Conference of the Canadian Society for Computational Studies of Intelligence, June 2003, Halifax, Canada, 329–341, https://doi.org/10.1007/3-540-44886-1_25, 2-s2.0-7044227562.
10.1007/3-540-44886-1_25
Google Scholar

All articles

Software Defect Prediction Based on Fuzzy Cost Broad Learning System

Abstract

1. Introduction

2. Related Work

2.1. Class Imbalanced Learning for Defect Prediction

2.2. BLS

3. Proposed Model

3.1. BLS Model

3.2. Design of Cost Matrix

3.3. Design of Fuzzy Membership Functions

3.4. FC-BLS Model

3.5. Computational Complexity Analysis

4. Experimental Setup

4.1. Software Defect Dataset

4.2. Performance Indicators

4.3. Experimental Design

5. Experimental Results and Analysis

5.1. Experiment Analysis

5.2. Parametric Analysis

5.2.1. Impact of Mapped Nodes and Enhancement Nodes on Recall

5.2.2. Impact of the Regularisation Parameter λ on Recall

6. Discussion

7. Conclusion

Conflicts of Interest

Funding

Open Research

Data Availability Statement

References

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley