Volume 2025, Issue 1 5595809
Research Article
Open Access

LBN-YOLO: A Lightweight Road Damage Detection Model Based on Multiscale Contextual Feature Extraction and Fusion

Guizhen Niu

Guizhen Niu

School of Mechanical, Electrical and Information Engineering , Shandong University , Weihai , 264209 , China , sdu.edu.cn

Search for more papers by this author
Guangming Li

Corresponding Author

Guangming Li

School of Mechanical, Electrical and Information Engineering , Shandong University , Weihai , 264209 , China , sdu.edu.cn

Shandong University–Weihai Research Institute of Industrial Technology , Weihai , 264209 , China

Search for more papers by this author
Chengyou Wang

Chengyou Wang

School of Mechanical, Electrical and Information Engineering , Shandong University , Weihai , 264209 , China , sdu.edu.cn

Shandong University–Weihai Research Institute of Industrial Technology , Weihai , 264209 , China

Search for more papers by this author
Kaixuan Hui

Kaixuan Hui

School of Mechanical, Electrical and Information Engineering , Shandong University , Weihai , 264209 , China , sdu.edu.cn

Search for more papers by this author
First published: 26 July 2025
Academic Editor: Tzu-Kang Lin

Abstract

Detecting and classifying road damage are crucial for road maintenance. To address the limitations of existing road damage detection methods, including insufficient fine-grained contextual feature extraction and complex models unsuitable for deployment, this paper proposes a lightweight backbone and neck road damage detection model named LBN-YOLO. First, the backbone and neck of the original model are improved to be lightweight, and the C2f-dilation wise residual (C2f-DWR) module is integrated in the backbone to extract multiscale contextual information. Second, a simplified bidirectional feature pyramid network is employed in the neck structure to optimize the feature fusion network, reducing the number of parameters and simplifying the model complexity. Finally, a dynamic head with self-attention is introduced to enhance the sensing capability of the detection head, thus improving the precision of detecting occluded small objects. The proposed model’s detection ability is evaluated using a custom road damage dataset. The experimental results demonstrate that our proposed LBN-YOLO model achieves superior performance compared with the YOLOv8n model, with an increase of 4.1% in [email protected] and a 5.2% enhancement in precision, outperforming other detection models. In addition, the model is evaluated on two public datasets, showing improved detection performance compared with the original model, demonstrating strong generalization capabilities. Code and dataset are available at https://github.com/gzNiuadc/Road-crack-dataset.

1. Introduction

Roads are an important part of the transportation system. As usage duration and road loading rise, various kinds of damage gradually occur, with road cracks being the most common [1]. Serious road damage can be a potential threat to the safety of pedestrians and vehicles, so it is particularly important to monitor road damage and its severity in a timely manner. How to realize high-precision automated road damage detection with high efficiency and low cost is a great challenge nowadays.

Over the past 2 decades, conventional image processing techniques such as edge detection [2], morphological operations [3], and threshold segmentation have been widely employed for road damage detection. However, these methods are sensitive to light variations and noise, making them susceptible to uneven illumination and noise, which lead to lower precision in detection results. To solve these limitations, researchers apply deep learning approaches for road damage detection. Compared with traditional image processing methods, the deep learning method is faster and more robust [4].

1.1. Related Works

As a classical deep learning algorithm, the object detection algorithm can locate and classify road damage. Object detection models are divided into one-stage object detection and two-stage object detection [5]. One-stage object detection models include SSD [6], RetinaNet [7], and YOLO series algorithms [8, 9]. Typical two-stage detection models are Fast R-CNN [10] and Faster R-CNN [11]. The one-stage object detection algorithm directly obtains detection boxes, eliminating the step of determining the candidate boxes, resulting in a faster detection speed for the one-stage detection model. The YOLO series model is a traditional one-stage object detection approach with both detection speed and precision.

To further improve the detection precision of the YOLO series of models, Liu et al. [12] optimized the YOLOv3 model by adding a four-scale detection layer and replacing the intersection over union (IoU) with the EIoU, effectively enhancing the detection capability for small and hidden cracks in pavements. Xing et al. [13] incorporated an additional Swin-Transformer prediction head (SPH) into the YOLOv5 model to enhance feature extraction in complex environments. The improved YOLOv5 model increased road crack detection accuracy by 15.4%, with ability to detect cracks as small as 1.2 mm. Xiong et al. [14] proposed an improved model, YOLOv8-GAM-wise-IoU, based on YOLOv8, which significantly improves the accuracy and efficiency of bridge crack detection by introducing a global attention module and the wise IoU loss function. Dong et al. [15] integrated visual attention networks (VANs), large convolutional attention (LCA) module, and large separable kernel attention (LSKA) module into the YOLOv8 model to enhance the extraction of damaged shapes and localized features of concrete surfaces, achieving a 15% improvement in [email protected].

The lightweight model facilitates deployment across various applications. To achieve this, Yu et al. [16] used YOLOv4-FPM, based on YOLOv4, for real-time bridge crack detection. A pruning algorithm was applied to simplify the network structure, increasing the detection speed by 20 times. Diao et al. [17] replaced the backbone network of YOLOv5 with the lightweight MobileNetV3 and substituted the squeeze-and-excitation (SE) channel attention mechanism in the MobileNetV3 network with the shuffle attention (SA) mechanism. This adaptation not only improved information exchange between channels and spatial dimensions but also reduced the model’s parameter count, providing an important reference for lightweight road damage detection algorithms. Ning et al. [18] proposed an improved YOLOv7-RDD model, which integrates a lightweight aggregation network, an enhanced spatial feature pyramid structure, and a similarity-based attention mechanism (SimAM), achieving 145 FPS on the CQURDD dataset and demonstrating high detection efficiency. Zhao et al. [19] introduced a lightweight real-time road damage detection model by incorporating MobileNetv3 as the backbone and the efficient channel attention (ECA) module, reducing computational complexity by 46.2% while maintaining real-time performance.

Although numerous advancements have been made in improving either the speed or accuracy of road damage detection models, research addressing both aspects simultaneously remains limited. Xing et al. [20] enhanced the YOLOv5 model by introducing an efficient decoupled header and replacing traditional convolution with the GCC3 module for global context modeling, improving road damage detection. On the RDD2022 dataset, accuracy increased by 1.5% but parameters and computational cost doubled compared with the original model. Wang and Chen [21] proposed a transformer-based detector incorporating receptive field attention blocks and a feature assignment mechanism to enhance crack detection accuracy. However, this increases in GFLOPs and parameters by 11% and 15%, respectively, resulted in reduced detection speed. Wan et al. [22] improved the BR-DETR model for bridge damage detection by introducing deformable conv2d and employing copy–paste augmentation, boosting accuracy. However, the high complexity of the transformer limits real-time detection. Meng et al. [23] developed a lightweight pavement crack detection method based on YOLOv8 and a knowledge distillation model with multiple teacher–assistants (KDMTA), achieving a 79.6% improvement in image processing speed. However, detection accuracy and mean average precision (mAP) decreased by 2.14% and 4.64%, respectively. He et al. [24] replaced standard convolution in YOLOv7 with ghost convolution and applied depthwise separable convolution to reduce parameters. By combining channel width, depth optimization, and knowledge distillation, the model achieves a lightweight design with a slight degradation in detection accuracy.

1.2. Motivations

In summary, road damage detection requires a balanced emphasis on both detection precision and real-time capabilities. However, the complexity of the detection models at this stage is high and the detection efficiency is low [2023]. The intricate morphological traits of different road damages, coupled with complex background interference and fluctuating lighting conditions, present significant challenges for accurate road damage recognition. Therefore, this study aims to develop a robust road damage detection model capable of accurately identifying and categorizing various road damages, while also minimizing training costs and detection time to overcome current limitations.

1.3. Contributions

Our present research introduces the road damage detection model LBN-YOLO, which is based on YOLOv8n and achieves a balance between precision and speed. The primary contributions of this study can be outlined as follows.
  • 1.

    A dataset of road damages in real scenes has been constructed, consisting of seven types of road damages in various lighting conditions and diverse backgrounds. This dataset enhances the available data for road damage detection.

  • 2.

    The proposed LBN-YOLO model is characterized by its lightweight design. It integrates dilation-wise residual (DWR) into the C2f module of the feature extraction network, forming the C2f-DWR module. This augmentation enhances the backbone network’s ability to extract multiscale contextual and global information.

  • 3.

    The neck section of the detection model is enhanced by eliminating nodes with minimal impact on feature fusion to reduce feature redundancy and proposes a simplified bidirectional feature pyramid network (SBiFPN) based on the feature fusion concept of the bidirectional feature pyramid network (BiFPN). In addition, the cross-level connection is introduced to fuse the feature information of multiple scales to improve the detection capability of objects at various scales.

  • 4.

    Dynamic head (Dyhead) is employed as the detection head, utilizing scale awareness, spatial awareness, and task awareness among different extraintegral layers, spatial positions, and within the output channel, respectively. This significantly enhances the classification and localization precision of small objects affected by shadows and occlusions.

The rest of the paper is organized as follows. Section 2 presents the details of the road damage detection model proposed in this study. Section 3 elaborates on the dataset used, experimental configurations, and outcomes. Sections 4 and 5 summarize the paper’s findings and outline potential directions for future research.

2. Algorithms for Road Damage Detection

This section introduces the improved design of the LBN-YOLO model, including the C2f-DWR module in the backbone network, the SBiFPN feature fusion network, and the Dyhead detection head incorporating multiple self-attention mechanisms, aiming to enhance the accuracy and efficiency of road damage detection.

2.1. LBN-YOLO Algorithm

YOLOv8 is an enhanced version of YOLOv5. Recently, YOLOv8 has demonstrated excellent performance in road damage detection [25, 26]. YOLOv8 consists of five pretrained versions with different model sizes, namely, YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x [19]. Among them, YOLOv8n is the model with the minimum parameters, saving computational resources while maintaining high detection precision, which is consistent with our research goal. Therefore, this paper chooses YOLOv8n as the baseline model. YOLOv8 introduces three significant improvements compared with YOLOv5. First, it replaces the C3 module in the backbone with the C2f that contains rich information of gradient flow to reduce the computational complexity. Second, it eliminates the 1 × 1 convolution preceding the upsampling layer in the neck section. Lastly, the head section uses the decoupled head structure to separate the classification task from the regression task, and Anchor-Based is replaced with Anchor-Free [27, 28].

Due to the complexity of the backgrounds presented in the road damage images we collected, such as water damage, leaves, and shadows, which can interfere with road damage detection, the YOLOv8n detection model is prone to misdetection and omission. To address these challenges, this study proposes an improved model named LBN-YOLO. Figure 1 illustrates the architecture of LBN-YOLO model. In the Trunk section, two C2f-DWR modules replace the C2f module at the end, enhancing the extraction capability of contextual information and multi-scale object features. In the Neck section, SBiFPN is introduced for bidirectional fusion and interaction of multiscale feature information, while simultaneously achieving a lightweight design. To further enhance precision in detecting occluded small objects, the dynamic detection head Dyhead with a self-attention mechanism is employed.

Details are in the caption following the image
Overall structure of LBN-YOLO.

2.2. C2f-DWR Module

The C2f module in YOLOv8 acquires rich gradient flow information through residual concatenation. The module’s construction is illustrated in Figure 1. The object scales in our collected road damage images are of various sizes. However, the standard convolution in the C2f module limits the sensory field of the network and lacks the ability to extract multiscale contextual information, potentially leading to false detection of damaged objects. To resolve this limitation, this paper proposes the C2f-DWR module in this study to efficiently capture multi-scale contextual information based on C2f. The C2f-DWR contains multiple DWR modules, as illustrated in Figure 2(b).

Details are in the caption following the image
Structure of the C2f-DWR and DWR modules. (a) C2f-DWR and (b) DWR.

The DWR module is structured residually, featuring three dilated convolutional modules with dilation rates of 1, 3, and 5 [29], alongside two standard convolutional modules with convolutional kernels of 1 and 3, as shown in Figure 2(b). It extracts the contextual information of multiscale features through two steps. First, the initial feature extraction is carried out using 3 × 3 convolution. Subsequently, the feature maps obtained in the first step are input into three expansion convolution layers with different expansion rates to extract features with rich contextual semantic information. The output feature maps of the three expansion convolutions are spliced in channel dimension. Finally, the output feature map after feature integration is concatenated with the residuals of the original feature map to produce a feature map containing both the original feature information and multiscale contextual information.

The scales and shapes of the road damages we captured exhibit significant variation. However, DWR can expand the receptive field without sacrificing the size of the feature map, thereby enhancing the extraction of multiscale features. Consequently, integrating the C2f-DWR module into the backbone network enables the extraction of more comprehensive feature information. This integration allows for fusing and interacting local fine-grained features, captured when the receptive field is small, with global coarse-grained features, captured when the receptive field is large.

2.3. Improved Feature Fusion Network SBiFPN

The feature fusion network of YOLOv8 utilizes feature pyramid network (FPN) and path aggregation network (PAN), as shown in Figure 3(a). PANet [30] achieves bidirectional fusion of features by upsampling and downsampling, which might lead to feature information loss or redundancy. BiFPN [31] network carries out complicated bidirectional cross-scale fusion based on PAN, while simultaneously eliminating unidirectional input nodes that do not contribute to feature fusion. In Figure 3(b), the node positioned as the first node on the right side of P7 exemplifies this kind of node. Because its contribution to the feature fusion network is minimal, its removal has negligible impact on the overall network. However, it is worth noting that the BiFPN network remains its complexity.

Details are in the caption following the image
Comparison of different feature extraction networks. (a) PANNet, (b) BiFPN, and (c) SBiFPN.
To better fuse the multiscale road damage features, overcome the information loss problem, and simplify the network structure, this study further optimizes the feature fusion network. The enhanced feature fusion network first aligns the number of channels through three layers of convolution for subsequent feature fusion. Then, the bottom-up and top-down bidirectional feature fusion flows are constructed, and additional lateral connections are added based on the bidirectional cross-scale feature information flow to weigh the original feature information and cross-level feature information. Taking P6 from Figure 3(c) as an example, equation (1) illustrates the calculation of intermediate features while equation (2) represents the output features as follows:
()
()
where w is the learned weight and “Resize” means upsampling or downsampling to change the size of the feature map.

Since feature fusion primarily occurs at the P4 and P6 levels, and large-scale objects are relatively rare, the P3 level output node shown in Figure 3(c) is removed to enhance small object detection performance, simplify the network, and further reduce computational overhead. Road cracks are typically small, irregularly shaped, and defined by subtle features, relying heavily on detailed information, such as edges and textures, embedded in low-level features. In small object detection, low-level features contain abundant positional information critical for accurate localization and boundary delineation. Directly connecting low-level P3 features to the high-level output allows detailed information from the lower layers to be efficiently propagated to the final output, preserving fine-grained features critical for effective detection.

The optimized feature fusion network fuses multiscale positional and semantic information with different weights, removes redundant feature information, significantly improves the detection precision of small objects, and overcomes the defects of the traditional feature pyramid network.

2.4. Dyhead

There are small objects with uneven illumination and shadow occlusion in the pictures captured by the real scene, which are difficult to detect. To improve the precision of detection, Dyhead [32] detection head based on self-attention mechanism is introduced. Dyhead combines scale-aware attention module, spatial-aware attention module, and task-aware attention module. By incorporating the three self-attention modules, the feature representation ability of the detection head is enhanced, and the detection precision of small objects is improved.

The three attention modules in Dyhead are separated from each other, and each attention module focuses on only one dimension; accordingly, the three attention modules are arranged [33], and the attention on the detection layer is shown in the following equation:
()
where F is a 3D tensor of L × S × C obtained from the backbone network, which is inputted into the three attention modules, πL stands for the scale attention function, πS is the spatial-aware attention function, and πC is the task-aware attention function.
The scale-aware attention module dynamically fuses feature information of different scales in L dimension, and the functional expression of πL is shown in the following equation:
()
where S and C represent the spatial and task dimensions, respectively. σ denotes the Sigmoid activation function, and f(·) is a linear function that utilizes a 1 × 1 convolutional approximation.
The spatial-aware attention module is applied after the scale-aware attention module, and sparsity is added to the attention learning by deformable convolution to aggregate different levels of information at the same spatial location, as shown in the following equation:
()
where pk + ∆pk is the shifted position affected by the spatial offset pk, ∆mk denotes the importance factor of the self-learned position, and k is the number of sparsely sampled positions.
The task-aware attention module dynamically controls the opening and closing of feature channels for different tasks, as shown in the following equation:
()
where denotes the cutoff feature corresponding to the C channel, and is the threshold of the activated channel.

The structure of Dyhead is shown in Figure 4. A single Dyhead module structure is composed of three kinds of attention modules in series and can also be stacked with multiple dynamic modules.

Details are in the caption following the image
Structure of the Dyhead module.

3. Experiment and Analysis

This section provides an overview of the experimental setup, evaluation methods, and analysis for model training, including data preprocessing, evaluation metrics, ablation experiments for individual modules, performance comparison with other algorithms, and validation results on different datasets.

3.1. Dataset and Preprocessing

Since there is a small dataset of publicly available road damage images containing complex environmental disturbances, this paper captured 1300 road damage images of 3000 × 4000 pixels using a smartphone. This paper captured images of road damage on various road surfaces, including asphalt concrete, cement concrete, and masonry pavement. Subsequently, this paper filtered the acquired data to eliminate excessively blurred, poorly lit, and overexposed images, ensuring high image quality. The censored road damage dataset consists of seven types of road damage: transverse crack (TC), longitudinal crack (LC), oblique crack (OC), alligator crack (AC), block crack (BC), pothole (Po), and repair (Re), and an example of the damage is shown in Figure 5. Since repair damage is relatively uncommon in real road scenes and the amount of collected image data is limited, this paper considers selecting some high-quality repair-type images from the public dataset RDD2020 [34] to augment the self-constructed dataset. The merged dataset is divided into training set, validation set, and test set according to the ratio of 7:2:1.

Details are in the caption following the image
Example of road damage in a dataset.

The richness of the dataset is closely related to the detection effectiveness, and the use of datasets containing a variety of complex environments to train the model can achieve better results and stronger robustness [35], effectively preventing the model from overfitting. Therefore, this paper employs both offline and online methods for data enhancement. Online enhancement involves using Mosaic data enhancement, while offline enhancement includes common methods such as random flipping and color perturbation. Data enhancement enriches the location and color information of the damaged images in the dataset, and the enhanced dataset contains a total of about 3000 road damage images. The sample sizes for the various road damage types in the dataset are shown in Table 1. The proportion of each damage type in the training, validation, and testing sets remains consistent with the overall dataset distribution.

Table 1. Number of images of various road damage types in the dataset.
Type of road damage Total number Training set Validation set Test set
Transverse crack 1221 917 225 79
Longitudinal crack 1608 1108 335 165
Oblique crack 2059 1538 329 192
Block crack 377 243 80 54
Alligator crack 447 333 77 37
Pothole 341 251 62 28
Repair 602 438 126 38

Training on 3000 × 4000 pixel images is time consuming and computationally demanding. Directly cropping damaged images may lead to the loss of critical information. Therefore, determining an optimal input image size that balances detection performance and training efficiency is crucial. To achieve this, a comparative experiment was conducted with different input sizes (320 × 320, 640 × 640, and 1280 × 1280) while keeping other hyperparameters constant. The experimental results are shown in Table 2. The results indicate that 640 × 640 provides the best balance between detection accuracy and computational efficiency. A smaller input size (320 × 320) reduces detection accuracy whereas a larger size (1280 × 1280) slightly improves accuracy but significantly increases memory consumption. Due to GPU limitations, the batch size must be reduced from 32 to 8, thereby prolonging training time. Furthermore, previous research [36] indicates that increasing image resolution does not substantially enhance detection accuracy. Therefore, images are resized to 640 × 640 for the YOLOv8 model to optimize performance and efficiency.

Table 2. Comparison of detection results for different image sizes on the LBN-YOLO model.
Image size [email protected] (%) [email protected]:0.95 (%) GPU memory usage (G) Training time (h)
320 × 320 63.4 44.9 2.5 1.4
640 × 640 74.0 51.9 9.2 3.4
1280 × 1280 64.5 38.4 9.2 9.2

To ensure the accuracy and reliability of the annotations, the collected road damage image dataset is manually annotated by experts using LabelImg. Detailed annotation guidelines are established based on international standards, with a strict verification procedure. Randomly selected annotated samples are validated to enhance dataset consistency and credibility.

3.2. Experimental Parameter Setting and Model Training

The model training and testing for the experiments in this paper were conducted on an Intel i5-12400F CPU and NVIDIA 4060-Ti GPU. The experimental framework is PyTorch 1.13.1, CUDA Version 11.3, and the algorithms are written in Python, Python Version 3.9.18.

The hyperparameter settings during the training process are shown in Table 3. Considering the size of the computational resources, the batch size is set to 32, the model is trained for 400 epochs, the optimizer is chosen to be stochastic gradient descent (SGD), and in order to prevent the model from overfitting, an early stopping strategy is used.

Table 3. Training parameter settings.
Hyperparameters Value
Batch size 32
Epoch 400
Initial learning rate 0.01
Warmup 3
Optimizer SGD
Momentum 0.9
Weight decay 0.0005
Training mechanism EarlyStopping

3.3. Evaluation Indicators

This paper used commonly used evaluation metrics in the field of object detection, such as precision (P), recall (R), [email protected], [email protected]:0.95, parameter number (Para), frames per second (FPS), and F1-score (F1) to evaluate the road damage detection model. In object detection, the objects are classified into two categories: positive samples and negative samples. A positive sample refers to a correctly detected object, where the IoU between the predicted bounding box and the ground truth box exceeds a predefined threshold. A negative sample refers to a sample where no object is detected or the IoU is below the threshold.

Precision denotes the proportion of samples with positive predictions that are truly positive and is also known as the check precision rate, as shown in the following equation:
()
where NTP denotes the number of samples correctly predicted as positive samples, NTN denotes the number of samples correctly predicted as negative samples, NFP denotes the number of samples incorrectly predicted as positive samples, and NFN denotes the number of samples incorrectly predicted as negative samples.
Recall denotes the proportion of true positive samples that are correctly predicted as positive, as shown in the following equation:
()
The F1-score is an evaluation index derived from the combined consideration of precision and recall, as shown in the following equation:
()
when 0.5 is set as the threshold value of IoU, the average precision (AP) is calculated at different recall values, AP measures the advantages and disadvantages of the detection results of a certain category, and mAP averages the AP values of all the categories, and the formula for mAP is shown in the following equation:
()
where N denotes the total number of categories and APi indicates the AP value of i.
The evaluation of the detection model includes two aspects of detection precision and detection speed, and FPS is used to evaluate the detection speed of the detection model, which is calculated as shown in the following equations:
()

Ttotal is the sum of inference time, preprocessing time, and nonmaximum suppression (NMS) time, and Nimage represents the total number of images.

3.4. Model Validation and Results

This paper trained the model using a self-built dataset and the detection results for the seven road damage types are shown in Table 4. “Total” denotes the overall detection results for all categories. From Table 4, it can be concluded that in the detection using the LBN-YOLO proposed in this study, the total value of [email protected] for all types increased by 4.3%.

Table 4. Comparison of detection results of seven road damage types.
Class P (%) R (%) F1 (%) [email protected] (%) [email protected]:0.95 (%)
YOLOv8n:
Total 69.9 63.7 66.7 69.7 43.7
Alligator crack 84.4 76.7 80.4 83.9 49.0
Block crack 31.4 100.0 47.8 99.5 84.5
Longitudinal crack 68.2 53.2 59.8 59.1 31.6
Oblique crack 67.1 47.9 55.9 53.1 31.9
Pothole 91.7 41.7 57.3 48.2 15.7
Repair 71.5 81.1 76.0 77.1 59.3
Transverse crack 75.3 59.1 66.2 68.0 34.2
  
LBN-YOLO:
Total 75.1 65.9 70.6 74 51.9
Alligator crack 89.8 73.3 80.7 87.4 55.2
Block crack 35.9 100.0 52.8 99.5 94.5
Longitudinal crack 72.0 51.6 60.1 60.0 37.3
Oblique crack 69.8 45.6 55.2 55.5 32.7
Pothole 92.9 41.7 57.6 62.5 32.1
Repair 82.0 86.2 84.1 83.8 71.9
Transverse crack 83.4 59.1 69.2 69.5 39.9

When using the original YOLOv8n model for road damage detection, the detection results for pothole types are notably lower, with a [email protected] value of only 48.2%, owing to the scarcity of pothole images, which often appear in varying sizes and shapes. It is worth noting that the LBN-YOLO model incorporates the C2f-DWR module and SBiFPN, enabling more effective extraction and fusion of the irregular features of pothole across multiple scales. Utilizing LBN-YOLO for detection, the [email protected] value for pothole types increased by 14.3%, and the [email protected]:0.95 value increased by 16.4%, representing the highest improvement among several damage categories. This indicates the exceptional performance of our proposed model on the more challenging damage categories, reducing both missed detections and false alarms.

The LBN-YOLO model proposed in this study demonstrates superior improvements in P, R, and F1 compared with the original YOLOv8n model, as visually represented in Figures 6 and 7 with confusion matrices. It is evident from Figures 6(a) and 6(d) that the performance of the enhanced LBN-YOLO model surpasses that of the original model trained for 400 epochs, demonstrating significantly improved performance. Figure 7 illustrates a comparison of the confusion matrices between the two models before and after improvement. The horizontal axis represents the actual damage categories, while the vertical axis represents the predicted categories. Notably, the classification precision depicted in Figure 7(b) surpasses that of Figure 7(a), implying the superior performance of our LBN-YOLO algorithm compared to YOLOv8n in detecting road damages.

Details are in the caption following the image
Comparison curves before and after improvement of YOLOv8n model. (a) Precision curve, (b) recall curve, (c) [email protected] curve, and (d) F1-score curve.
Details are in the caption following the image
Comparison curves before and after improvement of YOLOv8n model. (a) Precision curve, (b) recall curve, (c) [email protected] curve, and (d) F1-score curve.
Details are in the caption following the image
Comparison curves before and after improvement of YOLOv8n model. (a) Precision curve, (b) recall curve, (c) [email protected] curve, and (d) F1-score curve.
Details are in the caption following the image
Comparison curves before and after improvement of YOLOv8n model. (a) Precision curve, (b) recall curve, (c) [email protected] curve, and (d) F1-score curve.
Details are in the caption following the image
Comparison of confusion matrix before and after YOLOv8n model improvement. (a) YOLOv8n confusion matrix normalized and (b) LBN-YOLO confusion matrix normalized.
Details are in the caption following the image
Comparison of confusion matrix before and after YOLOv8n model improvement. (a) YOLOv8n confusion matrix normalized and (b) LBN-YOLO confusion matrix normalized.

3.5. Ablation Experiment

To verify the effectiveness of several modules proposed in this paper, ablation experiments are designed. Accordingly, the proposed C2f-DWR, SBiFPN, and Dyhead are added into the model for the ablation test. The experiments are conducted on the self-constructed dataset presented in Section 3.1, and the results are shown in Table 5.

Table 5. Results of ablation experiments.
YOLOv8n C2f-DWR SBiFPN Dyhead F1 (%) [email protected] (%) [email protected]:0.95 (%) Para (M)
67.7 69.7 43.7 3.0
69.4 71.7 46.1 2.9
65.7 69.0 43.1 2.0
70.0 71.2 48.9 3.4
69.7 72.8 51.1 1.9
69.2 72.1 51.0 3.4
68.9 71.9 49.5 2.5
70.6 74.0 51.9 2.4

In the first row of Table 5, the detection results using the original YOLOv8n model serve as the baseline. Subsequently, the C2f-DWR module is separately added to the baseline model, resulting in a 2% increase in [email protected]. The addition of SBiFPN achieves lightweight of the original model, with a significant 34% reduction in parameter count while maintaining comparable precision to YOLOv8n. When combining two modules, the results surpass those of using each module individually. Notably, the combination of C2f-DWR and SBiFPN modules increases [email protected] from 69.7% to 72.8%, a 3.1% improvement, while also reducing computational costs. Finally, incorporating all three improvement modules simultaneously results in the best performance. Compared with the original model, F1 scores increase by 2.9%, [email protected] by 4.3%, and [email protected]:0.95 by 8.2%, with a 20% reduction in parameter count. These experimental results robustly demonstrate the significant advantages of the LBN-YOLO model proposed in this study in terms of detection precision, speed, and the number of parameters, with each module making a positive contribution.

To further verify the effectiveness of the improved SBiFPN and evaluate its impact on road damage detection models, this study conducts a comparative analysis of the original BiFPN and SBiFPN, as shown in Table 6. The results indicate that incorporating the original BiFPN into YOLOv8n improves all performance metrics, primarily due to BiFPN’s strong multiscale feature fusion capabilities. Integrating the SBiFPN into YOLOv8n reduces fusion nodes, simplifies the feature fusion network, and retains efficient feature fusion capabilities, while reducing parameters by approximately 34% with only a slight decline in evaluation metrics. Compared with BiFPN, the integration of SBiFPN with the C2f-DWR and Dyhead modules yields significant improvements in mAP performance, with [email protected] increasing by 1.5% and [email protected]:0.95 increasing by 4.7%. The SBiFPN reduces the model complexity and the number of parameters, achieving better detection performance.

Table 6. Comparison of detection performance between BiFPN and SBiFPN.
Models F1 (%) [email protected] (%) [email protected]:0.95 (%) Para (M)
YOLOv8n 66.7 69.7 43.7 3.0
YOLOv8n + BiFPN 67.4 70.1 44.3 3.1
YOLOv8n + SBiFPN 65.7 69.0 43.1 2.0
YOLOv8n + C2f-DWR + Dyhead + BiFPN 70.2 72.5 47.2 3.4
LBN-YOLO (ours) 70.6 74.0 51.9 2.4

To visualize the image regions that the detection model focuses on when making predictions, this study uses a heatmap visualization technique. Specific layers in the model are selected to generate the heat map. Figure 8 highlights the image regions on which the model relies when identifying cracks. It can be observed from Figure 8 that during the gradual addition of the improvement module, the attention of the model is progressively focused on the region where the cracks are located. In particular, in Figure 8(d), that is, after all the improvement modules have been added, the model performs best by focusing its attention almost exclusively on the crack range.

Details are in the caption following the image
Heatmap visualization. (a) YOLOv8n, (b) YOLOv8n + SBiFPN, (c) YOLOv8n + SBiFPN + Dyhead, and (d) YOLOv8n + C2f-DWR + SBiFPN + Dyhead.

3.6. Comparing Different Object-Detection Algorithms

According to previous studies on road damage detection, models such as Faster-RCNN [11], SSD-MobileNet [6], YOLOv5 [37], and YOLOv7 [38] have been used for road damage detection. To evaluate the detection performance of the improved LBN-YOLO model, comparative experiments were conducted between these models and LBN-YOLO. The experimental results are shown in Table 7 and Figure 9, where the inference times are tested on the same RTX4060TI.

Table 7. Comparison of LBN-YOLO with other state-of-the-art algorithms.
Models F1 (%) [email protected] (%) [email protected]:0.95 (%) Para (M) FPS (f·s−1)
Faster R-CNN 62.0 65.7 39.7 28.3 16.2
SSD 51.6 61.7 31.1 4.3 83.8
YOLOv5s 66.1 66.8 41.1 7.0 85.1
YOLOv7-tiny 67.1 67.7 42.5 6.0 61.7
YOLOv8n 66.7 69.7 43.7 3.0 74.5
LBN-YOLO (ours) 70.6 74.0 51.9 2.4 87.7
Details are in the caption following the image
Comparison of various metrics between LBN-YOLO and other advanced algorithms.

Note that our model performs better than YOLOv5, YOLOv7-tiny, YOLOv8n, SSD, and Faster R-CNN in detecting pavement damages. The [email protected] of LBN-YOLO is 74%, showing a 6.3% improvement compared to the top-performing YOLOv7-tiny algorithm among other state-of-the-art models. When compared with the YOLOv8n model, there is a notable 4.3% enhancement in [email protected], demonstrating significant detection efficacy. Faster R-CNN, as a two-stage detection model, operates at slower speeds and involves more parameters. Conversely, the proposed model exhibits a 20% reduction in parameters compared with YOLOv8n, facilitated by the adoption of a SBiFPN as the feature fusion network, thereby achieving a lightweight design. Moreover, the FPS reaches 87.7 frames/s, surpassing that of other models. The experimental results validate that the proposed model performs exceptionally well in detection speed and precision, making it highly cost-effective for comprehensive evaluation.

3.7. Visualization of Test Results

After the model training is completed, in order to verify the detection performance of the model and distinguish the advantages and disadvantages of the model more intuitively, the detection results are visualized in this paper. A comparison of the visualization results is shown in Figure 10.

Details are in the caption following the image
Visualization of the results of the comparison experiments. (a) Faster R-CNN, (b) YOLOv5s, (c) YOLOv7-tiny, and (d) LBN-YOLO.

In the first image set of Figure 10, YOLOv7-tiny misidentifies leaves as pothole damage, whereas there is no misidentification in LBN-YOLO. Road damages typically exhibit spatial continuity, making them susceptible to misdetection or multiple detections when the model lacks the capability to extract global information adequately. In the second set of images, Faster R-CNN, YOLOv5s, and YOLOv7-tiny fail to detect, while LBN-YOLO can successfully detect all the fine cracks, which indicates that it is stronger in extracting multi-scale contextual information. In the third set of images, YOLOv5s has leakage detection due to shadow interference and Faster R-CNN and YOLOv7-tiny generate multiple detection boxes. LBN-YOLO improves its ability to detect occluded cracks through Dyhead’s self-attention mechanism, leading to better detection results. In the fourth set of images, alligator cracks are a rare type of road damage and are more difficult to detect. Faster R-CNN fails to detect cracks completely, while YOLOv5s and YOLOv7-tiny detect complete cracked areas but the redundant detection boxes indicate their insufficient global information feature extraction. In contrast, the detection boxes of LBN-YOLO are accurate with high confidence. In the fifth set of images, Faster R-CNN misidentifies Road Markings as transverse cracks, while YOLOv5s and YOLOv7-tiny miss more detections. LBN-YOLO detects all road damages in the images with excellent performance.

Based on the experimental results, the LBN-YOLO model proposed in this paper exhibits superior performance in road damage detection. It effectively detects road damage across various scales and under complex conditions while maintains high detection precision. Moreover, the model has minimal parameters and high detection speed, making it highly suitable for road damage detection applications.

3.8. Performances on China_MotorBike Dataset and NEU-DET

To verify the generality of LBN-YOLO, the China_MotorBike part of the publicly available road damage dataset RDD2022 [39] and the steel surface defects dataset NEU-DET [40] are selected for model validation, and the training parameters are kept the same as those in Section 3.2.

The China_MotorBike dataset is a road damage dataset containing five label types. Detection results are presented in Table 8. Experimental results show that the proposed LBN-YOLO model surpasses the baseline YOLOv8n model in detecting all five damage types while reducing the miss rate. Figure 11 shows the confusion matrix, clearly illustrating the improved detection performance for each class.

Table 8. Comparison of detailed detection results for YOLOv8n and its improved model on the China_MotorBike dataset.
Class P (%) R (%) F1 (%) [email protected] (%) [email protected]:0.95 (%)
YOLOv8n:
Total 87.6 84.5 86.0 91.3 62.8
D00 83.2 74.5 78.6 84.9 52.7
D10 84.7 62.2 71.7 80.1 46.7
D20 85.8 88.1 86.9 93.5 59.9
D40 89.9 97.5 93.5 98.6 60.8
Repair 94.5 100 97.2 99.5 93.9
  
LBN-YOLO:
Total 90.9 88.2 89.5 92.9 66.5
D00 86.8 80.0 83.3 87.1 60.1
D10 87.0 71.1 78.3 83.9 49.3
D20 88.9 89.8 89.3 94.1 66.3
D40 92.8 100 96.3 99.5 63.1
Repair 95.5 100 97.7 99.7 94.2
Details are in the caption following the image
Comparison of confusion matrices for YOLOv8n and its improved model on the China_MotorBike dataset. (a) YOLOv8n confusion matrix normalized and (b) LBN-YOLO confusion matrix normalized.
Details are in the caption following the image
Comparison of confusion matrices for YOLOv8n and its improved model on the China_MotorBike dataset. (a) YOLOv8n confusion matrix normalized and (b) LBN-YOLO confusion matrix normalized.

The NEU-DET dataset, designed for steel surface defect detection, contains 1800 images with six label types. As shown in Table 9, experimental results comparing the YOLOv8n and LBN-YOLO models demonstrate that the proposed LBN-YOLO model outperforms YOLOv8n in [email protected] and [email protected]:0.95 for most defect types. However, the detection accuracy for Patches and Pitted_Surface slightly decreases due to their significant shape differences from road damage. Nevertheless, the LBN-YOLO model achieves a lower miss rate for these categories. Figure 12 presents the comparative confusion matrix for the improved and original models.

Table 9. Comparison of detailed detection results for YOLOv8n and its improved model on the NEU-DET dataset.
Class P (%) R (%) F1 (%) [email protected] (%) [email protected]:0.95 (%)
YOLOv8n:
Total 75.5 67.4 71.2 74.9 39.8
Crazing 51.5 26.2 34.7 32.7 11.6
Inclusion 80.6 81.0 80.8 84.8 44.6
Patches 82.7 78.1 80.3 85.2 52.7
Pitted_surface 81.0 74.4 77.6 81.2 45.6
Rolled_in_scale 77.4 63.5 69.8 77.1 36.2
Scratches 79.7 81.4 80.5 88.5 48.4
  
LBN-YOLO:
Total 73.2 73.9 73.5 77.7 44.6
Crazing 51.6 35.5 42.1 40.6 13.8
Inclusion 81.3 83.4 82.4 84.9 45.5
Patches 80.8 79.1 79.9 87.7 58.8
Pitted_surface 77.1 74.6 75.8 83.4 52.6
Rolled_in_scale 72.8 80.8 76.6 81.2 41.9
Scratches 77.4 89.8 83.1 88.7 55.2
Details are in the caption following the image
Comparison of confusion matrices for YOLOv8n and its improved model on the NEU-DET dataset. (a) YOLOv8n confusion matrix normalized and (b) LBN-YOLO confusion matrix normalized.
Details are in the caption following the image
Comparison of confusion matrices for YOLOv8n and its improved model on the NEU-DET dataset. (a) YOLOv8n confusion matrix normalized and (b) LBN-YOLO confusion matrix normalized.

The experimental results above indicate that LBN-YOLO improves detection performance on both public datasets while reducing parameters by 20%. Thus, the proposed model exhibits strong generalization ability and is suitable for most datasets.

4. Discussion

Existing deep learning-based road damage detection models generally have low detection precision, and it is difficult to achieve the detection of multiscale and small-object road damage and other difficulties, so this paper focuses on overcoming the above difficulties and proposes an improved model LBN-YOLO; according to the results obtained from the above experiments, this study can accurately locate and classify a variety of road damage.

The experimental results demonstrate that integrating DWR modules into the last two C2f modules of the backbone network significantly enhances the model’s ability to extract global context information and obtain a more comprehensive feature representation. Utilizing SBiFPN as the feature fusion network further reduces redundant features and minimizes the number of model parameters. In addition, incorporating cross-level horizontal connections effectively fuses multiscale features with original features. Adopting Dyhead as the detection head, which combines scale-aware, spatial-aware, and task-aware attention mechanisms, enhances the model’s ability to detect occluded small objects. The proposed LBN-YOLO effectively balances precision and speed in road damage detection, reducing model parameters and computational load without compromising precision, thereby facilitating the model’s deployment in practical applications.

Although this study has achieved high detection precision in road damage detection, the dataset limitation, especially in capturing images under complex backgrounds like foggy weather, poses a challenge. Future work will focus on collecting more diverse road damage images under complex conditions and considering integration with road damage segmentation models such as U-Net, FCN, and PSPNet. This approach aims to further improve the road damage detection precision by extracting shape features and refining the road damage detection system.

5. Conclusions

This paper proposes a road damage detection model named LBN-YOLO, aiming to improve detection precision and speed. To achieve this goal, this paper has done the following. (1) To address the challenge of a small original dataset and data imbalance prone to model overfitting, data augmentation is carried out using offline and online data enhancement methods. (2) This paper addresses the issue of feature information loss due to downsampling by introducing DWR modules into the last two C2f modules of the backbone network. (3) To improve feature fusion efficiency and reduce redundancy, SBiFPN is used as the feature fusion network to achieve cross-level weighted fusion. (4) Utilizing Dyhead as the detection head enhances the scale, space, and task-sensing ability for small objects, thereby improving the classification and localization effect of occluded small objects.

This study employs LBN-YOLO for detecting seven types of road damage. Experimental results indicate that an [email protected] of 74% at 87.7 FPS, marking a 4.3% improvement over YOLOv8n. Furthermore, in comparison to other advanced object detection models such as Faster R-CNN, SSD, YOLOv5s, and YOLOv7-tiny, the proposed method demonstrates significant advantages with an 8.3%, 12.3%, 7.2%, and 6.3% increase in [email protected], respectively. Moreover, the proposed model has been validated on publicly available road damage datasets, such as the China_MotorBike dataset and the steel surface defect dataset NEU-DET, obtaining better detection results compared with the original model and indicating good generalizability. Ablation experiments and visual validations further confirm the effectiveness of the utilized modules. For future work, the model might be deployed on edge devices, such as UAVs, to achieve real-time automatic detection of road damage.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

This work was supported in part by Weihai Sunshine Engineering Technology Co., Ltd. (Project No. 1010025055); and in part by the Science and Technology Development Plan Project of Weihai Municipality under Grant 2022DXGJ13.

Acknowledgments

This research received no external funding.

    Data Availability Statement

    The data that support the findings of this study are available from the corresponding author upon reasonable request.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.