Volume 2025, Issue 1 1677778
Research Article
Open Access

Vision-Aided Damage Detection With Convolutional Multihead Self-Attention Neural Network: A Novel Framework for Damage Information Extraction and Fusion

Yiming Zhang

Yiming Zhang

State Key Laboratory for Strength and Vibration of Mechanical Structures , Xi’an Jiaotong University , Xi’an , 710049 , China , xjtu.edu.cn

Search for more papers by this author
Zili Xu

Corresponding Author

Zili Xu

State Key Laboratory for Strength and Vibration of Mechanical Structures , Xi’an Jiaotong University , Xi’an , 710049 , China , xjtu.edu.cn

Search for more papers by this author
Guang Li

Guang Li

State Key Laboratory for Strength and Vibration of Mechanical Structures , Xi’an Jiaotong University , Xi’an , 710049 , China , xjtu.edu.cn

Search for more papers by this author
Jun Wang

Jun Wang

Science and Technology on Liquid Rocket Engine Laboratory , Xi’an Aerospace Propulsion Institute , Xi’an , 710100 , China

Search for more papers by this author
First published: 08 April 2025
Academic Editor: Ka-Veng Yuen

Abstract

The current application of vibration-based damage detection is constrained by the low spatial resolution of signals obtained from contact sensors and an overreliance on hand-engineered damage indices. In this paper, we propose a novel vision-aided framework featuring convolutional multihead self-attention neural network (CMSNN) to deal with damage detection tasks. To meet the requirement of spatially intensive measurements, a computer vision algorithm called optical flow estimation is employed to provide informative enough mode shapes. As a downstream process, a CMSNN model is designed to autonomously learn high-level damage representations from noisy mode shapes without any manual feature design. In contrast to the conventional approach of solely stacking convolutional layers, the model is enhanced by combining a convolutional neural network (CNN)–based multiscale information extraction module with an attention-based information fusion module. During the training process, various scenarios are considered, including measurement noise, data missing, multiple damages, and undamaged samples. Moreover, the parameter transfer strategy is introduced to enhance the universality of the application. The performance of the proposed framework is extensively verified via datasets based on numerical simulations and two laboratory measurements. The results demonstrate that the proposed framework can provide reliable damage detection results even when the input data are corrupted by noise or incomplete.

1. Introduction

Engineering structures are susceptible to various levels of damage due to harsh environments and complex loads [1, 2]. To guarantee the integrity and safety of structures, studies on structural health monitoring (SHM) have been pursued. In the field of SHM, vibration signals are frequently used to assess structural conditions [3]. Several vibration-based methods are developed to detect damage, such as those based on frequency response function [46], modal damping [79], characteristic frequency [1012], and mode shape [1316]. Among these methods, mode shape-based methods are well suited to the task of detecting damage, leveraging their comprehensive spatial dynamic information.

Accurate mode shape measurement is vital for mode shape-based damage detection, as it directly influences the precision of damage localization. Currently, mode shape acquisition methods fall into two broad categories: contact and noncontact techniques. Contact methods based on contact sensors are the most adopted [1719]. However, contact methods inevitably induce mass-loading effects and offer only sparse and discrete monitoring points, resulting in low spatial measurement resolution [20, 21]. This is typically insufficient for mode shape-based damage detection. Noncontact methods, such as scanning laser vibrometer (SLV)–based and vision-based techniques, can collect vibration signals without requiring sensors to be physically mounted on structures. Yang et al. placed nineteen measurement points on an aluminum beam and utilized SLV to capture its mode shapes [22]. Pan et al. measured the mode shapes of carbon-epoxy curved plates under free conditions using SLV [23]. Xin et al. utilized high-speed videos processed with a phase-based computer vision algorithm to measure the mode shapes of beams [24]. Chen et al. applied complex-valued steerable pyramid filter banks to analyze digital videos of structural motion and extract mode shapes of a pipe [25]. Despite providing denser measurement points, the use of SLV is costly and time-consuming. In contrast, vision-based methods, which offer lower costs, higher efficiency, and high-spatial-resolution measurements, are gaining increasing attention [2628].

On the other hand, there is currently a lack of universally applicable mode shape-based index for revealing damage. Roy presented detailed mathematical derivations showing that the maximum difference in mode shape slopes occurs at the damage location [29]. Pooya et al. introduced the difference between mode shape curvature and its estimation as an indicator of damage location [30]. Cao et al. proposed a damage index combining wavelet transform technique and mode shape curvature to detect multiple cracks in beam structures [31]. Xiang et al. adopted modal curvature utility information entropy index (MCUIE) to catch the damage-induced discontinuity in mode shapes [32]. Cui et al. identified fatigue cracks by calculating spatially distributed wavelet entropy of mode shapes [33]. Although different damage indices have been presented and verified, it is important to note that these hand-engineered indices face many challenges. The formulation of a hand-engineered damage index necessitates an analysis of dynamic characteristics both before and after damage has occurred. This process heavily relies on domain-specific expert knowledge. Moreover, it is not feasible to universally apply a specific damage index in realistic complex noisy environments, even for identical structures. Therefore, a model that requires no manual intervention and is robust to noise, as well as capable of dealing with extreme cases of partial data missing, is desirable.

In recent years, data-driven methods based on deep neural network (DNN) have revolutionized numerous scientific domains [34]. One significant advantage of DNNs is their capacity to autonomously learn high-level feature representations from massive samples without manual feature engineering, which allows for end-to-end prediction [35]. Scholars have made efforts to utilize DNN-based methods to realize damage detection. Oh et al. used a convolutional neural network (CNN) to model the interrelation of dynamic displacement response between healthy and damaged states [36]. Lei et al. proposed a CNN model to identify structural damage from transmission functions of vibration response [37]. Tang et al. developed a CNN-based data anomaly detection method imitating human vision and decision making [38]. He et al. combined CNN with fast Fourier transform (FFT) to identify damage conditions [39]. Guo et al. utilized a model composed of stacked CNN modules to extract damage features from raw mode shapes [40]. Nevertheless, most DNN-based methods rely solely on CNN for their network architecture. This limits their model improvement to merely stacking network layers to increase trainable parameters, rather than enhancing feature extraction capabilities.

To overcome the above deficiencies, in this paper, we propose a novel vision-aided framework with convolutional multihead self-attention neural network (CMSNN) to deal with damage detection tasks. A computer vision algorithm named optical flow estimation is first employed to conduct high-spatial-resolution vibration measurements. Then, the CMSNN model, composed of two distinct types of modules, is designed to perform multiscale damage information (DI) extraction and fusion autonomously. To meet the requirement of generating massive labeled samples for model training, a numerical simulation strategy is adopted to construct datasets of damaged mode shapes, accounting for measurement noise, multiple damages, and undamaged samples. Moreover, the results of numerical simulations and experiments show that the proposed framework can accurately detect structural damage from raw data with strong robustness and remains effective across various scenarios.

The rest of the paper is organized as follows. Section 2 presents the theory of optical flow estimation and the architecture design of the CMSNN model. Section 3 describes the strategy for the CMSNN model training. The performance of the CMSNN model is numerically evaluated in Section 4, and the proposed framework is experimentally verified in Section 5. Section 6 concludes the paper with a summary of findings and suggests potential future research directions.

2. Methodology

In this section, the principles of the proposed damage detection framework are described in detail. Figure 1 provides a visual representation of the specific process of the framework.

Details are in the caption following the image
Flowchart of the proposed framework for structural damage detection.

2.1. High-Spatial-Resolution Mode Shapes via Vision

It is well accepted that the video recording process is a projection of structural motion onto the image plane. The aim of optical flow estimation is to compute reliable estimates of the motion field from time-varying image intensity. The initial step in estimating optical flow involves assuming that pixel intensities are consistently translated from one frame to the next
()
where I(x, y, t) is the image intensity at spatial location (x, y) and time t, dx, and dy are the pixel shift in the x and y directions, respectively, and dt is the interframe time.
Considering the high frame rate of video sampling and the typically small vibration amplitude, the pixel motion from one frame to the next is sufficiently minor. Thus, we can expand the right-hand side of equation (1) by applying the first-order Taylor series approximation:
()
By substituting equation (2) into equation (1), the following equation can be obtained:
()
where Ix and Iy are the horizontal and vertical image intensity gradients, respectively, It is the difference of the image intensities between the 2 frames, and u and v are the optical flow embedded with vibration signals in the horizontal and vertical direction.
Equation (3) is known as the optical flow equation. It is underdetermined, which implies that the two variables (u, v) cannot be recovered uniquely from a single gradient constraint. To address this issue, it is reasonable to assume that neighboring pixels in a small region around the measured pixel share the same optical flow. Let the number of neighboring pixels be k. The equation system below holds since the neighboring pixels are subject to the same motion
()
where p1, p2, …, pk are the neighboring pixels of the measured pixel.
For simplicity, we can express equation (4) as a matrix multiplication
()
Equation (5) is overdetermined since the value of k used is generally much greater than 2. To identify the solution that minimizes the constraint errors, the least-squares optimization method is employed to provide the most accurate optical flow estimation:
()

In practice, optical flow vectors are calculated to track the locations of pixels across video frame sequences, thereby providing the vibration signals associated with those pixels. Notably, each tracked pixel can serve as a measuring point, which allows for high-spatial-resolution vibration measurements. In this paper, blind source separation (BSS) technique is adopted to extract high-spatial-resolution mode shapes from the obtained vibration signals. The comprehensive explanations and derivations of BSS are available in [41].

2.2. CMSNN Model for DI Extraction and Fusion

In this paper, a novel CMSNN model is proposed to autonomously learn high-level feature representations from obtained noisy mode shapes, obviating the need for explicit manual feature design. In particular, the architecture of the CMSNN model mainly consists of two principal functional modules: the CNN-based multiscale information extraction module and the attention mechanism–based information fusion module.

2.2.1. Multiscale Information Extraction Module

The architecture design of the multiscale information extraction module is depicted in Figure 2. In this module, convolutional layers are used to extract multiscale damage features from the raw mode shape data. The input mode shape vector comprises 100 components, indicating that the mode shapes obtained via simulations and vision-aided experiments both equally contain 100 sample points.

Details are in the caption following the image
Architecture of the multiscale information extraction module.
Unlike operations of conventional CNN, which simply stack multiple convolutional layers on top of each other to deepen the network, we introduce four sets of cascade convolutional filters with varying kernel sizes (specifically, 1 × 3, 1 × 5, 1 × 7, and 1 × 9) to extract semantic information of damage in parallel. Smaller kernels are used to capture finer details of damage, whereas larger kernels provide greater noise robustness through their larger receptive fields. This approach leads to a more lightweight module and allows for DI extraction from multiple scales. The mathematical representation of the process can be described as follows:
()
where DIi,j denotes the jth DI vector at scale i, ψ denotes the input mode shape vector, Fi,j denotes the jth convolutional kernel at scale i, bi,j denotes the jth bias parameter at scale i, nci denotes the number of convolution kernels at scale i, and σ denotes the activation function.
Inspired by the earlier work in [42], the parametric rectified linear unit (PReLU) is adopted here as the activation function for the convolutional layer, which can be written as follows:
()
where α is the trainable parameter that controls the slope of the function.
It is worth noting that a pooling layer has not been included in order to retain DI to the maximum extent possible. At the end of the module, channel concatenation is performed on the DI across all scales to generate a highly integrated damage information map (DIM) for further information fusion, which can be described as follows:
()

2.2.2. Information Fusion Module

The attention mechanism is fundamentally analogous to the human vision mechanism focusing on the key features and attenuating the impact of noise interference during the training process [43], making it particularly well suited to the damage detection task. The primary objective of the information fusion module is to achieve an efficient fusion of multiscale DI using an architecture based on multihead self-attention mechanisms. Figure 3 provides a visual representation of this architecture.

Details are in the caption following the image
Architecture of the information fusion module.
In order to capture the dependencies between each sequence and other sequences of the input matrix, the multihead self-attention mechanism establishes mapping relationships between queries, keys, values, and outputs in different embedding spaces. In this study, the DIM extracted by the multiscale information extraction module is used as the input matrix, and the DI across all scales in the DIM are the sequences to be analyzed. The input DIM is first mapped into h groups of matrices through linear projection, with each group comprising three distinct matrices:
()
where Qi, Ki, and Vi are the query matrix, key matrix, and value matrix belonging to the ith head, , , and are trainable linear projection matrices of the ith head, h is the number of heads, and de = 100/h is the dimension of embedding space.
For each group of matrices, the computation of scaled dot-product attention is then performed according to equation (11):
()
where SoftMax is the normalized exponential function. Its detailed description can be found in [44].
Finally, the multihead attention of DI is acquired by horizontally concatenating the h scaled dot-product attention:
()

The multihead attention obtained has established an adaptive correlation of DI between different scales in multiple embedding spaces. It is important to note that this is an efficient and robust process of information fusion, not limited by the spacing of the scales.

2.2.3. CMSNN Model Construction

During the verification stage, a series of experiments were conducted to assess the performance of different module combinations and hyperparameter settings. Figure 4 illustrates the overall architectural configuration that yielded the optimal results, and Table 1 lists the detailed configuration of each module.

Details are in the caption following the image
Overall architecture of the proposed CMSNN model.
Table 1. Detailed configuration of the proposed CMSNN model.
Module type Kernel size Kernel number Stride Padding Head number Activation function Output size
Multiscale information extraction module 1 × 3 32 1 1 / PReLU 32 × 100
1 × 5 32 1 2 / PReLU 32 × 100
1 × 7 32 1 3 / PReLU 32 × 100
1 × 9 32 1 4 / PReLU 32 × 100
  
Information fusion module 1 / / / / 4 PReLU 128 × 100
  
Information fusion module 2 / / / / 2 PReLU 128 × 100
  
Full connection 1 / / / / / PReLU 1 × 1000
  
Full connection 2 / / / / / Sigmoid 1 × 100

The mode shape fed into the CMSNN model is initially processed by a dropout layer to randomly freeze input nodes. This simulates data-missing scenarios while simultaneously avoiding overfitting. Subsequently, the damage semantic information is extracted from various scales through the multiscale information extraction module to construct the DIM. Following this, two information fusion modules are stacked to perform attention computation on the DIM in multiple embedding spaces, achieving efficient and robust information fusion. Finally, the high-level attention feature is flattened and fed into two stacked fully connected layers to provide the damage probability distribution.

3. Training Strategy

3.1. Dataset Generation

Constructing datasets through video stream acquisition and optical flow estimation undoubtedly results in significant computational overheads and wasted storage space. Accordingly, we adopt a numerical simulation strategy to generate a large number of labeled samples for model training. The strategy is divided into two steps.

The first step involves generating undamaged samples using the theoretical formula for the mode shape of beam-like structures. Equations (13) and (14) list the theoretical formulas of mode shapes under three different boundary conditions:
()
()
where l denotes the length of the beam, C denotes a constant, and the specific values of βl for mode shapes of different orders are listed in Table 2.
Table 2. Specific values of βl for mode shapes of different orders.
Boundary condition First mode shape Second mode shape Third mode shape
Clamped–free (C–F) 1.875104 4.694091 7.854757
Clamped–clamped (C–C) 4.730041 7.853205 10.995608
Clamped–pinned (C–P) 3.926602 7.068583 10.210176
Second, the finite element method (FEM) is employed to model Euler–Bernoulli beams with single or multiple damages to generate damaged samples. Considering that localized damage to a structure can lead to a reduction in the material’s bearing capacity, structural damage is simulated here using the stiffness degradation method, which can be described by the following formula:
()
where and are the elastic modulus of the ith element in damage and undamaged beam, respectively, and di is the degree of damage of the ith element.
The label for each sample is designed as a vector with n components (n should be equal to the dimension of the mode shape and n = 100 in this work). The value of each component represents the damage probability of the corresponding beam element, which can be calculated as
()
where Pi is the value of the ith component of the label.

In the process of dataset generation, various factors are taken into account, including boundary conditions, measurement noise, number of damages, and degree of damages, to allow the CMSNN model to learn a more intrinsic representation of the DI. The range of possible choices at random for each of these factors is shown in Table 3, and the detailed allocation of samples for the dataset is shown in Table 4.

Table 3. The range of possible choices for various factors.
Boundary condition Number of damages Degree of damages Signal-to-noise ratio (dB)
{C-F, C-C, C-P} {1, 2, 3} (0, 1) (60, 120)
Table 4. Allocation of samples for the dataset.
Sample use Damaged samples Undamaged samples Total samples
Train 10,000 2000 12,000
Validate 2500 500 3000
Test 1000 200 1200

3.2. Loss Function

Since the prediction of our CMSNN model is the continuous damage probability distribution rather than the discrete judgment of finite structural conditions, the regression model rather than the classification model is more appropriate for our damage detection task. In this study, the mean square error (MSE, also known as L2 loss) is chosen as the loss function, which can be written as
()
where P and are the prediction result and true label, respectively, and n is the dimension of the label vector.

To minimize the loss value, we use Adam [45] as the optimization method to optimize the network weights, as it can adaptively change the learning rate according to the current gradient. The two momentum parameters for Adam are set to β1 = 0.9 and β2 = 0.999.

3.3. Training Results

The CMSNN model is implemented on the PyTorch (Version 1.11.0) platform. All modules are initialized from scratch with random weights. The training and testing processes are conducted on the same hardware (CPU: Intel Xeon Platinum 8375C, GPU: NVIDIA GeForce RTX 3090, RAM: 128 GB). We trained the proposed network with the generated dataset for 200 epochs using a batch size of 128, with detailed records of loss and accuracy values. The convergence history of the CMSNN model is plotted in Figure 5. The plots reveal that the loss and accuracy curves exhibit an inflection point around the 30th epoch, signifying the model’s quick convergence in the initial 30 epochs to meet the task’s predictive demands. Following 200 epochs of training, the accuracy on both the training set and testing set reaches 0.9, demonstrating that the CMSNN model presents an excellent ability for the prediction task.

Details are in the caption following the image
Convergence history of the CMSNN model: (a) loss evolution and (b) accuracy evolution.
Details are in the caption following the image
Convergence history of the CMSNN model: (a) loss evolution and (b) accuracy evolution.

4. Model Evaluation

4.1. Noise Floor Evaluation

The noise floor is an important indicator of a model’s performance. In the context of our damage detection task, the noise floor level is reflected in the predicted damage probability distribution for undamaged samples. We first randomly chose three cases to evaluate the noise floor level of the CMSNN model. The specific settings of these cases are presented in Table 5.

Table 5. Settings of three cases for noise floor evaluation.
Case Mode shape Boundary condition Signal-to-noise ratio (dB)
1 First order C-C 105
2 Second order C-F 69
3 Third order C-P 82

By feeding the mode shapes into the CMSNN model, the predicted damage probability distributions for three undamaged cases are obtained, as shown in Figure 6. It is presented that the predicted damage probability distributions lack localized peaks, indicating the absence of damage. Despite increased fluctuations in the damage probability distribution curve in noisier cases (i.e., lower signal-to-noise ratio), the damage probability remains at a low level. This suggests that the CMSNN model can deal well with undamaged states and possesses a low noise floor.

Details are in the caption following the image
Detection results for undamaged cases: (a) Case 1, (b) Case 2, and (c) Case 3.
Details are in the caption following the image
Detection results for undamaged cases: (a) Case 1, (b) Case 2, and (c) Case 3.
Details are in the caption following the image
Detection results for undamaged cases: (a) Case 1, (b) Case 2, and (c) Case 3.

Moreover, it is important to recognize that this good performance benefits from the incorporation of a proportion of undamaged samples within our dataset. To verify this inference, we remove the undamaged samples from the dataset and retrain the network. On a batch of 512 undamaged samples, the model trained with damaged samples only is compared to the model trained with both damaged and undamaged samples. Figure 7 demonstrates the frequency distribution pattern of damage probability predicted by the two models for all samples, where the height of the column denotes the mean of the frequencies and the height of the error bar denotes the standard deviation of the frequencies. As demonstrated in plots, the model trained with both damaged and undamaged samples achieves a lower and more stable predicted damage probability distribution for the undamaged case, indicating a lower noise floor.

Details are in the caption following the image
Frequency distribution pattern of damage probability predicted by the two models for undamaged samples.

4.2. Noise Immunity Test

To perform the noise immunity test, the Monte Carlo method is employed to evaluate the detection capability of the CMSNN model at various noise levels. Each noise level comprises 512 samples, including random boundary conditions, measurement noise, number of damages, and degree of damages. To evaluate the performance of the CMSNN model, the accuracy of detection is introduced, which is defined as
()
where Ncorrect and Ntotal are the number of correctly detected damage locations and total damage locations, respectively.

Figure 8 shows the results of the accuracy assessments. It can be observed that the higher order mode shape demonstrates greater robustness to noise compared to the lower order mode shape. Although the damage features are inevitably blurred by noise, the proposed model maintains satisfactory detection accuracy. The model can achieve an accuracy of over 90% when the signal-to-noise ratio exceeds 80 dB, and an accuracy of over 80% when the signal-to-noise ratio is above 60 dB. When the signal-to-noise ratio falls below 60 dB, the model’s detection accuracy drops more rapidly, as the model has not yet learned from samples with these noise levels. Nevertheless, the accuracy can reach over 60% at the signal-to-noise ratio of 40 dB, suggesting that the model has learned a more fundamental representation of the DI.

Details are in the caption following the image
Accuracy of damage detection at different noise levels.

4.3. Data Missing Test

Considering the limitations of measurement and data storage, the issue of missing data sometimes arises in practice. To evaluate the capability of the CMSNN model to deal with data-missing scenarios, three cases are randomly selected. In each case, we introduce 20% and 30% stiffness reductions at the relative lengths of 0.2 and 0.65, respectively. The data missing is simulated by replacing the original value of the missing location with zero. The settings of these cases are given in Table 6.

Table 6. Settings of three cases for data missing test.
Case Mode shape Boundary condition Missing ratio (%)
1 First order C-F 5
2 Second order C-C 8
3 Third order C-P 10

The detection results of the three cases are shown in Figure 9. It is presented that the damage locations can be correctly detected even when incomplete data are provided as input. This indicates that the proposed model is capable of extracting effective DI in the data-missing scenarios.

Details are in the caption following the image
Detection results for data missing cases: (a) Case 1, (b) Case 2, and (c) Case 3.
Details are in the caption following the image
Detection results for data missing cases: (a) Case 1, (b) Case 2, and (c) Case 3.
Details are in the caption following the image
Detection results for data missing cases: (a) Case 1, (b) Case 2, and (c) Case 3.

4.4. Compared With Other Methods

To showcase the efficacy of the proposed CMSNN model, a comparative analysis is conducted with two alternative damage detection methods: the MCUIE-based method [32] and the stacked CNN-based method [40]. For this comparison, a cantilever beam sample is selected, where a 20% stiffness reduction is introduced at a relative length of 0.4. As a representative case, we performed the comparison on the second mode shape in a noisy environment with a signal-to-noise ratio of 60 dB. To compare the detection effectiveness of different methods, the degree of differentiation is introduced here, which is defined as
()
where Adamage and Anoise are the amplitude of the damage location and the noise threshold, respectively.

The detection results of the three methods are illustrated in Figure 10. Although both the MCUIE-based and stacked CNN-based methods are capable of damage detection, each exhibits certain limitations. The MCUIE-based method, while sensitive to damage, is prone to noise interference. Its sensitivity to local mutations in the MCUIE can lead to false positives, where noise is mistakenly identified as damage features, and it achieves a degree of differentiation of only 1.28. On the other hand, the stacked CNN-based method is more robust against noise, but it still exhibits significant fluctuations in nondamaged regions, with a low degree of differentiation of 2.34. This can make it challenging to set an accurate noise threshold for damage detection and may introduce ambiguity in localizing the damage. In contrast, the proposed CMSNN model demonstrates a marked improvement, effectively detecting damage with a high degree of differentiation reaching 4.22. This superior performance highlights the CMSNN model’s ability to discern damage features with greater precision, even in the presence of noise, and to provide a clearer distinction between damaged and nondamaged regions.

Details are in the caption following the image
Detection results of different methods: (a) MCUIE-based, (b) stacked CNN-based, and (c) CMSNN-based.
Details are in the caption following the image
Detection results of different methods: (a) MCUIE-based, (b) stacked CNN-based, and (c) CMSNN-based.
Details are in the caption following the image
Detection results of different methods: (a) MCUIE-based, (b) stacked CNN-based, and (c) CMSNN-based.

Furthermore, to quantitatively assess the detection accuracy of the various methods under statistical conditions, we conducted Monte Carlo experiments. The outcomes of the Monte Carlo simulations are presented in Figure 11. The proposed CMSNN model exhibits a superior level of noise immunity relative to the other two methods under consideration, highlighting its reliability and stability in detecting damage even in the presence of significant noise.

Details are in the caption following the image
The accuracy comparison with other methods at different noise levels: (a) first mode, (b) second mode, and (c) third mode.
Details are in the caption following the image
The accuracy comparison with other methods at different noise levels: (a) first mode, (b) second mode, and (c) third mode.
Details are in the caption following the image
The accuracy comparison with other methods at different noise levels: (a) first mode, (b) second mode, and (c) third mode.

5. Experimental Verifications

This section details the experiments conducted to further validate the proposed damage detection framework with real-world data, which comprises two distinct parts. Concretely, the first part involves performing damage detection on a beam with multiple damages to validate the method’s effectiveness in practical situations. In the second part, a practical strategy, called parameter transfer, is introduced and applied to the method for further damage detection in a laminated plate.

5.1. Damage Detection for the Through-Hole Beam

The schematic diagram of the vision-aided experimental system is displayed in Figure 12(a). The experiment is carried out on an aluminum alloy 6061 beam specimen with three through-holes. The three through-holes are introduced into the specimen by a drilling machine, with internal diameters of 4, 4, and 5 mm, respectively. Figure 12(b) illustrates the dimensions of the specimen and the location of damages. The beam’s vibration video is collected during the experiment process using a high-speed camera (Revealer 5F04, full resolution: 2320 × 1718 pixels, pixel size: 7 × 7 μm, maximum frame rate: 52,800 fps, responsivity: ISO 6400). Based on the finite element analysis, the first three modal frequencies of the beam are calculated as 19.73, 123.71, and 346.39 Hz. In accordance with the Nyquist sampling theorem, it is imperative that the sampling frequency exceeds twice the frequency of the signal being identified to ensure that the discrete signals can accurately reconstruct the original continuous signals without the introduction of aliasing. For modal shape identification, which is crucial for damage detection, a more conservative sampling frequency is indeed preferable. Given that an elevated sampling frequency can enhance the fidelity of the vibration data, the frame rate of the camera is set to 3000 frames per second (fps).

Details are in the caption following the image
The experimental setup of damaged beam: (a) the vision-aided experimental system and (b) the beam with multiple through-holes.
Details are in the caption following the image
The experimental setup of damaged beam: (a) the vision-aided experimental system and (b) the beam with multiple through-holes.

A total of 100 pixel points are uniformly selected along the length direction of the target beam to be tracked by optical flow estimation, acting as 100 virtual sensors without mass loading effects. Based on the high-spatial-resolution measurement, the first three mode shapes are identified, which are shown in Figure 13. The detection results are shown in Figure 14. In actual scenarios, although vision-aided technology can be a useful solution to the limitation of measurement resolution, it cannot be ignored that the measurement noise is still present in the obtained mode shapes due to various factors like lighting conditions and image resolution. Benefiting from the efficient CNN-based DI extraction and robust attention-based information fusion of the CMSNN model, the exact locations of damage are all indicated at the peak of the predicted damage probability distribution.

Details are in the caption following the image
Measured mode shapes of the through-hole beam: (a) first mode, (b) second mode, and (c) third mode.
Details are in the caption following the image
Measured mode shapes of the through-hole beam: (a) first mode, (b) second mode, and (c) third mode.
Details are in the caption following the image
Measured mode shapes of the through-hole beam: (a) first mode, (b) second mode, and (c) third mode.
Details are in the caption following the image
Detection results for the through-hole beam: (a) first mode, (b) second mode, and (c) third mode.
Details are in the caption following the image
Detection results for the through-hole beam: (a) first mode, (b) second mode, and (c) third mode.
Details are in the caption following the image
Detection results for the through-hole beam: (a) first mode, (b) second mode, and (c) third mode.

5.2. Application of Parameter Transfer Strategy

In the field of SHM, it is a common issue that the length of the input signal (or the number of sampling points) varies due to the differences in the measuring approaches and target objects. Conventional methods suggest resolving this issue by interpolating the input signal to meet the signal length requirement or by retraining the network model. However, data interpolation can result in loss of information from the source signal, and retraining the network from scratch can be both time-consuming and computationally intensive.

Inspired by the domain of transfer learning, a practical strategy named parameter transfer is introduced here. The CNN-based multiscale information extraction module can extract essential DI from input mode shape signals after training without strict length requirements for input signals. This makes it a generic extractor for DI without needing to be retrained. That is, in various damage detection scenarios, the proposed model can be employed to address the detection task by fixing the parameters of the generic multiscale information extraction module and only retraining the information fusion module with explicit constraints on the signal length.

In order to verify the detection capability of the proposed framework with parameter transfer, an experimental investigation is performed based on the publicly available Damage Assessment Benchmark, as detailed in reference [46]. The schematic diagram of the experimental system is shown in Figure 15(a). This benchmark includes data measured from vibration tests of composite structures. The study focuses on the case of a damaged laminated plate specimen, which is a square plate made of 12-layered grass fiber-reinforced epoxy-based laminated composite. It has a length of 300 mm and a thickness of 2.64 mm, and all four sides are fixed. The spatial surface damage is machined with a milling machine to a depth of 0.5 mm. Figure 15(b) shows the dimensions and location of the damage. The first four mode shapes are obtained using an SLV (Polytec PSV-400) with a resolution of 64 × 64 sampling points, which are demonstrated in Figure 16.

Details are in the caption following the image
The experimental setup of damaged plate: (a) the scanning laser vibrometer-based experimental system and (b) the laminated plate with spatial surface damage.
Details are in the caption following the image
The experimental setup of damaged plate: (a) the scanning laser vibrometer-based experimental system and (b) the laminated plate with spatial surface damage.
Details are in the caption following the image
Measured mode shapes of the laminated plate: (a) first mode, (b) second mode, (c) third mode, and (d) fourth mode.
Details are in the caption following the image
Measured mode shapes of the laminated plate: (a) first mode, (b) second mode, (c) third mode, and (d) fourth mode.
Details are in the caption following the image
Measured mode shapes of the laminated plate: (a) first mode, (b) second mode, (c) third mode, and (d) fourth mode.
Details are in the caption following the image
Measured mode shapes of the laminated plate: (a) first mode, (b) second mode, (c) third mode, and (d) fourth mode.

In the experimental process, the weights of the multiscale information extraction module in the CMSNN model are frozen and only the weights of the information fusion module are retrained. To analyze the two-dimensional plate using the proposed model, we input the mode shape data of the plate row by row and column by column. This enables the predicted damage probability spatial distribution in two different directions to be obtained. The final damage probability spatial distribution is calculated by taking a weighted average of the two aforementioned spatial distributions. To enhance the intuitiveness of the results, positions with a damage probability of less than 0.3 are set to zero. The detection results are presented in Figure 17, which clearly demonstrates the damaged area.

Details are in the caption following the image
Detection results for the laminated plate: (a) first mode, (b) second mode, (c) third mode, and (d) fourth mode.
Details are in the caption following the image
Detection results for the laminated plate: (a) first mode, (b) second mode, (c) third mode, and (d) fourth mode.
Details are in the caption following the image
Detection results for the laminated plate: (a) first mode, (b) second mode, (c) third mode, and (d) fourth mode.
Details are in the caption following the image
Detection results for the laminated plate: (a) first mode, (b) second mode, (c) third mode, and (d) fourth mode.

6. Conclusions

This study proposes a novel vision-aided framework with CMSNN to deal with damage detection tasks. The performance of the proposed framework is evaluated through numerical and experimental analysis across a variety of scenarios. The principal conclusions of this paper are summarized as follows:
  • 1.

    The utilization of optical flow estimation algorithm enables the acquisition of mode shapes of target structures at a high spatial resolution without mass-loading effects, which is suitable and informative enough for damage detection tasks.

  • 2.

    A novel CMSNN model is designed to autonomously learn high-level feature representations from noisy mode shapes, eliminating the need for explicit manual feature design. The model combines a CNN-based multiscale information extraction module with an attention-based information fusion module, thereby enhancing its capabilities.

  • 3.

    During the design and training process, the CMSNN model considers a range of scenarios, including measurement noise, data missing, and undamaged samples. This ensures that the proposed framework can provide reliable detection results even when the input data are corrupted by noise or incomplete.

  • 4.

    The results of experiments demonstrate that the proposed framework accurately detects damages in actual scenarios. Furthermore, the application of the parameter transfer strategy minimizes retraining effort while preserving the capacity for damage detection, which increases versatility.

Despite the laboratory success of this paper, there are some potential challenges. The excessive compression artifacts and lighting variations of experimental videos under undesirable conditions may affect the accuracy of optical flow estimation and alter the noise distribution compared to the training dataset, which impacts the overall performance of the CMSNN model. In future work, real-world data will be further combined with simulation-generated dataset for training to improve generalization and address the challenges. Moreover, the interpretability of the proposed model will be studied using visualization techniques to guide the understanding of damage features. The combination of data-driven models and physics-based methods promises to provide a more comprehensive approach to damage detection.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

This work was supported by the Basic Research Project Group (No. 514010106-302).

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

    The full text of this article hosted at iucr.org is unavailable due to technical difficulties.