Aiming at the shortcomings of the current face liveness detection attack methods in the low generation speed of adversarial examples and the implementation of white-box attacks, a novel black-box attack method for face liveness detection named as FLDATN is proposed based on adversarial transformation network (ATN). In FLDATN, a convolutional block attention module (CBAM) is used to improve the generalization ability of adversarial examples, and the misclassification loss function based on feature similarity is defined. Experiments and analysis on the Oulu-NPU dataset show that the adversarial examples generated by the FLDATN have a good black-box attack effect on the task of face liveness detection and can achieve better generalization performance than the traditional methods. In addition, since FLDATN does not need to perform multiple gradient calculations for each image, it can significantly improve the generation speed of the adversarial examples.

1. Introduction

Face recognition technology has been widely used in mobile payment, access control, and smartphone unlocking. However, with the development of face recognition technology, spoofing attacks against face recognition systems are also increasing. Common face spoofing attacks include photo presentation attacks, video replay attacks, and 3D mask attacks. To ensure the security of the face recognition system, face liveness detection (face antispoofing) has received extensive attention from academia and industry, and it has been developed rapidly in recent years [1, 2].

Recently, adversarial attacks in the physical domain have brought serious challenges to the existing face liveness detection models [3–5]. Since the input image of this attack is a large area of a living face and a small area is decorated with adversarial examples such as earrings and glasses, it can not only successfully bypass the face liveness detection model but also directly launch an adversarial attack on the face recognition system. Different from the high image quality requirements for adversarial attacks perceived by the human eyes, its target only needs to successfully pass face detection, face liveness detection, and face recognition by using adversarial examples. However, since the adversarial examples for face liveness detection generated in the digital domain may introduce discriminative traces again in the physical domain acquisition, a success attack is hard to achieve. Zhang, Tondi, and Barni [6] proposed an adversarial attack method for face liveness detection. It first generates adversarial examples from face images in the replayed video playback. After reshooting the adversarial examples, they are input into the liveness detection model to cheat the face liveness detection model. However, it requires averaging gradients over the augmented 2000 images, and it results in slow generation speed. Furthermore, it is a white-box attack, and the generalization of adversarial examples needs further exploration.

To counteract the above shortcomings, this paper takes the face liveness detection model proposed by Yu et al. [1] as the attack object and proposes a black-box attack method for face liveness detection based on adversarial transformation network (ATN). The main contributions are summarized as follows:

•
An ATN for face liveness detection black-box attack is designed. The designed ATN does not need to perform multiple gradient calculations for each image after training, which can effectively improve the generation speed of adversarial examples.
•
Feature-level adversarial loss functions for cosine similarity (CS) and mean square error (MSE) are proposed. With these loss functions, the face liveness detection model can be successfully attacked without knowing the optimizer and loss function used for training the attack object model.
•
The experimental results and analysis show that the adversarial examples generated by FLDATN have a good effect on the black-box attack on the binary classification task of face liveness detection and have better generalization ability than existing attack methods.

The remainder of this paper is organized as follows. The related work is summarized in Section 2. Some preliminaries are presented in Section 3. The proposed FLDATN is detailed in Section 4. Experiment and analysis are previewed in Section 5. Some discussions are made in Section 6. Finally, some conclusions are drawn in Section 7.

2. Related Work

As the main target of the paper is to generate adversarial examples to attack face liveness detection module, face liveness detection and adversarial attacks are, respectively, reviewed.

2.1. Face Liveness Detection

In the early stage, artificially designed descriptors were created for face liveness detection, and typical descriptors include local binary pattern (LBP) [7, 8], histogram of oriented gradient (HOG) [9], and scale-invariant feature transform (SIFT) [10]. In addition, some early face liveness detection methods still relied on various auxiliary information and other hardware devices. Typical methods include human heart rate [11], rPPG [12], 3D scanner [13], and light field cameras [14], which further improve the detection performance. However, it increases the cost.

Practically, the artificially extracted features cannot well distinguish the living face from the prosthetic face, while the features extracted by the convolutional neural network (CNN) are more distinguishable. Thus, it is gradually replacing the artificially crafting features. Xu, Li, and Deng [15] developed an end-to-end network for face liveness detection. It fuses CNN and long short-term memory (LSTM) into CNN-LSTM network architecture, and the experiments demonstrate the effectiveness of image background information. Atoum et al. [16] considered the face depth map as the difference feature between the living face and the prosthetic face. The fundamental reason is that the printed photos and screen replay are 2D images which lack depth information, while the living face is 3D and has rich depth information. Liu, Jourabloo, and Liu [17] pointed out that many previous approaches treat the face liveness detection problem as a simple binary classification problem. Auxiliary supervision training is introduced by using depth map as spatial information supervision and rPPG number as temporal information supervision. It comprehensively considers the depth map and rPPG signal to distinguish live and prosthetic faces. The above methods utilized various auxiliary information pieces to improve the performance of face liveness detection, but the accuracy is still limited due to the imperfect CNN backbone network. Yu et al. [1] implemented central differential convolution to build central difference convolutional networks (CDCNs). It finally outputs a depth map, and the difference between the output depth map and the depth map in the dataset is acted as an evaluation metric. The central differential convolution operation greatly improves the detection accuracy, and CDCN is one of the best available face liveness detection models. Wang et al. designed PatchNet to mine local information and proposed classification loss and self-supervised similarity loss based on asymmetric margins to regularize the embedding space [18]. Deep learning networks can extract more discriminative features and exhibit excellent performance in various visual tasks, which significantly improve the accuracy of facial liveness detection models. In [19], universal liveness features were extracted from different modal data, and a flexible modal transformer network was constructed based on the transformer network structure. Using multimodal data for training, only one modal data can be used for liveness detection during testing. However, it requires additional multispectral imaging equipment larger scale deep learning network models. Recently, Liu et al. proposed Class Free Prompt Learning paradigm for domain generalization based face antispoofing [20]. Two lightweight transformers are utilized to learn the different semantic prompts conditioned on content and style features by using a set of learnable query vectors, respectively. It can achieve good performance on several cross-domain datasets.

2.2. Adversarial Attacks

2.2.1. Gradient-Based Adversarial Attack Methods

The most common adversarial attack is gradient-based adversarial attacks. It back-propagates the gradient from the loss function until the gradient propagates to the input layer, where an adversarial perturbation is added according to the gradient at the input layer, and the range of this perturbation is limited to a certain range. Goodfellow, Shlens, and Szegedy [21] created the FGSM method and found that the linear nature of high-dimensional neural networks is vulnerable to adversarial attacks. It performs only one gradient update along the gradient sign direction on each pixel of the image. It is fast, but the success rate of white-box attacks is low. Subsequently, based on FGSM, BIM was proposed by Kurakin et al. [22]. It needs multiple gradient back propagations while limits much change of a certain pixel. Compared with FGSM, smaller distortion, higher attack success rate, and slower speed can be achieved. After that, PGD was put forward by Madry et al. [23]. It adds random noise before performing multiple gradient back propagations. The generated adversarial examples can obtain good transferability. A new attack C&W was designed for the defensive distillation defense method [24]. It uses L₀, L₂, and l_∞ distance metrics to constrain the image quality. Therefore, the generated adversarial examples can launch successfully attack with minimal image perturbation. However, due to the large of iteration time, its speed is slower than that of the method [18–20]. Recently, Wan, Huang, and Zhao designed an average gradient-based adversarial attack [25]. It optimizes the added perturbations through a dynamic set of adversarial examples, and the size of the dynamic set increases with the number of iterations. It possesses good extensibility and can be integrated into most existing gradient-based attacks.

2.2.2. General Adversarial Attacks

Gradient-based adversarial attack requires back-propagating gradients for each image, so the speed is slow because a large number of adversarial examples need to be generated. Moosav et al. proposed an adversarial attack [26]. The generated adversarial perturbations can be added to various kinds of images. Adversarial examples with the same perturbations can mislead the classifier, and it can achieve good transferability to other neural networks with similar structures. The ATN was first contributed by Baluj and Fischer [27]. A trained ATN can quickly generate adversarial examples after inputting images. However, the generated adversarial examples have low generalization ability. After that, several GAN-based methods were emerged to generate adversarial examples. AdvGAN [28] is typical one of them. It consists of a generator, a discriminator, and a classifier. Different from ATN, AdvGAN complements a discriminator module to determine whether the input data are generated by the generator. Based on [28], Jandia et al. improved the model’s input layer [29]. It first extracts features from the input image, and then random noise is added to the extracted features as the input of the generator. The improved generator can generate more diverse adversarial examples. Liu and Hsieh proposed PGD-generated adversarial examples for adversarial training to get better generators [30]. In addition, the performance of the discriminator is also improved. In a summary, generator-based adversarial attack methods can achieve relatively good generalization on some specific tasks. Yang et al. proposed an adversarial example generation with AdaBelief optimizer and crop invariance [31]. By adopting the adaptive learning rate into the iterative attacks, it can optimize the convergence process, and more transferable adversarial examples can be obtained.

2.2.3. Adversarial Attacks against Facial Recognition Systems

The initial attacks on face recognition systems were attacks in the digital domain. Bose and Aarabi suggested adding tiny perturbations to the input image through the generator [32], which lead to wrong judgment of the faster R-CNN [33]. The AdvFaces [34] inherits the framework of the AdvGAN. In this network, the identity information of the face is utilized to train the generator to get the mask of the adversarial perturbation, which is added to the source image to obtain the adversarial example. However, since it is difficult for ordinary people to directly access the inside of the face recognition system in reality, the attack on the digital domain has less impact. However, the attack on the physical domain will cause more serious consequences. Sharif et al. [3] attached the generated adversarial examples to the glasses to launch attack. In this way, the identity of the person wearing the glasses is regarded as the identity of another person. It reduces the geometric distortion of adversarial examples in the physical domain and improves the generalization of adversarial examples. The effectiveness of adversarial examples in the real world is first validated. Wenger et al. built a poisoning attack against the physical domain [4]. It first puts a small number of poisonous samples into the training data and then adds triggers such as tattoos, handkerchiefs, and glasses during testing, which make the face identity discriminator misjudge the target’s identity. Yin et al. simulated adversarial examples in the digital domain by applying makeup to human faces [5]. To get a more natural generated image, a mixed-policy makeup is used to ensure the style consistency of the generated image and the original image. In addition, fine-grained meta-learning is also exploited to improve the transferability of adversarial examples to black-box models. These physical domain adversarial examples can bypass the face liveness detection module and directly attack the face authentication module. Zhang, Tondi, and Barni addressed an adversarial attack for face liveness detection [6]. It uses the expectation over transformation (EOT) to generate adversarial examples, which can remain robust under various transformations. Thus, the generated adversarial examples cannot be easily affected by the reshooting. However, the generation speed of adversarial examples is slow, and it also lacks black-box attack capability. Recently, Patnaik et al. proposed to an automated generative adversarial network to simulate print and replay attacks and generate adversarial images that can fool face presentation attack in physical domain [35]. Nevertheless, its robustness is still limited.

Gradient-based attack methods are fast for generating a small number of adversarial examples, but they are relatively slow for generating a large number of adversarial examples. For generator-based adversarial sample generation methods, they often require to design special generators according to specific tasks. At present, most of the attacks in the physical domain of the face recognition system are trying to bypass the face liveness detection submodule and then directly attack the face recognition submodule. The adversarial attack against the face liveness detection submodule itself is far from mature. Therefore, to quickly generate a large number of adversarial examples with good generalization, a network model named as FLDATN is proposed in this paper.

3. Preliminaries

3.1. ATN Network Definition

According to the above, the gradient-based adversarial attack is not suitable for generating a large number of adversarial examples. Comparing the network structure of ATN and GAN, it can be seen that an ATN is only composed of a generator and a classifier and it does not need to spend a lot of time to train the discriminator of a GAN. Furthermore, the ATN is more convenient to implement, and the training speed is also faster than GAN. Therefore, ATN is more suitable for generating adversarial examples. According to [27], ATN is a neural network model that transforms an input image into adversarial examples against a target network or a set of networks. ATN can be formally defined as

()

where g is the generator, t is the target class, and f is the target classifier. The entire network outputs the probability distribution of x^′∼x, and it requires label(x) ≠ x^′ label(x^′). It means that the target classifier will determine that the adversarial example does not belong to the same class as the source image.

3.1.1. Training

The optimization function of ATN is

()

where L_χ is the visual loss, and it is used to limit the image distortion within a certain range to ensure the image quality. Common visual losses include MSE and SSIM [36]. L_y is the misclassification loss, which compels the target classifier f make the classification label of g_f,t(x_i) to be inconsistent with x_i. For binary classification tasks, cross-entropy (CE) loss is commonly used as a misclassification loss. α is used to balance the weights of the L_χ and L_y loss functions.

3.1.2. Loss Function

The visual loss function MSE is

()

where m is the number of samples (the sample in the paper refers to a single pixel, and the number of samples is the size of the image), and x, y, respectively, represent the living face image and the prosthetic face image. x_i is the value of a single pixel of the image x, and y_i is the value of a single pixel of the image y. In this paper, MSE is the mean of the sum of squares of the difference between the pixel values of the adversarial image and those of the original pixel values. However, the pixel-by-pixel method of calculating the difference is inconsistent with the evaluation criteria of the human visual system. To this end, the visual loss function SSIM is introduced. It evaluates the image similarity from brightness, contrast, and structure, and it well conforms to the judgment criteria of the human visual system. It is defined as

()

where μ_x is the mean pixel value of the image x, σ_x, σ_y are the variance of the pixel values of the image x and image y, respectively, and σ_xy is the covariance between the pixel values of the two images x, y. The values of the three constants are

, and C₃ = C₂/2. L is the maximum value that an image pixel value can take. Here, the pixel value is limited to [0, 1], and L = 1. In the case of binary classification, the misclassification loss CE is

()

where y_i represents the label of the image i. The positive class label is 0, and the negative class label is 1. p_i represents the probability that the image i is predicted to be a positive class.

3.2. Threat Model

This paper aims to attack an end-to-end face recognition system. It contains three submodules. The first one is the face detection module, and the function of it is to mark the position of the face in the image. The second one is the face liveness detection module, and it determines whether the input face image is a living face. The last one is the face identity authentication module, and it authenticates whether the face identity matches the face identity in the database.

Here, it is assumed that the attacker has a good understanding of the three submodules of the face recognition system but has no access to the inside of the system. It means that the attacker has the same permissions as ordinary people, and the only system interface that can be accessed is the camera. Therefore, the adversarial examples generated in this paper need to meet the following requirements: (1) the adversarial examples on the screen must be captured by the face detection module, (2) the captured face images must be misjudged as a living face by the face liveness detection module, and (3) when face identity authentication is performed, the identity of the adversarial sample is consistent with the source image. A successful attack process is shown in Figure 1. Step (a) generates adversarial examples, and step (b) passes through each module of the face recognition system one by one. The most critical step is to successfully deceive the face liveness detection module. For the face detection module and the face identity authentication module, it only requires that the distortion of the adversarial sample will not affect the normal judgment of the module. Therefore, the pass rate of the face detection module and the face identity authentication module can be also used to judge the image quality of the adversarial sample.

Details are in the caption following the image — **Figure 1 (a)**
Open in figure viewer PowerPoint

The workflow diagram of adversarial attack. (a) Generation of adversarial samples. (b) Adversarial attack process.

4. The Proposed FLDATN

Based on the architecture in [26], an ATN is designed for face liveness detection in this paper and the architecture for training adversarial examples is shown in Figure 2. L_χ and L_y are corresponded to equation (2). The network consists of a generator and a classifier, where the generator is FLDATN and the classifier is CDCN++^∗, and the parameter settings of two classifiers CDCN++^∗ are exactly the same. After the image x is input into the FLDATN network, it is converted into an adversarial sample x^′. At this time, the L_x loss function is used to calculate the distortion degree of the image x and the image x^′, and then the images x and x^′ are both input into the CDCN++^∗. After that, the feature vector of the image x and the feature vector of the image x^′ can be, respectively, obtained, and the L_y loss function is used to calculate the similarity between the feature vectors of two images. When the input image x is a living face, the distance between the feature vectors of x and x^′ should be as close as possible. If the input image x is a prosthetic face, the distance between the feature vectors of x should be as far as those of x^′.

4.1. The Network Structure of FLDATN

The structure of the proposed FLDATN is shown in Figure 3, which is a U-shaped network. The first half of FLDATN is an encoder, and it is used to downsample the image, and each downsampling reduces the image area by 4 times. The second half of FLDATN is a decoder, which is used to upsample the image. The encoder consists of convolutional layers and max-pooling layers, and the decoder consists of a combination of convolutional layers, deconvolutional layers, and nearest neighbor upsampling. In addition, CBAM module [37] is exploited to improve generalization, and it is used in both downsamplings. For two upsamplings, the first is upsampling with a deconvolution layer, and the second is upsampling with the nearest neighbor upsampling. The adversarial sample size obtained after twice upsamplings is the same as the original image size. Since the generation of adversarial examples in the prediction stage is faster than that in the training stage, the generation of adversarial examples needs to be conducted after the FLDATN network is trained.

4.2. Loss Function

The total loss function to be optimized is defined in equation (2), where the L_χ loss function uses SSIM and MSE, and the L_y loss function uses the adjusted CS and MSE to adapt FLDATN network training.

4.2.1. MSE (L2 Norm)

For the generation of adversarial examples in [34], it requires two images as input. The image x^(s) is used to generate adversarial examples, and the image x^(t) is used as a reference category image, where x^(s) and x^(t) are different images. In a real actual attack, when the adversarial example of the image x^(s) is the same class as the image x^(t), it requires that the L2 norm between the feature vector of the adversarial example of x^(s) and the feature vector of x^(t) should be as small as possible. When the adversarial sample of the image x^(s) is a different class from the image x^(t), it requires that the L2 norm between the feature vector of the adversarial example of x^(s) and the feature vector of x^(t) should be as large as possible. So, the loss function is

()

where Δx is the adversarial perturbation added to the image x^(s), x^(s) + Δx is the adversarial sample, and F(x⁽ⁱ⁾) is the normalized image feature vector after the input image has been subjected to the face liveness detection model. When x^(s) and x^(t) are in the same class, it uses “+”; otherwise, its uses “−”.

Since the construction of a paired face dataset of {x^(s), x^(t)} may result in slower training of the FLDATN network, the input data need to be related only to the image x^(s) and not to x^(t) to improve the training speed of the generator. The L_y loss function in [27] is adjusted, and x_i and g_f,t(x_i) are used to replace x^(s) and x^(s) + Δx, respectively. The adjusted loss function is

()

where t represents the target category. y_i represents the label of the image x_i, and it takes the value from {0, 1}. y_i = 0 represents the label of the live face, and y_i = 1 represents the label of the prosthetic face. C is a constant. Let C = 0.5 in this paper, so that the distance between the feature vector of the adversarial image and the feature vector L₂ of the source image is as small as possible when y_i = 0. The distance between the feature vector of the adversarial sample image and the feature vector of the prosthetic face image is as large as possible when y_i = 1.

4.2.2. CS

The distance between features is not the only criterion for determining image classification, and CS is often used to measure text similarity in natural language processing. CS can also be applied to image classification, and an improved CS is defined as

()

where cos(X, Y) is a function to find the angle between the feature vector X and the feature vector Y. For the convenience of calculation, cos(X, Y) is associated with the Euclidean distance (L₂ norm number). The formula is

()

The X and Y in equation (10) can be combined with equation (9) after normalization to obtain equation (11). It is easy to know that dist(X, Y) decreases as cos(X, Y) increases. So, equation (11) can be approximately replaced by

()

where α = 2, C = 0.5. The only difference between equations (7) and (12) is normalization. The similarity of the attack effects using equations (8) and (12) as loss function will be illustrated in the experiments, respectively. While for MSE misclassification loss function used in subsequent experiments, the feature vectors are normalized.

5. Experimental Results and Analysis

5.1. Experimental Setting

In the experiment, face detection and face identity authentication used in this paper are known methods. Considering the efficiency, robustness, accuracy, and ease of deployment, MTCNN [38] is used for face detection and FaceNet [39] is chosen for face authentication. The attacked face liveness detection model is CDCN++ [1]. It needs to mention that this paper does not use the original CDCN++ model during training, and some changes are made to the final output layer. Because the original CDCN++ output is a depth map, and this paper changes the output layer to a Softmax function for the classification task. The changed CDCN++ model is shown in Figure 4, and it acts as a white-box attack model for generating adversarial examples. The used camera is Honor play4T pro (48 million pixels). The display parameters are 15.6 inches, 1920 × 1080 resolution, and 144 Hz. The attacked model is shown in Figure 2. The reason for choosing this model is that the model has won the multimodal track championship and the single-modal track runner-up in the 2020 ChaLearnFace Anti-spoofing Attack Detection Challenge. The dataset used in the experiment is the Oulu-NPU dataset [40], which consists of 4950 living and prosthetic face videos from six different mobile devices (Samsung Galaxy S6 edge, HTC Desire EYE, MEIZU X5, ASUS Zenfone Selfie, Sony XPERIA C5 Ultra Dual, and OPPO N3) cameras. The attack types in Oulu- NPU dataset are printing and video playback, and the original presentation attack videos of Oulu-NPU will not be used in this experiment. Considering that different shooting equipment will affect the experiment, our own equipment is used to reshoot the face of the Oulu-NPU dataset. To make the prosthetic face in the training dataset to be consistent with the collection method of the adversarial examples, it should be ensured that the shooting equipment and the experimental environment are the same. The process of shooting adversarial examples with a mobile phone is shown in Figure 5. It can be seen that the face detection of the mobile phone can successfully detect adversarial examples.

The input image and output image size of the FLDATN network are both 224 × 224, batch size is set to 16, epoch is set to 10, and learning rate is set to 0.0002. For the selection of the y_i label in the equations (9) and (13), instead of using labels in the original dataset, the value predicted by the face living detection model is utilized as the label. The training input of the face liveness detection module is 224 × 224, the batch size is set to 32, the epoch is set to 10, and the learning rate is set to 0.0002. The experiment requires developing 400 adversarial examples of 20 different faces to test the attack success rate. The processing of the dataset requires extracting face images every 5 frames from the videos in the Oulu- NPU and our own reshooting videos. The size of the training dataset has 8000 prosthetic faces and 7974 living faces; the test dataset has 1965 prosthetic faces and 1862 living face images, where the faces in the test dataset and the faces in the training dataset belong to different identities. In addition, since the ratio between the width and height of the faces in the Oulu-NPU dataset is not consistent, it needs to be scaled according to the original ratio, when the image is reduced to 224 × 224. The image transformation in [6] is exploited to enhance the image in the experiments.

5.2. Evaluation Metrics

Here, PSNR is used to evaluate the distortion of the images. To ensure that the adversarial examples are more robust to distortion and the accuracy of face detection and face recognition is high, the PSNR is set around 20. The attack success rate (ASR) is utilized to evaluate the attack success rate in the physical domain, and it is defined as

()

where Total_attack represents the total number of attacks and Success_Attack represents the number of successful attacks. The performance indicators of the face liveness detection module are as follows: (1) attack presentation classification error rate (APCER) is used to indicate the probability that a living face is misjudged as a prosthetic face, (2) bona fide presentation classification error rate (BPCER) is used to indicate the probability that a prosthetic face is misjudged as a living face, and (3) average classification error rate (ACER) is the average of the sum of APCER and BPCER. They are, respectively, defined as

()

where FP represents false positive, TN represents true negative, FN represents false negative, and FN represents false negative.

The attack methods for comparison are FGSM [21], BIM [22], PGD [23], and C&W [24], and their parameter settings are listed in Table 1. Furthermore, some pretrained neural network models are chosen for black-box testing of liveness detection, and they are ResNet [41], MobileNet [42], DenseNet [43], VGG [44], EfficientNet [45], and VisionTransformer [46]. These pretrained models are available from [47]. They are first pretrained by ImageNet and then trained by the training set of the Oulu-NPU dataset. After that, they are used for the face liveness detection task.

Table 1. Parameter settings for generating adversarial examples. “—” indicates that the parameter is not used and maximum perturbation indicates that the maximum range of pixel value changes, step size indicates the step size of each update, the number of steps indicates the number of iterative updates, and box constraint represents a bounded-constrained hyperparameter specific to the C&W attack.

Method	Maximum perturbation	Step size	Number of steps	Box constraining
FGSM [18]	25/255	—	1	—
BIM [19]	40/255	2/255	100	—
PGD [20]	38/255	2/255	100	—
C&W [21]	—	0.2	100	0.0001

5.3. Experiments and Performance Analysis

5.3.1. Black-Box Testing

The current digital domain adversarial attack methods have achieved a high attack success rate. Thus, they will not be discussed here, and our focuses are mainly on attacks in the physical domain. The single image cropped during the attack does not need to be stitched to the original image as [6], and it directly shoots the adversarial sample. As seen from Figure 5, it can be recognized by face detection even if it is not stitched to the source image. In the following, the expression of the proposed attack will be appeared in the form of visual loss function plus misclassification loss function. For example, “SSIM + MSE,” where SSIM is the visual loss function and MSE is the misclassification loss function. The experimental results of adversarial examples generated by the CDCN++^∗ model in the physical domain are listed in Table 2.

Table 2. The attack success rate (ASR) of the adversarial examples against various face liveness detection models in the physical domain.The adversarial examples are generated according to the CDCN++^∗ model.

Method	CDCN++^∗	DenseNet-161	DenseNet-121	MobileNetv3-large-100	Vgg-19	TF-EfficientNet-b0	ResNet-152	Vit-Base-ResNet-50
FGSM [18]	15.22	2.30	1.88	11.92	0.46	1.92	3.88	0.75
BIM [19]	78.71	0.67	2.16	5.80	0.00	0.04	2.84	1.00
PGD [20]	75.74	1.05	1.77	6.73	0.00	0.13	0.80	0.92
C&W [21]	70.08	0.00	0.00	0.00	0.00	0.04	1.29	0.25
SSIM + MSE	41.84	49.76	3.27	6.59	6.11	38.40	7.66	10.42
SSIM + cosine	43.44	27.27	0.21	2.53	0.51	31.10	0.97	4.05
SSIM + CE	25.78	54.83	0.67	6.44	1.96	31.42	11.98	4.55
MSE + MSE	53.30	72.93	0.62	1.36	3.60	50.97	3.78	8.63
MSE + cosine	48.06	61.30	0.58	0.54	2.24	42.20	0.49	1.79
MSE + CE	19.18	7.68	0.0	0.33	0.12	13.16	2.04	7.78

Note: The bold values represent the best results.

It can be found that FGSM can achieve the best performance 11.92% on MobileNetv3-large-100, and BIM can achieve the best performance 78.71 on CDCN++^∗, while the proposed scheme can achieve the best performance on the rest models with different combinations. The attack success rate on DenseNet-161, TF-EfficientNet-B0, and VIT-Base-ResNet-50 is higher than that of the three models. At the same time, it can be seen that it is equally effective to use the CS of the feature vector instead of the Euclidean distance of the feature vector as the classification loss function. Comparing SSIM with MSE, it can be seen that the attack effect of “SSIM + CE” loss function combination is obviously better than that of “MSE + CE” loss function combination. The generated adversarial examples are shown in Figure 6. It can be found that the adversarial perturbation generated by the proposed attack is relatively concentrated. The adversarial perturbation generated by CS and MSE is located on the right cheek, while the adversarial perturbation of the gradient-based methods is scattered to all corners of the image. The results show that the effects of equations (9) and (13) are indeed similar.

The training status of each face liveness detection module is listed in Table 3. From Table 3, CDCN++^∗ model is not as good as other models when it is changed to a binary classification task. Therefore, DenseNet-161 is used as the threat model to test the attack success rate of the generated adversarial examples. The results are listed in Table 4.

Table 3. Performance of the liveness detection model.

None	CDCN++^∗	DenseNet-161	DenseNet-121	MobileNetv3-large-100	Vgg-19	ResNet-152	Vit-Base-ResNet-50
APCER	6.02	0.16	0.16	0.11	0.16	0.16	0.16
BPCER	3.92	0.00	0.00	0.00	0.00	0.05	0.00
ACER	4.97	0.08	0.08	0.06	0.08	0.11	0.08

Table 4. The attack success rate (ASR) of the adversarial examples against various face liveness detection models in the physical domain. The adversarial examples are generated according to the DenseNet-161 model.

Method	CDCN++^∗	DenseNet-161	DenseNet-121	MobileNetv3-large-100	Vgg-19	TF-EfficientNet-b0	ResNet-152	Vit-Base-ResNet-50
FGSM [18]	17.32	46.43	39.63	34.42	0.78	6.44	14.97	5.70
BIM [19]	16.54	98.02	84.01	26.79	0.04	16.51	18.27	6.32
PGD [20]	15.63	97.51	82.71	37.10	0.13	19.20	15.54	12.4
C&W [21]	18.65	76.37	17.35	15.71	0.13	4.93	5.97	2.25
SSIM + MSE (high)	22.00	100.0	0.17	4.21	11.61	22.68	22.26	26.88
SSIM + MSE (low)	21.75	76.18	0.04	4.43	0.53	15.03	11.99	7.18
SSIM + CE	26.89	93.54	0.67	1.15	3.21	10.46	11.30	7.21

Note: The bold values represent the best results.

From the results, the attack success rate of all methods has been improved. For different face liveness detection models, the proposed method attack and the gradient-based methods have their own advantages. It is worth noting that the training of the SSIM + MSE combination has large fluctuations, and the results of the best case and the worst case are quite different. The reason for this may be that the image enhancement operation adds randomness, and stochastic gradient descent gets stuck in a local optimum. Along with the results in Table 2, the attack success rate of CDCN++^∗ model is lower than that of DenseNet-161, which indicates that the robustness of CDCN++^∗.

5.3.2. Image Quality Analysis

The actual attack success rate also needs to judge the pass rate of the face detection module and the pass rate of the face authentication module. The final attack success rate of the adversarial examples generated on CDCN++^∗ model is listed in Table 5, and the final attack success rate of the adversarial examples generated on DenseNet-161 is listed in Table 6, where Face Det represents the accuracy of the face detection module, and Face Rec represents the accuracy of the face recognition module. It can be seen from the tables that the success rate of face detection is basically above 90%. However, on the face authentication module, our proposed method has a low pass rate, especially when MSE is used as the image quality loss function, which results in the much image distortion. According to the experimental results, for different attack methods, the attack method with low PSNR does not necessarily have a lower pass rate of face target detection and face identity authentication. Comparing Tables 5 and 6, it can be found that the adversarial examples generated by DenseNet-161 have significantly lower pass rates on Face Det and Face Rec when the PSNR is not much different. This means that the image quality of the adversarial examples generated by DenseNet-161 is worse. The general trend is that the better the image quality is, the better the total ASR is. Nevertheless, the generation of adversarial examples inevitably introduces perturbation. Thus, it needs to strike a balance between the image quality and the attack performance.

Table 5. Adversarial example image quality table.The adversarial examples in the table are generated according to the CDCN++^∗ model.

Method	PSNR	Face Det	Spoof Det	Face Rec	Total ASR
FGSM [18]	20.82	99.87	15.22	95.86	14.85
BIM [19]	20.55	99.46	78.71	94.24	73.77
PGD [20]	19.88	99.16	75.74	94.40	70.89
C&W [21]	26.75	99.95	70.08	96.56	67.63
SSIM + MSE	21.09	97.15	41.84	87.91	35.73
SSIM + cosine	20.72	98.54	43.44	90.29	38.64
SSIM + CE	20.72	99.66	25.78	89.19	22.91
MSE + MSE	19.45	94.24	53.30	79.52	39.94
MSE + cosine	18.98	93.32	48.06	77.12	34.58
MSE + CE	18.88	99.54	19.96	82.64	16.41

Note: The bold values represent the best results.

Table 6. Adversarial example image quality table. The adversarial examples in the table are generated according to the DenseNet-161 model.

Method	PSNR	Face Det	Spoof Det	Face Rec	Total ASR
FGSM [18]	20.84	95.55	46.43	88.32	39.18
BIM [19]	21.19	95.03	98.02	82.11	76.48
PGD [20]	20.70	97.51	95.49	85.81	79.89
C&W [21]	23.66	96.25	76.37	89.60	65.86
SSIM + MSE (high)	19.88	99.87	100.0	78.86	78.75
SSIM + MSE (low)	19.58	98.70	76.18	81.97	61.63
SSIM + CE	21.24	99.58	93.54	89.67	83.52

Note: The bold values represent the best results.

5.3.3. Adversarial Sample Generation Speed

To explore the adversarial sample generation speed of various methods, each attack requires generating 400 adversarial examples and the average of the generation speed of 400 adversarial examples is the final result. The experimental results are listed in Table 7. Form the results, the proposed adversarial attack has the fastest generation speed. Additionally, FGSM is faster than other gradient-based methods because only one gradient back propagation is performed. As the proposed attack does not need gradients and can directly obtain adversarial examples through the FLDATN network, the speed of generating adversarial examples is much faster than FGSM.

Table 7. Average generation speed of adversarial examples.

Method	Generation speed(s)
FGSM [18]	0.048
BIM [19]	4.861
PGD [20]	4.912
C&W [21]	5.361
SSIM + MSE	0.018

Note: The bold values represent the best results.

5.3.4. Ablation Experiment

To explore the effectiveness of CBAM module in generating adversarial examples, ablation experiments are conducted, and the results are listed in Table 8. During training, without changing the training parameters, the distortion of the adversarial examples generated in the digital domain will be different. After reshooting the same adversarial sample, the attack success rate will slightly fluctuate. Descending training under stochastic gradient and reshooting of adversarial examples can interfere with the quality of the adversarial examples. The results of two experiments without CBAM both show that the attack success rate for the domain model TF-EfficientNet-b0 and the model Vit-Base-ResNet-50 decreases, while the attack success rate for CDCN++^∗ increases, and the results show that the generalization of the CBAM module has been improved, but the success rate of the white-box attack is decreased. In two experiments without the CBAM module, adversarial examples with poorer image quality can achieve better generalization.

Table 8. Ablation experiments for CBAM module, where ssim + mse loss function is used to generate adversarial examples on the CDCN++^∗ model.

CBAM	PSNR	Face Det	CDCN++^∗	DenseNet-161	DenseNet-121	MobileNetv3-large-100	Vgg-19	TF-EfficientNet-b0	ResNet-152	Vit-Base-ResNet-50
	20.34	96.72	60.07	46.86	0.39	2.62	2.62	14.63	2.32	0.90
√	22.10	99.70	43.19	48.65	0.67	3.96	3.96	35.90	9.53	7.83

Note: The bold values represent the best results.

6. Discussion

FLDATN is suitable for the situation with a large number of datasets. However, when a dataset has only a single image, FLDATN cannot be trained. In this case, it is more efficient to use a gradient-based method. To discuss the difference in the attack success rate of adversarial examples generated by faces with different identities, and meanwhile, consider the speed of generating adversarial examples, a method combining EOT and FGSM is used to generate adversarial examples for a single face image. In the experiment, the image transformation distribution settings are consistent with [6], and the number of transformed images is also 2000. Experiments are carried out with 12 different identities in the Oulu-NPU dataset (6 of which are living faces and 6 are prosthetic faces), and the attack object is CDCN++^∗. The results are shown in Figure 7. It can be found that the attack difficulty of different identities is different. Combining with Table 2, it can be seen that the extreme value of the FGSM attack success rate will be far away from the average value of the attack success rate. From the results, the attack success rate of adversarial samples of some faces will be high, while the attack success rate of adversarial samples of other faces will be very low. In actual attacks, only one face image is often selected to attack. At this time, the attack success rate of adversarial examples may be high or low.

7. Conclusion

A black-box attack for face liveness detection based on ATNs is proposed in this paper. FLDATN can quickly generate adversarial examples after a large amount of data training, and the CBAM module can effectively improve the generalization performance of the adversarial examples, and it also can better locate the location of adversarial perturbations. Experiments show that reasonable application of feature vector similarity can achieve a good attack success rate and good generalization in the physical domain without knowing the loss function and optimizer of the threat model. Compared with MSE, SSIM can better control the image distortion. Compared with other adversarial example generation methods, the proposed method can achieve better generalization and faster generation speed of adversarial examples. However, the quality of the generated adversarial examples is still far from satisfactory, and the attack success rate still has room to improve. Furthermore, many face detection models use depth map as the output of the model, which is not a simple binary classification task. Our future work will be concentrated on the attack of such face detection models.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

Research was supported by the National Natural Science Foundation of China (10.13039/501100001809) ∗ 62072055.

Acknowledgments

This work was supported by a project supported by the National Natural Science Foundation of China (Grant No. 62072055). The authors would like to thank the authors of the previous works for sharing their codes.

Open Research

Data Availability Statement

The data that support the findings of this study are the Oulu-NPU dataset, and they are available from https://sites.google.com/site/oulunpudatabase.

References

1 Yu Z., Zhao C., Wang Z. et al., Searching Central Difference Convolutional Networks for Face Anti-Spoofing, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020, Seattle, WA, 5295–5305.
Google Scholar
2 Zhang K.-Y., Yao T., Zhang J. et al., Face Anti-Spoofing via Disentangled Representation Learning, European Conference on Computer Vision, 2020, Springer, Berlin, Germany, 641–657.
10.1007/978-3-030-58529-7_38
Google Scholar
3 Sharif M., Bhagavatula S., Bauer L., and Reiter M. K., Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition, Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, October 2016, Vienna, Austria, 1528–1540.
Google Scholar
4 Wenger E., Passananti J., Bhagoji A. N., Yao Y., Zheng H., and Zhao B. Y., Backdoor Attacks Against Deep Learning Systems in the Physical World, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021, Seattle, WA, 6206–6215.
Google Scholar
5 Yin B., Wang W., Yao T. et al., Adv-Makeup: A New Imperceptible and Transferable Attack on Face Recognition, 2021, https://arxiv.org/abs/2105.03162.
Google Scholar
6 Zhang B., Tondi B., and Barni M., Adversarial Examples for Replay Attacks Against Cnn-Based Face Recognition With Anti-Spoofing Capability, Computer Vision and Image Understanding. (2020) 197, 102988.
10.1016/j.cviu.2020.102988
Google Scholar
7 Boulkenafet Z., Komulainen J., and Hadid A., Face Anti-Spoofing Based on Color Texture Analysis, 2015 IEEE International Conference on Image Processing (ICIP), September 2015, Québec City, Canada, IEEE, 2636–2640.
Google Scholar
8 Patel K., Han H., and Jain A. K., Secure Face Unlock: Spoof Detection on Smartphones, IEEE Transactions on Information Forensics and Security. (2016) 11, no. 10, 2268–2283.
10.1109/TIFS.2016.2578288
Web of Science® Google Scholar
9 Peng F., Qin L., and Long M., Face Presentation Attack Detection Using Guided Scale Texture, Multimedia Tools and Applications. (2018) 77, no. 7, 8883–8909.
10.1007/s11042-017-4780-0
Web of Science® Google Scholar
10 Erdogmus N. and Marcel S., Spoofing Face Recognition With 3d Masks, IEEE Transactions on Information Forensics and Security. (2014) 9, no. 7, 1084–1097.
10.1109/TIFS.2014.2322255
Web of Science® Google Scholar
11 Poh M.-Z., McDuff D. J., and Picard R. W., Advancements in Non- Contact, Multiparameter Physiological Measurements Using a Webcam, IEEE Transactions on Biomedical Engineering. (2010) 58, no. 1, 7–11.
10.1109/TBME.2010.2086456
PubMed Web of Science® Google Scholar
12 Liu S.-Q., Lan X., and Yuen P. C., Remote Photoplethysmography Correspondence Feature for 3d Mask Face Presentation Attack Detection, Proceedings of the European Conference on Computer Vision (ECCV), September 2018, Munich, Germany, 558–573.
Google Scholar
13 Lagorio A., Tistarelli M., Cadoni M., Fookes C., and Sridharan S., Liveness Detection Based on 3d Face Shape Analysis, 2013 International Workshop on Biometrics and Forensics (IWBF), April 2013, Lisbon, Portugal, IEEE, 1–4.
Google Scholar
14 Kim S., Ban Y., and Lee S., Face Liveness Detection Using a Light Field Camera, Sensors. (2014) 14, no. 12, 22 471–522 499.
10.3390/s141222471
Google Scholar
15 Xu Z., Li S., and Deng W., Learning Temporal Features Using Lstm-Cnn Architecture for Face Anti-Spoofing, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), November 2015, Kuala Lumpur, Malaysia, IEEE, 141–145.
Google Scholar
16 Atoum Y., Liu Y., Jourabloo A., and Liu X., Face Anti-Spoofing Using Patch and Depth-Based Cnns, 2017 IEEE International Joint Conference on Biometrics (IJCB), October 2017, Denver, CO, IEEE, 319–328.
Google Scholar
17 Liu Y., Jourabloo A., and Liu X., Learning Deep Models for Face Anti-Spoofing: Binary or Auxiliary Supervision, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2018, Boston, MA, 389–398.
Google Scholar
18 Wang C., Lu Y., Yang S., and Lai S., Patchnet: A Simple Face Anti-Spoofing Framework via Fine-Grained Patch Recognition, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2022, Vancouver, Canada, 20281–20290.
Google Scholar
19 Liu A., Tan Z., Yu Z. et al., FM-VIT: Flexible Modal Vision Transformers for Face Anti-Spoofing, IEEE Transactions on Information Forensics and Security. (2023) 18, 4775–4786.
10.1109/TIFS.2023.3296330
Google Scholar
20 Liu A., Xue S., Gan J. et al., CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-Spoofing, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2024, Seattle, WA, 222–232.
Google Scholar
21 Goodfellow I. J., Shlens J., and Szegedy C., Explaining and Harnessing Adversarial Examples, 2014, https://arxiv.org/abs/1412.6572.
Google Scholar
22 Kurakin A., Goodfellow I., Bengio S. et al., Adversarial Examples in the Physical World, 2016, https://arxiv.org/abs/1607.02533.
Google Scholar
23 Madry A., Makelov A., Schmidt L., Tsipras D., and Vladu A., Towards Deep Learning Models Resistant to Adversarial Attacks, 2017, https://arxiv.org/abs/1706.06083.
Google Scholar
24 Carlini N. and Wagner D., Towards Evaluating the Robustness of Neural Networks, 2017 IEEE Symposium on Security and Privacy (SP), May 2017, San Jose, CA, IEEE, 39–57.
Google Scholar
25 Wan C., Huang F., and Zhao X., Average Gradient-Based Adversarial Attack, IEEE Transactions on Multimedia. (2023) 25, 9572–9585.
10.1109/TMM.2023.3255742
Google Scholar
26 Moosavi S.-M., Fawzi A., Fawzi O., and Frossard P., Universal Adversarial Perturbations, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 2017, Honolulu, HI, 1765–1773.
Google Scholar
27 Baluja S. and Fischer I., Adversarial Transformation Networks: Learning to Generate Adversarial Examples, 2017, https://arxiv.org/abs/1703.09387.
Google Scholar
28 Xiao C., Li B., Zhu J.-Y., He W., Liu M., and Song D., Generating Adversarial Examples With Adversarial Networks, 2018, https://arxiv.org/abs/1801.02610.
Google Scholar
29 Jandial S., Mangla P., Varshney S., and Balasubramanian V., Advgan++: Harnessing Latent Layers for Adversary Generation, Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, June 2019, Salt Lake City, UT.
Google Scholar
30 Liu X. and Hsieh C.-J., Rob-gan: Generator, Discriminator, and Adversarial Attacker, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019, Long Beach, CA, 11234–11243.
Google Scholar
31 Yang B., Zhang H., Li Z., Zhang Y., Xu K., and Wang J., Adversarial Example Generation With Adabelief Optimizer and Crop Invariance, Applied Intelligence. (2023) 53, 2332–2347.
10.1007/s10489-022-03469-5
Web of Science® Google Scholar
32 Bose A. J. and Aarabi P., Adversarial Attacks on Face Detectors Using Neural Net Based Constrained Optimization, 2018 IEEE 20th Inter-National Workshop on Multimedia Signal Processing (MMSP), August 2018, Vancouver, Canada, IEEE, 1–6.
Google Scholar
33 Ren S., He K., Girshick R., and Sun J., Faster R-Cnn: Towards Real-Time Object Detection With Region Proposal Networks, Advances in Neural Information Processing Systems. (2015) 28.
Google Scholar
34 Deb D., Zhang J., and Jain A. K., Advfaces: Adversarial Face Synthesis, 2020 IEEE International Joint Conference on Biometrics (IJCB), July 2020, Houston, TX, IEEE, 1–10.
Google Scholar
35 Patnaik S. A., Chansoriya S., Namboodiri A. M., and Jain A. K., AdvGen: Physical Adversarial Attack on Face Presentation Attack Detection Systems, Proceedings of IEEE International Joint Conference on Biometrics, September 2023, Ljubljana, Slovenia, 1–10.
Google Scholar
36 Woo S., Park J., Lee J.-Y., and Kweon I. S., CBAM: Convolutional Block Attention Module, Proceedings of the European Conference on Computer Vision (ECCV), September 2018, Munich, Germany, 3–19.
Google Scholar
37 Wang Z., Bovik A. C., Sheikh H. R., and Simoncelli E. P., Image Quality Assessment: From Error Visibility to Structural Similarity, IEEE Transactions on Image Processing. (2004) 13, no. 4, 600–612.
10.1109/TIP.2003.819861
PubMed Web of Science® Google Scholar
38 Zhang K., Zhang Z., Li Z., and Qiao Y., Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks, IEEE Signal Processing Letters. (2016) 23, no. 10, 1499–1503.
10.1109/LSP.2016.2603342
Web of Science® Google Scholar
39 Schroff F., Kalenichenko D., and Philbin J., Facenet: A Unified Embedding for Face Recognition and Clustering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2015, Boston, MA, 815–823.
Google Scholar
40 Boulkenafet Z., Komulainen J., Li L., Feng X., and Hadid A., Oulu-npu: A Mobile Face Presentation Attack Database With Real-World Variations, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), May 2017, Washington, DC, DC, IEEE, 612–618.
Google Scholar
41 He K., Zhang X., Ren S., and Sun J., Deep Residual Learning for Image Recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016, Las Vegas, NV, 770–778.
Google Scholar
42 Howard A. G., Zhu M., Chen B. et al., Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications, 2017, https://arxiv.org/abs/1704.04861.
Google Scholar
43 Huang G., Liu Z., Van Der Maaten L., and Weinberger K. Q., Densely Connected Convolutional Networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 2017, Honolulu, HI, 4700–4708.
Google Scholar
44 Simonyan K. and Zisserman A., Very Deep Convolutional Networks for Large-Scale Image Recognition, 2014, https://arxiv.org/abs/1409.1556.
Google Scholar
45 Tan M. and Le Q., Efficientnet: Rethinking Model Scaling for Convolutional Neural Networks, International Conference on Machine Learning, June 2019, Long Beach, CA, PMLR, 6105–6114.
Google Scholar
46 Dosovitskiy A., Beyer L., Kolesnikov A. et al., An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale, 2020, https://arxiv.org/abs/2010.11929.
Google Scholar
47 Wightman R., Pytorch Image Models, 2019, https://github.com/rwightman/pytorch-image-models.
Google Scholar

Citing Literature

All articles

FLDATN: Black-Box Attack for Face Liveness Detection Based on Adversarial Transformation Network

Abstract

1. Introduction

2. Related Work

2.1. Face Liveness Detection

2.2. Adversarial Attacks

2.2.1. Gradient-Based Adversarial Attack Methods

2.2.2. General Adversarial Attacks

2.2.3. Adversarial Attacks against Facial Recognition Systems

3. Preliminaries

3.1. ATN Network Definition

3.1.1. Training

3.1.2. Loss Function

3.2. Threat Model

4. The Proposed FLDATN

4.1. The Network Structure of FLDATN

4.2. Loss Function

4.2.1. MSE (L2 Norm)

4.2.2. CS

5. Experimental Results and Analysis

5.1. Experimental Setting

5.2. Evaluation Metrics

5.3. Experiments and Performance Analysis

5.3.1. Black-Box Testing

5.3.2. Image Quality Analysis

5.3.3. Adversarial Sample Generation Speed

5.3.4. Ablation Experiment

6. Discussion

7. Conclusion

Conflicts of Interest

Funding

Acknowledgments

Open Research

Data Availability Statement

References

Citing Literature

Figures

References

Related

Information