Volume 2025, Issue 1 7945646
Research Article
Open Access

Exposing the Forgery Clues of DeepFakes via Exploring the Inconsistent Expression Cues

Jiatong Liu

Jiatong Liu

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author
Lina Wang

Corresponding Author

Lina Wang

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author
Run Wang

Run Wang

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author
Jianpeng Ke

Jianpeng Ke

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author
Xi Ye

Xi Ye

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author
Yadi Wu

Yadi Wu

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author
First published: 13 January 2025
Citations: 1
Academic Editor: Yu-an Tan

Abstract

The pervasive prevalence of DeepFakes poses a profound threat to individual privacy and the stability of society. Believing the synthetic videos of a celebrity and trumping up impersonated forgery videos as authentic are just a few consequences generated by DeepFakes. We investigate current detectors that blindly deploy deep learning techniques that are not effective in capturing subtle clues of forgery when generative models produce remarkably realistic faces. Inspired by the fact that synthetic operations inevitably modify the regions of eyes and mouth to match the target face with the identity or expression of the source face, we conjecture that the continuity of facial movement patterns representing expressions that existed in the veritable faces will be disrupted or completely broken in synthetic faces, making it a potentially formidable indicator for DeepFake detection. To prove this conjecture, we utilize a dual-branch network to capture the inconsistent patterns of facial movements within eyes and mouth regions separately. Extensive experiments on popular FaceForensics++, Celeb-DF-v1, Celeb-DF-v2, and DFDC-Preview datasets have demonstrated not only effectiveness but also the robust capability of our method to outperform the state-of-the-art baselines. Moreover, this work represents greater robustness against adversarial attacks, achieving ASR of 54.8% in the I-FGSM attack and 43.1% in the PGD attack on the DeepFakes dataset of FaceForensics++, respectively.

1. Introduction

In recent years, remarkable advancements in the computer vision domain, notably in generative adversarial networks (GANs) [1, 2], have made it possible to generate realistic images and videos, rendering them exceedingly difficult for human eyes to distinguish genuine faces. Particularly, DeepFakes have the ability to manipulate faces in diverse ways, such as identity swap, facial expression, or attributes manipulation to generate photorealism and accurately mimic the celebrities, presenting a distinct threat to the authenticity of the information in the digital landscape [3].

In the wake of DeepFakes’ proliferation, the wide availability of generative models has brought significant benefits to film production and entertainment. However, this accessibility has also led to a mass of misinformation, posing threats to both individual’s and societal stability. To combat the issues posed by DeepFakes, researchers are actively proposing various countermeasures aimed at determining the authenticity of images and videos. These efforts primarily focus on capturing subtle differences between real and synthetic faces passively, e.g., investigating spatial clues at the pixel-level [4], examining artifacts in the invisible frequency domain [5], and identifying differences in biological signals [6]. Unfortunately, with the rapid advancement of generative models, the clues of forgery are gradually being eliminated, rendering the majority of detectors ineffective.

When generating synthetic faces using various manipulations, it becomes evident that DeepFake techniques can disrupt the smooth and continuous muscular movements, as depicted in Figure 1. These irregularities serve as strong indicators for detectors to distinguish between natural and manipulated faces. In this work, we propose a novel idea by exploring the high-level inconsistent expressions to expose the subtle differences between genuine and fabricated faces, thereby offering valuable insights for DeepFake detection. We envision that such semantically high-level inconsistent expressions could provide enhanced robustness against various perturbations and adversarial attacks.

Details are in the caption following the image
Illustration of our design motivation. An example of real faces and corresponding fake faces synthesized by various manipulations (e.g., Deepfakes, Face2Face, FaceSwap, and NeuralTextures in the FaceForensics++ dataset). We can see that the forgery operations can easily diminish the continuity of optical flow representing the movements of expression. The original facial images used in this figure are sourced from the publicly available DeepFake dataset FaceForensics++ [7].

Several researchers have delved into expression-based DeepFake detectors, the negative impact of the prior studies emphasizing on entire face and ignoring the distinctive movement patterns present in different facial regions. In addition, they tend to overlook the importance of temporal consistency across frames, as they primarily consider the isolated images. To effectively address the challenge of exploring inconsistent expressions in synthetic faces, it becomes imperative to develop a methodology that encompasses both spatial continuity and temporal coherence. This approach will enable a more comprehensive understanding of the intricate dynamics within the facial expressions and pave the way for improving DeepFake detection.

Our aim is to develop a DeepFake detector focusing on the inconsistent motion patterns of eyes and mouth regions through exploring expression clues. Traditional methods for exploring expression features often employ optical flow [8], which is commonly used in current expression recognition approaches [9]. However, such techniques may overlook the crucial texture information present in faces, which holds significant relevance in the domain of DeepFake detection. Furthermore, the negative impact of the existing expression-based detectors ignores the different patterns of facial movements in various regions. To overcome these limitations, we are committed to integrating spatial continuity and temporal coherence to accentuate the representations of expression motion patterns. In addition, our approach will delve into the inconsistent patterns of facial movements to enhance the effectiveness of DeepFake detection.

While current detectors offer effective strategies, unfortunately, most of them are not practical for real-world scenarios. These detectors exhibit significant performance drops when encountering unseen manipulations, and this is mainly attributed to the overfitting of the training data. Moreover, they prove to be vulnerable to various perturbations (e.g., noise, compression, and blurring [Blur]), because these detectors heavily rely on specific image cues that can be easily disrupted by simple image transformations. Furthermore, they are susceptible to adversarial attacks [10] (e.g., I-FGSM [11] and PGD [12]), where specific perturbations are introduced to deceive the classifiers. In summary, the existing DeepFake detectors suffer the following two key challenges:
  • 1.

    Poor general capability to unseen manipulations: Many existing DeepFake detectors primarily aim to enhance performance on datasets with known synthetic techniques. However, when these models are transferred to unseen manipulation datasets, they often experience a significant performance decline. This is due to the fact that different datasets are generated using specific algorithms, causing the previously trained detectors to be overly tailored to the limited training data, thus limiting their general capabilities. Most previous detectors focus on different regions of interest based on distinctive data, resulting in a well-trained model that lacks generalization across datasets. Some previous detectors have focused mainly on improving their general potentialities, e.g., they search for common artifacts [4] or concentrate on blending boundaries [13]. However, generalization is disrupted by advanced generative models, and blending boundaries are vulnerable to low-level image transformation operations.

  • 2.

    Not robust to various perturbations. In real-world scenarios, DeepFakes are susceptible to various degradations. These perturbations can be categorized into two main types: common image transformations (e.g., noise, compression, and blur) [4] and imperceptible adversarial attacks (e.g., PGD and I-FGSM) [14]. For simple image transformations, most current detectors heavily rely on specific image textures, pixel distributions, or structural information to distinguish between authentic and manipulated faces. However, these clues can be easily disrupted or eliminated after undergoing common image transformations. In the case of adversarial attacks, existing studies often employ deep neural networks (DNNs) to design detectors, and attackers can exploit this vulnerability by introducing specific perturbations to a fake video, which can mislead the detector into classifying it as real.

To tackle the two fundamental challenges faced by DeepFake detectors, we first craft the movement spatial–temporal representations (MSTRs) of sequential frames to emphasize the spatial continuity and temporal coherence of movement patterns, and then we implement a dual-branch architecture to learn distinct movement patterns of eyes and mouth regions separately to explore the high-level expression clues, which are not susceptible to various perturbations and adversarial attacks.

In this work, we find that current generative models manipulate faces that focus on the quality of frame level, leading to poor temporal information modeling, and they do not explicitly capture the continuous expressions; thus, we expose DeepFake videos by exploring the inconsistent expression cues. Specifically, our method outperforms advanced DeepFake detection methods in in-dataset evaluation, e.g., it achieves 99.9% AUC on FaceForensics++ (C23) dataset, 99.3% AUC on Celeb-DF-v2 dataset, and 95.3% AUC on the DFDC-Preview dataset; this work is robuster to various perturbations (e.g., Gaussian noise (Noise), Gaussian blur (Blur), JPEG compression (JPEG), and video compression (Comp)) compared to the best LipForensics [15] method by 5.9% AUC on average, and more robust to adversarial attacks than advanced detectors with I-FGSM and PGD attacks by 46.6% and 56.5% attack success rate (ASR) averagely. Our main contributions are summarized as follows:
  • 1.

    We observe that existing DeepFakes disrupt the continuity of facial muscle movements. To capture these forgery cues, we propose a DeepFake detection method that includes extracting spatiotemporal representations of facial movements, and a dual-branch network is designed to explore inconsistent expression cues in the eyes and mouth regions, enabling effective detection of DeepFake videos.

  • 2.

    To enhance robustness against various degradations and adversarial attacks, we design spatiotemporal representations of facial movements. Image degradations and adversarial attacks typically destroy texture details within frames. However, motion patterns extracted across frames, as a high-level semantic feature, remain insusceptible and can strengthen the robustness of our method.

  • 3.

    To improve generalization to unknown DeepFakes, the dual-branch network is designed to independently learn expression cues from the eyes and mouth regions. By extracting motion patterns from different facial areas, the method reduces reliance on specific datasets or forgery techniques, enhancing the generalization capability of our method.

  • 4.

    Experimental results on FaceForensics++, Celeb-DF-v1, Celeb-DF-v2, and DFDC-Preview demonstrate that our method outperforms state-of-the-art approaches. Furthermore, we conduct experiments under various degradations and two typical adversarial attacks, achieving an average AUC of 92.6% and an ASR of 48.0%. The results indicate that our model exhibits superior robustness compared to other advanced detectors.

The rest of this manuscript is structured as follows. Section 2 provides a review of related works on DeepFake generation, DeepFake detection, and expression-based DeepFake detection. Section 3 defines the materials and methods of this study. Section 4 offers experimental details and comprehensive results. Section 5 discusses the limitations of current generative models in replicating human facial expressions. Lastly, Section 6 concludes the paper, with limitations and future research directions.

2. Related Works

2.1. DeepFake Generation

GANs [1, 2] have witnessed significant advancements in image generation and manipulation. DeepFakes exploit the power of GANs to create realistic multimedia content, posing potential risks to personal privacy and societal stability. Generally, there are four primary types of DeepFakes: entire face synthesis, attribute manipulation, identity swap, and face reenactment.

The entire face synthesis generates faces which are nonexistent in the real world, such as PGGAN [16] and StyleGAN [17]. The attribute manipulation can modify simple facial attributes (e.g., hair color and smile) and retouch complex attributes (e.g., gender and age), such as StarGAN [18] and STGAN [19]. The identity swap can replace the face between the target and source image, and CycleGAN [20] and FaceSwap [21] are the two classical works. Similar to identity swap, face reenactment is able to swap the facial expression between the target and source image by deploying popular methods, such as Face2Face [22].

Nevertheless, the existing DeepFake techniques do not explicitly preserve the natural expression movements of eyes and mouth because they generate faces focused on the quality of frame level, which fails to capture the continuity of expression movements and model the temporal information across sequential frames in a video. Furthermore, current generative models encounter challenges in learning sufficiently continuous and natural patterns of expressions from limited data samples, inspiring us to explore the indicators of inconsistent expressions to classify the real and synthesized faces.

2.2. DeepFake Detection

In recent years, numerous studies have demonstrated that even the minor differences between genuine and forged faces can be captured to distinguish authenticity. Therefore, researchers are actively engaged in devising various countermeasures to detect DeepFakes, which fall into two categories, that is, frame-level and video-level approaches.

Frame-level approaches always focus on the authenticity of single frames. Afchar et al. propose a MesoNet designed for analyzing the faces generated by DeepFake and Face2Face methods at a mesoscopic level [23]. Xception is a conventional CNN trained on ImageNet and is adapted for DeepFake detection tasks through replacing the fully connected layer with a pair of outputs [7]. Li et al. provide a general method in which most face forgery methods rely on a blending step and do not depend on any knowledge of the artifacts linked to a particular facial manipulation technique [24]. Zhao et al. propose a framework with an attention module for combining textural and semantic features to detect forgery images [25]. Wang, Sun, and Tang take the original images and their corresponding quality-degraded images as the inputs of Siamese networks [26]. Cao et al. use a reconstruction network to learn the common features of real images, and they utilize the reconstruction differences to classify real and fake images [27]. Shiohara and Yamasaki suggest a method to reproduce the common forgery clues by blending the pseudosource and target faces from the same pristine faces [13]. Ba et al. fuse broader semantic forgery features to improve the generalizability of DeepFake detection [28].

Video-level approaches distinguish the forgery of all video frames and add temporal information during forgery detection. Güera and Delp utilize a convolutional neural network (CNN) to extract features at the frame level and train a recurrent neural network (RNN) for DeepFake classification [29]. Haliassos et al. pretrain a spatial–temporal network to acquire high-level semantic representations associated with natural lipreading, which can be designed for general DeepFake detection [15]. Gu et al. present a network to learn the inconsistencies of local and global features to capture more general forgery clues [30].

In fact, most of the current methods are impractical in real-world scenarios, as they exhibit significant performance drops when encountering unseen manipulations. This is primarily due to their focus on distinct regions with unique data characteristics, and they suffer from various perturbations because they rely on low-level image features which are easily destroyed by common image transformation. Some existing works explore the generalization, they search common artifacts of multiple synthetic methods [13], but the artifacts are easy to destroy by advanced generative models and the blending boundary is susceptible to low-level image transformation operations. To address the pivotal challenges, we propose a movement representation for representing high-level semantic expression features.

2.3. Expression-Based DeepFake Detection

Since the expression recognition techniques achieved great development [9], some researchers use expression clues for DeepFake detection. Researchers exploit facial emotions learned from the expression recognition model to classify real and fake faces. In addition, some studies combine the multimodel features, for example, works capture the inconsistencies between visual and voice models to detect DeepFakes [31, 32].

Although current expression-based DeepFake detection methods have shown some progress, they still suffer from fundamental shortcomings. These approaches typically utilize the entire face during training, overlooking the fact that different regions of the face exhibit distinct movement patterns. Furthermore, they often fail to capture the crucial temporal coherence across frames, as they primarily focus on isolated images. Moreover, existing methods predominantly rely on in-dataset evaluation, neglecting the vital challenges related to general capabilities and robustness that current detectors encounter. To address these limitations, our work delves into exploring the diverse expression patterns exhibited by different facial regions. We recognize that fake faces may still display expressions but their movement patterns are disrupted by forgery techniques, leading to discernible deviations from genuine faces.

To solve the aforementioned issues, previous studies have noted that most forgery videos alter the eyes region [33] or mouth region [15] in order to align the target face with the identity or expression of the source face. In this paper, our objective is to expose inconsistent clues by exploring facial expressions. While generative models can indeed synthesize photorealistic faces, the natural movement patterns of facial expressions are driven by the coordinated actions of associated muscles, making them difficult to replicate. This is primarily because current generative models produce continuous videos always focusing on the quality of frame level, and they do not explicitly capture the temporal relationship and the continuity of subtle movements across multiple frames. To synthesize continuous motion features, generative models often through interpolation techniques introduce jitter and discontinuity in fake videos. This presents a significant challenge when attempting to create convincing and coherent expressions that flow naturally.

3. Materials and Methods

We present a novel approach aimed at effectively exposing DeepFakes by exploring the inconsistent facial expression patterns. In Section 3.1, we delve into the insights behind this work, followed by an overview of our methodology in Section 3.2. We then proceed to elaborate on the techniques employed to design spatial–temporal representations of facial movements in Section 3.3, as well as the process of learning expression inconsistency in Section 3.4. In addition, we outline the loss functions utilized in Section 3.5. The pipeline of our method is illustrated in Figure 2(a).

Details are in the caption following the image
Overview. We extract expression features from the eyes and mouth regions of natural and forgery faces. We compute optical flow from three sequential frames to capture movement consistencies, combine optical flow with the interframe as a new input, and train a dual-branch network to learn the movement patterns of eyes and mouth separately. Then, we aggregate the authenticity probabilities of the two branches to distinguish authenticity. The original facial images used in this figure are sourced from the publicly available DeepFake dataset FaceForensics++ [7]. (a) Pipeline of our method. (b) Dual-branch network.
Details are in the caption following the image
Overview. We extract expression features from the eyes and mouth regions of natural and forgery faces. We compute optical flow from three sequential frames to capture movement consistencies, combine optical flow with the interframe as a new input, and train a dual-branch network to learn the movement patterns of eyes and mouth separately. Then, we aggregate the authenticity probabilities of the two branches to distinguish authenticity. The original facial images used in this figure are sourced from the publicly available DeepFake dataset FaceForensics++ [7]. (a) Pipeline of our method. (b) Dual-branch network.

3.1. Insight

In practical scenarios, existing DeepFake detectors suffer from the challenges in terms of generalization capabilities and robustness. To effectively tackle these key issues, our goal is to identify forgery clues which are independent of specific manipulations and remain resistant to common perturbations. Our work is inspired by the observation that the current DeepFake techniques struggle to replicate continuous facial expressions due to their reliance on single-frame generation, and generative models encounter difficulties in capturing subtle and continuous changes in facial expressions. To capture these inconsistent expression clues, we investigate the role of optical flow in determining facial expressions, which are widely used for recognizing different categories of emotions and play a crucial role in analyzing the content of expressions. In addition, different action units labels describe either eyes or mouth movements, but not both, and the categories of expressions are determined by the combination of action units’ labels. Consequently, the movement patterns of the eyes and mouth exhibit differences [34].

Forgery operations have an impact on the continuous and natural movements present in authentic faces. This is due to the fact that current synthetic methods primarily focus on capturing and reproducing facial expressions by single frames, lacking the ability to accurately capture complex muscular movements and model temporal information. Therefore, the presence of inconsistent expression cues can be leveraged to reveal subtle differences between real and forged videos. As depicted in Figure 1, manipulation operations can easily disrupt and diminish the continuity of optical flow, which represents the speed and direction of facial movements. Building upon this understanding, we propose a novel approach that explores expression clues for DeepFake detection. In Figure 2(a), we illustrate the pipeline of our method, which conducts MSTRs to highlight the spatial continuity and temporal coherence of sequential frames, within a dual-branch network architecture to learn the different expression patterns of eyes and mouth regions separately.

3.2. Overview

Given a video that contains T frames, the target is to judge whether this video is real or fake by exploring expression features. For this purpose, we first develop the MSTR in Section 3.3 for the input video, which can highlight the spatial continuity and temporal coherence, and output a combined image with optical flow and the interframe through three sequential frames, that is, , where T is the number of three sequential frames, is the regions of the face we focus on, and C is the number of channels. In this stage, we formulate with a single color channel for computing optical flow. Intuitively, x represents the temporal variation of region movements (i.e., eyes and mouth region in our work) in the faces, which could highlight the expression patterns.

We design a dual-branch neural network that takes the MSTR feature map as input and predicts whether the input faces are real. However, various regions of a face have different motion patterns, e.g., eyebrows raise or droop, eyelid raise or tighten, and lips stretch or tight, and these motions are related to particular muscular movements. If we just take the whole face as the input, we may corrupt the distinct motion patterns of different face regions. To address this problem, we choose the most significant regions of the eyes and mouth as the input feature maps (e.g., the process shown in Figure 2(a)).

Our objective is to capture the diverse motion patterns exhibited by the eyes and mouth areas. To achieve this, we present a dual-branch network that effectively learns the motion of the eyes and mouth separately through the design of the base layer and branch layer.

3.3. MSTR

Utilizing expression features for DeepFake detection can be approached by using existing expression representations designed for expression recognition. However, they may ignore the significant texture information present in faces, which holds significant relevance in DeepFake detection. To track this problem, we employ the optical flow technique to capture continuous expression patterns across frames and combine the interframe as a new input. As illustrated in Figure 3, we apply the optical flow diagram to three consecutive frames of both real and synthetic faces. We observe that real facial expressions involve natural movements caused by the contraction of relevant muscles. For instance, in the first row and second column, we can see uniform and consistent movements in the cheek region of real faces. In contrast, the movements in synthetic faces appear discontinuous and irregular. In addition, in the fourth column, we observe that manipulation operations diminish the movements in the region of the eyes. These observations suggest that expression motion patterns can serve as strong indicators of authenticity.

Details are in the caption following the image
Analysis of optical flow. The first row shows the natural movements of real faces, while the other rows show the inconsistencies and destroyed movements of synthetic faces. The original facial images used in this figure are sourced from the publicly available DeepFake dataset FaceForensics++ [7].
The optical flow [8] measures subtle facial muscle movements by tracking pixel intensity changes between two adjacent facial frames over time. The position movement of a facial pixel at location (x, y, t) with intensity I(x, y, t) is represented by Δx, Δy, and Δt. Based on the brightness consistency constraint, we have
(1)
Assuming minimal positional changes between adjacent facial frames, the image constraint at I(x, y, t) can be expressed using a Taylor series as follows:
(2)
where τ represents a higher-order infinitesimal. From these formulas, we derive
(3)
which leads to the following formula:
(4)
where u and v represent the x and y components of the optical flow of I(x, y, t), respectively.
To solve the u and v, we perform the more robust TV-L1 optical flow model [35] between three facial frames to obtain two optical flow images. The TV-L1 optical flow model improves the optical flow constraint by introducing the variable w of light changes. The equation of TV-L1 optical flow can be described as
(5)
where ETVL1 is the energy function, ρ(u, v, w) = Ix(uu0) + Iy(vv0) + It + βw, and the parameter β serves as a weighting factor for the term related to light changes.
Next, we combine the two optical flow images and interframe into a new input image x to better correlate the spatial and temporal representations. Then, we divide the new input x into eyes region xeyes as
(6)
and mouth region xmouth as
(7)
where h is the height of the combined image x.

Figure 2(a) shows examples of real and forged faces and their MSTR images. We have observed that the inconsistencies can be easily exposed on our MSTR images which will provide a more powerful indicator for distinguishing authenticity. In the cross-dataset evaluation stage, to help the model learn common forgery cues, we deploy a landmark augmentation method [13] for the fake interframe as shown in Figure 4.

Details are in the caption following the image
Illustration of landmark augmentation. The fake interframe is replaced by a landmark augmented frame. The original facial images used in this figure are sourced from the publicly available DeepFake dataset FaceForensics++ [7].

3.4. Expression Inconsistency Learning

In this section, we describe the dual-branch network, which serves as the foundation for our effective DeepFake detector, leveraging both the MSTR and distinct movement patterns of the eyes and mouth regions. According to the works [36, 37], the network structures are shown in Figure 2(b). In our paper, we reconstruct the base layer and branch layer to capture distinct features from the eyes and mouth separately. For specific details regarding the backbone architecture, refer to Table 1.

Table 1. Details of backbone architecture.
Usage Layers Operator Input size Output size
Base layer Layer 1 Conv2d 7 × 7 3 × 112 × 224 64 × 56 × 112
MaxPool2d 3 × 3 64 × 56 × 112 64 × 28 × 56
Layer 2 Res-block 3 × 3 64 × 28 × 56 64 × 28 × 56
Res-block 3 × 3 64 × 28 × 56 64 × 28 × 56
Layer 3 Res-block 3 × 3 64 × 28 × 56 128 × 14 × 28
Res-block 3 × 3 128 × 14 × 28 128 × 14 × 28
  
Branch layer Layer 4 Res-block 3 × 3 128 × 14 × 28 256 × 7 × 14
Res-block 3 × 3 256 × 7 × 14 256 × 7 × 14
Layer 5 Res-block 3 × 3 256 × 7 × 14 512 × 4 × 7
Res-block 3 × 3 512 × 4 × 7 512 × 4 × 7
Layer 6 AdaptiveAvgPool 512 × 4 × 7 512 × 1 × 1
Flatten 512 × 1 × 1 512
  • Note: This table shows the size of kernels, the resolution of the input feature map, and the input channels of each layer.

3.4.1. The Base Layer

The layers serve to encode the input image xeyes and xmouth (xR3×112×224) into the same features. The specifics of the base layer are as follows:
  • 1.

    The convolutional layer with a kernel size of 7 × 7 is applied to extract low-level features (e.g., edges and textures), and after that, the input eyes or mouth images are transformed into a 64 × 56 × 112 feature map. After passing through the BatchNorm2d and ReLU layers, we obtain

    (8)

  • where Wconv is the weight of the convolutional kernel and bconv is the bias term of the convolutional operation.

  • 2.

    The MaxPool2d layer reduces the size of the feature map while preserving the most significant features, so we have a 64 × 28 × 56 feature map represented as follows:

    (9)

  • 3.

    A series of Res-block with a kernel size of 3 × 3 is employed to further capture subtle features and inconsistencies in forgery images, resulting in a 128 × 14 × 28 feature map. Each Res-block contains a Conv2d layer, BatchNorm2d layer, and ReLU layer. In the key Res-block, a SA layer is added, which introduces an attention mechanism, enabling the model to enhance important features, and this process can be represented as

    (10)

Through extensive experiments, we have determined that the four Res-blocks are the optimal choice for our base layer.

3.4.2. The Branch Layer

The layers come into play after extracting the feature maps from the base layer. We have devised a dual-branch model to independently capture the higher-level expression features of eyes and mouth regions. We incorporate another four Res-blocks in this process, which enable the extraction of more abstract and semantically rich features. In summary, our model utilizes the same base layer to extract low-level and midlevel features. The branch layer is dedicated to focusing on higher-level features while retaining the feature information from all channels, leading to more effective detection capabilities.

To make a final prediction on the input images, we choose a dual-branch CNN-based network with a fully connected layer. The fully connected layer denoted as FC(x; θc) is applied to each branch, which is used to transform the feature maps of eyes or mouth into a probability distribution of specific categories separately. The output can be described as
(11)
where x is the input data of eyes or mouth, θ is the parameter set of each layer, B(·) is the base layer, E(·) is the branch layer, and softmax(·) is the activation function of our binary task.

In summary, we introduce a dual-branch network architecture of the base layer and branch layer. This network effectively extracts comprehensive high-level expression features from both eyes and mouth regions, endowing our model with robustness against various perturbations. The training and testing process is shown in Algorithm 1.

    Algorithm 1: Training and testing process.
  • Input: RAW video , preprocess method p(·), base layer module B(·; θb), branch layer module E(·; θe), and classification layer C(·; θc)

  • Output: the authenticity probability of each input

  • 1.

    Preprocessed image

  • 2.

    while the number of training iterations do

  • 3.

    meB(xe; θb),

  • 4.

    ,

  • 5.

    ,

  • 6.

     Optimization by (14), updating weights

  • 7.

     when training achieves stability

  • 8.

    end while

  • 9.

    for detection steps do

  • 10.

    xe, xm = p(x)

  • 11.

    meB(xe; θb),

  • 12.

    ,

  • 13.

    ,

  • 14.

     pred = λfeyes + (1 − λ)fmouth

  • 15.

    end for

  • 16.

    returnpred

3.5. Classification Loss Functions

The dual-branch learning strategy in our method aims to capture and learn the distinct expression patterns of the eyes and mouth regions. The learning procedure of our proposed method involves three steps as follows: (1) extracting the same base features, (2) learning various branch features, and (3) performing binary classification. To optimize the model, we design three loss functions: the classification loss for the eyes Leyes_cls, the classification loss for the mouth Lmouth_cls, and the total loss Ltotal. These loss functions are optimized using the general cross-entropy loss to update the parameters of the network.

In this stage, both images xeyes and xmouth are fed into different fully connected networks at the end to obtain the classification features feyes, and fmouth. The classification loss Leyes_cls is calculated by cross-entropy as follows:
(12)
where N is the number of the samples, yi is the ground truth value of the ith sample, and is the network output of the ith sample of the eyes area.
Similarly, we compute the classification loss Lmouth_cls as follows:
(13)
where is the network output of an ith sample of the mouth area.
Finally, the total classification loss Ltotal is
(14)
where α is the hyperparameter, and α is set to 0.5 in our experiments.

4. Results and Analysis

In order to assess the effectiveness of our proposed approach, the experiment settings are described in Section 4.1, and we perform in-dataset experiments using a training set and a testing set sourced from the same data distribution, as detailed in Section 4.2. Furthermore, to evaluate the general capability of our method, we conduct cross-dataset evaluations by employing a known training set and an unseen testing set, as discussed in Section 4.3. We also investigate the robustness of our model to diverse perturbations in Section 4.4, and adversarial attacks are examined in Section 4.5. In addition, we employ ablation experiments in Section 4.6 to assess the significance of different components integrated within our model.

4.1. Datasets

Throughout the research process of DeepFakes, challenging datasets have emerged. To ensure a comprehensive and representative evaluation of our proposed method, we have carefully selected the following datasets:
  • 1.

    FaceForensics++ dataset [7] comprises 1000 genuine videos and 4000 fake videos created by face swapping techniques such as DeepFakes and FaceSwap, as well as face reenactment methods such as Face2Face and NeuralTextures. Each video is available in three versions: RAW, slightly compressed, and heavily compressed.

  • 2.

    Celeb-DF-v1 and Celeb-DF-v2 datasets [38] consist of 540 authentic videos and 5639 forged videos of celebrities on YouTube, generated by using advanced DeepFakes techniques to achieve high-quality forgeries.

  • 3.

    DFDC-Preview dataset [39], released in conjunction with the competition, comprises 5000 videos generated by two facial manipulation techniques. The dataset exhibits diversity across various attributes such as gender, age, and skin tone.

In our experiments, the video datasets are split into training, validation, and testing sets in an 8:1:1 ratio. Then, we use OpenFace to extract facial images from consecutive frames. Two optical flow images are calculated between three consecutive facial images and combined with the middle facial image to create a new MSTR. For generalization evaluations, we perform data augmentation on the middle facial image. These steps are crucial for standardizing the input data across different datasets and improving model performance.

4.2. Baselines

In this section, we adopt several state-of-the-art DeepFake detection methods for evaluating our proposed model. Considering that most DeepFake detection methods have not yet released their code and weights, for a fair comparison, we directly cite their results from the corresponding original papers. We compare the latest works in both in-dataset and cross-dataset evaluations. For the subsequent robustness evaluations, we select the top four methods that performed best in the cross-dataset evaluations, reproduce their released code, and compare them with our proposed method. These works show good generalization and robustness across multiple datasets, making them suitable as comparison baselines.

4.2.1. In-Dataset Evaluation

We train and test the methods on FaceForensics++ (C23), Celeb-DF-v2, and DFDC-Preview datasets, respectively. The comparative methods include the following (1) ResNet50 [36]: it presents a residual network to alleviate the training process for the task of image classification; (2) MesoNet [23]: it suggests detecting forgery faces by implementing the approach at a mesoscopic analysis level; (3) CNN + LSTM [29]: it utilizes a convolutional LSTM to generate a temporal sequence descriptor for detecting fake videos; (4) Xception [7]: it is released with the FaceForensics++ dataset; (5) SPSL [40]: it proposes a two-branch network combined with original information and multiband frequencies; (6) Face X-ray [24]:it proposes a forgery image, which is detected by detecting the mixed boundary of source image and target image; (7) F3-Net [41]: it takes two different frequency-aware clues to deeply capture the forgery patterns; (8) 3DCNN [42]: it proposes a novel network to extract forgery features of different scales; (9) DeepFakeucl [43]:it uses contrastive learning to design unsupervised detection methods; (10) LipForensics [15]: it extracts advanced semantic features through a pretrained lipreading model; (11) Liu et al. [44]: they present a lightweight 3DCNN to detect DeepFakes; (12) FInfer [45]: it proposes a novel framework to deal with the high-visual–quality synthetic videos; (13) SBIs [13]: it generates forgery images by blending source and target face image from real face images; and (14) ResNet34 [28]: it combines the semantic-rich features to improve the accuracy (Acc) and generalizability of DeepFake detection in real-life scenarios.

4.2.2. Cross-Dataset Evaluation

The cross-dataset evaluation is more challenging than in-dataset, and we train the model on the FaceForensics++ C23 dataset and test them on Celeb-DF-v1, Celeb-DF-v2, and DFDC-Preview datasets, respectively. Frame-level comparison experiments include the following: (1) DSP-FWA [33]: captures the features in the process of affine face warping; (2) Multiattention [25]: employs multiattention heads to design a new DeepFake detection method; (3) LiSiam [26]: considers the localization invariance of various image degradations to detect DeepFakes; (4) RECCE [27]: captures the common representations from real facial images to enhance the generalizability; (5) MesoNet [23]; (6) Xception [7]; (7) Face X-ray [24]; (8) SBIs [13]; and (9) ResNet34 [28]. These have been discussed in Section 4.2.1. Video-level comparison experiments include the following: (1) Xception [7]: it is the frame-level method used to compute video-level metrics; (2) HCIL [30]: adopts a two-level contractive learning to capture more general clues for detecting forgery videos; (3) RATF [46]: designs a module to generate temporal filters for various facial regions; (4) F3-Net [41]; and (5) LipForensics [15]. These have been discussed in Section 3.2.1.

The comparison of state-of-the-art DeepFake detection methods and our proposed method is shown in Table 2. We jointly leverage spatiotemporal features to explore advanced semantic expression clues in the eyes and mouth regions of the human face, which differs from state-of-the-art approaches. In addition, it demonstrates both generalization capabilities for unknown DeepFake technologies and robustness against various image degradations and adversarial attacks.

Table 2. Comparison of the existing state-of-the-art DeepFake detection methods and our proposed method.
Methods Time Input object Extracted features Capabilities
Image Video Type Methods G R
ResNet50 [36] 2016 S DNN-based
MesoNet [23] 2018 S DNN-based
CNN + RNN [29] 2018 S DNN-based
Xception [7] 2019 S DNN-based
Face X-ray [24] 2020 S Blending boundary
F3-Net [41] 2020 F Frequency-aware clues
SPSL [40] 2020 S + F DNN-based
3DCNN [42] 2021 S + T DNN-based
DeepFakeucl [43] 2021 S Image augmentation
DSP-FWA [33] 2021 S + T Eyes movements
LipForensics [15] 2021 S + T Mouth motion
Liu et al. [44] 2021 S + T DNN-based
Multiattention [25] 2021 S DNN-based
FInfer [45] 2022 S + T Frame inference–based
HCIL [30] 2022 S + T Temporal inconsistency
LiSiam [26] 2022 S Localization invariance
RATF [46] 2022 S + T Region-aware temporal inconsistency
RECCE [27] 2022 S Reconstruction-classification learning
SBIs [13] 2022 S Self-blended augmentations
ResNet34 [28] 2024 S Global and local features
Ours S + T Expression cues of eyes and mouth
  • Note: We mainly show the time, input object (e.g., image or video), type of their extracted features (e.g., spatial (S), frequency (F), or temporal (T)), and methods of their extracted features. G and R represent the capabilities of the detection methods. G denotes whether the detector has the generalization capabilities on unknown DeepFake technologies. R denotes whether the detector is robust against various degradations and attacks.

4.3. Metrics

To evaluate the effectiveness of our DeepFake detection method, we utilize the Acc and AUC as performance metrics to measure its ability to distinguish between real and synthetic face images. In addition, we report the ASR to assess the misclassification of faces when subjected to adversarial attacks.

4.4. Implementation Details

In our evaluation experiments, the optimizer is SGD, the learning rate is 10−3, with a momentum of 0.9, the training epoch is 100, and the batch size is 128. All our experiments are performed on a server with 40 cores 2.20 GHz Xeon CPUs, 64 GB RAM, and two NVIDIA Tesla V100 GPUs with 32 GB memory.

4.5. In-Dataset Evaluation

To comprehensively evaluate the effectiveness of our proposed method, we focus on in-dataset evaluation first. To ensure meaningful comparisons with existing studies, we employ widely recognized datasets, namely, FaceForensics++, Celeb-DF-v2, and DFDC-Preview. Our method is trained and tested within the same dataset to maintain consistent data distribution. Throughout the evaluation process, we conduct thorough comparisons between our proposed method and state-of-the-art approaches in the field.

The performance results on the FaceForensics++, Celeb-DF-v2, and DFDC-Preview datasets are presented in Table 3. Our proposed method demonstrates an average improvement in AUC scores of 8.9%, 5.2%, and 6.0%, respectively, across these datasets. Specifically, on the FaceForensics++ (C23) dataset, our method surpasses the state-of-the-art ResNet34 approach by 1.6% in AUC scores. On the DFDC-Preview dataset, our approach exceeds Liu et al.’s method by 1.3% in AUC scores. For the Celeb-DF-v2 dataset, although our method ranks below the most advanced ResNet36 method, it still achieves the second-highest AUC score of 99.3%.

Table 3. In-dataset evaluation results on FaceForensics++ (C23), Celeb-DF-v2, and DFDC-Preview datasets.
FaceForensics++ (C23) Celeb-DF-v2 DFDC-Preview
Methods Time AUC↑ Methods Time AUC↑ Methods Time AUC↑
SPSL [40] 2020 94.3 Xception [7] 2019 98.5 ResNet50 [36] 2016 87.7
Face X-ray [24] 2020 87.4 DeepFakeucl [43] 2021 90.5 MesoNet [23] 2018 90.6
F3-Net [41] 2020 98.1 3DCNN [42] 2021 88.8 CNN + RNN [29] 2018 91.5
3DCNN [42] 2021 72.2 SBIs [13] 2022 93.7 Xception [7] 2019 94.0
FInfer [45] 2022 95.7 FInfer [45] 2022 93.3 LipForensics [15] 2021 78.1
ResNet34 [28] 2024 98.3 ResNet34 [28] 2024 99.9 Liu et al. [44] 2021 94.0
Ours 99.9 Ours 99.3 Ours 95.3
  • Note: The best and second-best AUC scores (%) are highlighted using bold and italic formatting, respectively. The uparrow indicates that the larger the AUC, the better the experimental performance.

It is worth noting that the results on the DFDC-Preview dataset are lower compared to those on the FaceForensics++ (C23) and Celeb-DF-v2 datasets. This discrepancy is attributed to the presence of low-quality videos in the DFDC-Preview dataset, which complicates the extraction of human faces.

In order to better understand and visualize the spatial regions that our model relies on, we select the real faces and corresponding fake faces to plot the results of Grad-CAM in Figure 5. As expected, our proposed model focuses predominantly on similar areas (eyes and mouth) among real and fake faces.

Details are in the caption following the image
Visualization of regions which our model relies on. The original facial images used in this figure are sourced from the publicly available DeepFake datasets FaceForensics++ [7], Celeb-DF-v2 [38], and DFDC-Preview [39].

4.6. Cross-Dataset Evaluation

Given the rapid advancements in DeepFake techniques, it is crucial to develop a detector that can accurately distinguish forged contents generated by novel manipulation methods, even when trained on known datasets. To enable our model to learn common synthetic clues, we employ a data enhancement method [13] to the interframe. Specifically, we replace the preprocessed interframe in Section 3.3 with a data-enhanced frame, as illustrated in Figure 4. For our experiments, we initially train our model using the FaceForensics++ dataset and subsequently evaluate its performance on other datasets such as Celeb-DF-v1, Celeb-DF-v2, and DFDC-Preview. The experimental results are presented in Table 4.

Table 4. Cross-dataset evaluation on the Celeb-DF-v1, Celeb-DF-v2, and DFDC-Preview.
Methods Year Input object Training dataset Celeb-DF-v1 AUC↑ Celeb-DF-v2 AUC↑ DFDC-Preview AUC↑
MesoNet [23] 2018 Video FF++ 42.2 53.6 59.4
Xception [7] 2019 Video FF++ 75.0 77.8 69.8
Face X-ray [24] 2020 Video FF++ 80.6 74.2 80.9
F3-Net [41] 2020 Video FF++ 75.7
DSP-FWA [33] 2021 Video FF++ 78.5 81.4 59.5
LipForensics [15] 2021 Video FF++ 83.3 82.4 73.5
HCIL [30] 2022 Video FF++ 79.0 69.2
RATF [46] 2022 Video FF++ 76.5 69.1
LiSiam [26] 2022 Image/Video FF++ 81.1 78.2
Multiattention [25] 2021 Image FF++ 67.4
RECCE [27] 2022 Image FF++ 68.7
SBIs [13] 2022 Image FF++ 87.9 93.2 86.2
ResNet34 [28] 2024 Image FF++ 81.8 86.4 85.1
Ours Video FF++ 88.4 87.2 84.1
  • Note: The best and second-best AUC scores (%) are highlighted using bold and italic formatting, respectively. Our proposed method demonstrates superiority over state-of-the-art frame-level detectors on the Celeb-DF-v1 dataset and outperforms advanced video-level methods averagely on three datasets. The uparrow indicates that the larger the AUC, the better the experimental performance.

4.6.1. Comparison With Frame-Level Detectors

Our approach demonstrates a significant improvement over advanced frame-level methods on the Celeb-DF-v1 dataset, achieving an average increase of 13.1% in AUC scores. Although our method shows lower results on the Celeb-DF-v2 and DFDC-Preview datasets compared to the frame-level methods, this can be attributed to the inherent differences in focus. Frame-level models concentrate solely on the detailed information within individual images, ignoring interframe correlations, which enhances their generalization capabilities across various images. Despite the superior generalization ability of frame-level DeepFake detectors, video-level models, such as ours, leverage temporal sequence information, which proves advantageous for in-dataset evaluations as shown in Table 3 and specific scenarios involving interframe manipulations or motion artifacts.

4.6.2. Comparison With Video-Level Detectors

When compared to state-of-the-art video-level methods, our model outperforms the advanced LipForensics, HCIL, and RATF approaches across the three datasets. Specifically, our method achieves an average improvement in AUC scores of 5.1%, 9.7%, and 14.2% on the Celeb-DF-v1, Celeb-DF-v2, and DFDC-Preview datasets, respectively.

4.7. Robustness to Various Perturbations

DeepFake detectors should exhibit robustness against various image transformations and adversarial attacks that images may encounter in the real world. To assess the detectors’ robustness to unseen degradations, we consider several simple perturbations at five severity levels, as defined in the DeepForensics-1.0 benchmark [47]. These perturbations include changes in color saturation (CS), color contrast (CC), block-wise operations (Block), Gaussian noise (Noise), Gaussian blur (Blur), JPEG compression (JPEG), and video compression (Comp). Our model is trained on FaceForensics++ (RAW) dataset containing both real and fake videos, and subsequently tested on FaceForensics++ videos that have been subjected to the aforementioned perturbations. Examples of the disturbed images are depicted in Figure 6.

Details are in the caption following the image
Perturbation examples. The perturbation examples are generated to evaluate the robustness of our proposed method. It consists of changing color saturation (CS), color contrast (CC), blockwise operation (Block), Gaussian noise (Noise), blurring (Blur), JPEG compression (JPEG), and video compression (Comp). The original facial images used in this figure are sourced from the publicly available DeepFake dataset FaceForensics++ [7].

To ensure a fair comparison, we conducted robust experiments using the comparison methods with publicly released codes and weights in their original papers, which achieved the best performance as listed in Table 4. The methods included are Xception [7], Face X-ray [24], LipForensics [15], and SBIs [13].

Figures 7 and 8 show how the model performs with increasing severity levels for each perturbation type, while Table 5 summarizes the average AUC scores across five levels of severity. The results indicate that our method is more resilient to degradations than state-of-the-art approaches. For high-frequency corruptions such as Blur, JPEG, and Comp, our model outperforms the advanced LipForensics [15] and SBIs [13] methods by an average of 3.0% and 20.0% in AUC, respectively. This is because these degradations primarily affect the quality of individual frames without impacting the trajectory of facial muscle movements, allowing our model to overlook the effects caused by these degradations. Although other methods show strong generalization capabilities, they are particularly sensitive to quality degradations, for example, Face X-ray [24] is vulnerable to video compression, and LipForensics [15] is more affected by blockwise perturbations than our method.

Details are in the caption following the image
Robustness to various perturbations. The AUC scores of various severity levels for unseen distortions. Our proposed model is robuster than other advanced methods to all perturbations except color contrast corruption, which affect the calculation of optical flow. (a) Color saturation. (b) Color contrast. (c) Blockwise. (d) Gaussian noise.
Details are in the caption following the image
Robustness to various perturbations. The AUC scores of various severity levels for unseen distortions. Our proposed model is robuster than other advanced methods to all perturbations except color contrast corruption, which affect the calculation of optical flow. (a) Color saturation. (b) Color contrast. (c) Blockwise. (d) Gaussian noise.
Details are in the caption following the image
Robustness to various perturbations. The AUC scores of various severity levels for unseen distortions. Our proposed model is robuster than other advanced methods to all perturbations except color contrast corruption, which affect the calculation of optical flow. (a) Color saturation. (b) Color contrast. (c) Blockwise. (d) Gaussian noise.
Details are in the caption following the image
Robustness to various perturbations. The AUC scores of various severity levels for unseen distortions. Our proposed model is robuster than other advanced methods to all perturbations except color contrast corruption, which affect the calculation of optical flow. (a) Color saturation. (b) Color contrast. (c) Blockwise. (d) Gaussian noise.
Details are in the caption following the image
Robustness to various perturbations. Our proposed model works better than the advanced methods with Blur, JPEG, and Comp degradations. (a) Blur. (b) JPEG compression. (c) Video compression. (d) Average.
Details are in the caption following the image
Robustness to various perturbations. Our proposed model works better than the advanced methods with Blur, JPEG, and Comp degradations. (a) Blur. (b) JPEG compression. (c) Video compression. (d) Average.
Details are in the caption following the image
Robustness to various perturbations. Our proposed model works better than the advanced methods with Blur, JPEG, and Comp degradations. (a) Blur. (b) JPEG compression. (c) Video compression. (d) Average.
Details are in the caption following the image
Robustness to various perturbations. Our proposed model works better than the advanced methods with Blur, JPEG, and Comp degradations. (a) Blur. (b) JPEG compression. (c) Video compression. (d) Average.
Table 5. Robustness to various perturbations.
Methods AUC scores (%)↑
Origin CS CC Block Noise Blur JPEG Comp Avg
Xception [7] 99.8 99.3 98.6 99.7 53.8 60.2 74.2 62.1 78.3
Face X-ray [24] 99.8 97.6 88.5 99.1 49.8 63.8 88.6 55.2 77.5
LipForensics [15] 99.9 99.9 99.6 87.4 73.8 96.1 95.6 95.6 92.5
SBIs [13] 99.9 99.5 98.4 99.6 53.5 67.6 87.6 81.1 83.9
Ours 99.9 98.4 67.6 97.8 88.4 98.7 98.6 99.0 92.6
  • Note: The average AUC scores (%) of the detectors across five severity levels for seven perturbations. The best and second-best AUC scores (%) are highlighted using bold and italic formatting, respectively. “Avg” represents the average value of all severity levels for all perturbations.

In addition, our model’s performance declines under CC perturbation, which we attribute to our method’s reliance on optical flow features driven by brightness changes. Nonetheless, this degradation is not a significant real-world threat, as color-based attacks are typically too conspicuous for practical use. Importantly, our model shows resilience against Noise, and this is because our model focuses on the movement patterns in eyes and mouth regions, capturing high-level semantic features, while Noise mainly affects pixel-level and low-level features. As a result, the noise does not significantly interfere with the microexpression movements.

We further evaluate the robustness of our approach against various video compressions by training each detector on the original dataset and then testing them on three different compression levels of the Celeb-DF-v1 dataset. On Celeb-DF-v1, we compare our work with the best and advanced detectors on its benchmark (Table 5), LipForensics [15], and SBIs [13]. The experimental results are summarized in Table 6. The experimental results demonstrate that our model is less susceptible to video compression. Specifically, our work outperforms the two advanced methods by an average of 4.5%, 5.3%, and 6.4% on RAW, C23, and C40 compression levels, respectively. In addition, the performance decline from RAW to C23 and C40 is notably lower in our method compared to advanced approaches, averaging 0.8% and 2.0%, respectively.

Table 6. AUC scores of different compression levels on Celeb-DF-v1.
Methods Test set AUC scores (%)↑ Decline↓
Celeb-DF-v1 RAW Celeb-DF-v1 C23 Celeb-DF-v1 C40
LipForensics [15] 70.3 64.2 61.9 6.1/8.4
SBIs [13] 77.0 70.1 66.9 6.9/10.1
Ours 78.1 72.4 70.8 5.7/7.3
  • Note: The best and second-best results are highlighted using bold and italic formatting, respectively. Our proposed model has higher performance than other methods. The uparrow indicates that the larger the AUC scores, the better the experimental performance. The downarrow indicates that the smaller the decline, the better the experimental performance.

4.8. Robustness to Adversarial Attacks

On social media and online platforms, DeepFakes are widely used for the dissemination of misinformation. To mitigate the security threats posed by DeepFakes to politics, economics, and national stability, effective DeepFake detectors have been developed to identify forgery content. However, adversaries often employ adversarial attacks to inject subtle perturbations into DeepFake images or videos, enabling them to bypass detectors designed with DNNs without noticeable loss of visual quality. If DeepFake detectors lack robustness against adversarial attacks, DeepFake content can easily evade detection, leading to the further spread of misinformation. To simulate real-world scenarios where adversaries deploy such attacks, we utilize widely adopted adversarial attack methods, namely, I-FGSM [11] and PGD [12], to generate adversarial samples.

We further test the robustness of this work to typical adversarial attacks. Experiments use the DeepFakes dataset in FaceForensics++ benchmark and compare it with two typical attack models, including I-FGSM [11] and PGD [12]. We set the max perturbation parameter ε = 8/255, where the iteration number of attacks is 40, and the imperceptible attacks are added to the combined inputs in our work. We evaluate the robustness of this work with advanced DeepFake detectors in a white-box setting, and we select the frame-level detectors and best-performing methods in Table 5, including Xception [7], Face X-ray [24], and SBIs [13].

The results of adversarial attacks are shown in Table 7. In the white-box setting, our work is more robust to imperceptible attacks than state-of-the-art detectors. For instance, when deploying the I-FGSM and PGD attacks, our work has a 46.64% and 56.47% reduction, respectively, in ASR compared to the advanced detectors. This is because adversarial attacks typically alter low-level features, and the perturbations applied are often unable to disrupt the continuity in consecutive frames. In contrast, our designed DeepFake detector focuses on capturing the motion patterns of the eyes and mouth regions, which represent high-level semantic features and maintain continuity between consecutive frames. This enables the detector to effectively resist adversarial noise.

Table 7. Robustness to adversarial attacks.
Methods Xception [7] Face X-ray [24] SBIs [13] Ours
Attack model I-FGSM PGD I-FGSM PGD I-FGSM PGD I-FGSM PGD
Origin Acc↑ 97.32 99.10 99.99 97.82
Attack Acc↑ 0.62 0.92 0.78 0.16 0.24 0.19 46.17 55.66
ASR↓ 99.36 99.05 99.21 99.84 99.76 99.81 52.80 43.10
  • Note: The best and second-best results are highlighted using bold and italic formatting, respectively. When subjected to adversarial attacks, our work demonstrates a better ASR (%) compared to other advanced detectors trained and tested on the DeepFakes dataset of FaceForensics++. The downarrow indicates that the smaller the ASR, the better the experimental performance.

4.9. Ablation Study

To investigate the effectiveness of exploring inconsistent facial expression movements, we perform ablation experiments on our proposed method. Typically, current DeepFake detectors employ the entire face for training. To capture the distinct movement patterns of the eyes and mouth regions, we partition the face into two parts and input them into a dual-branch network. To assess the impact of segmentation, we train the network using the whole face, with only the eyes area and the mouth area as inputs.

Evaluations of facial segmentation. The results can be seen in Table 8. Training the whole image still has a good performance, but the performance is lower than the segmentation training method. For instance, the performance from 99.3% AUC in our method decreased to 90.1% AUC on the Celeb-DF-v2 dataset, and from 94.6% AUC to 89.7% AUC on the DFDC-Preview dataset, which indicates that the eyes and mouth have different movement patterns, and training the two regions separately can improve the efficiency of our detector.

Table 8. Ablation study of different training data.
Methods Celeb-DF-v2 DFDC-Preview
Acc (%)↑ AUC (%)↑ Acc (%)↑ AUC (%)↑
w/o Segmentation 90.8 90.1 89.7 89.7
w/o Eyes region 74.1 77.9 64.2 63.3
w/o Mouth region 80.8 80.2 85.8 83.5
Ours 96.6 99.3 94.7 95.3
  • Note: The best results are highlighted using bold formatting. Accuracy and AUC scores of all the models, which are trained and tested on Celeb-DF-v2 and DFDC-Preview. The uparrow indicates that the larger the Acc and AUC, the better the experimental performance.

Evaluations of single facial region. We only train the detector with a single area, and the results are shown in Table 8. We can see that the performance decreases significantly; when only training the eyes region, the performance decreases by 19.1% AUC on the Celeb-DF-v2 dataset and by 11.1% AUC on the DFDC-Preview dataset; when only training the mouth region, the performance drops by 21.4% AUC on the Celeb-DF-v2 dataset and by 31.3% AUC on the DFDC-Preview dataset, which indicates that the expression clues obtained by single area are limited. Furthermore, the results show that the region of the eyes contains more synthetic cues than the mouth.

4.10. Computational Costs

The computational costs of our proposed dual-branch network are presented in Table 9. Our network is a model with 24.99 M parameters and achieves floating point operations (FLOPs) of 1.89 GFLOPs on an NVIDIA Tesla V100 GPU. Due to the robust computational power of the GPU, our model is highly suitable for the real-time detection of DeepFake videos.

Table 9. Computational costs of our proposed model.
Train process
Input size 3 × 224 × 224 FLOPs 1.89 GFLOPs
Total parameters 24.99 M Trainable parameters 24.99 M
  
Inference process
  
Average inference time 15 ms Average inference performance 66.67 fps
  • Note: FLOPs refer to floating point operations.

For instance, the average inference time (time for one forward pass) of PhaseForensics [48] is 25 ms, and LipForensics [15] achieves 22.2 ms. In contrast, our model delivers a video-friendly inference time of 15 ms, corresponding to an average inference performance of 66.67 frames per second (fps), making it well-suited for real-time inference on standard 30 fps video streams.

5. Discussion

Existing DeepFake generation methods typically focus on the visual quality of a single frame, emphasizing detail optimization in local regions such as the eyes and mouth. However, they lack temporal modeling across consecutive video frames, neglecting the global coordination of facial muscle movements, which results in inconsistent facial expressions. Moreover, current DeepFake detectors suffer from limited generalizability and robustness. In contrast, our proposed method leverages the observation that DeepFakes may disrupt smooth and continuous facial muscle movements. We design a dual-branch network to explore expression cues from the eyes and mouth regions. These high-level semantic features enhance the model’s generalization capability for unknown DeepFakes data and demonstrate strong robustness against various image degradations and adversarial attacks. Evaluation and comparative experiments validate the significant contribution of inconsistent expression cues to DeepFake detection. Ablation studies consistently confirm that the effectiveness of our method lies in combining the forgery clues from the eyes and mouth regions.

More importantly, our proposed method has good cross-dataset generalization capabilities, and with further optimization, it could scale to larger datasets or different types of forgeries, such as multiple forgery regions in a video, as well as DeepFake detection across various devices and platforms. Our method shows strong robustness against various levels of postprocessing and adversarial attacks, making it adaptable to complex real-world scenarios, such as low-resolution videos or other quality degradation challenges. Finally, our method is primarily focused on detecting forgery facial videos and can be integrated with various multimedia verification systems to ensure the privacy and security of individuals’ information, particularly in applications such as facial recognition and emotion analysis.

6. Conclusions

In this paper, we introduce a novel DeepFake detector that relies on inconsistent facial expression cues. Through extensive experiments, we have demonstrated that synthetic faces generated by GANs fail to maintain spatial and temporal consistency in facial expressions. Leveraging our scientific analysis, we have developed a robust and general DeepFake detector that deploys the irregular patterns of facial movements. We extract these different patterns by utilizing MSTR maps of the eyes and mouth regions, and we have designed a dual-branch framework that further enhances the efficiency of our approach. These experiments have validated the high effectiveness of our method, independent of the generator used and the quality of the input video. We believe that a significant contribution of this study lies in our observation of inconsistent expression clues between natural and synthetic faces, particularly in the eyes and mouth regions. We have presented a scientific investigation and explanation of expression motion patterns in forged faces. We hope that our method will make a valuable contribution to combating advanced DeepFake techniques.

We acknowledge that while the proposed method demonstrates outstanding performance in detecting DeepFake videos in the current experiments, it may face certain challenges when dealing with highly sophisticated forgery techniques, such as talking-face videos generated by diffusion models. In future research, we plan to explore more diverse feature extraction methods and incorporate cross-modal information, such as audio and text, to enhance robustness and detection performance against advanced DeepFakes.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

This work was supported by the National Natural Science Foundation of China under Grant no. 62372334 and the National Key Research and Development Program of China under Grant no. 2023YFB3106900.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant no. 62372334 and the National Key Research and Development Program of China under Grant no. 2023YFB3106900.

    Data Availability Statement

    The datasets are publicly available.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.