International Journal of Intelligent Systems

Volume 2025, Issue 1 7945646

Research Article

Open Access

Exposing the Forgery Clues of DeepFakes via Exploring the Inconsistent Expression Cues

Jiatong Liu

orcid.org/0000-0001-9730-7110

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author

Lina Wang,

Corresponding Author

Lina Wang

[email protected]

orcid.org/0000-0001-8085-1312

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author

Run Wang,

Run Wang

orcid.org/0000-0002-2842-5137

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author

Jianpeng Ke,

Jianpeng Ke

orcid.org/0000-0002-4893-3665

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author

Xi Ye,

Xi Ye

orcid.org/0000-0001-8546-9737

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author

Yadi Wu,

Yadi Wu

orcid.org/0000-0002-6779-4226

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author

Jiatong Liu,

Jiatong Liu

orcid.org/0000-0001-9730-7110

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author

Lina Wang,

Corresponding Author

Lina Wang

[email protected]

orcid.org/0000-0001-8085-1312

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author

Run Wang,

Run Wang

orcid.org/0000-0002-2842-5137

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author

Jianpeng Ke,

Jianpeng Ke

orcid.org/0000-0002-4893-3665

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author

Xi Ye,

Xi Ye

orcid.org/0000-0001-8546-9737

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author

Yadi Wu,

Yadi Wu

orcid.org/0000-0002-6779-4226

Key Laboratory of Aerospace Information Security and Trusted Computing , Ministry of Education , School of Cyber Science and Engineering , Wuhan University , Wuhan , China , meb.gov.tr

Search for more papers by this author

First published: 13 January 2025

https://doi.org/10.1155/int/7945646

Citations: 1

Academic Editor: Yu-an Tan

Share a link

Email
Wechat
Bluesky

Abstract

The pervasive prevalence of DeepFakes poses a profound threat to individual privacy and the stability of society. Believing the synthetic videos of a celebrity and trumping up impersonated forgery videos as authentic are just a few consequences generated by DeepFakes. We investigate current detectors that blindly deploy deep learning techniques that are not effective in capturing subtle clues of forgery when generative models produce remarkably realistic faces. Inspired by the fact that synthetic operations inevitably modify the regions of eyes and mouth to match the target face with the identity or expression of the source face, we conjecture that the continuity of facial movement patterns representing expressions that existed in the veritable faces will be disrupted or completely broken in synthetic faces, making it a potentially formidable indicator for DeepFake detection. To prove this conjecture, we utilize a dual-branch network to capture the inconsistent patterns of facial movements within eyes and mouth regions separately. Extensive experiments on popular FaceForensics++, Celeb-DF-v1, Celeb-DF-v2, and DFDC-Preview datasets have demonstrated not only effectiveness but also the robust capability of our method to outperform the state-of-the-art baselines. Moreover, this work represents greater robustness against adversarial attacks, achieving ASR of 54.8% in the I-FGSM attack and 43.1% in the PGD attack on the DeepFakes dataset of FaceForensics++, respectively.

1. Introduction

In recent years, remarkable advancements in the computer vision domain, notably in generative adversarial networks (GANs) [1, 2], have made it possible to generate realistic images and videos, rendering them exceedingly difficult for human eyes to distinguish genuine faces. Particularly, DeepFakes have the ability to manipulate faces in diverse ways, such as identity swap, facial expression, or attributes manipulation to generate photorealism and accurately mimic the celebrities, presenting a distinct threat to the authenticity of the information in the digital landscape [3].

In the wake of DeepFakes’ proliferation, the wide availability of generative models has brought significant benefits to film production and entertainment. However, this accessibility has also led to a mass of misinformation, posing threats to both individual’s and societal stability. To combat the issues posed by DeepFakes, researchers are actively proposing various countermeasures aimed at determining the authenticity of images and videos. These efforts primarily focus on capturing subtle differences between real and synthetic faces passively, e.g., investigating spatial clues at the pixel-level [4], examining artifacts in the invisible frequency domain [5], and identifying differences in biological signals [6]. Unfortunately, with the rapid advancement of generative models, the clues of forgery are gradually being eliminated, rendering the majority of detectors ineffective.

When generating synthetic faces using various manipulations, it becomes evident that DeepFake techniques can disrupt the smooth and continuous muscular movements, as depicted in Figure 1. These irregularities serve as strong indicators for detectors to distinguish between natural and manipulated faces. In this work, we propose a novel idea by exploring the high-level inconsistent expressions to expose the subtle differences between genuine and fabricated faces, thereby offering valuable insights for DeepFake detection. We envision that such semantically high-level inconsistent expressions could provide enhanced robustness against various perturbations and adversarial attacks.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Illustration of our design motivation. An example of real faces and corresponding fake faces synthesized by various manipulations (e.g., Deepfakes, Face2Face, FaceSwap, and NeuralTextures in the FaceForensics++ dataset). We can see that the forgery operations can easily diminish the continuity of optical flow representing the movements of expression. The original facial images used in this figure are sourced from the publicly available DeepFake dataset FaceForensics++ [7].

Several researchers have delved into expression-based DeepFake detectors, the negative impact of the prior studies emphasizing on entire face and ignoring the distinctive movement patterns present in different facial regions. In addition, they tend to overlook the importance of temporal consistency across frames, as they primarily consider the isolated images. To effectively address the challenge of exploring inconsistent expressions in synthetic faces, it becomes imperative to develop a methodology that encompasses both spatial continuity and temporal coherence. This approach will enable a more comprehensive understanding of the intricate dynamics within the facial expressions and pave the way for improving DeepFake detection.

Our aim is to develop a DeepFake detector focusing on the inconsistent motion patterns of eyes and mouth regions through exploring expression clues. Traditional methods for exploring expression features often employ optical flow [8], which is commonly used in current expression recognition approaches [9]. However, such techniques may overlook the crucial texture information present in faces, which holds significant relevance in the domain of DeepFake detection. Furthermore, the negative impact of the existing expression-based detectors ignores the different patterns of facial movements in various regions. To overcome these limitations, we are committed to integrating spatial continuity and temporal coherence to accentuate the representations of expression motion patterns. In addition, our approach will delve into the inconsistent patterns of facial movements to enhance the effectiveness of DeepFake detection.

While current detectors offer effective strategies, unfortunately, most of them are not practical for real-world scenarios. These detectors exhibit significant performance drops when encountering unseen manipulations, and this is mainly attributed to the overfitting of the training data. Moreover, they prove to be vulnerable to various perturbations (e.g., noise, compression, and blurring [Blur]), because these detectors heavily rely on specific image cues that can be easily disrupted by simple image transformations. Furthermore, they are susceptible to adversarial attacks [10] (e.g., I-FGSM [11] and PGD [12]), where specific perturbations are introduced to deceive the classifiers. In summary, the existing DeepFake detectors suffer the following two key challenges:

1.
Poor general capability to unseen manipulations: Many existing DeepFake detectors primarily aim to enhance performance on datasets with known synthetic techniques. However, when these models are transferred to unseen manipulation datasets, they often experience a significant performance decline. This is due to the fact that different datasets are generated using specific algorithms, causing the previously trained detectors to be overly tailored to the limited training data, thus limiting their general capabilities. Most previous detectors focus on different regions of interest based on distinctive data, resulting in a well-trained model that lacks generalization across datasets. Some previous detectors have focused mainly on improving their general potentialities, e.g., they search for common artifacts [4] or concentrate on blending boundaries [13]. However, generalization is disrupted by advanced generative models, and blending boundaries are vulnerable to low-level image transformation operations.
2.
Not robust to various perturbations. In real-world scenarios, DeepFakes are susceptible to various degradations. These perturbations can be categorized into two main types: common image transformations (e.g., noise, compression, and blur) [4] and imperceptible adversarial attacks (e.g., PGD and I-FGSM) [14]. For simple image transformations, most current detectors heavily rely on specific image textures, pixel distributions, or structural information to distinguish between authentic and manipulated faces. However, these clues can be easily disrupted or eliminated after undergoing common image transformations. In the case of adversarial attacks, existing studies often employ deep neural networks (DNNs) to design detectors, and attackers can exploit this vulnerability by introducing specific perturbations to a fake video, which can mislead the detector into classifying it as real.

To tackle the two fundamental challenges faced by DeepFake detectors, we first craft the movement spatial–temporal representations (MSTRs) of sequential frames to emphasize the spatial continuity and temporal coherence of movement patterns, and then we implement a dual-branch architecture to learn distinct movement patterns of eyes and mouth regions separately to explore the high-level expression clues, which are not susceptible to various perturbations and adversarial attacks.

In this work, we find that current generative models manipulate faces that focus on the quality of frame level, leading to poor temporal information modeling, and they do not explicitly capture the continuous expressions; thus, we expose DeepFake videos by exploring the inconsistent expression cues. Specifically, our method outperforms advanced DeepFake detection methods in in-dataset evaluation, e.g., it achieves 99.9% AUC on FaceForensics++ (C23) dataset, 99.3% AUC on Celeb-DF-v2 dataset, and 95.3% AUC on the DFDC-Preview dataset; this work is robuster to various perturbations (e.g., Gaussian noise (Noise), Gaussian blur (Blur), JPEG compression (JPEG), and video compression (Comp)) compared to the best LipForensics [15] method by 5.9% AUC on average, and more robust to adversarial attacks than advanced detectors with I-FGSM and PGD attacks by 46.6% and 56.5% attack success rate (ASR) averagely. Our main contributions are summarized as follows:

1.
We observe that existing DeepFakes disrupt the continuity of facial muscle movements. To capture these forgery cues, we propose a DeepFake detection method that includes extracting spatiotemporal representations of facial movements, and a dual-branch network is designed to explore inconsistent expression cues in the eyes and mouth regions, enabling effective detection of DeepFake videos.
2.
To enhance robustness against various degradations and adversarial attacks, we design spatiotemporal representations of facial movements. Image degradations and adversarial attacks typically destroy texture details within frames. However, motion patterns extracted across frames, as a high-level semantic feature, remain insusceptible and can strengthen the robustness of our method.
3.
To improve generalization to unknown DeepFakes, the dual-branch network is designed to independently learn expression cues from the eyes and mouth regions. By extracting motion patterns from different facial areas, the method reduces reliance on specific datasets or forgery techniques, enhancing the generalization capability of our method.
4.
Experimental results on FaceForensics++, Celeb-DF-v1, Celeb-DF-v2, and DFDC-Preview demonstrate that our method outperforms state-of-the-art approaches. Furthermore, we conduct experiments under various degradations and two typical adversarial attacks, achieving an average AUC of 92.6% and an ASR of 48.0%. The results indicate that our model exhibits superior robustness compared to other advanced detectors.

The rest of this manuscript is structured as follows. Section 2 provides a review of related works on DeepFake generation, DeepFake detection, and expression-based DeepFake detection. Section 3 defines the materials and methods of this study. Section 4 offers experimental details and comprehensive results. Section 5 discusses the limitations of current generative models in replicating human facial expressions. Lastly, Section 6 concludes the paper, with limitations and future research directions.

2. Related Works

2.1. DeepFake Generation

GANs [1, 2] have witnessed significant advancements in image generation and manipulation. DeepFakes exploit the power of GANs to create realistic multimedia content, posing potential risks to personal privacy and societal stability. Generally, there are four primary types of DeepFakes: entire face synthesis, attribute manipulation, identity swap, and face reenactment.

The entire face synthesis generates faces which are nonexistent in the real world, such as PGGAN [16] and StyleGAN [17]. The attribute manipulation can modify simple facial attributes (e.g., hair color and smile) and retouch complex attributes (e.g., gender and age), such as StarGAN [18] and STGAN [19]. The identity swap can replace the face between the target and source image, and CycleGAN [20] and FaceSwap [21] are the two classical works. Similar to identity swap, face reenactment is able to swap the facial expression between the target and source image by deploying popular methods, such as Face2Face [22].

Nevertheless, the existing DeepFake techniques do not explicitly preserve the natural expression movements of eyes and mouth because they generate faces focused on the quality of frame level, which fails to capture the continuity of expression movements and model the temporal information across sequential frames in a video. Furthermore, current generative models encounter challenges in learning sufficiently continuous and natural patterns of expressions from limited data samples, inspiring us to explore the indicators of inconsistent expressions to classify the real and synthesized faces.

2.2. DeepFake Detection

In recent years, numerous studies have demonstrated that even the minor differences between genuine and forged faces can be captured to distinguish authenticity. Therefore, researchers are actively engaged in devising various countermeasures to detect DeepFakes, which fall into two categories, that is, frame-level and video-level approaches.

Frame-level approaches always focus on the authenticity of single frames. Afchar et al. propose a MesoNet designed for analyzing the faces generated by DeepFake and Face2Face methods at a mesoscopic level [23]. Xception is a conventional CNN trained on ImageNet and is adapted for DeepFake detection tasks through replacing the fully connected layer with a pair of outputs [7]. Li et al. provide a general method in which most face forgery methods rely on a blending step and do not depend on any knowledge of the artifacts linked to a particular facial manipulation technique [24]. Zhao et al. propose a framework with an attention module for combining textural and semantic features to detect forgery images [25]. Wang, Sun, and Tang take the original images and their corresponding quality-degraded images as the inputs of Siamese networks [26]. Cao et al. use a reconstruction network to learn the common features of real images, and they utilize the reconstruction differences to classify real and fake images [27]. Shiohara and Yamasaki suggest a method to reproduce the common forgery clues by blending the pseudosource and target faces from the same pristine faces [13]. Ba et al. fuse broader semantic forgery features to improve the generalizability of DeepFake detection [28].

Video-level approaches distinguish the forgery of all video frames and add temporal information during forgery detection. Güera and Delp utilize a convolutional neural network (CNN) to extract features at the frame level and train a recurrent neural network (RNN) for DeepFake classification [29]. Haliassos et al. pretrain a spatial–temporal network to acquire high-level semantic representations associated with natural lipreading, which can be designed for general DeepFake detection [15]. Gu et al. present a network to learn the inconsistencies of local and global features to capture more general forgery clues [30].

In fact, most of the current methods are impractical in real-world scenarios, as they exhibit significant performance drops when encountering unseen manipulations. This is primarily due to their focus on distinct regions with unique data characteristics, and they suffer from various perturbations because they rely on low-level image features which are easily destroyed by common image transformation. Some existing works explore the generalization, they search common artifacts of multiple synthetic methods [13], but the artifacts are easy to destroy by advanced generative models and the blending boundary is susceptible to low-level image transformation operations. To address the pivotal challenges, we propose a movement representation for representing high-level semantic expression features.

2.3. Expression-Based DeepFake Detection

Since the expression recognition techniques achieved great development [9], some researchers use expression clues for DeepFake detection. Researchers exploit facial emotions learned from the expression recognition model to classify real and fake faces. In addition, some studies combine the multimodel features, for example, works capture the inconsistencies between visual and voice models to detect DeepFakes [31, 32].

Although current expression-based DeepFake detection methods have shown some progress, they still suffer from fundamental shortcomings. These approaches typically utilize the entire face during training, overlooking the fact that different regions of the face exhibit distinct movement patterns. Furthermore, they often fail to capture the crucial temporal coherence across frames, as they primarily focus on isolated images. Moreover, existing methods predominantly rely on in-dataset evaluation, neglecting the vital challenges related to general capabilities and robustness that current detectors encounter. To address these limitations, our work delves into exploring the diverse expression patterns exhibited by different facial regions. We recognize that fake faces may still display expressions but their movement patterns are disrupted by forgery techniques, leading to discernible deviations from genuine faces.

To solve the aforementioned issues, previous studies have noted that most forgery videos alter the eyes region [33] or mouth region [15] in order to align the target face with the identity or expression of the source face. In this paper, our objective is to expose inconsistent clues by exploring facial expressions. While generative models can indeed synthesize photorealistic faces, the natural movement patterns of facial expressions are driven by the coordinated actions of associated muscles, making them difficult to replicate. This is primarily because current generative models produce continuous videos always focusing on the quality of frame level, and they do not explicitly capture the temporal relationship and the continuity of subtle movements across multiple frames. To synthesize continuous motion features, generative models often through interpolation techniques introduce jitter and discontinuity in fake videos. This presents a significant challenge when attempting to create convincing and coherent expressions that flow naturally.

3. Materials and Methods

We present a novel approach aimed at effectively exposing DeepFakes by exploring the inconsistent facial expression patterns. In Section 3.1, we delve into the insights behind this work, followed by an overview of our methodology in Section 3.2. We then proceed to elaborate on the techniques employed to design spatial–temporal representations of facial movements in Section 3.3, as well as the process of learning expression inconsistency in Section 3.4. In addition, we outline the loss functions utilized in Section 3.5. The pipeline of our method is illustrated in Figure 2(a).

3.1. Insight

In practical scenarios, existing DeepFake detectors suffer from the challenges in terms of generalization capabilities and robustness. To effectively tackle these key issues, our goal is to identify forgery clues which are independent of specific manipulations and remain resistant to common perturbations. Our work is inspired by the observation that the current DeepFake techniques struggle to replicate continuous facial expressions due to their reliance on single-frame generation, and generative models encounter difficulties in capturing subtle and continuous changes in facial expressions. To capture these inconsistent expression clues, we investigate the role of optical flow in determining facial expressions, which are widely used for recognizing different categories of emotions and play a crucial role in analyzing the content of expressions. In addition, different action units labels describe either eyes or mouth movements, but not both, and the categories of expressions are determined by the combination of action units’ labels. Consequently, the movement patterns of the eyes and mouth exhibit differences [34].

Forgery operations have an impact on the continuous and natural movements present in authentic faces. This is due to the fact that current synthetic methods primarily focus on capturing and reproducing facial expressions by single frames, lacking the ability to accurately capture complex muscular movements and model temporal information. Therefore, the presence of inconsistent expression cues can be leveraged to reveal subtle differences between real and forged videos. As depicted in Figure 1, manipulation operations can easily disrupt and diminish the continuity of optical flow, which represents the speed and direction of facial movements. Building upon this understanding, we propose a novel approach that explores expression clues for DeepFake detection. In Figure 2(a), we illustrate the pipeline of our method, which conducts MSTRs to highlight the spatial continuity and temporal coherence of sequential frames, within a dual-branch network architecture to learn the different expression patterns of eyes and mouth regions separately.

3.2. Overview

Given a video that contains T frames, the target is to judge whether this video is real or fake by exploring expression features. For this purpose, we first develop the MSTR in Section 3.3 for the input video, which can highlight the spatial continuity and temporal coherence, and output a combined image with optical flow and the interframe through three sequential frames, that is, , where T is the number of three sequential frames, is the regions of the face we focus on, and C is the number of channels. In this stage, we formulate with a single color channel for computing optical flow. Intuitively, x represents the temporal variation of region movements (i.e., eyes and mouth region in our work) in the faces, which could highlight the expression patterns.

We design a dual-branch neural network that takes the MSTR feature map as input and predicts whether the input faces are real. However, various regions of a face have different motion patterns, e.g., eyebrows raise or droop, eyelid raise or tighten, and lips stretch or tight, and these motions are related to particular muscular movements. If we just take the whole face as the input, we may corrupt the distinct motion patterns of different face regions. To address this problem, we choose the most significant regions of the eyes and mouth as the input feature maps (e.g., the process shown in Figure 2(a)).

Our objective is to capture the diverse motion patterns exhibited by the eyes and mouth areas. To achieve this, we present a dual-branch network that effectively learns the motion of the eyes and mouth separately through the design of the base layer and branch layer.

3.3. MSTR

Utilizing expression features for DeepFake detection can be approached by using existing expression representations designed for expression recognition. However, they may ignore the significant texture information present in faces, which holds significant relevance in DeepFake detection. To track this problem, we employ the optical flow technique to capture continuous expression patterns across frames and combine the interframe as a new input. As illustrated in Figure 3, we apply the optical flow diagram to three consecutive frames of both real and synthetic faces. We observe that real facial expressions involve natural movements caused by the contraction of relevant muscles. For instance, in the first row and second column, we can see uniform and consistent movements in the cheek region of real faces. In contrast, the movements in synthetic faces appear discontinuous and irregular. In addition, in the fourth column, we observe that manipulation operations diminish the movements in the region of the eyes. These observations suggest that expression motion patterns can serve as strong indicators of authenticity.

The optical flow [8] measures subtle facial muscle movements by tracking pixel intensity changes between two adjacent facial frames over time. The position movement of a facial pixel at location (x, y, t) with intensity I(x, y, t) is represented by Δx, Δy, and Δt. Based on the brightness consistency constraint, we have

(1)

Assuming minimal positional changes between adjacent facial frames, the image constraint at I(x, y, t) can be expressed using a Taylor series as follows:

(2)

where τ represents a higher-order infinitesimal. From these formulas, we derive

(3)

which leads to the following formula:

(4)

where u and v represent the x and y components of the optical flow of I(x, y, t), respectively.

To solve the u and v, we perform the more robust TV-L¹ optical flow model [35] between three facial frames to obtain two optical flow images. The TV-L¹ optical flow model improves the optical flow constraint by introducing the variable w of light changes. The equation of TV-L¹ optical flow can be described as

(5)

where E_TV−L1 is the energy function, ρ(u, v, w) = I_x(u − u₀) + I_y(v − v₀) + I_t + βw, and the parameter β serves as a weighting factor for the term related to light changes.

Next, we combine the two optical flow images and interframe into a new input image x to better correlate the spatial and temporal representations. Then, we divide the new input x into eyes region x_eyes as

(6)

and mouth region x_mouth as

(7)

where h is the height of the combined image x.

Figure 2(a) shows examples of real and forged faces and their MSTR images. We have observed that the inconsistencies can be easily exposed on our MSTR images which will provide a more powerful indicator for distinguishing authenticity. In the cross-dataset evaluation stage, to help the model learn common forgery cues, we deploy a landmark augmentation method [13] for the fake interframe as shown in Figure 4.

3.4. Expression Inconsistency Learning

In this section, we describe the dual-branch network, which serves as the foundation for our effective DeepFake detector, leveraging both the MSTR and distinct movement patterns of the eyes and mouth regions. According to the works [36, 37], the network structures are shown in Figure 2(b). In our paper, we reconstruct the base layer and branch layer to capture distinct features from the eyes and mouth separately. For specific details regarding the backbone architecture, refer to Table 1.

Table 1. Details of backbone architecture.

Usage	Layers	Operator	Input size	Output size
Base layer	Layer 1	Conv2d 7 × 7	3 × 112 × 224	64 × 56 × 112
	Layer 1	MaxPool2d 3 × 3	64 × 56 × 112	64 × 28 × 56
	Layer 2	Res-block 3 × 3	64 × 28 × 56	64 × 28 × 56
	Layer 2	Res-block 3 × 3	64 × 28 × 56	64 × 28 × 56
	Layer 3	Res-block 3 × 3	64 × 28 × 56	128 × 14 × 28
	Layer 3	Res-block 3 × 3	128 × 14 × 28	128 × 14 × 28

Branch layer	Layer 4	Res-block 3 × 3	128 × 14 × 28	256 × 7 × 14
	Layer 4	Res-block 3 × 3	256 × 7 × 14	256 × 7 × 14
	Layer 5	Res-block 3 × 3	256 × 7 × 14	512 × 4 × 7
	Layer 5	Res-block 3 × 3	512 × 4 × 7	512 × 4 × 7
	Layer 6	AdaptiveAvgPool	512 × 4 × 7	512 × 1 × 1
	Layer 6	Flatten	512 × 1 × 1	512

Note: This table shows the size of kernels, the resolution of the input feature map, and the input channels of each layer.

3.4.1. The Base Layer

The layers serve to encode the input image x_eyes and x_mouth (x_⋅ ∈ R^3×112×224) into the same features. The specifics of the base layer are as follows:

1.
The convolutional layer with a kernel size of 7 × 7 is applied to extract low-level features (e.g., edges and textures), and after that, the input eyes or mouth images are transformed into a 64 × 56 × 112 feature map. After passing through the BatchNorm2d and ReLU layers, we obtain
(8)
where W_conv is the weight of the convolutional kernel and b_conv is the bias term of the convolutional operation.
2.
The MaxPool2d layer reduces the size of the feature map while preserving the most significant features, so we have a 64 × 28 × 56 feature map represented as follows:
(9)
3.
A series of Res-block with a kernel size of 3 × 3 is employed to further capture subtle features and inconsistencies in forgery images, resulting in a 128 × 14 × 28 feature map. Each Res-block contains a Conv2d layer, BatchNorm2d layer, and ReLU layer. In the key Res-block, a SA layer is added, which introduces an attention mechanism, enabling the model to enhance important features, and this process can be represented as
(10)

Through extensive experiments, we have determined that the four Res-blocks are the optimal choice for our base layer.

3.4.2. The Branch Layer

The layers come into play after extracting the feature maps from the base layer. We have devised a dual-branch model to independently capture the higher-level expression features of eyes and mouth regions. We incorporate another four Res-blocks in this process, which enable the extraction of more abstract and semantically rich features. In summary, our model utilizes the same base layer to extract low-level and midlevel features. The branch layer is dedicated to focusing on higher-level features while retaining the feature information from all channels, leading to more effective detection capabilities.

To make a final prediction on the input images, we choose a dual-branch CNN-based network with a fully connected layer. The fully connected layer denoted as FC(x_⋅; θ_c) is applied to each branch, which is used to transform the feature maps of eyes or mouth into a probability distribution of specific categories separately. The output can be described as

(11)

where x_⋅ is the input data of eyes or mouth, θ_⋅ is the parameter set of each layer, B(·) is the base layer, E(·) is the branch layer, and softmax(·) is the activation function of our binary task.

In summary, we introduce a dual-branch network architecture of the base layer and branch layer. This network effectively extracts comprehensive high-level expression features from both eyes and mouth regions, endowing our model with robustness against various perturbations. The training and testing process is shown in Algorithm 1.

Algorithm 1: Training and testing process.

Input: RAW video , preprocess method p(·), base layer module B(·; θ_b), branch layer module E(·; θ_e), and classification layer C(·; θ_c)
Output: the authenticity probability of each input
1.
Preprocessed image
2.
while the number of training iterations do
3.
m_e⟵B(x_e; θ_b),
4.
,
5.
,
6.
Optimization by (14), updating weights
7.
when training achieves stability
8.
end while
9.
for detection steps do
10.
x_e, x_m = p(x)
11.
m_e⟵B(x_e; θ_b),
12.
,
13.
,
14.
pred = λf_eyes + (1 − λ)f_mouth
15.
end for
16.
returnpred

3.5. Classification Loss Functions

The dual-branch learning strategy in our method aims to capture and learn the distinct expression patterns of the eyes and mouth regions. The learning procedure of our proposed method involves three steps as follows: (1) extracting the same base features, (2) learning various branch features, and (3) performing binary classification. To optimize the model, we design three loss functions: the classification loss for the eyes L_{eyes_cls}, the classification loss for the mouth L_{mouth_cls}, and the total loss L_total. These loss functions are optimized using the general cross-entropy loss to update the parameters of the network.

In this stage, both images x_eyes and x_mouth are fed into different fully connected networks at the end to obtain the classification features f_eyes, and f_mouth. The classification loss L_{eyes_cls} is calculated by cross-entropy as follows:

(12)

where N is the number of the samples, y_i is the ground truth value of the ith sample, and

is the network output of the ith sample of the eyes area.

Similarly, we compute the classification loss L_{mouth_cls} as follows:

(13)

where

is the network output of an ith sample of the mouth area.

Finally, the total classification loss L_total is

(14)

where α is the hyperparameter, and α is set to 0.5 in our experiments.

4. Results and Analysis

In order to assess the effectiveness of our proposed approach, the experiment settings are described in Section 4.1, and we perform in-dataset experiments using a training set and a testing set sourced from the same data distribution, as detailed in Section 4.2. Furthermore, to evaluate the general capability of our method, we conduct cross-dataset evaluations by employing a known training set and an unseen testing set, as discussed in Section 4.3. We also investigate the robustness of our model to diverse perturbations in Section 4.4, and adversarial attacks are examined in Section 4.5. In addition, we employ ablation experiments in Section 4.6 to assess the significance of different components integrated within our model.

4.1. Datasets

Throughout the research process of DeepFakes, challenging datasets have emerged. To ensure a comprehensive and representative evaluation of our proposed method, we have carefully selected the following datasets:

1.
FaceForensics++ dataset [7] comprises 1000 genuine videos and 4000 fake videos created by face swapping techniques such as DeepFakes and FaceSwap, as well as face reenactment methods such as Face2Face and NeuralTextures. Each video is available in three versions: RAW, slightly compressed, and heavily compressed.
2.
Celeb-DF-v1 and Celeb-DF-v2 datasets [38] consist of 540 authentic videos and 5639 forged videos of celebrities on YouTube, generated by using advanced DeepFakes techniques to achieve high-quality forgeries.
3.
DFDC-Preview dataset [39], released in conjunction with the competition, comprises 5000 videos generated by two facial manipulation techniques. The dataset exhibits diversity across various attributes such as gender, age, and skin tone.

In our experiments, the video datasets are split into training, validation, and testing sets in an 8:1:1 ratio. Then, we use OpenFace to extract facial images from consecutive frames. Two optical flow images are calculated between three consecutive facial images and combined with the middle facial image to create a new MSTR. For generalization evaluations, we perform data augmentation on the middle facial image. These steps are crucial for standardizing the input data across different datasets and improving model performance.

4.2. Baselines

In this section, we adopt several state-of-the-art DeepFake detection methods for evaluating our proposed model. Considering that most DeepFake detection methods have not yet released their code and weights, for a fair comparison, we directly cite their results from the corresponding original papers. We compare the latest works in both in-dataset and cross-dataset evaluations. For the subsequent robustness evaluations, we select the top four methods that performed best in the cross-dataset evaluations, reproduce their released code, and compare them with our proposed method. These works show good generalization and robustness across multiple datasets, making them suitable as comparison baselines.

4.2.1. In-Dataset Evaluation

We train and test the methods on FaceForensics++ (C23), Celeb-DF-v2, and DFDC-Preview datasets, respectively. The comparative methods include the following (1) ResNet50 [36]: it presents a residual network to alleviate the training process for the task of image classification; (2) MesoNet [23]: it suggests detecting forgery faces by implementing the approach at a mesoscopic analysis level; (3) CNN + LSTM [29]: it utilizes a convolutional LSTM to generate a temporal sequence descriptor for detecting fake videos; (4) Xception [7]: it is released with the FaceForensics++ dataset; (5) SPSL [40]: it proposes a two-branch network combined with original information and multiband frequencies; (6) Face X-ray [24]:it proposes a forgery image, which is detected by detecting the mixed boundary of source image and target image; (7) F³-Net [41]: it takes two different frequency-aware clues to deeply capture the forgery patterns; (8) 3DCNN [42]: it proposes a novel network to extract forgery features of different scales; (9) DeepFakeucl [43]:it uses contrastive learning to design unsupervised detection methods; (10) LipForensics [15]: it extracts advanced semantic features through a pretrained lipreading model; (11) Liu et al. [44]: they present a lightweight 3DCNN to detect DeepFakes; (12) FInfer [45]: it proposes a novel framework to deal with the high-visual–quality synthetic videos; (13) SBIs [13]: it generates forgery images by blending source and target face image from real face images; and (14) ResNet34 [28]: it combines the semantic-rich features to improve the accuracy (Acc) and generalizability of DeepFake detection in real-life scenarios.

4.2.2. Cross-Dataset Evaluation

The cross-dataset evaluation is more challenging than in-dataset, and we train the model on the FaceForensics++ C23 dataset and test them on Celeb-DF-v1, Celeb-DF-v2, and DFDC-Preview datasets, respectively. Frame-level comparison experiments include the following: (1) DSP-FWA [33]: captures the features in the process of affine face warping; (2) Multiattention [25]: employs multiattention heads to design a new DeepFake detection method; (3) LiSiam [26]: considers the localization invariance of various image degradations to detect DeepFakes; (4) RECCE [27]: captures the common representations from real facial images to enhance the generalizability; (5) MesoNet [23]; (6) Xception [7]; (7) Face X-ray [24]; (8) SBIs [13]; and (9) ResNet34 [28]. These have been discussed in Section 4.2.1. Video-level comparison experiments include the following: (1) Xception [7]: it is the frame-level method used to compute video-level metrics; (2) HCIL [30]: adopts a two-level contractive learning to capture more general clues for detecting forgery videos; (3) RATF [46]: designs a module to generate temporal filters for various facial regions; (4) F³-Net [41]; and (5) LipForensics [15]. These have been discussed in Section 3.2.1.

The comparison of state-of-the-art DeepFake detection methods and our proposed method is shown in Table 2. We jointly leverage spatiotemporal features to explore advanced semantic expression clues in the eyes and mouth regions of the human face, which differs from state-of-the-art approaches. In addition, it demonstrates both generalization capabilities for unknown DeepFake technologies and robustness against various image degradations and adversarial attacks.

Table 2. Comparison of the existing state-of-the-art DeepFake detection methods and our proposed method.

Methods	Time	Input object		Extracted features		Capabilities
Methods	Time	Image	Video	Type	Methods	G	R
ResNet50 [36]	2016	✓	✕	S	DNN-based	✕	✕
MesoNet [23]	2018	✕	✓	S	DNN-based	✕	✕
CNN + RNN [29]	2018	✕	✓	S	DNN-based	✕	✕
Xception [7]	2019	✕	✓	S	DNN-based	✕	✕
Face X-ray [24]	2020	✕	✓	S	Blending boundary	✓	✕
F³-Net [41]	2020	✕	✓	F	Frequency-aware clues	✕	✓
SPSL [40]	2020	✕	✓	S + F	DNN-based	✓	✕
3DCNN [42]	2021	✕	✓	S + T	DNN-based	✓	✕
DeepFakeucl [43]	2021	✓	✕	S	Image augmentation	✓	✓
DSP-FWA [33]	2021	✕	✓	S + T	Eyes movements	✓	✕
LipForensics [15]	2021	✕	✓	S + T	Mouth motion	✓	✓
Liu et al. [44]	2021	✕	✓	S + T	DNN-based	✕	✕
Multiattention [25]	2021	✓	✕	S	DNN-based	✓	✕
FInfer [45]	2022	✕	✓	S + T	Frame inference–based	✓	✕
HCIL [30]	2022	✕	✓	S + T	Temporal inconsistency	✓	✕
LiSiam [26]	2022	✓	✓	S	Localization invariance	✓	✓
RATF [46]	2022	✕	✓	S + T	Region-aware temporal inconsistency	✓	✕
RECCE [27]	2022	✓	✕	S	Reconstruction-classification learning	✓	✓
SBIs [13]	2022	✓	✕	S	Self-blended augmentations	✓	✓
ResNet34 [28]	2024	✓	✕	S	Global and local features	✓	✕
Ours		✕	✓	S + T	Expression cues of eyes and mouth	✓	✓

Note: We mainly show the time, input object (e.g., image or video), type of their extracted features (e.g., spatial (S), frequency (F), or temporal (T)), and methods of their extracted features. G and R represent the capabilities of the detection methods. G denotes whether the detector has the generalization capabilities on unknown DeepFake technologies. R denotes whether the detector is robust against various degradations and attacks.

4.3. Metrics

To evaluate the effectiveness of our DeepFake detection method, we utilize the Acc and AUC as performance metrics to measure its ability to distinguish between real and synthetic face images. In addition, we report the ASR to assess the misclassification of faces when subjected to adversarial attacks.

4.4. Implementation Details

In our evaluation experiments, the optimizer is SGD, the learning rate is 10⁻³, with a momentum of 0.9, the training epoch is 100, and the batch size is 128. All our experiments are performed on a server with 40 cores 2.20 GHz Xeon CPUs, 64 GB RAM, and two NVIDIA Tesla V100 GPUs with 32 GB memory.

4.5. In-Dataset Evaluation

To comprehensively evaluate the effectiveness of our proposed method, we focus on in-dataset evaluation first. To ensure meaningful comparisons with existing studies, we employ widely recognized datasets, namely, FaceForensics++, Celeb-DF-v2, and DFDC-Preview. Our method is trained and tested within the same dataset to maintain consistent data distribution. Throughout the evaluation process, we conduct thorough comparisons between our proposed method and state-of-the-art approaches in the field.

The performance results on the FaceForensics++, Celeb-DF-v2, and DFDC-Preview datasets are presented in Table 3. Our proposed method demonstrates an average improvement in AUC scores of 8.9%, 5.2%, and 6.0%, respectively, across these datasets. Specifically, on the FaceForensics++ (C23) dataset, our method surpasses the state-of-the-art ResNet34 approach by 1.6% in AUC scores. On the DFDC-Preview dataset, our approach exceeds Liu et al.’s method by 1.3% in AUC scores. For the Celeb-DF-v2 dataset, although our method ranks below the most advanced ResNet36 method, it still achieves the second-highest AUC score of 99.3%.

Table 3. In-dataset evaluation results on FaceForensics++ (C23), Celeb-DF-v2, and DFDC-Preview datasets.

FaceForensics++ (C23)			Celeb-DF-v2			DFDC-Preview
Methods	Time	AUC↑	Methods	Time	AUC↑	Methods	Time	AUC↑
SPSL [40]	2020	94.3	Xception [7]	2019	98.5	ResNet50 [36]	2016	87.7
Face X-ray [24]	2020	87.4	DeepFakeucl [43]	2021	90.5	MesoNet [23]	2018	90.6
F³-Net [41]	2020	98.1	3DCNN [42]	2021	88.8	CNN + RNN [29]	2018	91.5
3DCNN [42]	2021	72.2	SBIs [13]	2022	93.7	Xception [7]	2019	94.0
FInfer [45]	2022	95.7	FInfer [45]	2022	93.3	LipForensics [15]	2021	78.1
ResNet34 [28]	2024	98.3	ResNet34 [28]	2024	99.9	Liu et al. [44]	2021	94.0
Ours		99.9	Ours		99.3	Ours		95.3

Note: The best and second-best AUC scores (%) are highlighted using bold and italic formatting, respectively. The uparrow indicates that the larger the AUC, the better the experimental performance.

It is worth noting that the results on the DFDC-Preview dataset are lower compared to those on the FaceForensics++ (C23) and Celeb-DF-v2 datasets. This discrepancy is attributed to the presence of low-quality videos in the DFDC-Preview dataset, which complicates the extraction of human faces.

In order to better understand and visualize the spatial regions that our model relies on, we select the real faces and corresponding fake faces to plot the results of Grad-CAM in Figure 5. As expected, our proposed model focuses predominantly on similar areas (eyes and mouth) among real and fake faces.

4.6. Cross-Dataset Evaluation

Given the rapid advancements in DeepFake techniques, it is crucial to develop a detector that can accurately distinguish forged contents generated by novel manipulation methods, even when trained on known datasets. To enable our model to learn common synthetic clues, we employ a data enhancement method [13] to the interframe. Specifically, we replace the preprocessed interframe in Section 3.3 with a data-enhanced frame, as illustrated in Figure 4. For our experiments, we initially train our model using the FaceForensics++ dataset and subsequently evaluate its performance on other datasets such as Celeb-DF-v1, Celeb-DF-v2, and DFDC-Preview. The experimental results are presented in Table 4.

Table 4. Cross-dataset evaluation on the Celeb-DF-v1, Celeb-DF-v2, and DFDC-Preview.

Methods	Year	Input object	Training dataset	Celeb-DF-v1 AUC↑	Celeb-DF-v2 AUC↑	DFDC-Preview AUC↑
MesoNet [23]	2018	Video	FF++	42.2	53.6	59.4
Xception [7]	2019	Video	FF++	75.0	77.8	69.8
Face X-ray [24]	2020	Video	FF++	80.6	74.2	80.9
F³-Net [41]	2020	Video	FF++	—	75.7	—
DSP-FWA [33]	2021	Video	FF++	78.5	81.4	59.5
LipForensics [15]	2021	Video	FF++	83.3	82.4	73.5
HCIL [30]	2022	Video	FF++	—	79.0	69.2
RATF [46]	2022	Video	FF++	—	76.5	69.1
LiSiam [26]	2022	Image/Video	FF++	81.1	78.2	—
Multiattention [25]	2021	Image	FF++	—	67.4	—
RECCE [27]	2022	Image	FF++	—	68.7	—
SBIs [13]	2022	Image	FF++	87.9	93.2	86.2
ResNet34 [28]	2024	Image	FF++	81.8	86.4	85.1
Ours		Video	FF++	88.4	87.2	84.1

Note: The best and second-best AUC scores (%) are highlighted using bold and italic formatting, respectively. Our proposed method demonstrates superiority over state-of-the-art frame-level detectors on the Celeb-DF-v1 dataset and outperforms advanced video-level methods averagely on three datasets. The uparrow indicates that the larger the AUC, the better the experimental performance.

4.6.1. Comparison With Frame-Level Detectors

Our approach demonstrates a significant improvement over advanced frame-level methods on the Celeb-DF-v1 dataset, achieving an average increase of 13.1% in AUC scores. Although our method shows lower results on the Celeb-DF-v2 and DFDC-Preview datasets compared to the frame-level methods, this can be attributed to the inherent differences in focus. Frame-level models concentrate solely on the detailed information within individual images, ignoring interframe correlations, which enhances their generalization capabilities across various images. Despite the superior generalization ability of frame-level DeepFake detectors, video-level models, such as ours, leverage temporal sequence information, which proves advantageous for in-dataset evaluations as shown in Table 3 and specific scenarios involving interframe manipulations or motion artifacts.

4.6.2. Comparison With Video-Level Detectors

When compared to state-of-the-art video-level methods, our model outperforms the advanced LipForensics, HCIL, and RATF approaches across the three datasets. Specifically, our method achieves an average improvement in AUC scores of 5.1%, 9.7%, and 14.2% on the Celeb-DF-v1, Celeb-DF-v2, and DFDC-Preview datasets, respectively.

4.7. Robustness to Various Perturbations

DeepFake detectors should exhibit robustness against various image transformations and adversarial attacks that images may encounter in the real world. To assess the detectors’ robustness to unseen degradations, we consider several simple perturbations at five severity levels, as defined in the DeepForensics-1.0 benchmark [47]. These perturbations include changes in color saturation (CS), color contrast (CC), block-wise operations (Block), Gaussian noise (Noise), Gaussian blur (Blur), JPEG compression (JPEG), and video compression (Comp). Our model is trained on FaceForensics++ (RAW) dataset containing both real and fake videos, and subsequently tested on FaceForensics++ videos that have been subjected to the aforementioned perturbations. Examples of the disturbed images are depicted in Figure 6.

To ensure a fair comparison, we conducted robust experiments using the comparison methods with publicly released codes and weights in their original papers, which achieved the best performance as listed in Table 4. The methods included are Xception [7], Face X-ray [24], LipForensics [15], and SBIs [13].

Figures 7 and 8 show how the model performs with increasing severity levels for each perturbation type, while Table 5 summarizes the average AUC scores across five levels of severity. The results indicate that our method is more resilient to degradations than state-of-the-art approaches. For high-frequency corruptions such as Blur, JPEG, and Comp, our model outperforms the advanced LipForensics [15] and SBIs [13] methods by an average of 3.0% and 20.0% in AUC, respectively. This is because these degradations primarily affect the quality of individual frames without impacting the trajectory of facial muscle movements, allowing our model to overlook the effects caused by these degradations. Although other methods show strong generalization capabilities, they are particularly sensitive to quality degradations, for example, Face X-ray [24] is vulnerable to video compression, and LipForensics [15] is more affected by blockwise perturbations than our method.

Table 5. Robustness to various perturbations.

Methods	AUC scores (%)↑
Methods	Origin	CS	CC	Block	Noise	Blur	JPEG	Comp	Avg
Xception [7]	99.8	99.3	98.6	99.7	53.8	60.2	74.2	62.1	78.3
Face X-ray [24]	99.8	97.6	88.5	99.1	49.8	63.8	88.6	55.2	77.5
LipForensics [15]	99.9	99.9	99.6	87.4	73.8	96.1	95.6	95.6	92.5
SBIs [13]	99.9	99.5	98.4	99.6	53.5	67.6	87.6	81.1	83.9
Ours	99.9	98.4	67.6	97.8	88.4	98.7	98.6	99.0	92.6

Note: The average AUC scores (%) of the detectors across five severity levels for seven perturbations. The best and second-best AUC scores (%) are highlighted using bold and italic formatting, respectively. “Avg” represents the average value of all severity levels for all perturbations.

In addition, our model’s performance declines under CC perturbation, which we attribute to our method’s reliance on optical flow features driven by brightness changes. Nonetheless, this degradation is not a significant real-world threat, as color-based attacks are typically too conspicuous for practical use. Importantly, our model shows resilience against Noise, and this is because our model focuses on the movement patterns in eyes and mouth regions, capturing high-level semantic features, while Noise mainly affects pixel-level and low-level features. As a result, the noise does not significantly interfere with the microexpression movements.

We further evaluate the robustness of our approach against various video compressions by training each detector on the original dataset and then testing them on three different compression levels of the Celeb-DF-v1 dataset. On Celeb-DF-v1, we compare our work with the best and advanced detectors on its benchmark (Table 5), LipForensics [15], and SBIs [13]. The experimental results are summarized in Table 6. The experimental results demonstrate that our model is less susceptible to video compression. Specifically, our work outperforms the two advanced methods by an average of 4.5%, 5.3%, and 6.4% on RAW, C23, and C40 compression levels, respectively. In addition, the performance decline from RAW to C23 and C40 is notably lower in our method compared to advanced approaches, averaging 0.8% and 2.0%, respectively.

Table 6. AUC scores of different compression levels on Celeb-DF-v1.

Methods	Test set AUC scores (%)↑			Decline↓
Methods	Celeb-DF-v1 RAW	Celeb-DF-v1 C23	Celeb-DF-v1 C40	Decline↓
LipForensics [15]	70.3	64.2	61.9	6.1/8.4
SBIs [13]	77.0	70.1	66.9	6.9/10.1
Ours	78.1	72.4	70.8	5.7/7.3

Note: The best and second-best results are highlighted using bold and italic formatting, respectively. Our proposed model has higher performance than other methods. The uparrow indicates that the larger the AUC scores, the better the experimental performance. The downarrow indicates that the smaller the decline, the better the experimental performance.

4.8. Robustness to Adversarial Attacks

On social media and online platforms, DeepFakes are widely used for the dissemination of misinformation. To mitigate the security threats posed by DeepFakes to politics, economics, and national stability, effective DeepFake detectors have been developed to identify forgery content. However, adversaries often employ adversarial attacks to inject subtle perturbations into DeepFake images or videos, enabling them to bypass detectors designed with DNNs without noticeable loss of visual quality. If DeepFake detectors lack robustness against adversarial attacks, DeepFake content can easily evade detection, leading to the further spread of misinformation. To simulate real-world scenarios where adversaries deploy such attacks, we utilize widely adopted adversarial attack methods, namely, I-FGSM [11] and PGD [12], to generate adversarial samples.

We further test the robustness of this work to typical adversarial attacks. Experiments use the DeepFakes dataset in FaceForensics++ benchmark and compare it with two typical attack models, including I-FGSM [11] and PGD [12]. We set the max perturbation parameter ε = 8/255, where the iteration number of attacks is 40, and the imperceptible attacks are added to the combined inputs in our work. We evaluate the robustness of this work with advanced DeepFake detectors in a white-box setting, and we select the frame-level detectors and best-performing methods in Table 5, including Xception [7], Face X-ray [24], and SBIs [13].

The results of adversarial attacks are shown in Table 7. In the white-box setting, our work is more robust to imperceptible attacks than state-of-the-art detectors. For instance, when deploying the I-FGSM and PGD attacks, our work has a 46.64% and 56.47% reduction, respectively, in ASR compared to the advanced detectors. This is because adversarial attacks typically alter low-level features, and the perturbations applied are often unable to disrupt the continuity in consecutive frames. In contrast, our designed DeepFake detector focuses on capturing the motion patterns of the eyes and mouth regions, which represent high-level semantic features and maintain continuity between consecutive frames. This enables the detector to effectively resist adversarial noise.

Table 7. Robustness to adversarial attacks.

Methods	Xception [7]		Face X-ray [24]		SBIs [13]		Ours
Attack model	I-FGSM	PGD	I-FGSM	PGD	I-FGSM	PGD	I-FGSM	PGD
Origin Acc↑	97.32		99.10		99.99		97.82
Attack Acc↑	0.62	0.92	0.78	0.16	0.24	0.19	46.17	55.66
ASR↓	99.36	99.05	99.21	99.84	99.76	99.81	52.80	43.10

Note: The best and second-best results are highlighted using bold and italic formatting, respectively. When subjected to adversarial attacks, our work demonstrates a better ASR (%) compared to other advanced detectors trained and tested on the DeepFakes dataset of FaceForensics++. The downarrow indicates that the smaller the ASR, the better the experimental performance.

4.9. Ablation Study

To investigate the effectiveness of exploring inconsistent facial expression movements, we perform ablation experiments on our proposed method. Typically, current DeepFake detectors employ the entire face for training. To capture the distinct movement patterns of the eyes and mouth regions, we partition the face into two parts and input them into a dual-branch network. To assess the impact of segmentation, we train the network using the whole face, with only the eyes area and the mouth area as inputs.

Evaluations of facial segmentation. The results can be seen in Table 8. Training the whole image still has a good performance, but the performance is lower than the segmentation training method. For instance, the performance from 99.3% AUC in our method decreased to 90.1% AUC on the Celeb-DF-v2 dataset, and from 94.6% AUC to 89.7% AUC on the DFDC-Preview dataset, which indicates that the eyes and mouth have different movement patterns, and training the two regions separately can improve the efficiency of our detector.

Table 8. Ablation study of different training data.

Methods	Celeb-DF-v2		DFDC-Preview
Methods	Acc (%)↑	AUC (%)↑	Acc (%)↑	AUC (%)↑
w/o Segmentation	90.8	90.1	89.7	89.7
w/o Eyes region	74.1	77.9	64.2	63.3
w/o Mouth region	80.8	80.2	85.8	83.5
Ours	96.6	99.3	94.7	95.3

Note: The best results are highlighted using bold formatting. Accuracy and AUC scores of all the models, which are trained and tested on Celeb-DF-v2 and DFDC-Preview. The uparrow indicates that the larger the Acc and AUC, the better the experimental performance.

Evaluations of single facial region. We only train the detector with a single area, and the results are shown in Table 8. We can see that the performance decreases significantly; when only training the eyes region, the performance decreases by 19.1% AUC on the Celeb-DF-v2 dataset and by 11.1% AUC on the DFDC-Preview dataset; when only training the mouth region, the performance drops by 21.4% AUC on the Celeb-DF-v2 dataset and by 31.3% AUC on the DFDC-Preview dataset, which indicates that the expression clues obtained by single area are limited. Furthermore, the results show that the region of the eyes contains more synthetic cues than the mouth.

4.10. Computational Costs

The computational costs of our proposed dual-branch network are presented in Table 9. Our network is a model with 24.99 M parameters and achieves floating point operations (FLOPs) of 1.89 GFLOPs on an NVIDIA Tesla V100 GPU. Due to the robust computational power of the GPU, our model is highly suitable for the real-time detection of DeepFake videos.

Table 9. Computational costs of our proposed model.

Train process
Input size	3 × 224 × 224	FLOPs	1.89 GFLOPs
Total parameters	24.99 M	Trainable parameters	24.99 M

Inference process

Average inference time	15 ms	Average inference performance	66.67 fps

Note: FLOPs refer to floating point operations.

For instance, the average inference time (time for one forward pass) of PhaseForensics [48] is 25 ms, and LipForensics [15] achieves 22.2 ms. In contrast, our model delivers a video-friendly inference time of 15 ms, corresponding to an average inference performance of 66.67 frames per second (fps), making it well-suited for real-time inference on standard 30 fps video streams.

5. Discussion

Existing DeepFake generation methods typically focus on the visual quality of a single frame, emphasizing detail optimization in local regions such as the eyes and mouth. However, they lack temporal modeling across consecutive video frames, neglecting the global coordination of facial muscle movements, which results in inconsistent facial expressions. Moreover, current DeepFake detectors suffer from limited generalizability and robustness. In contrast, our proposed method leverages the observation that DeepFakes may disrupt smooth and continuous facial muscle movements. We design a dual-branch network to explore expression cues from the eyes and mouth regions. These high-level semantic features enhance the model’s generalization capability for unknown DeepFakes data and demonstrate strong robustness against various image degradations and adversarial attacks. Evaluation and comparative experiments validate the significant contribution of inconsistent expression cues to DeepFake detection. Ablation studies consistently confirm that the effectiveness of our method lies in combining the forgery clues from the eyes and mouth regions.

More importantly, our proposed method has good cross-dataset generalization capabilities, and with further optimization, it could scale to larger datasets or different types of forgeries, such as multiple forgery regions in a video, as well as DeepFake detection across various devices and platforms. Our method shows strong robustness against various levels of postprocessing and adversarial attacks, making it adaptable to complex real-world scenarios, such as low-resolution videos or other quality degradation challenges. Finally, our method is primarily focused on detecting forgery facial videos and can be integrated with various multimedia verification systems to ensure the privacy and security of individuals’ information, particularly in applications such as facial recognition and emotion analysis.

6. Conclusions

In this paper, we introduce a novel DeepFake detector that relies on inconsistent facial expression cues. Through extensive experiments, we have demonstrated that synthetic faces generated by GANs fail to maintain spatial and temporal consistency in facial expressions. Leveraging our scientific analysis, we have developed a robust and general DeepFake detector that deploys the irregular patterns of facial movements. We extract these different patterns by utilizing MSTR maps of the eyes and mouth regions, and we have designed a dual-branch framework that further enhances the efficiency of our approach. These experiments have validated the high effectiveness of our method, independent of the generator used and the quality of the input video. We believe that a significant contribution of this study lies in our observation of inconsistent expression clues between natural and synthetic faces, particularly in the eyes and mouth regions. We have presented a scientific investigation and explanation of expression motion patterns in forged faces. We hope that our method will make a valuable contribution to combating advanced DeepFake techniques.

We acknowledge that while the proposed method demonstrates outstanding performance in detecting DeepFake videos in the current experiments, it may face certain challenges when dealing with highly sophisticated forgery techniques, such as talking-face videos generated by diffusion models. In future research, we plan to explore more diverse feature extraction methods and incorporate cross-modal information, such as audio and text, to enhance robustness and detection performance against advanced DeepFakes.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

This work was supported by the National Natural Science Foundation of China under Grant no. 62372334 and the National Key Research and Development Program of China under Grant no. 2023YFB3106900.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant no. 62372334 and the National Key Research and Development Program of China under Grant no. 2023YFB3106900.

Open Research

Data Availability Statement

The datasets are publicly available.

References

1 Olszewski K., Li Z., Yang C. et al., Realistic Dynamic Facial Textures From a Single Image Using Gans, Proceedings of the IEEE International Conference on Computer Vision, 2017, 5429–5438.
Google Scholar
2 Wu D., Gan J., Zhou J., Wang J., and Gao W., Fine-Grained Semantic Ethnic Costume High-Resolution Image Colorization With Conditional GAN, International Journal of Intelligent Systems. (2022) 37, no. 5, 2952–2968, https://doi.org/10.1002/int.22726.
10.1002/int.22726
Web of Science® Google Scholar
3 Chen G. L. and Hsu C. C., Jointly Defending DeepFake Manipulation and Adversarial Attack Using Decoy Mechanism, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2023) 45, no. 8, 9922–9931, https://doi.org/10.1109/tpami.2023.3253390.
10.1109/TPAMI.2023.3253390
PubMed Google Scholar
4 Chen H., Li Y., Lin D., Li B., and Wu J., Watching the BiG Artifacts: Exposing DeepFake Videos via Bi-Granularity Artifacts, Pattern Recognition. (2023) 135, https://doi.org/10.1016/j.patcog.2022.109179.
10.1016/j.patcog.2022.109179
Google Scholar
5 Liang Y., Wang M., Jin Y., Pan S., and Liu Y., Hierarchical Supervisions With Two-Stream Network for DeepFake Detection, Pattern Recognition Letters. (2023) 172, 121–127, https://doi.org/10.1016/j.patrec.2023.05.029.
10.1016/j.patrec.2023.05.029
Web of Science® Google Scholar
6 Ciftci U. A., Demir I., and Yin L., Fakecatcher: Detection of Synthetic Portrait Videos Using Biological Signals, IEEE Transactions on Pattern Analysis and Machine Intelligence. (2020) https://doi.org/10.1109/tpami.2020.3009287.
10.1109/tpami.2020.3009287
PubMed Google Scholar
7 Rossler A., Cozzolino D., Verdoliva L., Riess C., Thies J., and Niessner M., Faceforensics++: Learning to Detect Manipulated Facial Images, Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, 1–11, https://doi.org/10.1109/iccv.2019.00009.
10.1109/iccv.2019.00009
Google Scholar
8 Liu X., Liu H., and Lin Y., Video Frame Interpolation via Optical Flow Estimation With Image Inpainting, International Journal of Intelligent Systems. (2020) 35, no. 12, 2087–2102, https://doi.org/10.1002/int.22285.
10.1002/int.22285
Web of Science® Google Scholar
9 Zhu Q., Gao L., Song H., and Mao Q., Learning to Disentangle Emotion Factors for Facial Expression Recognition in the Wild, International Journal of Intelligent Systems. (2021) 36, no. 6, 2511–2527, https://doi.org/10.1002/int.22391.
10.1002/int.22391
Web of Science® Google Scholar
10 Zhang Y., Tan Y., Lu M., Chen T., Li Y., and Zhang Q., Boosting Cross-Task Adversarial Attack With Random Blur, International Journal of Intelligent Systems. (2022) 37, no. 10, 8139–8154, https://doi.org/10.1002/int.22932.
10.1002/int.22932
Web of Science® Google Scholar
11 Huang Q., Katsman I., He H., Gu Z., Belongie S., and Lim S. N., Enhancing Adversarial Example Transferability With an Intermediate Level Attack, Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, 4733–4742.
Google Scholar
12 Chiang P. Y., Geiping J., Goldblum M. et al., Witchcraft: Efficient PGD Attacks With Random Step Size, ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, 3747–3751.
Google Scholar
13 Shiohara K. and Yamasaki T., Detecting DeepFakes With Self-Blended Images, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 18720–18729.
Google Scholar
14 Yang S., Li J., Zhang T., Zhao J., and Shen F., AdvMask: A Sparse Adversarial Attack-Based Data Augmentation Method for Image Classification, Pattern Recognition. (2023) 144, https://doi.org/10.1016/j.patcog.2023.109847.
10.1016/j.patcog.2023.109847
Google Scholar
15 Haliassos A., Vougioukas K., Petridis S., and Pantic M., Lips Don’t Lie: A Generalisable and Robust Approach to Face Forgery Etection, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, 5039–5049.
Google Scholar
16 Karras T., Aila T., Laine S., and Lehtinen J., Progressive Growing of Gans for Improved Quality, Stability, and Variation.
Google Scholar
17 Karras T., Laine S., and Aila T., A Style-Based Generator Architecture for Generative Adversarial Networks, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, 4401–4410.
Google Scholar
18 Choi Y., Choi M., Kim M. et al., Stargan: Unified Generative Adversarial Networks for Multi-Domain Image-To-Image Translation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, 8789–8797.
Google Scholar
19 He Z., Zuo W., Kan M., Shan S., and Chen X., Attgan: Facial Attribute Editing by Only Changing What You Want, IEEE Transactions on Image Processing. (2019) 28, no. 11, 5464–5478, https://doi.org/10.1109/tip.2019.2916751, 2-s2.0-85071371223.
10.1109/TIP.2019.2916751
PubMed Web of Science® Google Scholar
20 Zhang Y., Zheng L., and Thing V. L. L., Automated Face Swapping and its Detection, 2017 IEEE 2nd International Conference on Signal and Image Processing (ICSIP), 2017, 15–19, https://doi.org/10.1109/siprocess.2017.8124497, 2-s2.0-85043457720.
10.1109/siprocess.2017.8124497
Google Scholar
21 FaceSwap, 2016, https://github.com/DeepFakes/faceswap.
Google Scholar
22 Thies J., Zollhofer M., Stamminger M., Theobalt C., and Niessner M., Face2face: Real-Time Face Capture and Reenactment of Rgb Videos, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, 2387–2395, https://doi.org/10.1109/cvpr.2016.262, 2-s2.0-84986308411.
10.1109/cvpr.2016.262
Google Scholar
23 Afchar D., Nozick V., Yamagishi J., and Echizen I., Mesonet: A Compact Facial Video Forgery Detection Network, 2018 IEEE International Workshop on Information Forensics and Security (WIFS), 2018, 1–7, https://doi.org/10.1109/wifs.2018.8630761, 2-s2.0-85058542365.
10.1109/wifs.2018.8630761
Google Scholar
24 Li L., Bao J., Zhang T. et al., Face X-Ray for More General Face Forgery Detection, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, 5001–5010.
Google Scholar
25 Zhao H., Wei T., Zhou W., Zhang W., Chen D., and Yu N., Multi-attentional DeepFake Detection, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, 2185–2194, https://doi.org/10.1109/cvpr46437.2021.00222.
10.1109/cvpr46437.2021.00222
Google Scholar
26 Wang J., Sun Y., and Tang J., LiSiam: Localization Invariance Siamese Network for DeepFake Detection, IEEE Transactions on Information Forensics and Security. (2022) 17, 2425–2436, https://doi.org/10.1109/tifs.2022.3186803.
10.1109/TIFS.2022.3186803
Google Scholar
27 Cao J., Ma C., Yao T., Chen S., Ding S., and Yang X., End-to-End Reconstruction-Classification Learning for Face Forgery Detection, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 4113–4122.
Google Scholar
28 Ba Z., Liu Q., Liu Z. et al., Exposing the Deception: Uncovering More Forgery Clues for DeepFake Detection, Proceedings of the AAAI Conference on Artificial Intelligence. (2024) 38, no. 2, 719–728, https://doi.org/10.1609/aaai.v38i2.27829.
10.1609/aaai.v38i2.27829
Google Scholar
29 Güera D. and Delp E. J., DeepFake Video Detection Using Recurrent Neural Networks, 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2018, 1–6, https://doi.org/10.1109/avss.2018.8639163, 2-s2.0-85063065209.
10.1109/avss.2018.8639163
Google Scholar
30 Gu Z., Yao T., Chen Y. et al., Hierarchical Contrastive Inconsistency Learning for DeepFake Video Detection, European Conference on Computer Vision, 2022, Springer Nature Switzerland, Cham, Switzerland, 596–613.
Google Scholar
31 Mittal T., Bhattacharya U., Chandra R., Bera A., and Manocha D., Emotions Don’t Lie: An Audio-Visual DeepFake Detection Method Using Affective Cues, Proceedings of the 28th ACM International Conference on Multimedia, 2020, 2823–2832, https://doi.org/10.1145/3394171.3413570.
10.1145/3394171.3413570
Google Scholar
32 Hosler B., Salvi D., Murray A. et al., Do DeepFakes Feel Emotions? A Semantic Approach to Detecting DeepFakes via Emotional Inconsistencies, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, 1013–1022, https://doi.org/10.1109/cvprw53098.2021.00112.
10.1109/cvprw53098.2021.00112
Google Scholar
33 Li M., Liu B., Hu Y., and Wang Y., Exposing DeepFake Videos by Tracking Eye Movements, 2020 25th International Conference on Pattern Recognition (ICPR), 2021, 5184–5189.
Google Scholar
34 Ruan B. K., Lo L., Shuai H. H., and Cheng W. H., Mimicking the Annotation Process for Recognizing the Micro Expressions, Proceedings of the 30th ACM International Conference on Multimedia, 2022, 228–236, https://doi.org/10.1145/3503161.3548185.
10.1145/3503161.3548185
Google Scholar
35 Qin W., Zou B., Li X., Wang W., and Ma H., Micro-Expression Spotting With Face Alignment and Optical Flow, Proceedings of the 31st ACM International Conference on Multimedia, 2023, 9501–9505, https://doi.org/10.1145/3581783.3612853.
10.1145/3581783.3612853
Google Scholar
36 He K., Zhang X., Ren S., and Sun J., Deep Residual Learning for Image Recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, 770–778.
Google Scholar
37 Elemmi M. C., Anami B. S., and Malvade N. N., Defective and Nondefective Classification of Fabric Images Using Shallow and Deep Networks, International Journal of Intelligent Systems. (2022) 37, no. 3, 2293–2318, https://doi.org/10.1002/int.22774.
10.1002/int.22774
Google Scholar
38 Li Y., Yang X., Sun P., Qi H., and Lyau S., Celeb-df: A Large-Scale Challenging Dataset for DeepFake Forensics, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, 3207–3216.
Google Scholar
39 Dolhansky B., Howes R., Pflaum B., Baram N., and Ferrer C. C., The DeepFake Detection Challenge (dfdc) Preview Dataset, 2019.
Google Scholar
40 Masi I., Killekar A., Mascarenhas R. M. et al., Two-Branch Recurrent Network for Isolating DeepFakes in Videos, Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, 2020, Springer International Publishing, 667–684.
Google Scholar
41 Qian Y., Yin G., Sheng L. et al., Thinking in Frequency: Face Forgery Detection by Mining Frequency-Aware Clues, European Conference on Computer Vision, 2020, Springer International Publishing, Cham, Switzerland, 86–103.
Google Scholar
42 Zhang D., Li C., Lin F., Zeng D., and Ge S., Detecting DeepFake Videos With Temporal Dropout 3DCNN, Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21), 2021, 1288–1294, https://doi.org/10.24963/ijcai.2021/178.
10.24963/ijcai.2021/178
Google Scholar
43 Fung S., Lu X., Zhang C., and Li C. T., DeepFakeucl: DeepFake Detection via Unsupervised Contrastive Learning, 2021 International Joint Conference on Neural Networks (IJCNN), 2021, 1–8, https://doi.org/10.1109/ijcnn52387.2021.9534089.
10.1109/ijcnn52387.2021.9534089
Google Scholar
44 Liu J., Zhu K., Lu W., Luo X., and Zhao X., A Lightweight 3D Convolutional Neural Network for DeepFake Detection, International Journal of Intelligent Systems. (2021) 36, no. 9, 4990–5004, https://doi.org/10.1002/int.22499.
10.1002/int.22499
Web of Science® Google Scholar
45 Hu J., Liao X., Liang J., Zhou W., and Qin Z., Finfer: Frame Inference-Based DeepFake Detection for High-Visual-Quality Videos, Proceedings of the AAAI Conference on Artificial Intelligence. (2022) 36, no. 1, 951–959, https://doi.org/10.1609/aaai.v36i1.19978.
10.1609/aaai.v36i1.19978
Google Scholar
46 Gu Z., Yao T., Chen Y., Yi R., Ding S., and Ma L., Region-Aware Temporal Inconsistency Learning for DeepFake Video Detection, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22), 2022, 920–926, https://doi.org/10.24963/ijcai.2022/129.
10.24963/ijcai.2022/129
Google Scholar
47 Jiang L., Li R., Wu W., Qian C., and Loy C. C., Deeperforensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, 2889–2898.
Google Scholar
48 Prashnani E., Goebel M., and Manjunath B. S., Generalizable Deepfake Detection With Phase-Based Motion Analysis, IEEE Transactions on Image Processing. (2025) 34, 100–112, https://doi.org/10.1109/tip.2024.3441821.
10.1109/TIP.2024.3441821
PubMed Google Scholar

Citing Literature

All articles