International Journal of Intelligent Systems

Volume 2025, Issue 1 1947582

Research Article

Open Access

A Scalable and Generalised Deep Learning Framework for Anomaly Detection in Surveillance Videos

Sabah Abdulazeez Jebur,

Sabah Abdulazeez Jebur

Department of Cyber Security , Imam Alkadhim University College , Baghdad , Iraq

Search for more papers by this author

Laith Alzubaidi,

Corresponding Author

Laith Alzubaidi

[email protected]

orcid.org/0000-0002-7296-5413

School of Mechanical, Medical, and Process Engineering , Queensland University of Technology Brisbane , Brisbane , 4000 , Queensland, Australia , health.qld.gov.au

Centre for Data Science , Queensland University of Technology Brisbane , Brisbane , 4000 , Queensland, Australia , health.qld.gov.au

Search for more papers by this author

Ahmed Saihood,

Ahmed Saihood

Faculty of Computer Science and Mathematics , University of Thi-Qar , Nasiriyah , 00964 , Thi-Qar, Iraq , utq.edu.iq

Search for more papers by this author

Khalid A. Hussein,

Khalid A. Hussein

Department of Computer Science , College of Education , Mustansiriyah University , Baghdad , Iraq , uomustansiriyah.edu.iq

Search for more papers by this author

Haider Kadhim Hoomod,

Haider Kadhim Hoomod

Department of Computer Science , College of Education , Mustansiriyah University , Baghdad , Iraq , uomustansiriyah.edu.iq

Search for more papers by this author

YuanTong Gu,

YuanTong Gu

School of Mechanical, Medical, and Process Engineering , Queensland University of Technology Brisbane , Brisbane , 4000 , Queensland, Australia , health.qld.gov.au

Search for more papers by this author

Sabah Abdulazeez Jebur,

Sabah Abdulazeez Jebur

Department of Cyber Security , Imam Alkadhim University College , Baghdad , Iraq

Search for more papers by this author

Laith Alzubaidi,

Corresponding Author

Laith Alzubaidi

[email protected]

orcid.org/0000-0002-7296-5413

School of Mechanical, Medical, and Process Engineering , Queensland University of Technology Brisbane , Brisbane , 4000 , Queensland, Australia , health.qld.gov.au

Centre for Data Science , Queensland University of Technology Brisbane , Brisbane , 4000 , Queensland, Australia , health.qld.gov.au

Search for more papers by this author

Ahmed Saihood,

Ahmed Saihood

Faculty of Computer Science and Mathematics , University of Thi-Qar , Nasiriyah , 00964 , Thi-Qar, Iraq , utq.edu.iq

Search for more papers by this author

Khalid A. Hussein,

Khalid A. Hussein

Department of Computer Science , College of Education , Mustansiriyah University , Baghdad , Iraq , uomustansiriyah.edu.iq

Search for more papers by this author

Haider Kadhim Hoomod,

Haider Kadhim Hoomod

Department of Computer Science , College of Education , Mustansiriyah University , Baghdad , Iraq , uomustansiriyah.edu.iq

Search for more papers by this author

YuanTong Gu,

YuanTong Gu

School of Mechanical, Medical, and Process Engineering , Queensland University of Technology Brisbane , Brisbane , 4000 , Queensland, Australia , health.qld.gov.au

Search for more papers by this author

First published: 13 March 2025

https://doi.org/10.1155/int/1947582

Academic Editor: Mohamadreza (Mohammad) Khosravi

Share a link

Email
Wechat
Bluesky

Abstract

Anomaly detection in videos is challenging due to the complexity, noise, and diverse nature of activities such as violence, shoplifting, and vandalism. While deep learning (DL) has shown excellent performance in this area, existing approaches have struggled to apply DL models across different anomaly tasks without extensive retraining. This repeated retraining is time-consuming, computationally intensive, and unfair. To address this limitation, a new DL framework is introduced in this study, consisting of three key components: transfer learning to enhance feature generalization, model fusion to improve feature representation, and multitask classification to generalize the classifier across multiple tasks without training from scratch when a new task is introduced. The framework’s main advantage is its ability to generalize without requiring retraining from scratch for each new task. Empirical evaluations demonstrate the framework’s effectiveness, achieving an accuracy of 97.99% on the RLVS (violence detection), 83.59% on the UCF dataset (shoplifting detection), and 88.37% across both datasets using a single classifier without retraining. Additionally, when tested on an unseen dataset, the framework achieved an accuracy of 87.25% and 79.39% on violence and shoplifting datasets, respectively. The study also utilises two explainability tools to identify potential biases, ensuring robustness and fairness. This research represents the first successful resolution of the generalization issue in anomaly detection, marking a significant advancement in the field.

1. Introduction

Integrating Artificial Intelligence (AI) in various domains has made big changes in our daily lives. One of the major uses of AI lies in surveillance cameras, which enhance security and situational awareness. Anomaly detection (AD), a cutting-edge AI technique, has emerged as a vital tool in the surveillance industry, revolutionising how we monitor and protect our environments [1]. AI-driven AD employs machine learning (ML) and deep learning (DL) models to recognise unusual behaviours in video streams automatically. These AI systems continuously analyse and learn from massive datasets, allowing them to detect subtle irregularities that might go unnoticed by human observers. This makes them extremely valuable for early threat detection, crime prevention, and overall safety enhancement [2]. AD identifies human activities that deviate from expected behaviour, including violence, stealing, arson, abuse, loitering, and vandalism in specific locations such as markets and streets. Regarding Computer Vision (CV), AD involves recognising patterns that exhibit significant deviations from normal behaviour. Monitoring surveillance cameras without leveraging intelligent systems demands substantial resources, including manpower, finances, and time. Moreover, it is susceptible to errors due to the challenge of simultaneously monitoring numerous surveillance cameras [3]. Automated CV systems are critical for detecting anomalies in video without manual intervention. DL techniques have consistently delivered state-of-the-art results in addressing issues related to monitoring suspicious activities in surveillance systems and excelling in various other domains. DL leverages deep neural networks (DNNs) to identify and extract features from input data. Furthermore, it has the capability to automatically discern numerous unidentified parameters during the training phase [3]. Convolutional Neural Networks (CNNs) are DNNs designed to learn features and recognise patterns from image data automatically. CNNs use a specialised convolutional layer to analyze the input image with small filters or kernels, allowing them to detect different features at multiple scales [4]. These networks have succeeded highly in various image-related tasks, including image recognition and generation [5]. CNNs are very efficient and powerful tools in video AD (VAD). These DL algorithms can classify video frames based on their feature content and extract relevant information. A critical application of CNNs is identifying abnormal events within video streams. This involves training a CNN on a labelled video dataset to enable it to detect deviations from expected behaviour and flag anomalies. The synergy between CNNs and VAD enhances security and showcases the remarkable adaptability of CNNs in extracting meaningful insights from dynamic visual data [6]. However, one challenge that CNNs face is the requirement for a large amount of data. To address this problem, researchers have widely used transfer learning (TL). TL is a technique in which the knowledge gained from training a model on one task is utilised to improve the performance of another related task instead of training a model from scratch for each specific task. In simpler terms, TL uses the knowledge acquired from one task to enhance learning and performance on another [7]. In addition to the data scarcity issue, VAD faces significant challenges related to generalisation [5], which is the ability of a model to work well on independent data. In addition, the adaptable integration of several models into a unified framework also poses a challenge in VAD. The fusion approach offers a concise representation of various features extracted from different sources, enhancing overall performance and improving generalisation capability [8, 9]. Integrating new datasets without requiring extensive retraining is a significant challenge in the DL field. Furthermore, understanding and interpreting the decisions made by DNNs in the context of image classification and object detection is crucial. Moreover, DNNs, particularly deep convolutional models, are highly complex and can be considered black boxes, making it difficult to understand how they arrive at their predictions [10]. The demand for greater transparency is a pressing issue, especially in critical applications where trust and explainability are paramount. Multi-AD is the task of identifying multiple types or classes of anomalies in a dataset. Unlike binary AD, which deals with only one class of anomalies and one class of normal data, multi-AD addresses scenarios with multiple types of anomalies. Incorporating new classes into the existing model typically involves retraining the model from scratch, which can be time-consuming, resource-intensive, and may require a substantial amount of labelled data for the new class. Moreover, ensuring the model can detect all known anomaly types is crucial while adapting to new ones. If we avoid this process, we can create more efficient and practical AD systems. In this study, new methods have been proposed to address the issues mentioned above. As a result, we present the following contributions:

•
A novel framework has been introduced for incorporating a new anomaly class into an existing AD model without retraining from scratch.
•
A deep feature fusion method is proposed to integrate diverse DL models for better feature representation.
•
The deep feature fusion approach achieved an accuracy of 83.59% on the University of Central Florida (UCF)-Crime dataset and 97.99% on the RLVS dataset, outperforming all the previous methods.
•
Based on our experiments with the UCF-Crime and RLVS datasets, the multiclassification approach we used was able to accurately detect and categorise two specific abnormal behaviour classes (shoplifting and violence) and normal behaviour, achieving an accuracy of 88.37%. This approach significantly enhances the ability to identify and categorise abnormal behaviours in different scenarios. To the best of our knowledge, an existing approach has yet to achieve a similar level of performance with a single model across multiple tasks in VAD.
•
The proposed framework was tested on unseen datasets, achieving an accuracy of 87.25% and 79.39% on violence and shoplifting datasets, respectively.

2. Related Works

This section provides an overview of the evolving landscape of VAD research, highlighting key studies and methodologies that harness ML and DL’s potential to enhance AD systems’ accuracy and efficiency. A study [11] presented a new DL architecture for identifying violent behaviours in videos. The method leverages recurrent neural networks (RNNs) and 2D CNN to capture spatial and temporal features. Optical flow information is integrated to encode motion patterns. The proposed approach has been tested with success on multiple databases. In [12], the authors proposed an approach to assist monitoring staff in directing their attention to specific screens where the likelihood of a crime is higher. This approach involved identifying situations in video footage that may signal an impending crime. They employed a 3D CNN to examine surveillance videos and capture behavioural attributes for the identification of suspicious actions. The model was trained using carefully chosen videos from the UCF-Crimes dataset. In [13], a Hybrid CNN Framework (HCF) was presented to identify distracted driver behaviours by leveraging DL and image processing techniques. The framework employed pretrained CNN models in collaboration to extract behavioural features through TL, thereby improving result accuracy. In a study [14], TL was employed to enhance the accuracy of abnormal behaviour detection by extracting human motion characteristics from RGB video frames. The authors utilised the VGGNet-19 architecture for feature extraction and subsequently applied a Support Vector Machine (SVM) classifier to identify complex motion scenarios. In [15], several DL models, such as CNN, Long Short-Term Memory (LSTM), CNN-LSTM, and Autoencoder-CNN-LSTM, were explored to identify unusual behaviours in older people. The models were trained using temporal and spatial data, which enabled them to make accurate predictions. To tackle the issue of data imbalance, the researchers oversampled minority classes, especially for the LSTM model. Overall, the study provides insights into how DL can be used to detect and prevent unusual behaviours in elderly individuals. In [16], the authors tackled the issue of shoplifting by focusing on detecting suspicious behaviours that may lead to criminal activities. Instead of identifying the crime itself, their approach aims to model and detect behaviours that precede criminal acts, providing opportunities for prevention. They used a 3D CNN to extract video features and classify segments with potential shoplifting behaviour [17]. introduced a shoplifting detection system that utilises DNN. The system used the Inceptionv3 model for feature extraction and employed LSTM networks to understand temporal sequences. This system can accurately identify individuals involved in shoplifting activities with an accuracy of up to 74.53%. The paper [18] presented a deep violence detection approach that leverages handcrafted features related to appearance, speed of movement, and representative images. These features are input into CNN through spatial, temporal, and spatiotemporal streams. The spatial stream captured environment patterns, the temporal stream focused on motion patterns using modified optical flow, and the spatiotemporal stream introduced a novel feature to enhance interpretability. The CNN is trained on datasets containing both violent and normal behaviour frames. A study [19] used DL techniques to detect abnormal driver behaviours such as smoking, eating, drinking, and calling. A dataset was created to train and test models comprising these behaviours and normal driving. The study evaluated DL models, including a proposed CNN-based model and pretrained models such as ResNet101, VGG-16, VGG-19, and Inception-v3. Keyframe extraction was used to optimise computation. In [20], a shoplifting detection system was introduced. This approach involved using a hybrid neural network that combined convolutional and recurrent components to extract information from video frames and analyse their temporal sequence. Specifically, it employed gated recurrent units for data processing. Data augmentation was conducted to mitigate class imbalance and enhance the dataset. A pretrained MobileNetV3Large CNN was combined with a recurrent network incorporating gated nodes for classification. In reference [21], a method for detecting violent behaviour using keyframes is proposed. This approach treats video frames as discrete events and detects instances of violence by assessing whether the count of keyframes exceeds a predefined threshold, thereby reducing hardware demands. Furthermore, the paper introduced a novel training technique that leverages pairs of background-removed and original images to improve feature extraction for DL models while avoiding introducing extra network complexity. In [22], a model for detecting crowd violence behaviour, named HD-Net, was developed. It utilised a human contour extractor to minimise background noise in violence detection by focusing on individuals in video frames. A dynamic feature encoder also extracts dynamic features from adjacent frames. The model is built on a 3D CNN framework for spatial feature extraction and LSTM for temporal feature fusion [23]. proposed a convolutional autoencoder architecture that can detect anomalies in appearance and motion patterns. The architecture used two components, the spatial and temporal autoencoder, to differentiate between spatial and temporal representations. The spatial autoencoder captures appearance features by reconstructing the initial frame. On the other hand, the temporal component models motion through RGB differences across sequential frames. To further enhance the performance of the motion autoencoder, the paper incorporated a variance-based attention module that highlights critical movement areas. A novel deep K-means clustering approach was introduced to extract concise representations. In [24], the authors presented a CNN model to detect crowd anomalies in video sequences. The model comprises two convolutional layers followed by two fully connected layers that utilise Rectified Linear Unit (ReLU) and sigmoid functions. The intermediate layers generate features that are used for abnormality detection. The model’s performance was evaluated on three scientific datasets that included normal and abnormal activities. The outcomes demonstrated that the model performed effectively when applied to random YouTube videos exhibiting abnormal behaviour. In [25], a new approach was introduced to detect abnormal behaviour of workers in manufacturing environments. The model identifies and describes unusual worker actions based on their interaction with objects using a combination of technologies such as Mask R-CNN, Media Pipe Holistic, LSTM, and a worker behaviour description algorithm. The approach involved object recognition; worker poses identification, and pattern analysis to differentiate between typical and unusual actions. Anomalous behaviours include worker falls, slips, tool breakage, and machine failures. The article [26] presented an anomaly recognition model employing a deep CNN architecture. This model extracts deep features from surveillance video frames and directs them to a temporal convolution network (TCN) with a multihead attention module. The TCN comprises multiple layers of temporal convolutional filters with varying dilation rates, enabling the capture of diverse temporal contexts and long-range dependencies. It is trained by minimising an objective function, using cross-entropy loss, to optimise parameters for accurate classification or recognition of activities in sequential data. The research presented in the paper [27] utilised TL-InceptionV3 to improve AD in surveillance cameras. Two TL methods, pretraining and fine-tuning, were employed using InceptionV3 to classify frames as normal or abnormal behaviours. The UCF-Crime dataset was utilised for training and evaluation. The results demonstrated that the fine-tuning approach outperformed the pretraining approach significantly. This indicates substantial enhancements in the model’s performance. In [28], a comprehensive benchmark dataset consisting of 900 samples was evenly divided into 450 instances of shoplifting and 450 instances of nonshoplifting, annotated across different shoplifting scenarios. This dataset assessed shoplifting detection techniques, including 2D CNN, 3D CNN, and a novel hybrid method that combines InceptionV3 and bidirectional LSTM. Notably, the hybrid approach outperformed the others in terms of performance. In [29], a DL method is introduced for identifying violence in animation videos. The research involved modifying a Faster R-CNN model to handle the intricate aspects of violence depicted in cartoon and animation content. The modification included replacing the model’s backbone with a customised RegNet to capture frame features, utilising a modulated deformable convolutional (MDC) layer instead of the standard inner lateral connection for flexible feature map extraction, and introducing a novel distributed attention module (DAM) within the feature pyramid network to enhance feature extraction. Additionally, the researchers incorporated a modified multiscale Region of Interest (ROI) Align to enhance violence detection across diverse scenarios. Moreover, the method integrated a classification component into the detection model to categorise different levels of violence within each frame. In [30], an innovative semisupervised hard attention mechanism was introduced. This mechanism facilitated identifying and separating crucial regions within videos from less informative data segments. The model’s accuracy improved by efficiently eliminating redundant data and highlighting valuable visual information at a higher resolution. This approach obviated the necessity for attention annotations in video violence datasets, rendering them more widely applicable. The proposed model utilised a pretrained I3D backbone to expedite and stabilise the training process. Ref. [31] propoed a novel approach to enhance the generalisation of violence detection across multiple scenarios. This approach employed three pretrained CNN models, Xception, InceptionV3, and InceptionResNetV2, to extract significant features from RLVS and Hockey datasets. The extracted features from each dataset were then fused into a single feature pool separately. Finally, these feature pools from the first violence scenario and the second were combined into a unified feature space, facilitating the training of an ML classifier capable of generalising across multiple scenarios. However, it is essential to note that this approach needs to address generalisation across multitask AD. The article [32] explored advanced techniques to enhance aggression detection using multimodal fusion and DL in surveillance systems. The study addressed the limitations of traditional single-modality approaches by integrating audio, visual, and text-based features, along with additional meta-information such as Audio-Focus, Video-Focus, Context, History, and Semantics. Four distinct fusion methods were developed and compared: intermediate-level fusion, concatenation-based fusion, and two methods involving element-wise operations followed by concatenation. In the paper [33], the authors proposed a method to enhance the detection of abnormal actions, particularly human aggression and car accidents, using wavelet-based channel augmentation. The core of the proposed method was the MultiWave-Net, a spatiotemporal network designed to integrate wavelet transformation with traditional DL architectures such as CNNs and ConvLSTMs. The wavelet-based channel augmentation technique was applied to improve the feature extraction capabilities of these networks, allowing them to capture both spatial and temporal aspects of the input data better. The article [34] proposed an approach to recognising violent video behaviours by leveraging the MLP-Mixer architecture and a new dataset format called Sequential Image Collage (SIC). SIC aggregated video frames into SIC, capturing spatial and temporal dimensions to enhance the model’s understanding of violent actions. These collages and original frames were processed through the MLP-Mixer architecture, which relied solely on Multilayer Perceptrons (MLPs) for computational efficiency. The method involved patch embedding, token mixing, and channel mixing operations to capture the dataset’s local and global features. The paper [35] proposed a VAD approach, namely AnomalyCLIP, to detect and classify abnormal events at the frame level using only video-level supervision. The method manipulated the latent CLIP feature space to define a normal event subspace, allowing it to detect anomalies based on text-driven prompts. It incorporated a Transformer architecture to enhance this capability, capturing both short- and long-term temporal dependencies between frames [36]. proposed a lightweight, trainable 3D CNN to recognise violent actions in public places by processing spatial and temporal information. This model is designed to be computationally efficient for deployment on resource-constrained devices like mobile systems. In [37], the authors introduced an ensemble approach comprising three CNN architectures integrated with a stacked LSTM to enhance model performance. It utilised the RLVS and RWF-2000 datasets to evaluate the model’s generalisation ability across diverse datasets.

Despite the literature introducing advanced AI methods to detect and recognise anomaly behaviours in videos, a significant challenge emerges concerning generalisation. These methods necessitate retraining the entire model from scratch when a new anomaly class is introduced, resulting in increased time and computational resource requirements, especially in cases where the model is complex or requires extensive training. This challenge presents a practicality and efficiency issue for AD systems. Therefore, new methods are urgently needed to address the generalisation issue in multitask AD without requiring extensive retraining.

3. Materials and Methods

3.1. Datasets

UCF-Crime and RLVS datasets were employed in this article. The UCF dataset [38] is widely recognised in criminal activity recognition research and was compiled by the UCF. It is publicly accessible for research purposes and includes 1900 untrimmed surveillance videos gathered from platforms such as YouTube, TV news, and documentaries, varying in quality and resolution. This dataset encompasses 128 h of footage depicting 13 real-world criminal activities, including abuse, arrest, arson, assault, road accidents, shoplifting, and more. For this investigation, samples were extracted explicitly from the “shoplifting” and “normal” classes, each comprising 50 video clips recorded in retail stores. On the other hand, the RLVS dataset [39] consists of 2000 video clips, equally divided between violent and everyday activities. The violent videos depict physical altercations in various environments, such as streets, prisons, and schools. Videos within the RLVS dataset feature high resolutions, ranging from 480p to 720p, with durations spanning 3–7 s. A frame extraction process was performed using a 10-frame interval, resulting in six frames per second. Notably, some frames within violent and shoplifting videos were eliminated during the data cleaning phase because they did not depict the relevant actions and were more similar to frames from normal videos. Figure 1 shows some samples from both the UCF and RLVS datasets.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Samples from the UCF and RLVS datasets. The first two rows depict normal samples, while the last two depict shoplifting and violence samples in both datasets.

3.2. CNN Architectures

This study utilised four deep CNN models, MobileNetV2, InceptionV3, InceptionResNetV2, and Xception, to tackle the challenge of detecting anomalous video behaviours. These models have several advantages worth noting. They have excelled in achieving outstanding results on the ImageNet dataset, which is widely acknowledged as the benchmark for CV tasks. Furthermore, their well-designed architecture excels at feature extraction, allowing them to capture a broad spectrum of features due to their diverse filter sizes, ranging from 1 × 1 to 7 × 7. Furthermore, incorporating ReLU activations and residual connections improves the feature representation quality and addresses the gradient vanishing problem. Dropout layers and global average pooling (GAP) are also utilised in these models to mitigate the risk of overfitting. Moreover, the incorporation of Batch Normalization layers expedites the training process. These advantages collectively make these models effective methods for VAD. The following subsections provide concise descriptions of the models used in the work.

3.2.1. MobileNetV2 Model

The MobileNetV2 model is an efficient and lightweight CNN technique explicitly designed for mobile and embedded devices. MobileNetV2’s architectural composition commences with fully convolutional layers containing 32 filters and encompasses 19 residual bottleneck layers. It consists of two modules, each comprising three layers. These blocks start and end with a 1 × 1 convolutional layer comprising 32 filters—notably, the second block functions as a fully connected layer with one depth. ReLU activation is applied at various levels throughout the architecture. The primary distinction between the two modules is their stride lengths; the first employs a stride length of 1, while the second utilises a stride length of 2 [40]. The MobileNetV2 model successfully achieves a delicate equilibrium between model size and performance, rendering it particularly suitable for resource-constrained applications like mobile devices and embedded systems.

3.2.2. InceptionV3 Model

InceptionV3 [41] is a CNN architecture for image classification and object recognition tasks. It is renowned for its deep structure and utilisation of specialised convolutional layers, including Inception modules designed to capture features at varying scales. InceptionV3 employs parallel convolutional layers of different sizes and pooling operations within these modules to capture features at various scales effectively. It leverages factorized convolutions to reduce network parameters, includes auxiliary classifiers to enhance training, and utilises GAP to mitigate overfitting. Extensive use of batch normalisation accelerates training. InceptionV3’s depth and parameter count make it suitable for various CV applications. These attributes contribute to InceptionV3’s ability to achieve high accuracy in image classification tasks while maintaining manageable computational complexity.

3.2.3. InceptionResNetV2 Model

InceptionResNetV2 is a deep CNN architecture combining elements from the Inception and ResNet architecture [42]. It was designed to improve feature learning and representation in CV tasks. InceptionResNetV2 integrates residual connections, similar to ResNet, to facilitate the training of deep networks while also utilising Inception modules to capture features at multiple scales. This architecture typically includes a stem module to preprocess input data, multiple Inception-ResNet blocks that increase network depth, and final layers for classification or feature extraction. InceptionResNetV2’s unique combination of these architectural elements aims to achieve superior performance in tasks such as image classification and object detection.

3.2.4. Xception Model

The Xception network [43] evolves from Inception by replacing conventional convolution layers with depthwise separable convolution layers. This design optimises spatial and cross-channel correlations within the network’s core functionality. XceptionNet, with 36 convolution layers segmented into 14 modules, supersedes Inception’s architecture. It maintains a continuous relationship between the remaining layers after removing the initial and final ones. The network transforms the original image to determine probabilities across multiple input channels and employs 11 depth-wise convolutions, offering an alternative to three-dimensional maps by visualising relationship patterns.

The selected CNN models—MobileNetV2, InceptionV3, InceptionResNetV2, and Xception—were chosen due to their unique architectural strengths and their proven effectiveness on the ImageNet dataset, a benchmark for CV tasks. MobileNetV2 is a lightweight and efficient model optimized for resource-constrained environments, making it ideal for real-time applications. InceptionV3 and InceptionResNetV2 are renowned for their ability to capture multiscale features through parallel convolutional layers, with the latter incorporating residual connections for enhanced gradient flow in deep networks. Xception leverages depthwise separable convolutions to optimise spatial and cross-channel correlations, reducing computational complexity while maintaining high accuracy. These models provide complementary advantages in feature extraction, offering a diverse set of representations for the AD task.

These models utilise prior knowledge from large-scale datasets like ImageNet, allowing them to transfer generalised features to new tasks. This significantly reduces the need for extensive labelled data, speeds up convergence, and improves performance with minimal computational resources. In contrast, non-pretrained networks require training from scratch, which demands substantial computational resources, larger datasets, and longer training times, making them less practical for real-world AD scenarios. In this context, self-supervised learning can be an alternative to transferring learning from imagenet [4]. By leveraging pretrained networks, the framework efficiently captures generalised features and refines them for specific tasks, ensuring robust and accurate AD.

3.3. Part1: Proposed Framework, Deep Feature Fusion Approach

To leverage the features extracted by different CNN models for detecting anomalies in surveillance videos, since each model has its architecture and different filter sizes for feature extraction from input data, combining these features provides a better feature representation. This, in turn, improves overall performance. The proposed deep feature fusion approach (Figure 2) used four CNN models, MobileNetV2, InceptionV3, InceptionResNetV2, and Xception, as feature extractors to capture features from the input video frames. Next, these features, extracted from the individual models, are combined into a unified feature pool. The different colours in the feature pool correspond to the features extracted from different CNN models. The feature pool is then used to train ML classifiers. Finally, ML classifiers were employed to assign class labels and classify human behaviours as normal or abnormal. Six classifiers were used to recognise anomalies in captured videos: SVM, SoftMax, K-Nearest Neighbor (KNN), AdaBoost, Logistic Regression (LogReg), and Naive Bayes classifiers.

The deep fusion approach offers several benefits. It provides a flexible means of combining multiple CNN models without the need to train them from scratch. This approach allows for the incorporation of new models trained on specific datasets by extracting features from the final fully connected layer and then inserting them into the feature space. This incorporation method saves time and computational costs, eliminating the need to retrain the already-used individual models. Moreover, the fusion of features extracted from multiple models yields a broader and more inclusive array of information from which classifiers can acquire knowledge. This technique empowers ML classifiers to harness the distinct strengths and characteristics of the individual models, thus enhancing the holistic understanding of the target task. Furthermore, combining diverse models can mitigate the risk of overfitting and enhance generalisation capability.

3.4. Part2: Proposed Framework, Multitask Classification

Traditional AD models struggle with handling new or unforeseen anomalies, making them susceptible to false negatives (FN) or misclassifications. In contrast to binary AD, which deals with only one class of anomalies and one class of normal data, multi-AD addresses scenarios where multiple types of anomalies are present.

The proposed multitask classification approach introduces a critical innovation that eliminates the need to retrain the entire model from scratch when introducing a new anomaly class. This approach unifies features extracted from various CNN models and multiple datasets with different classes into a single feature space. These features are then used to train ML classifiers. Figure 3 provides a schematic diagram of the multitask classification model. In the scenario presented in this paper, two feature fusion pools were used: the UCF-features pool and the RLVS-features pool, which fused the features extracted from the UCF and RLVS datasets. The UCF-feature pool comprises features related to normal and violent classes, while the RLVS-features pool includes features representing normal and shoplifting classes.

To categorise incoming frames as violent, shoplifting, or usual, the proposed approach utilised the UCF-features pool and RLVS-features pool, along with their corresponding class labels, to create a unified feature space. This feature space was then used to train the ML classifiers to categorise and classify incoming frames based on their respective classes.

There are several advantages to incorporating new classes without going through full retraining. This approach can save time and computational resources, especially when dealing with complex models or models that require significant training time. It also enables the system to adapt to emerging anomaly types, which enhances its robustness in dynamic environments. This approach effectively addresses a considerable challenge in ML by allowing models to adapt to evolving anomaly patterns efficiently.

3.5. Training

We conducted experiments involving four individual CNN models: MobileNetV2, InceptionV3, InceptionResNetV2, and Xception, as well as a deep feature fusion model and a multitask classification model. These experiments were structured as follows:

•
Training and testing each individual CNN model on the UCF dataset.
•
Training and testing each individual CNN model on the RLVS dataset.
•
Assessing the performance of the deep fusion model through distinct tests on the UCF and RLVS datasets.
•
Finally, evaluating the proposed multitask classification approach combines the captured features from the UCF and RLVS datasets, each containing different abnormal behaviors.

3.6. Explainable Tools

DL has been considered a complicated and obscure process, often described as a “black box” due to the difficulty in comprehending why a model makes specific choices. The absence of transparency in these systems can significantly undermine trust in the ultimate decisions they make [44]. This paper uses gradient-weighted class activation mapping (Grad-CAM) and t-SNE visualization techniques to address limitations and provide a more comprehensive insight into how DL methods make decisions.

1.
The Grad-CAM is an interpretability technique designed to elucidate the predictions of any DL model in a coherent and comprehensible manner. This is achieved through a focus on visualizing crucial regions within images. The approach capitalizes on gradients, effectively highlighting areas of the image that wield significant influence over the model’s decision-making process. Grad-CAM starts with a forward pass, processing an image through a pretrained CNN. Backpropagation calculates how changes in each feature map influence the final prediction score for a specific class. GAP computes importance scores for each feature map by averaging gradients. These important scores are used as weights for a weighted sum of feature maps, indicating their impact. The ReLU activation focuses on positive contributions, and the final output is a heatmap highlighting image regions that influenced CNN’s decision. Brighter heatmap areas correspond to more influential image regions in classification [45, 46]. Researchers can analyze and comprehend the crucial regions to better understand the model’s reasoning process. This approach also helps them to verify whether the influential regions align with their expectations, boosting confidence in the model’s predictions. In our study, we applied Grad-CAM to the final convolutional layers of the CNN, as these layers contain more detailed feature representations, providing a better insight into the model’s focus on higher-level patterns. We selected these layers because they capture complex features that are essential for making final predictions, unlike earlier layers that focus on basic edges and textures.
2.
t-SNE, which stands for t-Distributed Stochastic Neighbor Embedding, is a nonlinear technique for reducing the dimensions of data while preserving the structure at various scales. It is particularly well-suited for visualizing high-dimensional datasets. The low-dimensional representation produced by t-SNE can be plotted, allowing you to visualise clusters, patterns, and relationships in the data that might be difficult to discern in the high-dimensional space. This paper uses t-SNE to understand the fusion techniques and how they improve the feature space. Feature fusion techniques often combine features from multiple layers or sources to create more comprehensive representations, which can improve model performance. By using t-SNE to visualize the extracted features, we aimed to observe how feature fusion enhances the distinction between different classes or categories in a lower-dimensional space. This visualization helps reveal patterns in the data that show the improved discriminative power of fused features compared to nonfused ones. Additionally, t-SNE allowed us to identify any potential biases in the feature space, as unintended clustering or overlap between certain categories could indicate problematic areas in the model’s learned representations. This insight was crucial in refining the feature fusion process to ensure that it contributed to both the model’s robustness and fairness.

4. Experimental Results

This section discusses the experimental results of various DL models used for AD, specifically focusing on tasks such as violence and shoplifting detection in video datasets. It outlines the performance of individual CNN models and a proposed deep feature fusion model, multitask classification approach assessed using metrics like accuracy, recall, precision, and F1 score.

4.1. Evaluation Metrics

The utilized models in this study underwent assessment employing a range of evaluation metrics, encompassing accuracy, recall, precision, and the F1 score. AD systems require minimized rates of false positives (FP) and FN while simultaneously maximizing the counts of true positives (TP) and true negatives (TN). TN signifies the count of correctly labeled negative (normal) instances, whereas TP corresponds to correctly labeled positive (anomaly) instances. Conversely, FP and FN counts reveal the number of instances inaccurately labeled as positive or negative [3]. Each evaluation metric is computed as follows:

()

4.2. Experimental Results of Individual CNN Models

The four deep CNN models used in this work were evaluated for performance in VAD tasks by being tested on the UCF and RLVS datasets, as described in subsequent sections.

4.2.1. Experiment Results on UCF Dataset

The results of the pretrained CNN models were evaluated based on their accuracy, loss curves for training and validation, and the confusion matrix, as shown in Figure 4. Table 1 presents the evaluation metrics obtained by testing these models on the UCF dataset. MobileNetV2 performed the best in accuracy, precision, and F1 score, making it the ideal choice when minimizing FP is a priority. Xception, on the other hand, had high recall but lower precision, making it a suitable option when capturing TP is a priority. InceptionV3 and InceptionResNetV2 achieved a balance between these two metrics. Additionally, Figure 5 displays the Grad-CAM-generated heatmaps for these individual models.

Table 1. The experimental results of the individual models on the UCF dataset.

Model	Accuracy (%)	Recall (%)	Precision (%)	F1 score (%)
MobileNet	83.48	74.56	90.14	81.61
Inception	80.71	70.58	87.81	78.26
InceptionResNet	79.60	78.15	79.92	79.03
Xception	70.50	89.10	64.47	74.81

4.2.2. Experiment Results on RLVS Dataset

The RLVS dataset served as the training and testing data for four distinct models in this specific scenario. These models underwent training and validation across multiple epochs, allowing the measurement of losses, accuracies, and confusion matrices to assess their performance. The results of this evaluation are presented in Figure 6. Table 2 provides a summary of the performance metrics of these models on the RLVS dataset. All four models demonstrated high accuracy, highlighting their proficiency in correctly classifying videos as either violent or normal. The differences in accuracy, recall, and F1 score between the models are minimal, with MobileNet holding a slight edge, demonstrating a strong capability to detect violent instances. InceptionResNet exhibited the highest precision. These models deliver robust performance in violence recognition, characterized by their high accuracy. Figure 7 displays the heatmap generated by Grad-CAM.

Table 2. The experimental results of the individual models on the RLVS dataset.

Model	Accuracy (%)	Recall (%)	Precision (%)	F1 score (%)
MobileNet	96.57	97.74	95.51	96.61
Inception	96.0	95.88	96.18	96.0
InceptionResNet	96.19	95.32	97.0	96.16
Xception	96.17	96.75	95.65	96.20

4.3. Experimental Results of the Deep Feature Fusion Model

In the following subsections, we provide detailed information about the experimental results achieved by utilizing the deep Fusion model on both the UCF and RLVS datasets. Each model focused on a distinct ROI, and the fusion of these four models proved to be highly effective in capturing features for the ML classifiers.

4.3.1. Experimental Results of the Deep Fusion Model on the UCF Dataset

The proposed feature fusion model was evaluated on the UCF dataset to evaluate its effectiveness in recognizing shoplifting behavior in surveillance videos. As mentioned earlier, this model involved combining features extracted from individual CNN models, which were trained on the UCF dataset, into a unified feature pool. Subsequently, we assessed the model’s performance on the UCF testing dataset using six classifiers. Table 3 and Figure 8 present the results and their corresponding confusion matrices. The results demonstrate that the proposed model outperforms the individual CNN models, achieving an accuracy of 83.59%, a recall of 84.62%, a precision of 82.46%, and an F1 score of 83.53%. This represents an improvement of 0.11% over the MobileNetV2 model, which achieved the highest accuracy among the individual models. Figure 9 shows the Feature distribution visualized using t-SNE for UCF dataset including CNN models and features after fusion.

Table 3. Experimental results of the deep fusion model on the UCF dataset.

Classifier	Accuracy (%)	Recall (%)	Precision (%)	F1 score (%)
LogReg	83.59	84.62	82.46	83.53
SoftMax	83.55	84.53	82.45	83.48
KNN	83.39	82.41	83.59	82.99
SVM	81.28	72.70	87.10	79.25
AdaBoost	81.28	72.70	87.10	79.25
Naïve Bayes	77.25	53.92	99.67	69.98

4.3.2. Experimental Results of the Deep Feature Fusion Model on the RLVS Dataset

Once again, the proposed deep feature fusion approach outperformed the individual CNN models used in this work when tested on the RLVS dataset for detecting violent activities in videos, achieving an accuracy of 97.99%. This represents an increase of 1.42% over the MobileNetV2 model, which achieved the highest accuracy among the individual models. Table 4 and Figure 10 present the experimental results, including the confusion matrices of the ML classifiers. Figure 11 shows the Feature distribution visualized using t-SNE for RLVS dataset including CNN models and features after fusion.

Table 4. Experimental results of the deep fusion model on the RLVS dataset.

Classifier	Accuracy (%)	Recall (%)	Precision (%)	F1 score (%)
KNN	97.99	98.57	97.45	98.01
LogReg	97.89	98.57	97.26	97.91
SoftMax	97.89	98.57	97.26	97.91
AdaBoost	97.82	97.78	98.86	97.82
SVM	97.60	99.01	96.30	97.63
Naïve Bayes	97.34	99.28	95.57	97.39

4.4. Experimental Results of the Multitask Classification Model

We assessed the effectiveness of our multitask classification model in identifying multiple anomaly classes by using two video anomaly behavior datasets, UCF, which has normal and shoplifting classes, and RLVS, which has normal and violent classes. The proposed model used four pretrained CNN models as feature extractors to extract the features from the UCF and RLVS datasets. The extracted features from these different models for each dataset were fused to create a single feature set (Figure 12). It is worth noting that the features for normal behavior from both datasets were merged due to their similarity in behavior. After that, ML classifiers were trained to categorize and classify incoming frames as shoplifting, violent, or normal behavior. Table 5 shows our results, including accuracy, recall, precision, and F1 scores for six different classifiers—AdaBoost, KNN, LogReg, SoftMax, Naïve Bayes, and SVM. We also included their respective confusion matrices in Figure 13. AdaBoost showed the highest accuracy at 88.37%, recall at 84.34%, precision at 88.0%, and an F1 score of 85.43%, indicating strong overall performance. KNN also performed well with a high accuracy of 86.45%, recall of 82.05%, precision of 82.62%, and an F1 score of 82.08%. LogReg and SoftMax yielded similar performance metrics with moderate accuracy and F1 scores. SVM and Naïve Bayes showed lower accuracy, recall, precision, and F1 scores compared to the other classifiers.

Table 5. Experimental results of the multiclassification model.

Classifier	Accuracy (%)	Recall (%)	Precision (%)	F1 score (%)
AdaBoost	88.37	84.34	88.00	85.43
KNN	86.45	82.05	82.62	82.08
LogReg	78.49	71.98	72.01	71.27
SoftMax	78.49	71.98	72.01	71.27
SVM	72.86	64.86	63.86	62.02
Naïve Bayes	71.87	63.70	62.17	60.35

We evaluated six ML classifiers (SVM, SoftMax, KNN, AdaBoost, LogReg, and Naïve Bayes) using the unified feature pool generated by the proposed deep feature fusion framework. Each classifier was trained on the same feature pool to ensure a fair comparison, and their performance was assessed using four metrics: accuracy, recall, precision, and F1 score. The results indicated that AdaBoost achieved the highest overall performance, particularly in multitask classification, due to its ability to balance precision and recall effectively. AdaBoost demonstrated robustness in identifying anomalies across datasets, benefiting from its adaptive boosting mechanism, which enhances performance by focusing on misclassified samples during training. Other classifiers, such as SVM and KNN, exhibited strong performance in specific metrics (e.g., precision or recall) but did not achieve the same balance as AdaBoost. This comparative evaluation highlights the suitability of AdaBoost as the best-performing classifier for the proposed framework.

4.5. Comparative Studies With Recent Research

In this section, we compared the performance of the proposed deep feature fusion approach to recent research that utilizes DL models to detect anomalous behaviors in surveillance videos, particularly those related to violence and shoplifting crimes. Table 6 presents a comparison of the accuracy values achieved by the proposed deep feature fusion model with existing methods for automatic shoplifting detection using the UCF dataset. On the other hand, Table 7 presents the results of the experiments conducted by applying our deep feature fusion model and various DL methods for violence detection on the RLVS dataset. Our proposed Deep Fusion approach has demonstrated remarkable superiority over contemporary methodologies by achieving notable accuracies of 83.59% and 97.99% on the UCF and RLVS datasets, respectively. These substantial performance metrics unequivocally affirm our approach’s outstanding quality and efficacy in VAD. Such results establish our method’s prowess and signify a substantial stride forward in advancing the current state-of-the-art in this intricate field. The exceptional accuracy attained on both datasets underscores our approach’s potential to address real-world challenges in video surveillance and AD scenarios, signifying its value in practical applications and its potential for further research and development.

Table 6. Comparison of the deep feature fusion model with previous methods using the UCF dataset.

Ref., year	Model	Accuracy %
[12], 2020	3D CNN	75.0
[16], 2021	3D CNN	75.7
[17], 2021	InceptionV3 and LSTM networks	74.53
[28], 2023	InceptionV3 and bidirectional LSTM	81.0
[35], 2024	AnomalyCLIP	75.03
Proposed deep feature fusion model		83.59

Note: The bold value indicates the result of the proposed method.

Table 7. Comparison of the deep feature fusion model with previous methods using the RLVS dataset.

Ref., year	Model	Accuracy %
[39], 2019	VGG16 + LSTM	88.20
[11], 2020	ValdNet2 + GRU	96.74
[47], 2021	Flow gated RGB	87.25
[21], 2022	Keyframe + ResNet18 network	94.60
[22], 2022	3DCNN + LSTM networks	96.50
[26], 2023	BoTNet152 + TCN	93.15
[33], 2024	MultiWave-Net	96.0
[34], 2024	MLP-mixer architecture	96.0
[36], 2024	Lightweight 3D CNN	88.0
[37], 2024	Ensemble approach (CNN + LSTM)	96.6
Proposed deep feature fusion model		97.99

Note: The bold value indicates the result of the proposed method.

The model for multiclassification achieved a high accuracy of 88.37% by using the AdaBoost classifier. Its primary task was to recognize three classes of behavior, which included two distinct types of abnormal behavior (shoplifting and violence) and a normal class. The model leveraged the UCF and RLVS datasets, making it highly effective in precisely recognising anomalies. It provided a robust solution for multianomaly recognition and significantly enhanced the capacity to detect and classify diverse anomalous behaviors across various scenarios. This approach is groundbreaking and innovative as it addresses the challenge of integrating multitasks within video anomaly systems without requiring complete model retraining when incorporating a new anomaly class. As a result, our approach is unique and incomparable to prior studies regarding this specific aspect.

4.6. Independent Test

An independent test involves evaluating a model’s performance on data it has not seen or been trained on, a crucial step in assessing the model’s generalization capabilities. In this study, we conducted an independent test on the proposed multitask classification model to evaluate its performance on unseen data with new scenarios and gauge its ability to generalize learned patterns to new information. The model underwent training on the UCF dataset, encompassing shoplifting actions, and the RLVS dataset, which includes instances of violent actions. The model was then tested on two independent datasets: a Movie dataset [48] featuring both normal and violent actions, and a shoplifting dataset [49] consisting of normal and shoplifting actions. Table 8 lists the results achieved by the multitask classification demonstrated on the movie test set. KNN stands out with the highest accuracy at 87.25% among the presented classifiers. It also exhibits balanced recall at 80.25%, precision at 97%, and an F1 score at 87.87%, indicating robust performance across different metrics. AdaBoost also performs well across all metrics: accuracy at 86.23%, recall at 77.80%, precision at 99.43%, and F1 score at 87.30%. On the other hand, the results in Table 9 show how different classifiers performed on the Shoplifting dataset for a multitask classification model. Naïve Bayes performed the best, achieving the highest accuracy 79.39%, recall 76.79%, precision 83.28%, and F1 score 79.91%. KNN followed closely, with a good balance between precision 82.17% and recall 69.86%. SoftMax, AdaBoost, LogReg, and SVM all had similar results, with accuracy around 75%. This noteworthy result suggests the model’s proficiency in generalizing to sequences of violence and shoplifting behaviors across diverse scenarios, validating its robustness and potential practical utility. Figure 14 displays heatmaps generated by the Grad-CAM method on samples from the dependent datasets.

Table 8. Experimental results of the multitask classification model on the movie dataset.

Classifier	Accuracy (%)	Recall (%)	Precision (%)	F1 score (%)
KNN	87.25	80.25	97.00	87.87
AdaBoost	86.23	77.80	99.43	87.30
SoftMax	82.90	74.60	99.50	85.25
LogReg	82.90	74.60	99.50	85.25
Naïve Bayes	81.79	75.90	97.30	85.36
SVM	78.46	71.58	98.50	82.97

Table 9. Experimental results of the multitask classification model on the shoplifting dataset.

Classifier	Accuracy (%)	Recall (%)	Precision (%)	F1 score (%)
Naïve Bayes	79.39	76.79	83.28	79.91
KNN	75.83	69.86	82.17	75.52
SoftMax	75.05	68.06	82.12	74.43
AdaBoost	74.96	67.80	82.15	74.29
LogReg	74.92	67.80	82.07	74.26
SVM	74.92	67.80	82.07	74.26

5. Conclusion

This paper addresses the challenging task of detecting anomalies in complex image data that is characterized by noise and diverse actions such as violence, shoplifting, and property destruction. While DL has shown promising results in this domain, previous studies have often needed help with the crucial problem of generalization across different AD tasks without resorting to training from scratch for each new task. This approach is not only time-consuming and computationally expensive but also unfair. To mitigate these issues, our paper introduces a novel DL framework that comprises three key components. First, TL is used to enhance feature generalization. Second, model fusion is employed to improve feature representation, which enhances generalization. Finally, multitask classification is used to enable the generalization of the classifier across various tasks. Empirical results demonstrate the effectiveness of our approach, surpassing state-of-the-art methods. Using a single classifier, we achieved an excellent accuracy of 97.99% on the RLVS dataset for violence detection, 83.59% on the UCF dataset for shoplifting detection, and 88.37% on both datasets, all without the necessity of training from scratch for each task. To the best of our knowledge, this represents the first successful resolution of the generalization problem in AD, which is a significant advancement in this domain. In conclusion, our novel DL framework provides a more efficient and fair approach to AD in complex image data. The limitation of our framework lies in the potential increase in model complexity and computational overhead, particularly in multitask learning scenarios. This limitation will be addressed in future work by introducing a lightweight DL approach aimed at reducing complexity while maintaining performance efficiency. Our goal is to optimise the framework for practical applications, ensuring it remains suitable for deployment in real-world environments.

Disclosure

A preprint has previously been published [50]: https://arxiv.org/abs/2408.00792.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

The authors would like to acknowledge the support received through the following funding schemes of the Australian Government: Australian Research Council (ARC) Industrial Transformation Training Centre (ITTC) for Joint Biomechanics under Grant IC190100020. Open access publishing facilitated by Queensland University of Technology, as part of the Wiley-Queensland University of Technology agreement via the Council of Australian University Librarians.

Open Research

Data Availability Statement

All used datasets were cited within the article.

References

1 Nayak R., Pati U. C., and Das S. K., A Comprehensive Review on Deep Learning-Based Methods for Video Anomaly Detection, Image and Vision Computing. (2021) 106, https://doi.org/10.1016/j.imavis.2020.104078.
10.1016/j.imavis.2020.104078
Web of Science® Google Scholar
2 Alzubaidi L., Al–Dulaimi K., Obeed H. A. H. et al., MEFF-A Model Ensemble Feature Fusion Approach for Tackling Adversarial Attacks in Medical Imaging, Intelligent Systems With Applications. (2024) 22, https://doi.org/10.1016/J.ISWA.2024.200355.
10.1016/j.iswa.2024.200355
PubMed Web of Science® Google Scholar
3 Jebur S. A., Hussein K. A., Hoomod H. K., Alzubaidi L., and Santamaría J., Review on Deep Learning Approaches for Anomaly Event Detection in Video Surveillance, Electronics. (2022) 12, no. 1, https://doi.org/10.3390/electronics12010029.
10.3390/electronics12010029
Google Scholar
4 Alzubaidi L., Fadhel M. A., Hollman F et al., SSP: Self-Supervised Pertaining Technique for Classification of Shoulder Implants in X-Ray Medical Images: A Broad Experimental Study, Artificial Intelligence Review. (2024) 57, no. 10, https://doi.org/10.1007/s10462-024-10878-0.
10.1007/s10462-024-10878-0
Web of Science® Google Scholar
5 Saihood A., Abdulhussien W. R., Alzubaid L., Manoufali M., and Gu Y., Fusion-Driven Semi-Supervised Learning-Based Lung Nodules Classification With Dual-Discriminator and Dual-Generator Generative Adversarial Network, BMC Medical Informatics and Decision Making. (2024) 24, no. 1, https://doi.org/10.1186/s12911-024-02820-9.
10.1186/s12911-024-02820-9
PubMed Web of Science® Google Scholar
6 Singh K., Rajora S., Vishwakarma D. K., Tripathi G., Kumar S., and Walia G. S., Crowd Anomaly Detection Using Aggregation of Ensembles of Fine-Tuned Convnets, Neurocomputing. (2020) 371, 188–198, https://doi.org/10.1016/j.neucom.2019.08.059, 2-s2.0-85072575083.
10.1016/j.neucom.2019.08.059
Web of Science® Google Scholar
7 Ali L. R., Jebur S. A., Jahefer M. M., and Shaker B. N., Employing Transfer Learning for Diagnosing COVID-19 Disease, International Journal of Online and Biomedical Engineering (iJOE). (2022) 18, no. 15, 31–42, https://doi.org/10.3991/ijoe.v18i15.35761.
10.3991/ijoe.v18i15.35761
Google Scholar
8 Alzubaidi L., Chlaib H. K., Fadhel M. A. et al., Reliable Deep Learning Framework for the Ground Penetrating Radar Data to Locate the Horizontal Variation in Levee Soil Compaction, Engineering Applications of Artificial Intelligence. (2024) 129, https://doi.org/10.1016/j.engappai.2023.107627.
10.1016/j.engappai.2023.107627
Web of Science® Google Scholar
9 A Alwzwazy H., Alzubaidi L., Zhao Z., and Gu Y., FracNet: An End-To-End Deep Learning Framework for Bone Fracture Detection, Pattern Recognition Letters. (2025) 190, 1–7, https://doi.org/10.1016/j.patrec.2025.01.034.
10.1016/j.patrec.2025.01.034
Google Scholar
10 Alzubaidi L., Jebur S. A., Jaber T. A. et al., ATD Learning: A Secure, Smart, and Decentralised Learning Method for Big Data Environments, Information Fusion. (2025) 118, https://doi.org/10.1016/j.inffus.2025.102953.
10.1016/j.inffus.2025.102953
Web of Science® Google Scholar
11 Traoré A. and Akhloufi M. A., Violence Detection in Videos Using Deep Recurrent and Convolutional Neural Networks, Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), April 2020, 154–159, https://doi.org/10.1109/smc42975.2020.9282971.
10.1109/smc42975.2020.9282971
Google Scholar
12 Martínez-Mascorro G. A., Abreu-Pederzini J. R., Ortiz-Bayliss J. C., and Terashima-Marín H., Suspicious Behavior Detection on Shoplifting Cases for Crime Prevention by Using 3D Convolutional Neural Networks, 2020, https://arxiv.org/abs/2005.02142.
Google Scholar
13 Huang C., Wang X., Cao J., Wang S., and Zhang Y., HCF: A Hybrid CNN Framework for Behavior Detection of Distracted Drivers, IEEE Access. (2020) 8, 109335–109349, https://doi.org/10.1109/access.2020.3001159.
10.1109/ACCESS.2020.3001159
Web of Science® Google Scholar
14 Al-Dhamari A., Sudirman R., and Mahmood N. H., Transfer Deep Learning Along With Binary Support Vector Machine for Abnormal Behavior Detection, IEEE Access. (2020) 8, 61085–61095, https://doi.org/10.1109/access.2020.2982906.
10.1109/ACCESS.2020.2982906
Web of Science® Google Scholar
15 Zerkouk M. and Chikhaoui B., Spatio-Temporal Abnormal Behavior Prediction in Elderly Persons Using Deep Learning Models, Sensors. (2020) 20, no. 8, https://doi.org/10.3390/s20082359.
10.3390/s20082359
Web of Science® Google Scholar
16 Martínez-Mascorro G. A., Abreu-Pederzini J. R., Ortiz-Bayliss J. C., Garcia-Collantes A., and Terashima-Marín H., Criminal Intention Detection at Early Stages of Shoplifting Cases by Using 3D Convolutional Neural Networks, Computation. (2021) 9, no. 2, https://doi.org/10.3390/computation9020024.
10.3390/computation9020024
Web of Science® Google Scholar
17 Ansari M. A. and Singh D. K., An Expert Eye for Identifying Shoplifters in Mega Stores, Proceedings of the International Conference on Innovative Computing and Communications: Proceedings of ICICC 2021, Volume 3, May 2021, 107–115, https://doi.org/10.1007/978-981-16-3071-2_10.
10.1007/978-981-16-3071-2_10
Google Scholar
18 Mohtavipour S. M., Saeidi M., and Arabsorkhi A., A Multi-Stream CNN for Deep Violence Detection in Video Sequences Using Handcrafted Features, The Visual Computer. (2022) 38, no. 6, 2057–2072, https://doi.org/10.1007/s00371-021-02266-4.
10.1007/s00371-021-02266-4
Web of Science® Google Scholar
19 Abosaq H. A., Ramzan M., Althobiani F. et al., Unusual Driver Behavior Detection in Videos Using Deep Learning Models, Sensors. (2022) 23, no. 1, https://doi.org/10.3390/s23010311.
10.3390/s23010311
Google Scholar
20 Kirichenko L., Radivilova T., Sydorenko B., and Yakovlev S., Detection of Shoplifting on Video Using a Hybrid Network, Computation. (2022) 10, no. 11, https://doi.org/10.3390/computation10110199.
10.3390/computation10110199
PubMed Web of Science® Google Scholar
21 Bi Y., Li D., and Luo Y., Combining Keyframes and Image Classification for Violent Behavior Recognition, Applied Sciences. (2022) 12, no. 16, https://doi.org/10.3390/app12168014.
10.3390/app12168014
Google Scholar
22 Chexia Z., Tan Z., Wu D., Ning J., and Zhang B., A Generalized Model for Crowd Violence Detection Focusing on Human Contour and Dynamic Features, Proceedings of the 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), March 2022, 327–335, https://doi.org/10.1109/ccgrid54584.2022.00042.
10.1109/ccgrid54584.2022.00042
Google Scholar
23 Chang Y., Tu Z., Xie W. et al., Video Anomaly Detection With Spatio-Temporal Dissociation, Pattern Recognition. (2022) 122, https://doi.org/10.1016/j.patcog.2021.108213.
10.1016/j.patcog.2021.108213
Web of Science® Google Scholar
24 Lalit R., Purwar R. K., Verma S., and Jain A., Crowd Abnormality Detection in Video Sequences Using Supervised Convolutional Neural Network, Multimedia Tools and Applications. (2022) 81, no. 4, 5259–5277, https://doi.org/10.1007/s11042-021-11781-4.
10.1007/s11042-021-11781-4
Web of Science® Google Scholar
25 Rijayanti R., Hwang M., and Jin K., Detection of Anomalous Behavior of Manufacturing Workers Using Deep Learning-Based Recognition of Human–Object Interaction, Applied Sciences. (2023) 13, no. 15, https://doi.org/10.3390/app13158584.
10.3390/app13158584
Google Scholar
26 Ullah W., Ullah F. U. M., Khan Z. A., and Baik S. W., Sequential Attention Mechanism for Weakly Supervised Video Anomaly Detection, Expert Systems With Applications. (2023) 230.
10.1016/j.eswa.2023.120599
PubMed Web of Science® Google Scholar
27 Jebur S. A., Hussein K. A., and Hoomod H. K., Abnormal Behavior Detection in Video Surveillance Using Inception-V3 Transfer Learning Approaches, Iraqi Journal of Computer, Communication, Control and System Engineering. (2023) 23, no. 2, 210–221.
Google Scholar
28 Muneer I., Saddique M., Habib Z., and Mohamed H. G., Shoplifting Detection Using Hybrid Neural Network CNN-BiLSMT and Development of Benchmark Dataset, Applied Sciences. (2023) 13, no. 14, https://doi.org/10.3390/app13148341.
10.3390/app13148341
Google Scholar
29 Tang Y., Chen Y., Sharifuzzaman S. A. S. M., and Li T., An Automatic Fine-Grained Violence Detection System for Animation Based on Modified Faster R-CNN, Expert Systems With Applications. (2024) 237, https://doi.org/10.1016/j.eswa.2023.121691.
10.1016/j.eswa.2023.121691
Web of Science® Google Scholar
30 Mohammadi H. and Nazerfard E., Video Violence Recognition and Localization Using a Semi-Supervised Hard Attention Model, Expert Systems With Applications. (2023) 212, https://doi.org/10.1016/j.eswa.2022.118791.
10.1016/j.eswa.2022.118791
Web of Science® Google Scholar
31 Jebur S. A., Hussein K. A., Hoomod H. K., and Alzubaidi L., Novel Deep Feature Fusion Framework for Multi-Scenario Violence Detection, Computers. (2023) 12, no. 9, https://doi.org/10.3390/computers12090175.
10.3390/computers12090175
Web of Science® Google Scholar
32 Jaafar N. and Lachiri Z., Multimodal Fusion Methods With Deep Neural Networks and Meta-Information for Aggression Detection in Surveillance, Expert Systems With Applications. (2023) 211, https://doi.org/10.1016/j.eswa.2022.118523.
10.1016/j.eswa.2022.118523
Web of Science® Google Scholar
33 Elmasry R. M., Abd El Ghany M. A., Salem M. A.-M., and Fahmy O. M., MultiWave-Net: An Optimized Spatiotemporal Network for Abnormal Action Recognition Using Wavelet-Based Channel Augmentation, AI. (2024) 5, no. 1, 259–289, https://doi.org/10.3390/ai5010014.
10.3390/ai5010014
Web of Science® Google Scholar
34 Tu Y.-S., Shen Y.-S., Chan Y. Y., Wang L., and Chen J., Violent Video Recognition by Using Sequential Image Collage, Sensors. (2024) 24, no. 6, https://doi.org/10.3390/s24061844.
10.3390/s24061844
Web of Science® Google Scholar
35 Zanella L., Liberatori B., Menapace W., Poiesi F., Wang Y., and Ricci E., Delving Into Clip Latent Space for Video Anomaly Recognition, Computer Vision and Image Understanding. (2024) 249, https://doi.org/10.1016/j.cviu.2024.104163.
10.1016/j.cviu.2024.104163
Web of Science® Google Scholar
36 Dündar N., Keçeli A. S., Kaya A., and Sever H., A Shallow 3D Convolutional Neural Network for Violence Detection in Videos, Egyptian Informatics Journal. (2024) 26, https://doi.org/10.1016/j.eij.2024.100455.
10.1016/j.eij.2024.100455
Web of Science® Google Scholar
37 Kaur G. and Singh S., An Ensemble Based Approach for Violence Detection in Videos Using Deep Transfer Learning, Multimedia Tools and Applications. (2024) 1–25, https://doi.org/10.1007/s11042-024-19388-1.
10.1007/s11042-024-19388-1
Google Scholar
38 Sultani W., Chen C., and Shah M., Real-world Anomaly Detection in Surveillance Videos, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2018, 6479–6488.
Google Scholar
39 Soliman M. M., Kamal M. H., Nashed M. A. E.-M., Mostafa Y. M., Chawky B. S., and Khattab D., Violence Recognition From Videos Using Deep Learning Techniques, Proceedings of the 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), September 2019, 80–85.
Google Scholar
40 Shamrat F. J. M., Azam S., Karim A., Ahmed K., Bui F. M., and De Boer F., High-Precision Multiclass Classification of Lung Disease Through Customized MobileNetV2 From Chest X-Ray Images, Computers in Biology and Medicine. (2023) 155, https://doi.org/10.1016/j.compbiomed.2023.106646.
10.1016/j.compbiomed.2023.106646
PubMed Web of Science® Google Scholar
41 Szegedy C., Vanhoucke V., Ioffe S., Shlens J., and Wojna Z., Rethinking the Inception Architecture for Computer Vision, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, January 2016, 2818–2826, https://doi.org/10.1109/cvpr.2016.308, 2-s2.0-84986296808.
10.1109/cvpr.2016.308
Google Scholar
42 Peng S., Huang H., Chen W., Zhang L., and Fang W., More Trainable Inception-ResNet for Face Recognition, Neurocomputing. (2020) 411, 9–19, https://doi.org/10.1016/j.neucom.2020.05.022.
10.1016/j.neucom.2020.05.022
Web of Science® Google Scholar
43 Chollet F., Xception: Deep Learning With Depthwise Separable Convolutions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, August 2017, 1251–1258.
Google Scholar
44 Albahri A. S., Duhaim A. M., Fadhel M. A. et al., A Systematic Review of Trustworthy and Explainable Artificial Intelligence in Healthcare: Assessment of Quality, Bias Risk, and Data Fusion, Information Fusion. (2023) 96, 156–191, https://doi.org/10.1016/j.inffus.2023.03.008.
10.1016/j.inffus.2023.03.008
Web of Science® Google Scholar
45 Selvaraju R. R., Cogswell M., Das A., Vedantam R., Parikh D., and Batra D., Grad-cam: Visual Explanations From Deep Networks via Gradient-Based Localization, Proceedings of the IEEE International Conference on Computer Vision, July 2017, 618–626, https://doi.org/10.1109/iccv.2017.74, 2-s2.0-85041910265.
10.1109/iccv.2017.74
Google Scholar
46 Abdulredah A. A., Fadhel M. A., Alzubaidi L., Duan Y., Kherallah M., and Charfi F., Towards Unbiased Skin Cancer Classification Using Deep Feature Fusion, BMC Medical Informatics and Decision Making. (2025) 25, no. 1, https://doi.org/10.1186/s12911-025-02889-w.
10.1186/s12911-025-02889-w
PubMed Web of Science® Google Scholar
47 Durães D., Santos F., Marcondes F. S., Lange S., and Machado J., Comparison of Transfer Learning Behaviour in Violence Detection with Different Public Datasets, 20, Proceedings of the Progress in Artificial Intelligence: 20th EPIA Conference on Artificial Intelligence, EPIA 2021, Virtual Event, August 2021, 290–298, https://doi.org/10.1007/978-3-030-86230-5_23.
10.1007/978-3-030-86230-5_23
Google Scholar
48 Bermejo Nievas E., Deniz Suarez O., Bueno García G., and Sukthankar R., Violence Detection in Video Using Computer Vision Techniques, Lecture Notes in Computer Science. (2011) 6855, no. 2, 332–339, https://doi.org/10.1007/978-3-642-23678-5_39, 2-s2.0-80052786853.
10.1007/978-3-642-23678-5_39
Google Scholar
49 Shoplifting-Dataset, 2024, https://www.kaggle.com/datasets/jashwantsinghyadav/shoplifting-dataset.
Google Scholar
50 Jebur S. A., Hussein K. A., Hoomod H. K., Alzubaidi L., Saihood A. A., and Gu Y., A Scalable and Generalized Deep Learning Framework for Anomaly Detection in Surveillance Videos, 2024, https://arxiv.org/abs/2408.00792.
Google Scholar

All articles

A Scalable and Generalised Deep Learning Framework for Anomaly Detection in Surveillance Videos

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Datasets

3.2. CNN Architectures

3.2.1. MobileNetV2 Model

3.2.2. InceptionV3 Model

3.2.3. InceptionResNetV2 Model

3.2.4. Xception Model

3.3. Part1: Proposed Framework, Deep Feature Fusion Approach

3.4. Part2: Proposed Framework, Multitask Classification

3.5. Training

3.6. Explainable Tools

4. Experimental Results

4.1. Evaluation Metrics

4.2. Experimental Results of Individual CNN Models

4.2.1. Experiment Results on UCF Dataset

4.2.2. Experiment Results on RLVS Dataset

4.3. Experimental Results of the Deep Feature Fusion Model

4.3.1. Experimental Results of the Deep Fusion Model on the UCF Dataset

4.3.2. Experimental Results of the Deep Feature Fusion Model on the RLVS Dataset

4.4. Experimental Results of the Multitask Classification Model

4.5. Comparative Studies With Recent Research

4.6. Independent Test

5. Conclusion

Disclosure

Conflicts of Interest

Funding

Open Research

Data Availability Statement

References

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley