International Journal of Intelligent Systems

Volume 2025, Issue 1 4370827

Research Article

Open Access

An Anomaly Detection System for High-Dimensional Industry Time Series Data Based on GNN and Apache Flink

Feng Ye,

Corresponding Author

Feng Ye

[email protected]

orcid.org/0000-0003-0005-2073

College of Computer Science and Software Engineering , Hohai University , Nanjing , China , hhu.edu.cn

Key Laboratory of Hydrologic-Cycle and Hydrodynamic System of Ministry of Water Resources , Hohai University , Nanjing , China , hhu.edu.cn

Search for more papers by this author

Kaibo Zhang,

Kaibo Zhang

College of Computer Science and Software Engineering , Hohai University , Nanjing , China , hhu.edu.cn

Search for more papers by this author

Jun Sun,

Jun Sun

College of Computer Science and Software Engineering , Hohai University , Nanjing , China , hhu.edu.cn

Engineering Service Center , Nanjing Nari-Relays Electric Company , Nanjing , China

Search for more papers by this author

Na Li,

Na Li

College of Computer Science and Software Engineering , Hohai University , Nanjing , China , hhu.edu.cn

Search for more papers by this author

Feng Ye,

Corresponding Author

Feng Ye

[email protected]

orcid.org/0000-0003-0005-2073

College of Computer Science and Software Engineering , Hohai University , Nanjing , China , hhu.edu.cn

Key Laboratory of Hydrologic-Cycle and Hydrodynamic System of Ministry of Water Resources , Hohai University , Nanjing , China , hhu.edu.cn

Search for more papers by this author

Kaibo Zhang,

Kaibo Zhang

College of Computer Science and Software Engineering , Hohai University , Nanjing , China , hhu.edu.cn

Search for more papers by this author

Jun Sun,

Jun Sun

College of Computer Science and Software Engineering , Hohai University , Nanjing , China , hhu.edu.cn

Engineering Service Center , Nanjing Nari-Relays Electric Company , Nanjing , China

Search for more papers by this author

Na Li,

Na Li

College of Computer Science and Software Engineering , Hohai University , Nanjing , China , hhu.edu.cn

Search for more papers by this author

First published: 10 July 2025

https://doi.org/10.1155/int/4370827

Academic Editor: Vasudevan Rajamohan

Share a link

Email
Wechat
Bluesky

Abstract

Intelligent systems have been widely used in various fields. They generate a large number of high-dimensional time series monitoring data in the process of operation, which often hide various potential abnormal conditions, which bring hidden dangers to the stable operation of the system. Existing anomaly detection methods mainly focus on the sequence characteristics of time series data, but often ignore the correlation between different variables of multivariate data, and the detection efficiency is low when facing high-dimensional time series data. To solve the above problems, we propose a deep anomaly detection method based on graph neural network, and combined with the big data computing framework Apache Flink, we construct a real-time anomaly detection system for large-scale high-dimensional time series data. Experimental results on SWaT and WADI show that our proposed method can accurately detect anomalies in multivariate time series data, and can perform low-latency real-time anomaly detection on high-dimensional industrial streaming data.

1. Introduction

In the real world, real-time monitoring of infrastructure such as aircraft, server clusters and water conservancy projects is of great significance, helping managers to keep track of the operation of large equipment and systems in a timely manner [1]. In the process of monitoring them, a large number of high-dimensional time series data will be generated to reflect their operating status. These data not only includes the information of normal operation, but also records the abnormal conditions caused by network attacks, equipment failures and other events, reflecting the problems and threats that may be encountered in the operation of the infrastructures. Therefore, timely anomaly detection of high-dimensional time series data to detect and resolve abnormal events in industrial systems and prevent potential system failures has become a major challenge [2–4].

In the past, experts set thresholds for normal events, and if the system’s measurements exceeded the threshold defined by the expert, the system was considered abnormal [5]. However, with the increase in the number and types of sensors deployed in industrial systems, the hidden correlations between different variables in the collected time series data become more complex, and the traditional threshold approach is no longer applicable in some cases. Therefore, in order to adapt to the changes of this environment and the diversity of data, it is of great significance to model the complex correlation between different variables of multivariate data and realize a more efficient and automated anomaly detection method of multivariate time series data [6].

At present, anomaly detection technology has been widely used in many industrial fields, but the research of real-time anomaly detection system is still in the development stage [7, 8]. The traditional anomaly detection system mainly adopts the method of expert setting threshold, which cannot effectively respond to the complex data relationship and dynamic changing environment, especially when dealing with high-dimensional data, its performance is limited by the breadth and depth of expert knowledge. However, in the process of monitoring large-scale systems, a large number of real-time multidimensional time series data with complex correlation will be generated. The anomaly detection system adopted needs to better handle high-dimensional, fast-flowing and real-time changing data, and ensure that the system can maintain the detection efficiency and accuracy when dealing with complex situations in industrial processes.

We propose a deep anomaly detection method based on graph neural networks (GNNs) to capture spatial relationships in multivariate time series. Capture complex relationships in data space by fusing GNNs. In addition, this research also combines the deep anomaly detection method with the big data computing framework Apache Flink to design a real-time anomaly detection system for high-dimensional time series streaming data. The system divides the time series data into correlation partitions, and distributes the computing load while maintaining the data correlation, so as to improve the scalability and efficiency of the system. In addition, Flink’s distributed real-time computing capability is utilized to realize real-time anomaly detection of convective data. The main contribution can be summarized as follows:

1.
We propose a deep anomaly detection method based on GNNs to solve the problem of complex correlation among variables of multivariate time series data. We improve the graph attention network to pay more attention to edge information, and add the output structure of Transformer to improve the output representation ability of graph features and make the training process more stable.
2.
In order to realize real-time anomaly detection on high-dimensional time series data, we design a distributed real-time anomaly detection system based on associative partitioning to solve the problem that high-dimensional data leads to low efficiency of the deep anomaly detection method proposed in this paper. This system combines the above deep anomaly detection method with big data computing framework Apache Flink to realize real-time anomaly detection of high-dimensional time series data.
3.
We conduct experiments on SWaT and WADI public data sets for multivariate time series anomaly detection to further evaluate the performance of the algorithm in diverse scenarios, comprehensively compare and verify the proposed anomaly detection method with multiple baseline methods, and test the designed real-time anomaly detection system. The performance and scalability of the system are verified.

2. Related Work

Traditional anomaly detection methods mainly include methods based on density, distance, linear model and classification model, such as isolated forest [9], LOF [10], OCSVM [11], etc. Traditional anomaly detection methods use statistics or machine learning technology to achieve anomaly detection. These methods are based on statistical principles and assume that data is generated from a specific distribution. However, in real-world complex data, the data distribution can change and be difficult to accurately represent with a single statistical model.

Deep learning methods have been developed in the field of anomaly detection of multivariate time series data [12–14]. The current anomaly detection based on deep learning mainly adopts prediction method or reconstruction method. The prediction method regards the anomaly score as the deviation between the observed value and the predicted value, while the reconstruction method regards the observed value deviating from the reconstructed value as the anomaly [15, 16]. Compared with traditional methods, deep anomaly detection can learn hierarchical discriminable features from large-scale complex data to achieve high-precision anomaly detection.

Reconstruction methods mainly include autoencoder [17], USAD [18], OmniAnomaly [14], etc. This kind of method maps the original data to the low-dimensional feature space, and attempts to restore the data from the low-dimensional space through the decoder to optimize the parameters of the two networks by minimizing the reconstruction loss. Since abnormal data is difficult to be accurately reconstructed by the model, the reconstruction error is large. This method uses the reconstruction error as an anomaly score to identify outliers. Prediction based methods mainly include LSTM [19], DAGMM [20], and so on. Normal data usually follows temporal dependencies well and can be easily predicted, whereas abnormal data usually violates this dependency and is unpredictable. The above deep anomaly detection methods mainly focus on the time dependence of data, that is, the trend of data change over time and the correlation between time points, but do not model the spatial structure of multivariate time series data. These methods have some limitations when applied to multivariate time series data, especially when complex spatial association structures between multiple variables are involved.

GNNs are developing rapidly in the field of anomaly detection in time series. The graph network structure can learn the relationship between different variables of multidimensional time series data, and promote the interpretability of time series tasks. GDN [12] and GTA [21] are typical representatives of anomaly detection using GNN. When using GNN to model time series, normal data will fall into a specific potential data distribution space due to the correlation of data nodes, while abnormal data will be difficult to establish connections with other data, resulting in a large distribution gap compared with normal data.

The relevant research of anomaly detection system is still in its infancy, especially in the industrial scenario for large-scale data, there are some problems that need to be solved. Existing systems, such as CADF [22], MIDS [23], and RDAD [24], are often validated on small scale or domain-specific data, so their performance and scalability in real-world applications are limited.

3. Proposed Framework

3.1. Overview

We propose an anomaly detection model based on GNNs to model the spatial structure of multivariate time series data. The overall structure of the model is shown in Figure 1, which uses the following steps to extract the spatial associations of multivariate time series data to capture the interactions among multiple variables more comprehensively and accurately: Firstly, the association graph is constructed by using cosine similarity and TopK strategy for variable features, and the graph structure is optimized by using Gumbel-Softmax sampling method. Secondly, the improved graph attention network is used to learn the features of nodes in the graph. Finally, the variable features are combined with the graph node features, and a deep prediction model is used to realize the time series prediction, and the anomaly score is calculated by comparing the error between the predicted value and the measured value.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

The overall structure of proposed model.

3.2. Model Implementation

3.2.1. Graph Construction

To construct the association graph, we employ cosine similarity to measure the pairwise correlation between variables in multivariate time series data. Cosine similarity is particularly suitable for this task as it focuses on the directional alignment of variable dynamics rather than their absolute magnitude variations. In industrial scenarios, sensors may exhibit heterogeneous measurement scales or inherent noise, making amplitude-based metrics (e.g., Euclidean distance) less reliable for capturing true functional dependencies. Cosine similarity, by normalizing vectors to unit length, effectively isolates the temporal correlation patterns between variables, which is critical for identifying anomalies rooted in coordinated behavioral deviations. Furthermore, we apply the TopK strategy to retain the top k most correlated variables for each node, where k is adaptively determined based on the desired sparsity level of the correlation matrix. This step reduces computational complexity while preserving the most salient inter-variable relationships.

The number of neighbor nodes of the association graph obtained by the above way is relatively fixed, but in practical applications, the structure of the graph may change dynamically, and the number of neighbor nodes needs to be adjusted according to different scenarios. We optimize the graph structure based on the Gumbel-Softmax sampling method to hide part of the connections in the graph.

First, the edge sampling probabilities s_ij from node i to node j in the graph are learned using a linear model with node features as input, and second, Gumbel noise g_k is generated for each possible edge state and added to the corresponding sampling probabilities, as shown in equations (1) and (2).

()

u_k is randomly sampled from the Uniform distribution, that is, u_k ~ Uniform(0, 1).

Finally, the probability of edge existence c_ij is calculated, which determines whether the edge from node i to node j in the association graph is hidden. The specific calculation method is shown in (3).

()

Among them, τ is a temperature parameter that controls the smoothness of Gumbel-Softmax. The smaller τ is, the more sparse the graph structure is.

The purpose of Gumbel-Softmax sampling method is to fine-tune the graph generated by the TopK strategy, balancing the stability and dynamic adaptability of graph structures. So in this study, a smaller τ value will not be selected. In the practical application process, π is chosen as the value of τ. This method offers unique advantages: compared with random pruning (which tends to lose potential correlation information) and entropy-based sampling (which relies on unstable probability thresholds), Gumbel-Softmax employs a controlled noise injection and temperature regulation mechanism to dynamically hide redundant edges while preserving critical correlations. This effectively addresses the robustness issue of multivariate time-series data distribution changes under varying industrial operating conditions.

3.2.2. Graph Feature Learning

The model uses a GNN to learn the features of the spatial association graph. By continuously iterating the transfer of node information, GNNs integrates local correlation information into global features, so as to more comprehensively understand the entire spatial structure. The obtained spatial features play a key role in better understanding the dynamic changes and associations between different data variables.

Different from the graph attention network used in the GDN [12], we propose a graph feature extraction method based on the improved graph attention network, which adds edge information to the original graph attention network, and uses the residual connection and layer normalization in Transformer as the output structure of the graph attention network to improve the representation ability of the GNNs. Suppose, at time t, the input data to the model is X^(t) = [x^{(t − w)}, x^{(t − w + 1)}, …, x^{(t − 1)}], w is the window length. Here, the component of every graph node i is .

Firstly, in order to fuse the feature information, the component

of each node and the node feature r_i are fused and concatenated, as shown in (4), where W is a trainable weight matrix.

()

Secondly, the attention weight α_i,j is calculated, which measures the attention score contributed by node i to node j. The original graph attention network does not consider side information, however, the interaction between nodes is not homogeneous, and some connections may be more important than others. The edge in the association graph represents the association strength between nodes. In this study, the edge weight is introduced in the process of calculating the attention score, and the improved graph attention network can more accurately simulate the interaction strength between nodes. Thus, it is more effective in capturing the intrinsic characteristics of data with complex graph structure, and the calculation method is shown in equations (5) and (6).

()

Here, W_e is the edge weight matrix and e_ij is the edge information between nodes. Then, for each graph node, the graph feature extraction is completed by aggregating the information of neighbor nodes, where the edge feature only plays a role in the weight calculation and is not part of the new node feature, as shown in (7).

()

Finally, the final graph feature output z^(t) is performed through the residual connection structure and layer normalization, which is calculated as shown in equations (8) and (9).

()

where LayerNorm stands for layer normalization and FNN represents feedforward neural network.

3.2.3. Time Series Forecasting

Through the above GNN, the model obtains a set of associated graph features, and uses these graph features and time series data as input to realize the time series prediction of future data points. Although there are many complex multidimensional time series prediction methods, such as LSTM, Informer, etc., in the anomaly detection task, only single-step prediction is needed, and too many complex structures are not required. Therefore, in order to reduce the computational complexity, we adopt multilayer stacked fully connected layers to obtain the prediction results, as shown in (10).

()

where {r₁, r₂, …, r_N} is the data feature and

is the graph feature obtained by the GNN. The output of this fully connected layer is

, which represents the prediction at time t. To evaluate the performance of the model, mean square error is used as the loss function, which is measured by the difference between the predicted output

and the observed data x^(t). The prediction loss is calculated as shown in (11).

()

Then, the optimization algorithm is used to continuously adjust the weight and bias of the model, so that the predicted results are closer to the actual observed values. The anomaly score in anomaly detection is used to measure how a data point deviates from the normal pattern, with a higher anomaly score indicating that the data point is more likely to be an outlier. In the prediction model, the difference between the observed value and the predicted value of the model is calculated as the anomaly score, as shown in (12).

()

where

is the predicted value, x^(t) is the measured values.

However, due to the different characteristics of different data variables, the measurement scales of the corresponding prediction errors are also different. Therefore, in this paper, the prediction anomaly scores are not directly calculated by using (12), but the results are optimized. Consider an anomaly score a_i = {a₁, a₂, …, a_N} for a set of different variables acquired at time t, the anomaly scores of this group are standardized by using (13), where μ is the mean and σ is the standard deviation.

()

Meanwhile, in order to further reduce the influence of the error of the spike noise type in the prediction process of the model on the anomaly detection results, this paper uses the exponential moving average method to generate smooth anomaly scores.

3.2.4. Anomaly Detection

The model realizes the judgment of abnormal data by defining a threshold, and the data output by the model with an anomaly score higher than the threshold is regarded as abnormal. Determining an appropriate anomaly threshold is challenging because it directly affects the sensitivity and accuracy of anomaly detection models. Some existing works use some dynamic outlier threshold selection algorithms based on extreme value theory, but the algorithms are complex and difficult to understand. Therefore, an easily understood threshold selection algorithm is used in this study, which takes the maximum anomaly score on the validation dataset as the threshold of the dataset.

3.3. Real-Time Anomaly Detection System

Although anomaly detection has been applied in various fields, real-time anomaly detection in the context of high-dimensional time series data is still a challenge. In the process of large-scale facility monitoring, there are often tens of thousands of sensor devices. Using deep learning methods to model time series data with high dimension will face unacceptable overhead. We design a real-time anomaly detection system based on associative partitioning, which combines the Apache Flink stream processing framework with the deep anomaly detection model described above. The design of the system aims to make full use of the advantages of Flink’s stream processing and the powerful representation learning ability of deep learning models to achieve fast and accurate anomaly detection for high-dimensional real-time data streams. Aiming at the problem of low computational efficiency of anomaly detection methods for high-dimensional data, we combine the association graph structure in anomaly detection methods and use the idea of correlation partition to reduce the computational load and improve the parallel performance of anomaly detection. At the same time, the distributed ability of Flink framework is used to realize efficient real-time anomaly detection.

3.3.1. Data Partitioning Module

The data partitioning module is responsible for classifying the original time series data to organize the data effectively and provide a distributed basis for the subsequent real-time anomaly detection module based on Flink. The partition module is divided into two stages: offline stage and online stage. In the offline stage, it mainly realizes the data partition and partition structure storage. In the online stage, the partition structure is loaded, and the real-time data stream is partitioned and delivered. Figure 2 shows the structure of the data partitioning module.

The purpose of the partitioning strategy is to divide the data nodes with strong correlation into the same partition, but the data between partitions is not strongly correlated. Firstly, a preliminary division is made according to the physical area of the data measurement points. This step helps to aggregate data points that are adjacent to the physical area together and establish physical links, which lays the foundation for subsequent processing. Secondly, in each physical area, the partition module uses the partition algorithm based on spectral clustering to partition the data measurement points. Spectral clustering is a clustering method based on graph theory and matrix characteristics, which is mainly applied to the segmentation of graph data. It maps the graph data into a low-dimensional subspace via eigenvalue decomposition of the Laplacian matrix, followed by K-means clustering in the reduced space. Spectral clustering is chosen for its ability to handle graph-structured data and preserve global structural information, which is critical for capturing cross-variable dependencies in industrial time series. The computational complexity of spectral clustering depends primarily on the eigenvalue decomposition of the Laplacian matrix (O(n³)) and the K-means iteration (O(tkn)), where n is the number of nodes, t the iterations, and k the number of clusters. To ensure efficiency, we limit the number of clusters during the initial phase and adopt a multiscale clustering strategy to dynamically adjust the granularity of partitioning based on the correlation strength. Once partitions are determined, the time series data are streamed to the Apache Flink cluster for distributed anomaly detection.

3.3.2. Anomaly Detection Module

Anomaly detection module is the key component of the system. Its main responsibility is to detect anomalies in real-time streaming data, and solve problems such as data merging and missing values processing. This module is deployed on multiple servers in a distributed Flink cluster. Through this distributed structure, the system is able to process a large amount of streaming data more efficiently in order to detect potential abnormal events in time. Figure 3 shows the structure of the anomaly detection module.

The data received by the anomaly detection module transmitted by the upstream message queue is a single-dimensional structure, in which each data is composed of timestamp, point ID and measurement value. However, the deep anomaly detection model needs to receive multidimensional data structures. Therefore, the single point data need to be aggregated into a wide table according to the timestamp, where each row represents a timestamp, and each column represents the measurement value of a point under the timestamp. We have implemented a data aggregation operator based on Flink’s window functionality to concatenate data. In the anomaly detection stage, we design a deep anomaly detection operator to realize real-time streaming data anomaly detection. This operator mainly realizes two functions in the Flink program, one is to connect the upstream data source to realize real-time anomaly detection, and the other is to connect the downstream output operator to realize the anomaly information alarm of the system.

4. Experiments

4.1. Datasets and Experiments Environment

4.1.1. Datasets

In the experiment, the multivariate time series anomaly detection public datasets SWaT and WADI are used. The SWaT dataset was collected by the iTrust laboratory in Singapore in the Water Treatment testbed. Data were acquired from 51 sensors and actuators of critical infrastructure systems under continuous operation, totaling 11 days of recordings. Among them, the normal time series data were recorded in the first 7 days as the training dataset, and in the last 4 days, the laboratory attacked the water treatment testbed and collected the time series data with abnormal conditions as the test dataset. The WADI dataset was collected from the Water Distribution System testbed as an extension of the SWaT dataset. The time series data generated by the continuous operation of the system for a total of 16 days are included, of which the regular operation for 14 days is the training set and the attack scenario for 2 days is used as the test set. The number of datasets, dimensions, and ratio of anomalies are listed in Table 1.

Table 1. Datasets.

Name	Dimension	Train	Test	Anomaly ratio (%)
SWaT	51	495000	449919	11.97
WADI	123	784537	172801	5.99

4.1.2. Experiments Environment

The experiments are conducted on a Dell T5820 workstation with Windows 10 Professional operating system, 64 GB memory, Intel Xeon W-2223 3.60 GHz CPU, and a single NVIDIA GeForce RTX 4090 24 GB GPU.

4.2. Baseline

A total of four time series anomaly detection methods are used as the baseline methods of the experiment, including the traditional method IF [9], the reconstruction based method USAD [18], the prediction based method LSTM [19] and the GNN based method GDN [12]. In order to facilitate the presentation and comparison, EGAD (EGAT for anomaly detection) is used as the name of the anomaly detection method proposed in this paper.

4.3. Evaluation Metrics

In the experiments, standard evaluation metrics commonly used in time series anomaly detection tasks, namely Precision, Recall, and F1 score, are used to comprehensively evaluate the performance of the proposed anomaly detection method. Specifically, these evaluation metrics are calculated in the following equations:

()

4.4. Results

The experimental results on SWaT and WADI public datasets are shown in Table 2. Compared with the four baseline methods, the proposed method EGAD achieves better performance in the evaluation indicators.

Table 2. Main results.

Model	SWaT			WADI
Model	P	R	F1	P	R	F1
IF	51.92	70.66	59.86	43.70	25.71	32.37
LSTM	91.21	58.60	71.36	87.86	27.52	41.91
USAD	95.49	58.59	72.62	93.24	28.26	43.37
GDN	98.31	68.25	80.57	93.50	43.18	59.08
EGAD	97.20	72.92	83.33	90.83	46.32	61.35

It can be seen from the experimental results that the detection results of all algorithm models on SWaT dataset are better than those on WADI dataset, which is because the data dimension of WADI dataset is much higher than that of SWaT dataset, and the anomaly rate is only 5.99%, which is also lower than 11.97% of SWaT. This means that the abnormal samples in the WADI dataset are sparser, and the increase of data dimension also increases the complexity of node association. The IF algorithm performs relatively poorly, especially on the WADI dataset, indicating that these traditional anomaly detection algorithms are difficult to fully capture the characteristics of the data when dealing with complex multidimensional time series data, and the EGAD method has a great improvement compared with the two. Although USAD method performs well in precision, it has a low recall rate. Compared with USAD method EGAD has 10.71% and 17.98% improvement in F1 score on SWaT and WADI, respectively. The method GDN based on GNN shows strong performance on these two datasets, but EGAD has an average increase of 2.52% in F1 score compared with GDN on the two datasets. This is because EGAD realizes dynamic graph connection and integrates side information, which can fully mine the correlation of data, so it obtains the best detection performance.

4.5. Ablation

The ablation experiment mainly verifies the influence of each functional module on the experiment, and the experimental results are shown in Table 3. In the table, w/o embedding means remove encoding module, w/o edge means remove edge encoding, w/o output means remove output structure of GAT, and w/o GAT means remove GNN. After deleting each module of the EGAD model, the anomaly detection indicators P, R, and F1 all show a decline.

Table 3. Ablation results.

Model	SWaT			WADI
Model	P	R	F1	P	R	F1
W/o embedding	94.25	58.70	72.34	82.14	40.79	54.51
W/o edge	96.68	69.03	80.55	84.31	43.12	57.06
W/o output	96.16	68.48	79.99	90.56	42.28	57.65
W/o GAT	88.28	54.14	67.12	83.13	34.14	48.40
EGAD	97.20	72.92	83.33	90.83	46.32	61.35

4.6. Parameter Analysis

In this experiment, we mainly verify the influence of the model hyperparameters on the experimental results. We study the three main parameters of the model: the time window length w, the window step size s, and the k value of the graph TopK to explore their influence on the F1 score. Except for the above target parameters, the remaining hyperparameters remain unchanged. The experiments are conducted on both SWaT dataset and WADI dataset.

The experimental results are shown in Figure 4. On both datasets, the model performs best when the time window length is 100. When the window length is 10, the model has the worst performance, with F1 scores of 52.19% and 40.25% for SWaT and WADI datasets, respectively. This is because a shorter window length fails to capture the long-term dependencies of the system, leading to performance degradation. After the window length exceeds 100, the F1 score starts to decrease because the redundant information increases and the model becomes too focused on the historical data and ignores the current state, which leads to a decrease in detection performance. The model achieves the best performance with a window step size of 20. When the window step size is 5, the F1 scores of SWaT and WADI datasets are decreased by 2.78% and 2.70%, respectively. Due to the tighter input data, although some redundant information is brought, the time series information is well preserved, and the performance only decreases slightly. When the window step size increases to 100, the F1 scores of SWaT and WADI datasets decrease by 4.37% and 5.30%, respectively, and tend to continue to decrease, because large time span will lose a lot of timing information, leading to significant performance degradation. On the SWaT dataset, the model has the best performance when the k value is 13, and the F1 score is 83.33%. In the WADI dataset, due to its higher dimensionality, the best performance is achieved with a k value of 30, with an F1 score of 61.35%.

4.7. System Test

4.7.1. Data Generator

In view of the relatively low data dimension of current public anomaly detection datasets, it limits the comprehensive verification of the performance of anomaly detection systems. To remedy this deficiency, this study designs an efficient multidimensional data generator, which aims to simulate high-dimensional real-time streaming data in practical application scenarios. The test data is generated according to real industrial scenarios, including timestamps, point names, point values and point locations. The point values are continuous or discrete values, which are used to model sensor data and actuator data such as valves. The generator converts the data into a JSON string type and transmits it to the Kafka message queue, and the data for each measurement point is a single message.

4.7.2. Test Result

The detection efficiency of anomaly detection based on single node and anomaly detection based on real-time anomaly detection system designed in this paper is compared. In this experiment, we used the same deep anomaly detection model and tested the same amount of data. The experiment uses the data generator to generate 1000 dimensional time series data, including 1000 time points, a total of 1 million data, the data is divided into 20 partitions through the data partition module, each partition contains 50 dimensions of data, the detection input window length is 100 time points, the window step is 10. The anomaly detection method for single-machine deployment is implemented using a single machine. Figure 5 shows the time taken by both of the proposed real-time anomaly detection systems running on a Flink cluster for the same dataset.

From the experimental results, the detection time is 98.8 s in the case of a single machine, while the real-time detection system based on Flink reaches 135.4 s when the parallelism is set to 1. This is because at the lowest parallelism, the performance of the system is limited by the processing power of a single computing node, and the advantages of distributed computing cannot be fully utilized. The Flink cluster introduces additional overhead, so the detection time is higher than in the standalone case. When the parallelism of the task is increased to 4, the time consumption is significantly reduced by 77.9 s compared with the parallelism of 1. This is because when the parallelism is increased, Flink can make full use of the distributed computing power of multiple nodes to improve the computing efficiency. The data partitioning strategy adopted by the system divides the data nodes into regions, which is highly consistent with the processing mode of Flink. It further promotes the improvement of data processing efficiency. When the parallelism of the real-time anomaly detection task is further increased to 8, the detection efficiency is not significantly improved, which is only 18.3 s higher than that of 4, and the improvement of system performance is limited, reaching a performance bottleneck. When using this system, the effect of parallelism setting on performance should be considered, and the best parallelism degree should be selected according to the requirements and resource situation of the specific task.

Then, the experiment evaluates the performance of the real-time anomaly detection system, mainly testing the delay and throughput of the system. These two key metrics directly affect the practicability and efficiency of the system when dealing with large-scale real-time data streams. The experimental results are shown in Figure 6.

The delay of the real-time anomaly detection system increases proportionally with the growth in data dimension. Given the existing computing resources, when the data dimension is below 1500, there is no significant increase in system delay, all remaining within 1000 ms. However, when the data dimension exceeds 1500, the system latency increases significantly. This is because high-dimensional data causes computational complexity to exhibit exponential growth—such as the surge in adjacency matrix construction and attention computations in GNNs—while memory usage and distributed communication overhead simultaneously escalate. Additional computing resources are required to address this issue. From a throughput perspective, when the data dimension is less than 1000, system throughput rises alongside an increase in data dimension until reaching its peak at 1500 dimensions with a throughput exceeding 20000. However, if the data dimension exceeds 1500, throughput declines due to encountering computational bottlenecks. In practical applications, striking a balance between throughput and latency is crucial while promptly adjusting computing resources according to data scale ensures optimal performance of real-time systems.

5. Conclusions

In this work, we propose an anomaly detection system for high-dimensional industrial time-series data. In order to achieve accurate anomaly detection, we propose an anomaly detection method based on GNN. To solve the problem of low detection efficiency, we combine data partitioning and Apache Flink to implement distributed real-time anomaly detection. Future work can consider improving our proposed anomaly detection method by leveraging lightweight graph structure modeling to reduce computational complexity, integrating approximate inference algorithms to minimize real-time reasoning overhead, and utilizing heterogeneous computing acceleration strategies to enhance distributed processing efficiency. These efforts aim to break through the real-time bottleneck for high-dimensional data while maintaining detection accuracy.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

This work was supported by the National Key Research and Development Program Project (2022YFC3202600) and the Jiangsu Province Water Conservancy Science and Technology Project (2022003).

Acknowledgments

This work was supported by the National Key Research and Development Program Project (2022YFC3202600) and the Jiangsu Province Water Conservancy Science and Technology Project (2022003).

Open Research

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

1 Xu J., Wu H., Wang J., and Long M., Anomaly Transformer: Time Series Anomaly Detection With Association Discrepancy, 2021, https://arxiv.org/abs/2110.02642.
Google Scholar
2 Pang G., Shen C., Cao L., and Hengel A. V. D., Deep Learning for Anomaly Detection: A Review, ACM Computing Surveys. (2021) 54, no. 2, 1–38, https://doi.org/10.1145/3439950.
10.1145/3439950
Web of Science® Google Scholar
3 Xia X., Pan X., Li N. et al., GAN-Based Anomaly Detection: A Review, Neurocomputing. (2022) 493, 497–535, https://doi.org/10.1016/j.neucom.2021.12.093.
10.1016/j.neucom.2021.12.093
Web of Science® Google Scholar
4 Yang Y., Zhang H., and Li Y., Pipeline Safety Early Warning by Multifeature-Fusion CNN and LightGBM Analysis of Signals from Distributed Optical Fiber Sensors, IEEE Transactions on Instrumentation and Measurement. (2021) 70, 1–13, https://doi.org/10.1109/tim.2021.3092518.
10.1109/TIM.2021.3126366
PubMed Web of Science® Google Scholar
5 Fisch A. T., Eckley I. A., and Fearnhead P., Subset Multivariate Collective and Point Anomaly Detection, Journal of Computational & Graphical Statistics. (2022) 31, no. 2, 574–585, https://doi.org/10.1080/10618600.2021.1987257.
10.1080/10618600.2021.1987257
Web of Science® Google Scholar
6 Ding C., Sun S., and Zhao J., MST-GAT: A Multimodal Spatial–Temporal Graph Attention Network for Time Series Anomaly Detection, Information Fusion. (2023) 89, 527–536, https://doi.org/10.1016/j.inffus.2022.08.011.
10.1016/j.inffus.2022.08.011
Web of Science® Google Scholar
7 Demertzis K., Iliadis L., and Bougoudis I., Gryphon: A Semi-Supervised Anomaly Detection System Based on One-Class Evolving Spiking Neural Network, Neural Computing & Applications. (2020) 32, no. 9, 4303–4314, https://doi.org/10.1007/s00521-019-04363-x, 2-s2.0-85069885261.
10.1007/s00521-019-04363-x
Web of Science® Google Scholar
8 Mishra S., Sagban R., Yakoob A., and Gandhi N., Swarm Intelligence in Anomaly Detection Systems: an Overview, International Journal of Computers and Applications. (2021) 43, no. 2, 109–118, https://doi.org/10.1080/1206212x.2018.1521895, 2-s2.0-85053494115.
10.1080/1206212x.2018.1521895
Google Scholar
9 Liu F. T., Ting K. M., and Zhou Z. H., Isolation Forest, 2008 Eighth Ieee International Conference on Data Mining, December 2008, IEEE, 413–422.
Google Scholar
10 Breunig M. M., Kriegel H. P., Ng R. T., and Sander J., LOF: Identifying Density-Based Local Outliers, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 2000, 93–104.
Google Scholar
11 Schölkopf B., Williamson R. C., Smola A., Shawe-Taylor J., and Platt J., Support Vector Method for Novelty Detection, Advances in Neural Information Processing Systems. (1999) 12.
Google Scholar
12 Deng A. and Hooi B., Graph Neural Network-Based Anomaly Detection in Multivariate Time Series, Proceedings of the AAAI Conference on Artificial Intelligence. (2021) 35, no. 5, 4027–4035, https://doi.org/10.1609/aaai.v35i5.16523.
10.1609/aaai.v35i5.16523
Google Scholar
13 Huang S., Wang D., Wu X., and Tang A., Dsanet: Dual Self-Attention Network for Multivariate Time Series Forecasting, Proceedings of the 28th ACM International Conference on Information and Knowledge Management, November 2019, 2129–2132.
Google Scholar
14 Su Y., Zhao Y., Niu C., Liu R., Sun W., and Pei D., Robust Anomaly Detection for Multivariate Time Series Through Stochastic Recurrent Neural Network, Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, July 2019, 2828–2837, https://doi.org/10.1145/3292500.3330672, 2-s2.0-85071168275.
10.1145/3292500.3330672
Google Scholar
15 Aubet F. X., Zügner D., and Gasthaus J., Monte Carlo EM for Deep Time Series Anomaly Detection, 2021, https://arxiv.org/abs/2112.14436.
Google Scholar
16 Schmidl S., Wenig P., and Papenbrock T., Anomaly Detection in Time Series: A Comprehensive Evaluation, Proceedings of the VLDB Endowment. (2022) 15, no. 9, 1779–1797.
10.14778/3538598.3538602
Web of Science® Google Scholar
17 An J. and Cho S., Variational Autoencoder Based Anomaly Detection Using Reconstruction Probability, Special lecture on IE. (2015) 2, no. 1, 1–18.
Google Scholar
18 Audibert J., Michiardi P., Guyard F., Marti S., and Zuluaga M. A., Usad: Unsupervised Anomaly Detection on Multivariate Time Series, Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 2020, 3395–3404.
Google Scholar
19 Hundman K., Constantinou V., Laporte C., Colwell I., and Soderstrom T., Detecting Spacecraft Anomalies Using Lstms and Nonparametric Dynamic Thresholding, Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, July 2018, 387–395, https://doi.org/10.1145/3219819.3219845, 2-s2.0-85051518933.
10.1145/3219819.3219845
Google Scholar
20 Zong B., Song Q., Min M. R. et al., Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection, International Conference on Learning Representations, February 2018.
Google Scholar
21 Chen Z., Chen D., Zhang X., Yuan Z., and Cheng X., Learning Graph Structures with Transformer for Multivariate Time-Series Anomaly Detection in IoT, IEEE Internet of Things Journal. (2022) 9, no. 12, 9179–9189, https://doi.org/10.1109/jiot.2021.3100509.
10.1109/JIOT.2021.3100509
Web of Science® Google Scholar
22 Wei Y., Law A. W. K., Yang C., and Tang D., Combined Anomaly Detection Framework for Digital Twins of Water Treatment Facilities, Water. (2022) 14, no. 7, https://doi.org/10.3390/w14071001.
10.3390/w14071001
Web of Science® Google Scholar
23 Mokhtari S., Abbaspour A., Yen K. K., and Sargolzaei A., A Machine Learning Approach for Anomaly Detection in Industrial Control Systems Based on Measurement Data, Electronics. (2021) 10, no. 4, https://doi.org/10.3390/electronics10040407.
10.3390/electronics10040407
Web of Science® Google Scholar
24 Ha T. W., Kang J. M., and Kim M. H., Real-time Deep Learning-Based Anomaly Detection Approach for Multivariate Data Streams With Apache Flink, International Conference on Web Engineering, May 2021, Cham, Springer International Publishing, 39–49.
Google Scholar

All articles

An Anomaly Detection System for High-Dimensional Industry Time Series Data Based on GNN and Apache Flink

Abstract

1. Introduction

2. Related Work

3. Proposed Framework

3.1. Overview

3.2. Model Implementation

3.2.1. Graph Construction

3.2.2. Graph Feature Learning

3.2.3. Time Series Forecasting

3.2.4. Anomaly Detection

3.3. Real-Time Anomaly Detection System

3.3.1. Data Partitioning Module

3.3.2. Anomaly Detection Module

4. Experiments

4.1. Datasets and Experiments Environment

4.1.1. Datasets

4.1.2. Experiments Environment

4.2. Baseline

4.3. Evaluation Metrics

4.4. Results

4.5. Ablation

4.6. Parameter Analysis

4.7. System Test

4.7.1. Data Generator

4.7.2. Test Result

5. Conclusions

Conflicts of Interest

Funding

Acknowledgments

Open Research

Data Availability Statement

References

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley