Volume 2022, Issue 1 7276756
Research Article
Open Access

[Retracted] Dynamic Analysis of Network Protocol Hiding Behavior Based on Perceptual Mining

Xiaoyan Xu

Corresponding Author

Xiaoyan Xu

School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266525, China qust.edu.cn

Search for more papers by this author
First published: 27 February 2022
Citations: 2
Academic Editor: Gengxin Sun

Abstract

This paper analyzes the typical abnormal behaviors such as hidden and invisible attacks of network protocols and proposes a new method to perceive and mine the abnormal behaviors of protocols so as to evaluate the operational security of protocols. Aiming at the problems of long latency and difficulty to expose the hidden behavior of network protocols, a scheme combining dynamic stain analysis and instruction clustering analysis is proposed. The proposed method cannot only quickly distinguish the hidden behavior but also accurately grasp the hidden behavior instruction sequence. Experimentation results show that the proposed method can mine unknown protocols for abnormal behavior with high accuracy.

1. Introduction

Network protocol is the basis of cyberspace and one of the most important research fields in information science. With the rapid expansion of network development scale and application fields, the security problems caused by network protocols are becoming more and more prominent. As an important technical basis for active defense and network attack and defense confrontation, network protocol abnormal behavior analysis can preperceive and immediately find a variety of abnormal behaviors hidden in network protocols, quickly detect and locate network fault points and problems affecting performance and take decisive measures to quickly solve problems so as to shorten the network failure time.

A network protocol defines the communication rules between one or more entities, and the protocol specification is usually composed of vocabulary and grammar [1]. Vocabulary is a collection of messages and their formats, and grammar defines all the process rules. It is expressed as an effective sequence of messages, which is usually described by state machines. In the process of network communication, each network entity usually carries out operations such as message parsing, business logic, message encapsulation, and state transformation. Inferring an unknown protocol usually requires knowledge of vocabulary and grammar, which are mainly covered by the current protocol’s reverse analysis.

The behavior security of network protocols is the basis and guarantee of cyberspace security. However, at present, there is no complete and clear understanding of protocol behavior. The definition, mining, and analysis methods of protocol behavior need to be discussed. So far, there is still no set of protocol behavior models or protocol behavior analysis methods to ensure the safe operation of network protocol. Therefore, this paper explores the analysis methods of protocol behavior, studies the basic characteristics of normal behavior and abnormal behavior, and studies the analysis methods of abnormal behavior of unknown protocols with the analysis and mining of protocol behavior.

At present, there are two research hotspots on protocol reverse analysis [2]: message format identification and protocol behavior pattern identification. The main analysis methods include network message flow analysis [3] and instruction sequence analysis based on parsing messages [4]. The analysis based on instruction sequence has high accuracy and focuses on analyzing the instruction sequence related to network messages in the communication process of the protocol program. The knowledge base of instruction sequence analysis is that protocol entities parse protocol messages by using protocol specification. Through analyzing the procedure of a protocol entity parsing protocol messages, we can accurately obtain the inside structure and semantics of protocol messages.

The early protocol reverse mainly analyzed network traffic. The pioneer of protocol reverse analysis introduced the multiple sequence alignment (MSA) algorithm [5] in bioinformatics into the reverse analysis of protocol message formats. Xiao et al. [6] extracted the protocol message specification and the data output through the protocol program by statically analyzing the program binary code. The method needs users to give the prototype function of data input to the data output buffer. However, this kind of function is often difficult to provide. For example, when the program does not output data, the writing data function would not be easy to obtain. Kleber et al. [7] proposed a scheme to mine the message format based on the data change distribution characteristics of the message. The theoretical basis of the scheme is to statistically analyze the change characteristics of the value range of each field in the message. This method first statistically analyzes the frequency distribution of each byte in a session, then statistically analyzes the frequency distribution of each byte in multiple sessions, and finally infers which byte sequence is a keyword or delimiter according to the frequency distribution.

Private protocols, especially malware protocols, often use encryption technology to protect the transmitted messages, which greatly limits the performance of protocol inference algorithms. Some existing solutions are committed to directly extracting text information before encryption by bypassing encryption methods, but these methods would face great difficulties when encountering strong or customized encryption schemes. Traditional protocol reverse analysis only focuses on the syntax of unknown protocols, the inference and recovery of message formats, and does not involve the study of protocol behavior, which is closely related to network security. Protocol software is a special kind of software. Category theory and type theory are the mathematical basis of software behavior. The granularity of protocol behavior is different. Existing research mostly uses system call sequence to represent the behavior of protocol software [8,9]. Due to the basic and key role of system calls, the related attributes of system calls have become an important basis for describing process behavior and implementing intrusion detection. Liu et al. [10] proposed a method to implement process behavior modeling based on static analysis of executable files. Its basic idea is to establish a set of all possible system calls in the process of process execution by statically analyzing the executable code of the application program and monitoring the process based on the system call set.

2. Hidden Behavior Mining of Network Protocols Based on Cluster Analysis

Protocol reverse analysis can extract the type and message format of an unknown protocol, infer its internal structure, analyze protocol behavior, and obtain other important information about the protocol when the protocol specification cannot be obtained. The great majority of protocol reverse analysis focuses on analyzing and deducing the specifications and fields of unknown protocols’ messages but pays little attention to the behavior of protocols. In view of the importance and basic role of protocol behavior analysis, we believe that protocol behavior, especially hiding behavior, affects the security of protocol operation and is directly bound up with the basis of cyberspace security.

Because of the lack of knowledge of unknown protocol specifications and other details, protocol reverse analysis might become the best method to research protocol hiding behavior. Hiding behavior usually hides or disguises itself through binary code of a specific protocol, so this key information is suitable for reverse acquisition from the program’s native code. In this paper, a method for mining the hiding behavior of unknown protocols through combining dynamic analysis and instruction clustering analysis is proposed. The method only relies on the original binary protocol data, which can quickly cluster the protocol behavior and finally expose the hidden behavior.

In this paper, a virtual analysis platform is developed based on the analysis of the message processing process of an unknown protocol binary program. According to the common instruction-level characteristics, the analysis platform adopts the scheme of instruction clustering to automatically group the protocol behaviors and generate different behavior clusters.

This method describes the protocol, protocol behavior, and hidden behavior analysis from a new perspective. Related concepts are defined as follows:

Network protocols [11]: P can be regarded as a collection composed of protocol program C and protocol message S. P  = (C, S) is defined as a binary and S is generated by C. A protocol program C = {c1, c2, …, cn|nN} is defined as a collection of functions, which is generated from instruction sequences cj, cj = {i1, i2, …, ik}(k > 0,1 ≤ jn) is an ordered sequence of executable instructions. S = (Ss, Sr) represents a binary, where Ss represents a collection of sent messages and Sr represents a collection of received messages.

Protocol behavior: B : SC is regarded as the mapping from S to C. The behavior of a protocol comes from the message parsing process. Protocol behavior B can be described as: Br = Recieve(S), Bp = Process(S),  Bg = Generate(S),  Bs = Send(S), and Bt = Trigger(S) = BpublicBhidden, where Bpublic is the public behavior and Bhidden is the hidden behavior. The relationship between the behaviors of each part is BrBpBgBsBtB, and BrBpBgBsBt = ∅.

Public behavior instruction sequence: it is the expression of protocol disclosure behavior at the instruction level. They are usually instruction sequences that can be captured by dynamic analysis. Protocol hiding behavior analysis based on public behavior instruction sequence.

Hidden behavior instruction sequence [12]: protocol hiding is the expression of protocol hiding behavior at the instruction level. They need special methods to mine and trigger execution.

Analysis of protocol hidden behavior: it refers to the process of obtaining C and capturing S, guarding and analyzing C, and parsing S without protocol specification.

Both public and hidden behaviors are represented by instruction sequences, which makes it difficult to calculate the semantic differences between them. For protocol analysts, two behavior instruction sequences may contain a large number of the same instructions, or there is a small Hamming distance between them, but from the perspective of behavior, they have very different functions. For this reason, we use the distribution of gene instructions to construct the evaluation method. Bhidden = Disthidden(F, D, C) is the gene instruction distribution vector of hidden behavior, Bpublic = Distpublic(F, D, C) is gene instruction distribution vector of public behavior. Therefore, the characteristic distance between public behavior and hidden behavior is used as the evaluation criterion of protocol operation security, it is denoted as RS
(1)

The value of RS indicates the distance between public behavior and hidden behavior, which means the deviation degree of hidden behavior. The larger the RS value, the higher the possibility that the protocol may have hidden behavior. It also shows the security situation of the protocol operation. Massive case researches show that when RS = 0, it means that there is no significant difference between public behavior and hidden behavior, and the behavior of the protocol is visible. We believe that the operation of the protocol is safe in this situation.

Protocol behavior is regarded as a collection of all instruction sequences of the protocol.

Protocol behavior instruction sequence: a specific sequence of instructions defined to the protocol program parsing protocol data, which is expressed as A = {a1, a2, …, an|nN}. Instruction granularity aj = {i1, i2, …, ik}(1 ≤ jn, k ≥ 1) is a single machine instruction or a basic program cycle time that consists of some machine instructions and application program interface.

Hidden transformation: A = {a1, a2, …, an|nN} is a protocol behavior instruction sequence, Torm(A, D) = D(A) is a hidden transformation of instruction sequence A by using the D method. Hidden transformation can be realized by a variety of confusion methods, such as encryption, data conversion, and dynamic code generation.

Antihidden transformation: it is the inverse process of hiding transformation. For a given hidden transformation Torm(A, D), the antihidden transformation is defined as Re(Torm(A, D)) = D−1(Torm(A, D)) = A.

Protocol behavior trigger condition: A = {a1, a2, …, an|nN} is a sequence of protocol behavior instructions. It contains m conditional jump blocks, the set of all conditional jump blocks can be represented as CMP = {cmpi}. Set O = {oi} is defined as a potential behavior trigger condition. If all comparison instructions related to oi trigger the execution of instruction sequence A, it is said that O triggers A.

A new message that triggers hidden behavior: if M is the protocol message space and B is the behavior space, the operation of the protocol can be described as the mapping function F : MB from the message space M to the behavior space B.

Instruction sequence clustering: C = {c1, c2, …, cn|nN} is a collection of n instruction sequences generated from protocol binary code, ci is a sequence of public behavior instructions captured by dynamic binary analysis. Then the instruction clustering algorithm is applied to the instruction sequence set C, the instruction sequence P = {P1P2PT} formed by T clusters can be obtained. Pi is a deterministic behavior represented by a specific sequence of instructions. Each behavior Pt,  t = 1,2, …, T constitutes a cluster , where k is the quantity of instruction sequences in behavior Pt. For different clusters, k will be distinct. For behavior Pt, the connection matrix M(Pt) is defined as follows:
(2)
Through the connection matrix, the distance between two behaviors P1 and P2 can be defined as follows:
(3)

Given ciP1 and ciP2, if dij(P1, P2) = 1, then P2 is a hidden behavior, otherwise it is not.

The hidden behavior of the protocol [13] can be hidden anywhere in the protocol program in different forms. Simple static analysis is hard to recognize the special transformed instruction sequences, whereas dynamic analysis is also hard to recognize hidden behavior, and it is almost impossible to fully cover total instruction sequences. The proposed analysis scheme in this paper does not care how to hide and transform the hidden behavior instruction sequence, because it has a process of self-recognition and self-transformation anyway, and the hidden behavior instruction sequence is quite distinct from the public behavior instruction sequence. Therefore, based on public behavior instruction sequence analysis and instruction clustering algorithm, hidden behavior instruction sequences and their triggering conditions could be identified and located.

The research method of this paper includes three major steps:

Step 1. we supervise the message processing process of the protocol program and dynamically construct the instruction sequence through basic blocks. Every byte of the message is labeled as tainted data, which can be traced throughout the operation of the protocol. All comparison instructions with regard to tainted data are stored, including jump statements and system call instructions. These characteristics could describe protocol behavior. The analysis scheme could not only label the executed instruction sequence but also infer the unexecuted instruction sequence. The analysis results show that the protocol would contain underlying hidden behavior.

Step 2. static analysis of instruction sequences by an instruction clustering algorithm. Instructions are labeled as three categories of gene instructions, and the labeled instruction sequences can be transformed into n-grams. Based on the instruction clustering method, n-gram characteristics from these instruction sequences can be collected and some behavioral clusters can be established. The instruction clustering procedure is shown in Figure 1.

In Figure 1, instruction sequences A and B represent two captured public behaviors, while P1, P2, and P3 represent three unknown protocol program binaries. By using an instruction clustering algorithm, the mined behavior P1.A conforms to instruction sequence A, so the two gathers to forge behavior clustering 1. Newly recognized behavior P1.B and P2.B are very similar to the known instruction sequence B, then they can be clustered together to forge behavior clustering 2. If excavated behavior P3.C is completely distinct from behaviors A and B, then it forges an independent cluster, and P3.C is likely to be an underlying occult behavior.

The hiding behavior of the protocol can be hidden anywhere in the protocol program. In order to make up for the deficiency of protocol reverse analysis, a virtual analysis scheme platform is designed and constructed. The scheme platform is used to reverse analyze the hidden behavior of unknown protocols, regardless of their transformation mode. It can capture the hidden behavioral instruction sequence of the protocol, trigger its execution, and achieve more comprehensive information on protocol behavior. The proposed analysis scheme platform framework is shown in Figure 2.

Details are in the caption following the image
An example of instruction cluster generation.
Details are in the caption following the image
The proposed analysis framework.

3. Experiment and Analysis

In the analysis, the best features that distinguish hidden behavior from normal behavior should be selected and used in the training and classification process. If the existing feature-based solutions are used, mining will become extremely difficult for a variety of hidden behaviors, at least in experiments. Therefore, we detect the hidden behavior of the protocol through instruction clustering analysis.

The experimental results reflect ordinary message flow analysis or dynamic binary analysis, which makes it hard to obtain the hidden behavior characteristics of a protocol. We start with the cluster analysis of the instruction category, the quantity, and execution frequency of the protocol program, and then analyze the open and hidden behavior characteristics of the protocol. From a microscopic point of view, all instructions consist of three gene categories: function system call instructions (F), control flow instructions (C), and data processing instructions (D). The different behaviors of the protocol could be made a clear distinction by cluster analysis category, quantity, and frequency of gene instructions. Behavioral instruction sequences are labeled as three types of gene instruction tags. It is assumed that the protocol binary is composed of labeled gene instructions, which are grouped into clusters. Based on marked behavior instruction sequence, all the behaviors of a protocol sample would be automatically clustered into behavior groups. The behavior instruction sequence in this paper refers to the marked instruction sequence.

Through using the proposed instruction clustering algorithm [14,15], 1297 protocol samples are automatically divided into five behavior clusters, in which CBOT is a sample with hidden behavior, and Hidden-data is a protocol with hidden behavior intentionally developed by us. The cluster analysis results are shown in Figure 3.

Details are in the caption following the image
The proposed analysis framework.

As shown in Figure 3, the number of function calls and control flow-related instructions in protocols with hidden behavior is significantly higher than that in normal protocols. The main behavior of normal protocols focuses on data processing.

The protocol behaviors captured during the experiment were shown in Table 1.

Table 1. Characteristic data description of protocol behavior clustering.
Protocol program Bpublic : Bhidden Distpublic(F1, D1, C1) Disthidden(F2, D2, C2) RS value Security
Cluster 1 5150 : 160 (0.3,0.3,0.4) (0.25,0.25,0.5) 0 Y
Cluster 2 4410 : 120 (0.4,0.3,0.3) (0.3,04.,0.3) 0.015 Y
Cluster 3 4380 : 80 (0.25,0.25,0.5) (0.25,0.25.0.5) 0.055 Y
CBot 2600 : 1390 (0.15,0.15,0.7) (0.7,0.15,0.15) 0.975 N
Hidden-data 4900 : 1880 (0.1,0.45,0.5) (0.7,0.2,0.1) 0.375 N

As shown in Table 1, different distributions of gene instructions mean different protocol behaviors. In the protocol sample of cluster 1, the distribution of gene instructions is exactly the same and there is no difference. Although a small number of hidden behaviors have been excavated, there is no essential difference between these behaviors and public behaviors resulting from the distribution of gene instructions. RS = 0, which indicates that all behaviors are completely open and transparent. Therefore, we believe that this clustering protocol is safe in operation. In some protocols of cluster 2 and cluster 3, we found hidden behavior, and the distribution of gene instructions also showed some differences. The calculation results of RS show that although there are some differences between public behavior and hidden behavior, the feature distance is very small, which is far lower than our preset threshold. Verification execution finds that these so-called hidden behaviors are actually some protection code. Therefore, the two clustering protocols are also considered safe in operation. However, CBot and Hidden-data are completely different. Firstly, the proportion of hidden behavior increased significantly. Secondly, the distribution of gene instructions is also significantly different. The calculation results of RS exceed the threshold, and the RS value is much higher than the first three clusters, which also means that the feature distance between hidden and public behavior increases significantly. Therefore, we believe that the operation of these two protocols is unsafe. The final verification execution shows that it is accurate and reliable to determine the operation security of the protocol through the characteristic distance.

The proposed method cannot only perceive the unknown hidden behavior but also discover, utilize, and construct the trigger conditions of the hidden behavior. While the hidden behavior has not been revealed, it can be triggered to execute on the virtual platform, so that the hidden behavior can be recognized and analyzed at any moment. Only from the perspective of a clustering algorithm, public and hidden behavior can be naturally distinguished without understanding the specific content of public behavior. However, there are many kinds of hidden behaviors and great differences, while the types of public behaviors are very limited, including sending and receiving messages and other related functions. Public behavior is used as the input of instruction clustering. In this way, behavior that is close to public behavior is also regarded as public behavior, and all behaviors that deviate from it can be classified as hidden behavior so that the hidden behavior of unknown protocols can be mined more effectively.

The proposed scheme is compared with the current classical analysis mechanism, especially in terms of the recognition ability of hidden behavior. The comparison data were shown in Table 2.

Table 2. Comparison between the proposed scheme and the mainstream schemes.
Scheme Analysis unit Resist encryption Recognize hidden behavior Recognition rate
Protocol status inference machine Network data flow × × 0
Dynamic taint analysis Protocol program 10%–20%
Protocol program fortress reverse engineering Network data flow <50%
Identify hidden functions Malware program × 20%–40%
Malware classification Malware program 20%–30%
Detecting multiexecution path malware analysis Malware program 50%–60%
Zombie group activity detector Network data flow 20%–30%
The proposed scheme Protocol program and network data flow >85%

Since the above analysis tools are not available, we cannot use these methods to analyze our protocol samples, but according to the description of relevant literature, we can infer the recognition rate of these tools for hidden behavior. The comparative study shows that after gradual evolution and improvement, a large number of schemes and algorithms can well recognize encryption and confusion algorithms, and the recognition rate has been significantly improved. However, the recognition accuracy of these methods for hidden behavior is poor. As shown in Table 2, the proposed method in this paper is the most powerful tool for detecting and analyzing hidden behavior, and the recognition rate would be more than 85%, which is much higher than the existing methods and tools.

4. Conclusions

This paper presents a novel scheme for detecting and analyzing protocol hiding behavior. Different from the existing methods, this scheme integrates dynamic stain analysis into instruction sequence clustering of an unknown protocol and effectively integrates dynamic and static analysis. Dynamic stain analysis can effectively monitor the runtime disclosure behavior of the protocol, while instruction clustering can mine the unexecuted hidden behavior. On the designed analysis platform, the function of the instruction clustering algorithm is extended so that protocols with similar behavior can be divided into the same cluster, and hidden behavior can be divided into different clusters. Experiment results demonstrate that the proposed method could effectively analyze large amounts of unknown protocols without a prior-knowledge. The proposed scheme is more efficient than most existing schemes and can resist a wider range of attacks.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Data Availability

The basic data used in this paper is downloaded from the online public dataset: Cyber Grand Challenge http://www.ll.mit.edu/r-d/datasets?keywords=DARPA.

    The full text of this article hosted at iucr.org is unavailable due to technical difficulties.