Volume 2025, Issue 1 4469975
Research Article
Open Access

A Lightweight Deep Learning Approach for Encrypted Proxy Traffic Detection

Yanjie He

Corresponding Author

Yanjie He

Xi’an Jiaotong University , Xi’an , China , xjtu.edu.cn

Search for more papers by this author
Wei Li

Wei Li

Xi’an Jiaotong University , Xi’an , China , xjtu.edu.cn

Search for more papers by this author
First published: 31 March 2025
Academic Editor: Vincenzo Conti

Abstract

As the ability to circumvent internet censorship, encrypted proxies are widely used by criminals in illegal activities (e.g., online gambling and darknet transactions). Thus, detection of encrypted proxy traffic is important. In recent years, deep learning-based approaches have become mainstream approaches. Many deep learning-based approaches transform internet traffic into images, but the transformed images are normally large, leading to huge computational and storage resource overhead. To solve this issue, a novel approach is proposed to compress the image size for reducing overhead in detecting encrypted proxy traffic while still achieving comparable performance. By analyzing the spatiotemporal features of the flow, we discovered that the sequences of sizes, directions, and interval times of the first few packets of a flow can be used to detect encrypted proxy traffic. We compare and analyze the characteristics of the size, direction, interval time of the packet, and the pixel value of the image, and design several equations to encode the sequences of sizes, directions, and interval times of only the first N packets of a flow into an image. Furthermore, a lightweight convolutional neural network (CNN) is constructed to classify the converted images. The experimental results exhibit that the proposed approach could reduce the image size by at least 90% and achieve F1 scores of 99.67% in ShadowsocksR traffic detection and 99.44% in VPN traffic detection. These results show that the proposed approach is effective and efficient. Because of its high efficiency, the proposed method can be applied to large-scale network traffic analysis tasks.

1. Introduction

Over the past few years, more and more network users have used encrypted proxy services (e.g., V2Ray [1], virtual private network (VPN) [2], and ShadowsocksR [3]).

Encrypted proxies are effective tools for users to circumvent internet supervision and censorship. However, they are also used by criminals in illegal activities (e.g., pornographic propagation, online gambling, cyberattacks, and online fraud) [4]. Therefore, encrypted proxy traffic detection is crucial for cyber security surveillance.

Currently, there are two mainstream methods for encrypted proxy traffic detection, namely, traditional machine learning approaches and deep learning-based approaches. Many traditional machine learning approaches rely on human-engineered features. However, the feature extraction and selection depend on the experience of experts. Furthermore, this process is very time-consuming and labor-intensive. These limitations greatly limit the generalizability of these methods. In recent years, deep learning-based detection has attracted significant attention, since deep learning can automatically learn features. Many deep learning-based approaches transform internet traffic into images, but the transformed images are usually large. For example, the approaches in references [58] convert the first S bytes of packet payloads of the flow into images. They concatenate the payload bytes of the flow, intercept the first S bytes as a byte stream, and transform each byte into a whole number from 0 to 255. Since the byte stream consists of a large number of bytes, the transformed images are large, such as 784 and 1521 bytes. Large images lead to huge computational and storage overhead.

To solve the above issues, a novel approach is proposed to compress the image size for reducing overhead in detecting encrypted proxy traffic while still achieving comparable performance. The method was inspired by the common fact that the first few packets of the flow exhibit application-specific traffic features (e.g., client-server key negotiation) [9, 10].

By comparing and analyzing the characteristics of the size, direction, interval time of the packet, and the pixel value of the image, we discovered that the size, direction, and interval time of the packet (packet interval time is the time difference between two packets sent or received by the same host) can be encoded into the pixel values of the image by compressing and expanding, respectively. We design several equations to encode the sequences of sizes, directions, and interval times of only the first N packets of a flow into an image. Based on the compression thought, the packet size and packet direction are compressed and encoded into pixel values of the image. Based on the expansion thought, the packet interval time is expanded and encoded into the pixel value of the image. Finally, the sequences of pixel values of all features are concatenated together. Furthermore, a lightweight CNN is constructed to classify the converted images.

Since the method employs a mapping-based feature encoding mechanism and only uses a small amount of feature data (i.e., the sequences of sizes, directions, and interval times of the first N packets of a flow) instead of a packet payload containing large numbers of bytes, the size of transformed images is greatly compressed. The transformed images of the approach are at least 90% smaller than those of many image-based deep learning approaches. Thus, the method can significantly reduce the required computational and storage resources and work in a lightweight fashion. Meanwhile, as the transformed images of the approach comprehensively capture the flow differences from both bidirectional and unidirectional spatiotemporal features, the approach is accurate and acquires comparable performance to state-of-the-art approaches, such as those approaches in references [5, 11].

In addition, the approach is capable of automatically extracting and selecting features, obviating the labor-intensive and time-consuming work of manually designing and selecting features. The method can detect encrypted proxy traffic in real time. Because of its high efficiency, the approach can be used for traffic analysis tasks on large-scale networks. The major contributions of this research are as follows:
  • Discovering the sequences of sizes, directions, and interval times of the first few packets of a flow can be used to detect encrypted proxy traffic.

  • We propose a novel approach for encrypted proxy traffic detection. The approach can reduce computational and storage resource overheads by compressing the image size.

  • The approach has broad applicability and can be used for network traffic classification, VPN traffic detection, and ShadowsocksR and Shadowsocks traffic detections.

Roadmap. Section 2 presents the related work. Section 3 analyzes features. Section 4 describes the proposed method, and Section 5 performs the evaluation. Section 6 discusses the limitations of the proposed approach. Section 7 presents the conclusion.

2. Related Work

We summarized the encrypted proxy traffic detection approaches and network traffic classification approaches. According to the characteristics, these approaches are classified into deep learning-based approaches and traditional machine learning approaches.

Traditional machine learning approaches categorize network traffic or detect encrypted proxy traffic using traditional machine learning models and statistical characteristics. Deng et al. [12] extract some characteristics (e.g., the number of packets of the flow, the number of outgoing packets of the flow, and the maximal burst length) from network traffic, and then use Random Forest Algorithms to detect the Shadowsocks traffic. Zeng et al. [4] extract some characteristics (e.g., the max of flow burst length and the number of flow bursts) from the DNS behavior of the host, the flow behavior of the host, and the relationship between flows, and then use Random Forest algorithms to detect Shadowsocks traffic. Cheng et al. [13] propose an approach that combines active detection and machine learning. They collect the port and IP of the server as databases and then use XGBoost algorithms to categorize the Shadowsocks servers.

Shim et al. [9] propose an approach for application-level traffic classification. They utilize statistical information (i.e., the orders, payload sizes, and directions) of the first N packets of a flow to produce distinctive payload size sequence (PSS) signatures for the application. They identify application traffic through matching PSS signatures. Lu et al. [10] propose two statistics-based classifiers, i.e., the message size sequence classifier (MSSC) and the message size distribution classifier (MSDC). MSSC is a real-time approach. MSSC utilizes the sequence of sizes and directions of the first 15 packets of a flow to classify network traffic into applications. MSDC classifies network traffic into applications (e.g., Skype and SMTP) using packet size distribution.

Traditional machine learning methods usually extract and select features manually in a trial-and-error mode. This process relies on expert experience and is time-consuming and labor-intensive. With the development of the internet, network applications have become more diverse and network traffic has become more complex. Therefore, it has become more difficult to design some features with good generalization abilities.

Deep learning-based approaches are capable of automatically extracting and selecting features [14, 15]. Wang et al. [16] first propose to convert payloads of a TCP flow into an image. They identify protocol traffic and detect anomalous protocol traffic using auto-encoder (SAE) or artificial neural network (ANN). Many methods improve the method in reference [16]. Wang et al. [17] apply the method in reference [16] to encrypted traffic classification. They only convert the payload of the first few packets of a flow into an image. Cheng et al. [6] propose a lightweight convolutional neural network (CNN) to classify network traffic in real time. They significantly reduce the parameters and running time of the network. Lotfollahi et al. [18] propose to convert each packet into an image. They classify encrypted traffic and identify application traffic using the stacked autoencoder and CNN.

Tang et al. [19] propose a novel model for encrypted traffic identification. The model comprises the long short-term memory network (LSTM) and CapsNet. Guo et al. [5] present two models based on deep learning to identify VPN traffic. They transform the internet traffic into pictures using convolutional auto-encoding and then identify VPN traffic using CNN. Lan et al. [7] present a deep learning approach to identify applications and classify darknet traffic. They obtain local spatiotemporal characteristics from packet payloads using the 1D-CNN and the bidirectional LSTM network. Moreover, to enhance the classification performance, they extract the side-channel characteristic (e.g., number of inbound packets/bytes) from packet payload statistics. Lin et al. [8] propose a novel network traffic identification scheme. They extract the spatial features and temporal features using CNN and stack bidirectional LSTM, respectively. Zheng et al. [20] propose a multitask learning approach for application in traffic identification and network traffic classification. They extract features using a multihead attention mechanism.

Shapira et al. [11] propose a novel approach for the application of traffic identification and encrypted traffic classification. They transform pairs of the arrival time and packet size of the flow into two-dimensional square histograms by setting the Y-axis as the packet size and the X-axis as the packet arrival time. They classify the transformed histograms using CNN.

Many deep learning-based approaches have a common limitation: large transformed images. This limitation causes these methods to have very large computational and storage overheads. In this paper, we propose a novel approach to compress the size of transformed images, thereby reducing the computational and storage overheads.

3. Feature Analysis

3.1. Spatial Features

The early stage (i.e., the first few packets) of a flow is the key exchange phase of an application. The communication of this phase is based on predefined regulations by the application. This phase is distinct in each application. Thus, the size sequence of the first few packets of a flow can be used as a distinguishing characteristic to identify application traffic [9, 10, 21].

3.2. Working Mechanism of Encrypted Proxies

The working mechanism of many encrypted proxies is similar, such as ShadowsocksR, V2Ray, and VPN. ShadowsocksR and VPN are typical encrypted proxy systems. They are composed of two parts: the remote proxy server and the local proxy server. The local proxy server is usually set up on a host or other machine on the local network. The remote proxy server is set outside the firewall [4, 12, 21].

The communication process of encrypted proxies is as follows: the client transmits request packets to the local proxy server, the local proxy server encrypts the request packets and sends them to the remote proxy server. The remote proxy server decrypts the request packets and sends them to the real target server. Similarly, the reply packets of the real target server are encrypted and relayed by the remote proxy server back to the local proxy server. The local proxy server decrypts the reply packets and returns them to the original client [4, 12, 21]. Throughout the process, the data have been encrypted. The proxy servers do not know the content of the data. Furthermore, the local proxy server and the remote proxy server use a variety of encryption algorithms to prevent firewalls from detecting and interfering with their communication data. The working mechanism of encrypted proxies is exhibited in Figure 1.

Details are in the caption following the image
The working mechanism of encrypted proxies.

However, the client and the target server communicate directly in a conventional network environment. By analyzing the working mechanism of the encrypted proxies, we inferred that encrypted proxy traffic and ordinary network traffic have differences in temporal characteristics.

3.3. Temporal Features

In this section, the packet interval time sequences of encrypted proxy (i.e., ShadowsocksR and VPN) traffic and ordinary network traffic are analyzed. According to the size, the application traffic can be classified as small flows and large flows. In general, the small flow comprises a small number of packets, while the large flow comprises a large number of packets.

We extracted small flows and large flows of encrypted proxy traffic and ordinary network traffic. For the small flow, the packet interval time sequence of the entire flow is analyzed. For the large flow, the interval time sequence of the first 100 packets of the flow is analyzed. Moreover, most of the packet interval times of the flow are very short, less than 1 s. If there is an unusually long packet interval time in a flow, the difference in packet interval time sequences between the encrypted proxy traffic and the ordinary network traffic cannot be clearly shown in the figure. To clearly display the difference in the figure, the packet interval times longer than 1 s in the flow were removed. Figures 2, 3, 4, and 5 show the analysis results. In these figures, ssr represents ShadowsocksR traffic, and regular represents ordinary network traffic.

Details are in the caption following the image
The packet interval time sequence of web page traffic.
Details are in the caption following the image
The packet interval time sequence of video traffic.
Details are in the caption following the image
The packet interval time sequence of file transfer traffic and P2P traffic.
Details are in the caption following the image
The packet interval time sequence of spotify traffic and E-mail traffic.

As shown in these figures, the packet interval time distributions of ShadowsocksR traffic and VPN traffic are more discrete than that of ordinary network traffic. Most packet interval times of ShadowsocksR traffic and VPN traffic are longer than those of ordinary network traffic. Most packet interval times of ordinary network traffic are between 0 and 0.1 s. Most packet interval times of ShadowsocksR traffic and VPN traffic are between 0 and 0.3 s. As shown in Figures 2, 3, and 4, for large flows, the interval time distribution of the first few packets of the flow is more discrete than that of the remaining packets. Most interval times of the first few packets of the flow are longer than those of the remaining packets. The difference between packet interval time sequences of ordinary network traffic and encrypted proxy traffic is relatively big at the first few packets of the flow, whereas the difference of interval time sequences of the remaining packets is small. Therefore, the sequence of the interval time of the first few packets of a flow can serve as a distinguishing feature to detect ShadowsocksR traffic and VPN traffic.

4. Method Design

We use a framework to show the proposed method. The framework consists of the data preprocessing phase, image transformation phase, and CNN model training and classification phase. Figure 6 shows the details of the framework.

Details are in the caption following the image
The framework of the proposed method.

4.1. Image Transformation Approach

In the field of deep learning, CNN is a typical mature algorithm that has been verified. CNN has achieved excellent performance in image classification [22], image recognition [23], natural language processing, etc. Therefore, many researchers convert network traffic into images to exploit these advantages of CNN.

The sequences of the size, direction, and interval time of the first N packets of a flow can be imagined as a grayscale image, where the size, direction, and interval time of the packet are considered as pixel values. For spatial characteristics of the flow (i.e., the size and direction of the packet), there is a one-to-one correspondence between the pixel value sequence of the image and the sequence of the first N packets of a flow. For temporal characteristics (i.e., the interval time of the packet) of the flow, there is a one-to-one correspondence between the pixel value sequence and the sequence of the interval time of the first N packets of a flow.

The size of the packet is encoded into a pixel value by Equation (1).
()
where the round() is the rounding operation, the PS is the size of the packet, and the MTU is the maximum transmission unit of the network. By rounding, the value of Equation (1) is an integer. Generally, MTU is used to inform the receiver of the maximum payload size that the sender can accept. As different link media have different MTU values, and different open system interconnect (OSI) levels lead to different MTU values even under the same link, the MTU in Equation (1) is not a fixed value. The MTU can be adjusted according to different datasets or networks. In this work, for ShadowsocksR and VPN traffic detections, the MTU in Equation (1) is set to 1514 bytes.

In this paper, the flow (five-tuple) consists of forward (client to server) packets and reverse (server to client) packets. For the direction of the packet, two different integers are extracted from [1, 255] to represent the forward direction and the reverse direction. For instance, 100 represents the forward direction of the flow, and 255 represents the reverse direction of the flow. The reason for this choice is that flows have different lengths; for the flow with less than N packets, 0 is added at the end of its pixel value sequence to unify the same length. To avoid interference, two different integers are extracted from [1, 255] to represent the directions of the packet.

The threshold of the packet interval time can be set according to the packet interval time analysis of encrypted proxy traffic and ordinary network traffic. The interval time of the packet is encoded into a pixel value by Equations (2) and (3). If the value obtained by Equation (3) is greater than 255, it is set to 255.
()
()
where the round () is the rounding operation, the Threshold is the threshold of the packet interval time, the Time is the interval time of the packet. The server’s response time is affected by many factors, such as the hardware configuration of the server, the network bandwidth, etc. Hence, the Threshold in Equation (2) is not a fixed value, which can be adjusted according to different tasks. For example, for ShadowsocksR traffic detection, the Threshold in Equation (2) is set to 0.3 s. For VPN traffic detection, the Threshold in Equation (2) is set to 0.2 s. Finally, the sequences of pixel values of all features are concatenated together.

The 1D-CNN is used for encrypted proxy traffic detection. As the images input to the CNN must have a uniform size, the transformed images of the approach are set to the same size. In ShadowsocksR traffic detection, we unified the size of transformed images to 64 bytes. In VPN traffic detection, we unified the size of transformed images to 49 bytes. The size of transformed images of the approach can be set according to the parameter N (the first N packets of the flow) used in a task. For example, in VPN traffic detection, when the spatial characteristics of the first 10 packets (i.e., the bidirectional packet size sequence, the unidirectional packet size sequences, and the packet direction sequence) and the bidirectional packet interval time sequences of the first 15 packets of the flow are encoded into an image, the method achieves better performance. The approach encodes the spatial characteristics of the first 10 packets into 30 pixel values. The approach encodes the bidirectional packet interval time sequences of the first 15 packets into 14 pixel values. Therefore, in VPN traffic detection, we set the size of transformed images to 49 bytes.

Finally, if the concatenated pixel value sequence is smaller than the uniform image size (e.g., 64 bytes), 0 is added to the end to supplement it to the uniform image size. We converted the files containing the labels and pixel values into IDX files. The IDX format files are used to store vectors and multidimensional matrices [24].

The method proposed in this paper is different from the method in reference [21]. The method in reference [21] is based on byte conversion. It transforms the size and interval time of the packet into binary integers. It concatenates the binary integers of a feature into a byte stream, and then transforms a byte into a whole number. However, the method proposed in this paper is based on mapping. The method encodes the size, direction, and interval time of the packet into pixel values of the image using Equations (1), (2), and (3). Moreover, compared to the method in reference [21], the method in this paper uses the direction of the packet in addition to the size and interval time of the packet. The method proposed in this paper achieves better performance on ShadowsocksR and Shadowsocks traffic detections.

The image transformation approaches are presented in Algorithms 1, 2, Figures 7, and 8. In these figures, the + represents the forward direction of the flow, while the − represents the reverse direction of the flow.

    Algorithm 1: The image transformation process of spatial features.
  • Input: The size of the packet, PktSize; The port of the packet, Port; The number of packets of the flow, Nrow; The first N packets of a flow, N; The port of the first packet of a flow, Port1.

  • Output: The pixel value sequence of the image.

  • 1.

    i ← 2

  • 2.

     num ← 0

  • 3.

    while i ≤ Nrow do

  • 4.

      pixel ← equation (1) (PktSize)

  • 5.

      Ptwo.append(pixel) //Use a list to save the pixel value

  • 6.

      If (Port = Port1) then //Determine the direction of the packet based on the port number

  • 7.

       Psrc.append(pixel)

  • 8.

       Direction.append(100)

  • 9.

      Else

  • 10.

       Pdes.append(pixel)

  • 11.

       Direction.append(255)

  • 12.

      end if

  • 13.

      num ← num + 1

  • 14.

      i ← i + 1

  • 15.

      if num = N then

  • 16.

       Break

  • 17.

      end if

  • 18.

    end while

  • 19.

    Concatenate sequences of pixel values of all characteristics

    Algorithm 2: The image transformation process of temporal features.
  • Input: The arrival time of adjacent packets, Time1, Time2; The port of the packet, Port; The number of packets of the flow, Nrow; The first N packets of a flow, N; The threshold of the packet interval time, Threshold; The port of the first packet of a flow, Port1.

  • Output: The pixel value sequence of the image.

  • 1.

    i ← 2

  • 2.

     num ← 0

  • 3.

    while i ≤ Nrow do

  • 4.

      interval ← Time2 - Time1

  • 5.

      Expansion ← Equation (2) (Threshold)

  • 6.

      pixel ← Equation (3) (interval)

  • 7.

      Ptwo.append(pixel)

  • 8.

      If (Port = Port1) then

  • 9.

       Tsrc.append(Time2)

  • 10.

      Else

  • 11.

       Tdes.append(Time2)

  • 12.

      end if

  • 13.

      num ← num + 1

  • 14.

      i ← i + 1

  • 15.

      if num = N then

  • 16.

       Break

  • 17.

      end if

  • 18.

    end while

  • 19.

    Calculate the interval time sequences of two unidirectional packet arrival time sequences and encode them into pixel value sequences

  • 20.

    Concatenate sequences of pixel values of all characteristics

Details are in the caption following the image
The pixel value transformation process of spatial features.
Details are in the caption following the image
The pixel value transformation process of temporal features.

4.2. Detection Model

CNN has been applied to many fields and achieved excellent results, e.g., image recognition, video analysis [25], and natural language processing. The 1D-CNN model is similar to the LeNet-5 architecture. It consists of an input layer, two pooling layers, two convolutional layers, a fully connected layer, and an output layer. Pooling layers and convolutional layers are the core modules of CNN to realize feature extraction and selection. The convolutional layer uses local connection and parameter sharing instead of full connectivity. The pooling layer selects salient characteristics. Pooling layers and convolutional layers can greatly cut down the parameters of the model and reduce the complexity of the model. In the model, we use max pooling. The activation function adds nonlinear properties to the network. In the model, we use the Rectified Linear Units (ReLU) activation function. The fully connected layer integrates feature representations learned by previous layers. The Softmax function serves as the output layer. The constructed 1D-CNN model has several advantages: simple structure, small number of parameters, low computational complexity, and fast training speed.

As previously mentioned, the input of the 1D-CNN model is 49 bytes (7 × 7 pixels) and 64 bytes (8 × 8 pixels) grayscale images. We take 64 bytes input images as an example to introduce the parameters of the model. The first layer is the first one-dimensional convolutional layer (C1) with a filter of size 1 × 25. The output of C1 is 32 characteristic maps of size 1 × 64. The second layer is the first max-pooling layer (P1) with a filter of size 1 × 3. The output of P1 is 32 characteristic maps of size 1 × 22. The third layer is the second convolutional layer (C2) with a filter of size 1 × 25. The output of C2 is 64 characteristic maps of size 1 × 22. The fourth layer is the second max-pooling layer (P2) with a filter of size 1 × 3. The output of P2 is 64 characteristic maps of size 1 × 8. The next layer is the fully connected layer, its output is 1024. Finally, the output layer of the model is the Softmax layer. The output of the Softmax layer is the classification result. Furthermore, the dropout layer is used to reduce overfitting. Table 1 shows the main parameters of the 1D-CNN model.

Table 1. The main parameters of the model.
Layer Operation Input Filter Stride Pad Output
1 Conv + Relu 1 × 64 1 × 25 1 Same 32 × (1 × 64)
2 Maxpool 32 × (1 × 64) 1 × 3 3 Same 32 × (1 × 22)
3 Conv + Relu 32 × (1 × 22) 1 × 25 1 Same 64 × (1 × 22)
4 Maxpool 64 × (1 × 22) 1 × 3 3 Same 64 × (1 × 8)
5 Full connect 64 × (1 × 8) 1024
6 Softmax 1026/2/6 2/6

5. Experiment

In this section, the datasets, the data preprocessing process, and performance evaluation metrics are presented. The parameter settings of the approach are analyzed. We compare the proposed approach with state-of-the-art approaches.

5.1. Datasets

In this work, two datasets are employed to validate the performance of the proposed method, namely the public dataset ISCX VPN-nonVPN [26] and the proprietary dataset ShadowsocksR-Ordinary. The ShadowsocksR-ordinary dataset was built by Wireshark [27]. While collecting ShadowsocksR traffic, we set the local ShadowsocksR server to global pattern. The ShadowsocksR-ordinary dataset is made up of encrypted traffic via ShadowsocksR and ordinary network traffic. The ShadowsocksR traffic and ordinary traffic comprise video data, music data, and web page data. They consist of the data from some well-liked applications. For ShadowsocksR traffic, the video data consists of the traffic from YouTube and Twitter. The music data comprises Spotify traffic. The web page data consists of the traffic from some blogs, some network Q & A websites, and Instagram. For ordinary traffic, the video data consists of the traffic from Youku, Iqiyi, Bilibili, Tengxun, and Weibo. The music data comprises Wangyiyun traffic. The web page data consists of the traffic from Weibo, some blogs, and some network Q & A websites.

The public dataset ISCX VPN-nonVPN is composed of 7 kinds of encrypted traffic via VPN (i.e., VPN-Streaming, VPN-P2P, VPN-Browsing, VPN-Chat, VPN-VoIP, VPN-Email, and VPN-File transfer) and 7 kinds of ordinary network traffic (i.e., Streaming, P2P, Browsing, Chat, Email, VoIP, and File transfer). Each kind of network traffic consists of data from several applications. The VPN-Streaming traffic and Streaming traffic consist of data from Vimeo, YouTube, Netflix, and Spotify. The VPN-P2P traffic and P2P traffic comprise Bittorrent data and Torrent data, respectively. The VPN-Browsing traffic and Browsing traffic consist of data from Chrome and Firefox. The VPN-VoIP traffic and VoIP traffic consist of voice data from Hangouts, Facebook, VoipBuster, and Skype. The VPN-Chat traffic and Chat traffic consist of data from ICQ, AIM, Facebook, Hangouts, and Skype. The VPN-Email traffic and Email traffic comprise email data. The VPN-File transfer traffic and File transfer traffic consist of data from FTPS, SFTP, and Skype. Tables 2 and 3 show these datasets. In the tables, the Number is the number of flows in the dataset.

Table 2. ShadowsocksR-ordinary dataset.
Categories Applications Number
ShadowsocksR Twitter, YouTube, Instagram, Spotify, some blogs, and some network Q & A websites 10,625
Ordinary Youku, Iqiyi, Bilibili, Tengxun, Weibo, Wangyiyun, some blogs, and some network Q & A websites 10,672
Table 3. ShadowsocksR-ordinary dataset.
Categories Applications Number
VPN Chrome, Firefox, AIM, Hangouts, Facebook, ICQ, VoipBuster, Skype, Email, SFTP, FTPS, YouTube, Vimeo, Spotify, Netflix, and Bittorrent 1331
Non-VPN Youku, Iqiyi, Bilibili, Tengxun, Weibo, Chrome, Firefox, AIM, Hangouts, Facebook, ICQ, VoipBuster, Skype, Email, SFTP, FTPS, YouTube, Vimeo, Spotify, Netflix, and Torrent 2009

5.2. Data Preprocessing

Data preprocessing consists of traffic segmentation and traffic clearance.

Traffic Segmentation: We define the flow as a group of packets with the same five-tuple (i.e., source IP, source port, destination IP, destination port, and transport-level protocol) [28]. The PCAP files are split into discrete flows. The packets in the flow are listed in the order of arrival time. We saved each flow as a file.

Traffic Clearance: For TCP flows, when the packet with the RST/FIN flag is inspected for the first time, the flow is judged as the end. If there are no packets with RST/FIN flags in a TCP flow, the finish of the flow file indicates the flow’s end [21]. For UDP flows, the finish of the flow file indicates the flow’s end. During the process of collecting ShadowsocksR traffic, although the local ShadowsocksR server is set to global pattern, some ordinary network traffic was still captured. Therefore, we filtered out the ordinary network traffic according to the port of the ShadowsocksR server.

We removed some traffic that was useless to us, such as the Simple Service Discovery Protocol (SSDP) traffic and the Domain Name System (DNS) traffic. We removed the flow without payloads. Moreover, for VPN traffic detection, very small flows and incomplete TCP flows were removed. A flow that has fewer than two packets with payloads is a very small flow. Incomplete TCP flows do not have a connection establishment stage.

5.3. Evaluation Metrics

We use four metrics to evaluate the proposed method, namely, accuracy, precision, recall, and F1 score. These evaluative metrics are defined as follows:
()
()
()
()

5.4. Parameter Settings

We assess the validity of the features. We analyze the parameter N and the threshold of the packet interval time that enable the proposed approach to obtain better detection performance. The performance of the approach using the sequences of payload sizes and packet directions of the first N packets of a flow is evaluated. The performance of the approach on VPN traffic classification and Shadowsocks traffic detection is evaluated. The parameters of the model that enable the method to attain better performance are evaluated.

5.4.1. The Features

We evaluate the effectiveness of spatial features (i.e., the bidirectional packet size sequence, the unidirectional packet size sequence, and the packet direction sequence) and temporal features (i.e., the bidirectional and unidirectional packet interval time sequences) of the flow. We analyze the parameter N (the first N packets of the flow) and the threshold of the packet interval time that enables the approach to attain better performance. Since the method does not work well on VPN-VoIP and VoIP traffic classification, we validate the proposed method using 12 kinds of traffic in the ISCX VPN-non-VPN dataset apart from VPN-VoIP and VoIP traffic.

The evaluation results are presented in Figures 9, 10, 11, 12, 13, and 14. In Figures 9, 10, 12, and 13, the two-way represents the bidirectional packet interval time sequence of a flow or the bidirectional packet size sequence of a flow. The one-way represents the unidirectional packet interval time sequences of a flow or the unidirectional packet size sequences of a flow. The Direction represents the packet direction sequence of a flow. The 10 packets represent the first 10 packets of the flow.

Details are in the caption following the image
The performance of the approach employing the bidirectional packet size sequence of the flow, the unidirectional packet size sequence of the flow, and the packet direction sequence of the flow on ShadowsocksR traffic detection, respectively.
Details are in the caption following the image
When the threshold of the packet interval time is 0.3 s, the performance of the approach employing the bidirectional and unidirectional packet interval time sequences of the flow on ShadowsocksR traffic detection, respectively.
Details are in the caption following the image
The performance of the approach employing different thresholds of the packet interval time on ShadowsocksR traffic detection.
Details are in the caption following the image
The performance of the approach employing the bidirectional packet size sequence of the flow, the unidirectional packet size sequence of the flow, and the packet direction sequence of the flow on VPN traffic detection, respectively.
Details are in the caption following the image
When the threshold of the packet interval time is 0.2 s, the performance of the approach employing the bidirectional and unidirectional packet interval time sequences of the flow on VPN traffic detection, respectively.
Details are in the caption following the image
The performance of the approach employing different thresholds of the packet interval time on VPN traffic detection.

As shown in Figures 9, 10, 12, and 13, in ShadowsocksR and VPN traffic detections, these spatiotemporal features of the flow are effective. Moreover, they have better N values, which enable the approach to obtain better detection performance. The temporal features have a better threshold of the packet interval time, which enables the approach to obtain better detection performance.

As shown in Figures 9 and 10, in ShadowsocksR traffic detection, encoding the bidirectional size sequence of the first N packets of a flow into images, the method acquires better detection performance when N is 10. Encoding the unidirectional size sequences of the first N packets of a flow into images, the method obtains better detection performance when N is 30. Encoding the direction sequence of the first N packets of a flow into images, the method achieves better detection performance when N is 20. Encoding the bidirectional interval time sequence of the first N packets of a flow into images, the method attains better detection performance when N is 20. Encoding the unidirectional interval time sequences of the first N packets of a flow into images, the method obtains better detection performance when N is 20. As shown in Figure 11, when the threshold of the packet interval time is 0.2 or 0.3 s, the method achieves better detection performance.

As shown in Figures 12 and 13, in VPN traffic detection, encoding the bidirectional size sequence of the first N packets of a flow into images, the method acquires better detection performance when N is 20. Encoding the unidirectional size sequences of the first N packets of a flow into images, the method obtains better detection performance when N is 20. Encoding the direction sequence of the first N packets of a flow into images, the method achieves better detection performance when N is 40. Encoding the bidirectional interval time sequence of the first N packets of a flow into images, the method acquires better detection performance when N is 10. Encoding the unidirectional interval time sequences of the first N packets of a flow into images, the method obtains better detection performance when N is 10. As shown in Figure 14, when the threshold of the packet interval time is 0.2 s, the method achieves better detection performance.

Encoding several features with different N values into an image requires reading the same flow file multiple times, which increases the computational resource overhead. To reduce computational overhead, for spatial features and temporal features, we employ a uniform N value. When several features have the same N value, the flow file is read only once. Moreover, for ShadowsocksR traffic detection, 0.2 or 0.3 s is set as the threshold of the packet interval time. For VPN traffic detection, 0.2 s is set as the threshold of the packet interval time.

5.4.2. Packet Direction and N (The First N Packets of A Flow)

The performance of the proposed method using different integers to represent packet directions is evaluated. Some different integers are extracted from [1, 255] to represent the forward direction and the reverse direction of the flow. The evaluation results are shown in Table 4. In this table, (20, 200) means that 20 represents the forward direction of the flow and 200 represents the reverse direction of the flow.

Table 4. The performance of the proposed method using different integers to represent packet directions (F1 score, %).
Task (20, 200) (50, 220) (100, 255) (150, 250)
Shadowsocks 99.14 99.38 99.24 99.29
VPN 90.59 90.12 91.43 89.86

As shown in Table 4, in ShadowsocksR traffic detection, the maximum difference between the F1 scores of the proposed method using different integers to represent the packet direction is 0.24%. In VPN traffic detection, the maximum difference between the F1 scores of the proposed method using different integers to represent the packet direction is 1.57%. These experimental results show that in ShadowsocksR and VPN traffic detections, the method using different integers to represent the direction of the packet achieves considerable performance. Therefore, in ShadowsocksR and VPN traffic detections, two different integers can be randomly extracted from [1, 255] to represent the forward direction and reverse direction of the flow.

Furthermore, we analyze N values of spatial features and temporal features that enable the approach to attain better performance. Figures 15, 16, 17, 18 and Table 5 show the analysis results.

Details are in the caption following the image
The performance of the approach utilizing the spatial characteristics of the flow on ShadowsocksR traffic detection.
Details are in the caption following the image
The performance of the approach utilizing the temporal characteristics of the flow on ShadowsocksR traffic detection.
Details are in the caption following the image
The performance of the approach utilizing the spatial characteristics of the flow on VPN traffic detection.
Details are in the caption following the image
The performance of the approach utilizing the temporal characteristics of the flow on VPN traffic detection.
Table 5. The detection performance of the approach utilizing the temporal and spatial features of the flow (%).
Task Feature N Threshold (s) Precision Recall F1 score
SSR Space & Time 10 & 15 0.3 99.81 99.52 99.67
VPN Space & Time 10 & 10 0.2 98.88 100 99.44

As shown in Figures 15 and 16, in ShadowsocksR traffic detection, for spatial features, the approach acquires better performance when the first 10 packets of a flow are employed. For temporal features, the approach attains better performance when the first 15 packets of a flow are employed.

As shown in Figures 17 and 18, in VPN traffic detection, for spatial features, when the first 25 packets of a flow are employed, the approach achieves better performance.

For temporal features, the approach acquires better performance when the first 10 packets of a flow are employed.

Comparing Figure 18 with Figure 13, the performance of the approach using only the bidirectional packet interval time sequence of the flow outperforms that employing the bidirectional and unidirectional packet interval time sequences of the flow. Thus, on VPN traffic detection, we only employ the bidirectional packet interval time sequence of the flow. Table 5 shows the performance of the approach employing the temporal and spatial features of the flow on VPN and ShadowsocksR traffic detections. In the table, N represents the first N packets of the flow. The Threshold represents the threshold of the packet interval time.

As exhibited in Table 5, for ShadowsocksR traffic detection, when the spatial features of the first 10 packets and the temporal features of the first 15 packets of a flow are encoded into images, the approach achieves better performance. For VPN traffic detection, when the spatiotemporal features of the first 10 packets of a flow are encoded into images, the approach achieves better performance.

These experimental results show that the sequences of the size, direction, and interval time of the first N packets of a flow have important distinguishable features. They can be used to detect ShadowsocksR and VPN traffic.

5.4.3. PSS of the Flow

The detection performance of the approach employing the sequences of payload sizes and packet directions of the first N packets of a flow is evaluated. Compared with the sequences of the packet size and direction of a flow, the sequences of payload sizes and packet directions of the flow do not include the packets during the connection establishment phase between the client and the server and the ACK packets during the communication process between the two parties. The sequences of payload sizes and packet directions of the flow include only packets with data in the flow. In ShadowsocksR and VPN traffic detections, the MTU in Equation (1) is set to 1460 bytes. The evaluation results are exhibited in Table 6. In this table, the payload represents the sequences of payload sizes and packet directions of the flow. The packet represents the sequences of the packet size and direction of a flow.

Table 6. The detection performance of the approach utilizing the sequences of payload sizes and packet directions of the flow (%).
Task Feature N Precision Recall F1 score
ShadowsocksR Payload 10 99.33 99.43 99.38
Packet 10 98.96 99.52 99.24
  
VPN Payload 15 76.41 84.18 80.11
Packet 25 92.49 90.4 91.43

As exhibited in Table 6, for ShadowsocksR traffic detection, the F1 score of the approach employing the sequences of payload sizes and packet directions of the flow is 0.14% higher than that employing the sequences of the packet size and direction of the flow. For VPN traffic detection, the F1 score of the approach employing the sequences of payload sizes and packet directions of the flow is 11.32% lower than that employing the packet size and direction sequences of the flow.

5.4.4. Model Parameters

We evaluate the parameters of the model that enable the method to attain better performance. Table 7 presents the evaluation results. In this table, 2D-6L (5 × 5) represents a 2D-CNN, which comprises six layers (i.e., two pooling layers, two convolutional layers, a fully connected layer, and an output layer), and the size of convolutional layers’ filters is 5 × 5. 1D-4L is the 1D-CNN consisting of four layers (i.e., one pooling layer, one convolutional layer, a fully connected layer, and an output layer).

Table 7. The performance of the approach utilizing different detection models (F1 score, %).
Task 2D-6L (5 × 5) 1D-4L (1 × 25) 1D-6L (1 × 9) 1D-6L (1 × 25)
SSR 99.24 99.47 99.57 99.67
VPN 98.05 98.88 98.87 99.44

As exhibited in Table 7, in ShadowsocksR and VPN traffic detections, the performance of the method using 1D-CNN is better than that using 2D-CNN. The performance of the method using 1D-CNN composed of six layers is better than that using 1D-CNN composed of four layers. For the 1D-CNN with six layers, the performance of the method using the convolutional layer filter of size 1 × 25 is better than that using the convolutional layer filter of size 1 × 9. Therefore, a 1D-CNN consisting of six layers is used as the detection model. Furthermore, the size of the convolutional layer filter is 1 × 25.

5.4.5. VPN Traffic Classification and Shadowsocks Traffic Detection

The performance of the proposed approach on VPN traffic classification and Shadowsocks traffic detection is evaluated. Shadowsocks and ShadowsocksR are different versions of encrypted proxies. Like ShadowsocksR traffic, the Shadowsocks traffic also comprises the traffic of some popular applications (i.e., Twitter, YouTube, Instagram, Spotify, some blogs, and some network Q & A websites). The VPN traffic consists of seven kinds of traffic (i.e., VPN-Streaming, VPN-P2P, VPN-Browsing, VPN-VoIP, VPN-Chat, VPN-File transfer, and VPN-Email). The VPN-Browsing traffic is not the main activity. It is generated when users perform other activities using a browser. Thus, we abandoned VPN-Browsing traffic. We classify VPN traffic into six categories using the spatial characteristics (i.e., the size and direction of the packet) of the flow.

Table 8 displays the evaluation results. In the table, SS stands for Shadowsocks traffic detection. SSR stands for ShadowsocksR traffic detection, and VPN stands for VPN traffic classification. Space represents spatial features of the flow, and Time represents temporal features of the flow.

Table 8. The performance of the approach on VPN traffic classification, ShadowsocksR traffic detection, and Shadowsocks traffic detection (%).
Task Feature N Threshold Precision Recall F1 score
SS Space 15 96.32 97.52 96.92
Time 20 0.1 s 98.1 98.76 98.43
Space & time 10 & 15 0.1 s 98.96 99.62 99.29
  
SSR Space 10 98.96 99.52 99.24
Time 15 0.3 s 98.66 98.1 98.38
Space & time 10 & 15 0.3 s 99.81 99.52 99.67
  
VPN Space 15 96.25 96.03 96.14

As shown in Table 8, the proposed approach can be used for VPN traffic classification and Shadowsocks traffic detection. The approach obtains comparable performance on ShadowsocksR traffic detection and Shadowsocks traffic detection. The approach obtains good performance on VPN traffic classification. These experimental results reveal that the approach has broad applicability.

Furthermore, as exhibited in Tables 5 and 8, in VPN traffic detection, VPN traffic classification, ShadowsocksR traffic detection, and Shadowsocks traffic detection, the method achieves better performance when using the first 10 or 15 packets of a flow. Therefore, it can be inferred that the optimal value of N (the first N packets of a flow) is in references [10, 15].

5.5. Performance Comparison

In this section, the proposed approach is compared with other deep learning-based approaches. Concretely, we compare our approach with the approaches in references [5, 11, 21] from three aspects: the size of the transformed images (Imgsize), accuracy, and F1 score.

Like our method, the approaches in references [5, 11, 21] also perform VPN traffic detection on the public dataset ISCX VPN-nonVPN. Both the proposed approach and the approaches in reference [21] perform ShadowsocksR traffic detection on the proprietary dataset ShadowsocksR-Ordinary. Thus, we duplicated the results from references [5, 11, 21].

As exhibited in Table 9, in VPN traffic detection, the accuracy of the proposed approach is 0.12% lower than that of the approach in reference [11], 0.29% lower than that of the approach in reference [5], and 0.27% lower than that of the approach in references [21]. However, the transformed images of the proposed approach are much smaller than those of the approaches in references [5, 11]. Therefore, the proposed approach is more efficient than the approaches in references [5, 11]. Under the same circumstances (e.g., the same task and hardware devices), the proposed approach can save a great deal of computational and storage resource overheads.

Table 9. The performance comparison of VPN traffic detection (%).
Methods Imgsize Accuracy Precision Recall F1 score
This Paper 49 B 99.58 98.88 100 99.44
[11] 2250 KB 99.7
[5] 1521 B 99.87
[21] 16 B 99.85 99.6 100 99.8

As exhibited in Table 10, in ShadowsocksR and Shadowsocks traffic detections, the size of the transformed image of the proposed approach is comparable to that of the approach in reference [21]. However, in ShadowsocksR traffic detection, the F1 score of the proposed approach is 1.16% higher than that of the approach in reference [21]. In Shadowsocks traffic detection, the F1 score of the proposed approach is 1.27% higher than that of the approach in references [21]. For ShadowsocksR and Shadowsocks traffic detections, the performance of the proposed approach is better than that of the approach in reference [21].

Table 10. The performance comparison of ShadowsocksR and Shadowsocks traffic detections (%).
Task Methods Imgsize (B) Precision Recall F1 score
SSR This Paper 64 99.81 99.52 99.67
[21] 49 99.32 97.71 98.51
  
SS This Paper 64 98.96 99.62 99.29
[21] 49 98.51 97.54 98.02

In VPN traffic classification, we compare our approach with the approach in reference [5]. Like our approach, the approach in reference [5] also classifies VPN traffic into six categories. As exhibited in Table 11, for VPN traffic classification, the accuracy of the proposed approach is 3.11% higher than that of the approach in reference [5]. For VPN traffic classification, the performance of the proposed approach is better than that of the approach in reference [5].

Table 11. The performance comparison of VPN traffic classification (%).
Methods Imgsize (B) Accuracy Precision Recall F1 score
This Paper 36 96.03 96.25 96.03 96.14
[12] 1521 92.92

Furthermore, the computational resource overhead of different image sizes is compared. Currently, the minimum converted image size of many image-based deep learning approaches is 784 bytes. The maximum converted image size of the proposed method is 64 bytes. Table 12 shows the computational resource overhead of different image sizes in the model training and execution phases. As shown in Table 12, in the same case (e.g., the same hardware devices, task, and detection model), when the size of the converted images is 784 bytes, the model training and execution time is about 761 s. When the size of the converted images is 64 bytes, the model training and execution time is about 89 s. The model training and execution time of the proposed method is at least 88% less than that of many image-based deep learning approaches.

Table 12. Computational resource overhead for different image sizes (%).
Image size (B) Model training and execution time (s) CPU (%) Memory (MB)
64 89 85.5 132
784 761 96.5 279–1484

For the computational overhead, in the process of model training and execution, when the size of the converted images is 784 bytes, CPU resource overhead is approximately 96.5% of the total CPU resources. The memory resource overhead varies greatly, from approximately 279 MB to 1484 MB. When the size of the converted images is 64 bytes, the CPU resource overhead is approximately 85.5% of the total CPU resources, and the memory resource overhead is approximately 132 MB. The CPU resource overhead of the proposed method is at least 11% less than that of many image-based deep learning approaches. The memory resource overhead of the proposed method is at least 147 MB less than that of many image-based deep learning approaches. Since the proposed method and other image-based deep learning methods use different features and image conversion methods, we no longer compare their computational overhead in the image conversion process.

6. Discussion

We propose a novel method to reduce the computational and storage resource overheads of encrypted proxy traffic detection. Compared with many image-based deep learning methods, the proposed method can save a lot of computing and storage resource overheads. However, the method also has limitations. The approach encodes the size of the packet into the pixel value of the image. This encoding loses a small number of distinguishable characteristics since different packet sizes are encoded into the same pixel value. For example, the packet sizes 54 and 56 are encoded into the same pixel value of 9. In addition, network traffic can also be converted into a nonimage form to input into deep learning algorithms. We leave these problems as open questions for future research.

7. Conclusion

We propose a novel deep learning-based approach for encrypted proxy traffic detection. The approach is effective and very lightweight. It obtains comparable performance to state-of-the-art approaches. At the same time, the transformed images are at least 90% smaller than those of many existing deep learning-based approaches. Thus, it can greatly reduce the computational and storage resource overhead. The approach is capable of automatically extracting and selecting features. The method can be applied to VPN traffic detection, VPN traffic classification, and ShadowsocksR and Shadowsocks traffic detections. Thus, the approach has broad applicability. The method can detect encrypted proxy traffic in real time. The approach is efficient and can be used for large-scale network traffic analysis tasks.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 61672026).

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant No.61672026).

    Data Availability Statement

    The data that support the findings of this study are available from the corresponding author.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.