Volume 2022, Issue 1 9876777
Research Article
Open Access

Classification of Videos Based on Deep Learning

Yinghui Liu

Corresponding Author

Yinghui Liu

Jiangxi Industry Polytechnic College, Nanchang 330099, China jxgzy.cn

Search for more papers by this author
First published: 06 September 2022
Citations: 1
Academic Editor: Yaxiang Fan

Abstract

Automatic classification of videos is a basic task of content archiving and video scene understanding for broadcasters. And the time series modeling is the key to video classification. To solve this problem, this paper proposes a new video classification method based on temporal difference networks (TDN), which focuses on capturing multiscale time information for effective action classification. The core idea of TDN is to design an effective time module by clearly using the time difference operator and systematically evaluate its impact on short-term and long-term motion modeling. In order to fully capture the time information of the entire video, TDN has established a two-level difference model. For local motion modeling, the time difference in consecutive frames is used to provide a more refined motion mode for convolutional neural network (CNN). For global motion modeling, the time difference of the segments is combined to capture the remote structure for the extraction of the motion feature. The experimental results on two public video anomaly detection data sets, namely, UCF sports data set and SVW field sports data set, prove that the performance of the proposed method is better than some existing methods.

1. Introduction

Due to the massive ratings of different sports events around the world, sports broadcasters have produced a large amount of video content. According to statistics, more than half of the world’s population (about 3.6 billion) watched the 2018 Men’s Football World Cup, and the global viewership of the 2020 Olympic Games in Japan reached about 4 billion. In addition, a similar increase in viewership has also occurred in other sports around the world, which has brought great challenges to manually analyzing and processing such a large amount of video content. Therefore, there is an urgent need to develop effective methods to automatically process and analyze a large number of sports videos appearing in cyberspace. Among them, automatic video classification provides important technical tools for a series of applications, such as video indexing, browsing, annotation, and retrieval, as well as improving the efficiency and effectiveness of its access to sports video archives.

Since video is composed of many single image frames, video processing can be computationally challenging [1, 2]. Therefore, one method of video classification is to simply examine the frames as a single image and try to classify them and then combine the results into a single output classification of the entire video. Although this idea is relatively intuitive, people can use a single frame to distinguish between shots and shots. But most of the time, the information encoded in the video is discarded. In order to solve this problem, there are already many researchers classifying videos based on video features, audio features, and other videos. Literature [3] used two different neural network methods and texture feature methods and combined them to compare the results of all three methods. In literature [4], Gade et al. used thermal imaging technology to generate heat maps and then used principal component analysis (PCA) and Fischer linear discriminator to project the heat maps into a low-dimensional space and then classify them. As reported, this method achieved good results in five categories. On this basis, Gade et al. [5] combined the Mel frequency cepstral coefficient (MFCC) features of audio and visual motion features to classify motion videos and obtained very good classification results. The literature [6] fused the three kinds of data of video, audio and sensor in the motion video, and used the multiclass support vector machine (SVM) method for classification. Literature [7] used a hidden Markov model (HMM) method to identify events in videos and classify them, but the results only provided computational time performance without mentioning the accuracy.

In the past ten years, deep learning theory has been widely used in computer vision fields such as image and video processing [8], and great breakthroughs have been made. Unlike images, video contains a lot of time information, so a large number of researchers have carried out research on how to represent the time information in the video. Ji et al. [9] proposed a 3DCNN across spatial and temporal dimensions to extract information about the motion that occurred between frames. At the same time, literature [10] employed the time pooling method and long short-term memory (LSTM) method to represent time information. Wang et al. [11] proposed a more widely applicable action recognition method, which gone beyond the layer type and model architecture. The method divided each video into multiple blocks, and then classified each block into micro videos, and then aggregated the results of all predictions to generate a final prediction for each complete video. Literature [12] studied video classification relying on fusion of time information from different frames, using three different strategies: early fusion, late fusion, and slow fusion. It found that slow fusion achieved the best result in these models.

This paper uses the temporal difference network (TDN) to achieve video classification. The proposed method can capture multiscale time information and further realize the timing descriptions of different videos. The core of TDN is to design an effective time module by explicitly using the time difference operator and systematically evaluate its impact on short-term and long-term motion modeling. In order to fully capture the time information of the entire video, TDN established a two-level difference modeling paradigm. For local motion modeling, the time difference on consecutive frames is used to provide a fine motion pattern for CNN. For global motion modeling, the time difference across segments is combined to capture the long-term structure for motion feature excitation. Experiments on two public data sets show that the classification performance of the proposed method exceeds the state of the art.

2. Algorithm Principle

The TDN-based method proposed in this paper actually uses the entire video information to learn the video action model. Due to the limitation of GPU memory, following the temporal segment networks (TSN) framework [13], a sparse and overall sampling strategy is proposed for each video. Different from the TSN method, the TDN proposed in this paper mainly uses the time difference operator in the network design to clearly capture short-term and long-term motion information. In order to improve the efficiency of the algorithm, this paper incorporated the residual connection into the main network to complete the motion supplement in the local window and the motion enhancement across different segments. Specifically, each video is first divided into segments of equal duration and nonoverlapping. Next, a frame is randomly sampled from each segment, and a total of T frame I = [I1, I2, ⋯, IT] ∈ T×C×H×W are obtained. These frames are input to CNN to extract frame-level features F = [F1, F2, ⋯, FT], in which represents the feature representation in the hidden layer. The purpose of the short-time difference module is to provide local motion information for these frame-by-frame representations of the previous layers to improve their representation capabilities:
(1)
where represents the enhanced representation of the timing differential module and represents the short-time difference module, which can extract the local motion of Ii in the surrounding adjacent layers. The long-term difference module is used to enhance the frame-level feature representation by using the span time structure:
(2)
where represents the long-term difference module. As only adjacent segment-level information is considered, the long-term modeling can be performed in each long-term difference module. By stacking multiple long-term difference modules, the long-term time structure can be captured.

2.1. Short-Time Difference Module

Adjacent frames in a video are very similar in a local time window, and it is inefficient to directly stack multiple frames for subsequent processing. On the other hand, sampling a single frame from each window can extract appearance information, but cannot capture local motion information. Therefore, the short-time difference module chooses to provide a single RGB frame with a time difference to produce an effective video representation to simultaneously encode appearance and motion information.

Specifically, the short-time difference module performs low-level feature extraction in the first few layers of the neural network and enables a single frame of RGB to capture local motion by fusing time difference information. For each sample frame Ii, the partial window extracts the timing RGB difference and then accumulates it into the channel dimension D(Ii) = [D−2, D−1, D1, D2]. Accordingly, the effective form of TDM can be expressed as
(3)
where D(Ii) represents the RGB difference around and CNN is a specific neural network at different stages. In order to maintain efficiency, this method designs a lightweight CNN module to deal with stacked RGB differences D(Ii). It usually follows a low-resolution processing strategy: (1) The average pooling is used to downsample the RGB difference to half. (2) CNN is used to extract motion features. (3) The motion features are upsampled to match RGB features. This is because the RGB difference shows a very small value in most areas and contains high response only in areas where motion is significant. Therefore, it is sufficient to use a low-resolution architecture for this sparse signal without losing too much accuracy. The information of the short-time difference module is merged with a single RGB frame, so that the original frame-level representation understands the motion mode and can better describe the local time window. This fusion is achieved through the horizontal connection, and the fusion connection of the short-time difference module is attached to the frame-level representation of each early stage. In practice, the residual connection is also compared with other fusion strategies.

2.2. Long-Time Difference Module

The frame-by-frame representation of the short-time difference module is very effective for capturing spatio-temporal information in the local segment (window). However, this representation is limited in terms of time receptive field, so it is impossible to explore the long-term time structure of the learned action model. Therefore, the long-term difference module attempts to use cross-segment information to enhance the original representation through the new two-way and multiscale time difference module. In addition to efficiency, the lack of alignment of spatial positions between long-term frames is another problem that needs to be solved. Therefore, this method designs a multiscale architecture to smooth the difference in the large receptive field before the difference calculation. The feature dimension is first compressed into a ratio and convolved to improve efficiency, and the alignment time difference is calculated through adjacent video clips:
(4)
where C(Fi, Fi+1)  represents the difference in the alignment time of the segment, which is used for spatial smoothing channel convolution, thereby alleviating the lost alignment problem. Then, the aligned time difference is extracted through the multiscale module for long-term motion information:
(5)
In Equation (4), different spatial scales are designed to extract motion information from different receptive fields. In practice, N = 3. For missing alignment problems, their fusion may be more robust. In terms of implementation, it involves three branches: (1) short connection, (2) 3 × 3 convolution, and (3) average pooling. Finally, the two-way span time difference is used to enhance the frame-level features, as shown below:
(6)
where ⊙ denotes element-wise multiplication.

The proposed method also combines the original frame-level representation and enhances the representation through residual connection. For example, Equation (2) is slightly different from the short-time difference module. The difference representation is used as the attention map to enhance the frame-level features. This part based on the observed attention model is more effective in the later stage of CNN. In practice, the residual connection is also compared with other fusion strategies.

In summary, the TDN-based method proposed in this paper is based on the sparse sampling of TSN, which operates on a series of frames evenly distributed across the entire video. TDN provides a two-level motion modeling mechanism, focusing on capturing time information in a local to global manner. In particular, the short-term timing difference module is inserted in the early stage to perform finer and low-level motion extraction. And the long-term timing difference module is inserted in the later stage to perform more coarse and advanced time extraction structure modeling. Similar to the residual network [14], this method adopts its main structure. The first two stages use a short-time difference module to extract short-term information, and the latter three stages are equipped with a long-time difference module to capture the cross-period timing structure. In order to improve computational efficiency, two measures are adopted. For local motion modeling, a residual connection is added between the short-time difference module in the first and second stages and the main network. For long-term motion modeling, a long-term difference module is added to each residual block.

3. Experiment and Discussion

In this section, the effectiveness of the proposed method is verified and compared with other existing methods. First, the two public data sets used for evaluation are briefly introduced. Then, the experimental details are given. And finally the experimental results are analyzed.

3.1. Experimental Data

Two data sets are used for experiments, and their basic descriptions are given in Table 1. The UCF Sports Action data set [15] is composed of a set of actions collected from various sports. These sports are usually broadcast on radio and television channels, and the video sequences are obtained from various material websites. The data set contains the following 10 sports: diving (14 videos), golf swing (18 videos), kicking (20 videos), weightlifting (6 videos), horse riding (12 videos), running (13 Video), skateboard (12 videos), swing stool (20 videos), swing side (13 videos), walking (22 videos), and sample frames are shown in Figure 1. The data set includes a total of 150 sequences with a resolution of 720 × 480 and 10 fps. The available annotations are the bounding box for action positioning and the class label for activity recognition. In addition, the data set also provides annotations from the audience.

1. Basic descriptions of the two data sets.
Data set Number of sports Number of videos Resolution
UCF 10 150 720 × 480
SVW 30 4200 480 × 272 to 1280 × 720
Details are in the caption following the image
Details are in the caption following the image

In addition, this paper also uses the SVW field sports data set for experiments. The data set consists of 4200 videos taken by Coachs Eye smartphones. The Coachs Eye smartphone application is an application for sports training developed by TechSmith. The data set includes 30 types of sports and 44 different actions. Compared with the UCF sports action data set, this data set is more complex. Most of the videos taken are amateur sports videos. At the same time, the shooting methods are not as professional as TV videos. First, the static image context has a low degree of discrimination for classification. Second, the cluttered background and common environment will also bring difficulties to unconstrained sports video classification. Then, amateur users’ nonprofessional shooting brings additional challenges, such as extreme camera vibration, incorrect camera movement, being obscured by the audience, judges and fences due to improper camera position, and unusual viewing angles. Some examples of the SVW field sports data set can also be observed in Figure 1.

3.2. Experimental Data and Evaluation Indicators

In the experiment, this method uses the ResNet50 network to implement the timing difference module and samples the time frame from each video T = 16. During training, each video frame is adjusted to have a shorter edge in and is cropped randomly 224 × 224. The method is pretrained on the ImageNet data set. The batch size is 128, the initial learning rate is 0.001, and the number of iterations is set to 100. When the performance of the validation set is saturated, the learning rate will be reduced to 0.0001. For testing purposes, the shorter side of each video was adjusted to 256, and then, only the 224 × 224 center crop of a single clip was used for evaluation. The hardware environment of the whole experiment process is Intel Core i7-10700 2.9GHz CPU, NVIDIA GeForce GTX 2080Ti (11 GB video memory) GPU, and 32GB RAM memory, and the computing platform is Python 3.7 and Tensorflow2.0.

3.3. Experimental Results

For the UCF sports action data set, 75% of the video clips are used for training, and the remaining 25% are used for testing. In order to prove the effectiveness of the proposed method in video classification, this paper compares it with the some existing methods. These methods include manual feature methods and deep learning methods, mainly including Wang et al. [16], Le et al. [17], Kovashka et al. [18], Dense trajectories [19], Weinzaepfel et al. [20], SGSH [21], snippets [22], and two stream LSTM [23]. Among them, the latter two are deep learning methods, and the others are manual feature methods.

Table 2 gives the classification accuracy of each kind of sports in the UCF data set. Table 3 shows the classification results of various methods. It can be observed that the performance of the deep learning method is better than the manual feature method, and the method proposed in this paper achieves the best classification result of 99.3%, which exceeds the two stream LSTM [23] method 0.5%. This is mainly because the TDN used in this paper can better describe the timing structure in sports videos. It can be seen from the individual sports classification results in Table 3 that most of the sports classification accuracy rates in this data set have reached 100%, except for golf, skateboarding, and walking sports. The reason for the confusion of these kinds of sports videos is that they all have the same action, that is, the action of walking.

2. The accuracy each class of UCF Sports data set.
Class Accuracy
Diving 100%
Golf 89.7%
Skate boarding 93.1%
Swing bench 100%
Kicking 100%
Lifting 100%
Riding horse 100%
Running 100%
Swing side 100%
Walking 91.1%
3. Comparison with the state of the art methods (UCF data set).
Method Mean Acc.
Wang et al. [16] 85.6%
Le et al. [17] 86.5%
Weinzaepfel et al. [20] 90.5%
SGSH [21] 90.9%
Snippets [22] 97.8%
Two stream LSTM [23] 98.9%
Proposed 99.3%

For the SVW data set, the experiment adopts the same training/testing configuration described in [24], in which there are 3 different training/testing set division methods. The experimental comparison can be shown in Table 4. From Table 4, it can be seen that the method proposed in this paper is better than the motion-based [24] (motion-based feature, HOG feature, and SVM classifier) method provided by the original data set. The performance is 27% higher than that of the CNN method [25, 26]. Specifically, it can be seen from Table 4 that the accuracy of the “running” event category is the worst. Most of the errors are that the classifier misclassifies the “running” event category image as the “long jump” event category. These problems arise because there are a lot of the same actions between the long jump and running, especially when the long jumper runs up.

4. Comparison with the state of the art methods.
Method Test I Test II Test III Mean Acc.
Motion-assisted \ \ \ 39.1%
Context-based [24] \ \ \ 37.8%
Combined CNN [25] 81.9% 82.1% 83.4% 82.5%
RWRS [26] 84.5% 84.3% 85.3% 84.4%
Proposed 87.5% 88.5% 86.3% 86.8%

4. Conclusion

This paper proposes a video classification method based on TDN, which is used to learn sports action models from the entire video. The core of the timing difference network is to generalize the time difference operator into an efficient general-purpose time module with a specific design, which is used to capture short-term and long-term time information in the video. As the experimental results on two public data sets show, the performance of the extracted time series difference algorithm is better than other previous methods. In the next step, the timing differential network will be improved to replace the 3DCNN commonly used in video modeling for video classification.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Data Availability

The data sets used in this paper are can be accessed upon request.

    The full text of this article hosted at iucr.org is unavailable due to technical difficulties.