Classification of Videos Based on Deep Learning
Abstract
Automatic classification of videos is a basic task of content archiving and video scene understanding for broadcasters. And the time series modeling is the key to video classification. To solve this problem, this paper proposes a new video classification method based on temporal difference networks (TDN), which focuses on capturing multiscale time information for effective action classification. The core idea of TDN is to design an effective time module by clearly using the time difference operator and systematically evaluate its impact on short-term and long-term motion modeling. In order to fully capture the time information of the entire video, TDN has established a two-level difference model. For local motion modeling, the time difference in consecutive frames is used to provide a more refined motion mode for convolutional neural network (CNN). For global motion modeling, the time difference of the segments is combined to capture the remote structure for the extraction of the motion feature. The experimental results on two public video anomaly detection data sets, namely, UCF sports data set and SVW field sports data set, prove that the performance of the proposed method is better than some existing methods.
1. Introduction
Due to the massive ratings of different sports events around the world, sports broadcasters have produced a large amount of video content. According to statistics, more than half of the world’s population (about 3.6 billion) watched the 2018 Men’s Football World Cup, and the global viewership of the 2020 Olympic Games in Japan reached about 4 billion. In addition, a similar increase in viewership has also occurred in other sports around the world, which has brought great challenges to manually analyzing and processing such a large amount of video content. Therefore, there is an urgent need to develop effective methods to automatically process and analyze a large number of sports videos appearing in cyberspace. Among them, automatic video classification provides important technical tools for a series of applications, such as video indexing, browsing, annotation, and retrieval, as well as improving the efficiency and effectiveness of its access to sports video archives.
Since video is composed of many single image frames, video processing can be computationally challenging [1, 2]. Therefore, one method of video classification is to simply examine the frames as a single image and try to classify them and then combine the results into a single output classification of the entire video. Although this idea is relatively intuitive, people can use a single frame to distinguish between shots and shots. But most of the time, the information encoded in the video is discarded. In order to solve this problem, there are already many researchers classifying videos based on video features, audio features, and other videos. Literature [3] used two different neural network methods and texture feature methods and combined them to compare the results of all three methods. In literature [4], Gade et al. used thermal imaging technology to generate heat maps and then used principal component analysis (PCA) and Fischer linear discriminator to project the heat maps into a low-dimensional space and then classify them. As reported, this method achieved good results in five categories. On this basis, Gade et al. [5] combined the Mel frequency cepstral coefficient (MFCC) features of audio and visual motion features to classify motion videos and obtained very good classification results. The literature [6] fused the three kinds of data of video, audio and sensor in the motion video, and used the multiclass support vector machine (SVM) method for classification. Literature [7] used a hidden Markov model (HMM) method to identify events in videos and classify them, but the results only provided computational time performance without mentioning the accuracy.
In the past ten years, deep learning theory has been widely used in computer vision fields such as image and video processing [8], and great breakthroughs have been made. Unlike images, video contains a lot of time information, so a large number of researchers have carried out research on how to represent the time information in the video. Ji et al. [9] proposed a 3DCNN across spatial and temporal dimensions to extract information about the motion that occurred between frames. At the same time, literature [10] employed the time pooling method and long short-term memory (LSTM) method to represent time information. Wang et al. [11] proposed a more widely applicable action recognition method, which gone beyond the layer type and model architecture. The method divided each video into multiple blocks, and then classified each block into micro videos, and then aggregated the results of all predictions to generate a final prediction for each complete video. Literature [12] studied video classification relying on fusion of time information from different frames, using three different strategies: early fusion, late fusion, and slow fusion. It found that slow fusion achieved the best result in these models.
This paper uses the temporal difference network (TDN) to achieve video classification. The proposed method can capture multiscale time information and further realize the timing descriptions of different videos. The core of TDN is to design an effective time module by explicitly using the time difference operator and systematically evaluate its impact on short-term and long-term motion modeling. In order to fully capture the time information of the entire video, TDN established a two-level difference modeling paradigm. For local motion modeling, the time difference on consecutive frames is used to provide a fine motion pattern for CNN. For global motion modeling, the time difference across segments is combined to capture the long-term structure for motion feature excitation. Experiments on two public data sets show that the classification performance of the proposed method exceeds the state of the art.
2. Algorithm Principle
2.1. Short-Time Difference Module
Adjacent frames in a video are very similar in a local time window, and it is inefficient to directly stack multiple frames for subsequent processing. On the other hand, sampling a single frame from each window can extract appearance information, but cannot capture local motion information. Therefore, the short-time difference module chooses to provide a single RGB frame with a time difference to produce an effective video representation to simultaneously encode appearance and motion information.
2.2. Long-Time Difference Module
The proposed method also combines the original frame-level representation and enhances the representation through residual connection. For example, Equation (2) is slightly different from the short-time difference module. The difference representation is used as the attention map to enhance the frame-level features. This part based on the observed attention model is more effective in the later stage of CNN. In practice, the residual connection is also compared with other fusion strategies.
In summary, the TDN-based method proposed in this paper is based on the sparse sampling of TSN, which operates on a series of frames evenly distributed across the entire video. TDN provides a two-level motion modeling mechanism, focusing on capturing time information in a local to global manner. In particular, the short-term timing difference module is inserted in the early stage to perform finer and low-level motion extraction. And the long-term timing difference module is inserted in the later stage to perform more coarse and advanced time extraction structure modeling. Similar to the residual network [14], this method adopts its main structure. The first two stages use a short-time difference module to extract short-term information, and the latter three stages are equipped with a long-time difference module to capture the cross-period timing structure. In order to improve computational efficiency, two measures are adopted. For local motion modeling, a residual connection is added between the short-time difference module in the first and second stages and the main network. For long-term motion modeling, a long-term difference module is added to each residual block.
3. Experiment and Discussion
In this section, the effectiveness of the proposed method is verified and compared with other existing methods. First, the two public data sets used for evaluation are briefly introduced. Then, the experimental details are given. And finally the experimental results are analyzed.
3.1. Experimental Data
Two data sets are used for experiments, and their basic descriptions are given in Table 1. The UCF Sports Action data set [15] is composed of a set of actions collected from various sports. These sports are usually broadcast on radio and television channels, and the video sequences are obtained from various material websites. The data set contains the following 10 sports: diving (14 videos), golf swing (18 videos), kicking (20 videos), weightlifting (6 videos), horse riding (12 videos), running (13 Video), skateboard (12 videos), swing stool (20 videos), swing side (13 videos), walking (22 videos), and sample frames are shown in Figure 1. The data set includes a total of 150 sequences with a resolution of 720 × 480 and 10 fps. The available annotations are the bounding box for action positioning and the class label for activity recognition. In addition, the data set also provides annotations from the audience.
Data set | Number of sports | Number of videos | Resolution |
---|---|---|---|
UCF | 10 | 150 | 720 × 480 |
SVW | 30 | 4200 | 480 × 272 to 1280 × 720 |


In addition, this paper also uses the SVW field sports data set for experiments. The data set consists of 4200 videos taken by Coachs Eye smartphones. The Coachs Eye smartphone application is an application for sports training developed by TechSmith. The data set includes 30 types of sports and 44 different actions. Compared with the UCF sports action data set, this data set is more complex. Most of the videos taken are amateur sports videos. At the same time, the shooting methods are not as professional as TV videos. First, the static image context has a low degree of discrimination for classification. Second, the cluttered background and common environment will also bring difficulties to unconstrained sports video classification. Then, amateur users’ nonprofessional shooting brings additional challenges, such as extreme camera vibration, incorrect camera movement, being obscured by the audience, judges and fences due to improper camera position, and unusual viewing angles. Some examples of the SVW field sports data set can also be observed in Figure 1.
3.2. Experimental Data and Evaluation Indicators
In the experiment, this method uses the ResNet50 network to implement the timing difference module and samples the time frame from each video T = 16. During training, each video frame is adjusted to have a shorter edge in and is cropped randomly 224 × 224. The method is pretrained on the ImageNet data set. The batch size is 128, the initial learning rate is 0.001, and the number of iterations is set to 100. When the performance of the validation set is saturated, the learning rate will be reduced to 0.0001. For testing purposes, the shorter side of each video was adjusted to 256, and then, only the 224 × 224 center crop of a single clip was used for evaluation. The hardware environment of the whole experiment process is Intel Core i7-10700 2.9GHz CPU, NVIDIA GeForce GTX 2080Ti (11 GB video memory) GPU, and 32GB RAM memory, and the computing platform is Python 3.7 and Tensorflow2.0.
3.3. Experimental Results
For the UCF sports action data set, 75% of the video clips are used for training, and the remaining 25% are used for testing. In order to prove the effectiveness of the proposed method in video classification, this paper compares it with the some existing methods. These methods include manual feature methods and deep learning methods, mainly including Wang et al. [16], Le et al. [17], Kovashka et al. [18], Dense trajectories [19], Weinzaepfel et al. [20], SGSH [21], snippets [22], and two stream LSTM [23]. Among them, the latter two are deep learning methods, and the others are manual feature methods.
Table 2 gives the classification accuracy of each kind of sports in the UCF data set. Table 3 shows the classification results of various methods. It can be observed that the performance of the deep learning method is better than the manual feature method, and the method proposed in this paper achieves the best classification result of 99.3%, which exceeds the two stream LSTM [23] method 0.5%. This is mainly because the TDN used in this paper can better describe the timing structure in sports videos. It can be seen from the individual sports classification results in Table 3 that most of the sports classification accuracy rates in this data set have reached 100%, except for golf, skateboarding, and walking sports. The reason for the confusion of these kinds of sports videos is that they all have the same action, that is, the action of walking.
Class | Accuracy |
---|---|
Diving | 100% |
Golf | 89.7% |
Skate boarding | 93.1% |
Swing bench | 100% |
Kicking | 100% |
Lifting | 100% |
Riding horse | 100% |
Running | 100% |
Swing side | 100% |
Walking | 91.1% |
For the SVW data set, the experiment adopts the same training/testing configuration described in [24], in which there are 3 different training/testing set division methods. The experimental comparison can be shown in Table 4. From Table 4, it can be seen that the method proposed in this paper is better than the motion-based [24] (motion-based feature, HOG feature, and SVM classifier) method provided by the original data set. The performance is 27% higher than that of the CNN method [25, 26]. Specifically, it can be seen from Table 4 that the accuracy of the “running” event category is the worst. Most of the errors are that the classifier misclassifies the “running” event category image as the “long jump” event category. These problems arise because there are a lot of the same actions between the long jump and running, especially when the long jumper runs up.
4. Conclusion
This paper proposes a video classification method based on TDN, which is used to learn sports action models from the entire video. The core of the timing difference network is to generalize the time difference operator into an efficient general-purpose time module with a specific design, which is used to capture short-term and long-term time information in the video. As the experimental results on two public data sets show, the performance of the extracted time series difference algorithm is better than other previous methods. In the next step, the timing differential network will be improved to replace the 3DCNN commonly used in video modeling for video classification.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Open Research
Data Availability
The data sets used in this paper are can be accessed upon request.