Introduction to Human Action Recognition

Human action recognition, as one of the most important topics in computer vision, has been extensively researched during the last decades due to its potential diverse applications. However, it is still regarded as a challenging task especially in realistic scenarios. The main challenge lies in how to design an effective human action representation that is sufficiently descriptive while computationally efficient. In the past decades, local and holistic representations are extensively studied for human action recognition and both achieve state-of-the-art performance on commonly used benchmarks. In this article, we provide an introduction to human action recognition and a comprehensive review on recent progress in both local and holistic representations of actions. In addition, we also describe the widely used benchmark human action datasets on which action recognition methods are evaluated and compared.

Bibliography

1 K. Huang, D. Tao, Y. Yuan, X. Li, and T. Tan. View-Independent Behavior Analysis. IEEE Trans. Syst. Man Cybernet. Part B: Cybernet. 2009, 39 (4), pp 1028–1035.
Google Scholar
2 L. Liu, L. Shao, X. Zhen, and X. Li. Learning Discriminative Key Poses for Action Recognition. IEEE Trans. Cybernet. 43 (6), pp 1860–1870.
10.1109/TSMCB.2012.2231959
Web of Science® Google Scholar
3 L. Shao, S. Jones, and X. Li. Efficient Search and Localization of Human Actions in Video Databases. IEEE Trans. Circ. Syst. Video Technol. 2014, 24 (3), pp 504–512.
10.1109/TCSVT.2013.2276700
Web of Science® Google Scholar
4 J. Aggarwal and M. Ryoo. Human Activity Analysis: A Review. ACM Comput. Surv. 2011, 43 (3), p 16.
10.1145/1922649.1922653
Web of Science® Google Scholar
5 I. Laptev. On Space-Time Interest Points. Int. J. Comput. Vis. 2005, 64 (2–3), pp 107–123.
10.1007/s11263-005-1838-7
Web of Science® Google Scholar
6 P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie. Behavior Recognition via Sparse Spatio-Temporal Features, in Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance 2005, pp 65–72.
Google Scholar
7 A. Oikonomopoulos, I. Patras, and M. Pantic. Spatiotemporal Salient Points for Visual Recognition of Human Actions. IEEE Trans. Syst. Man Cybernet. Part B: Cybernet. 2005, 36 (3), pp 710–719.
10.1109/TSMCB.2005.861864
Web of Science® Google Scholar
8 H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense Trajectories and Motion Boundary Descriptors for Action Recognition. Int. J. Comput. Vis. 2013, 103 (1), pp 60–79.
10.1007/s11263-012-0594-8
Web of Science® Google Scholar
9 G. Johansson. Visual Motion Perception. Sci. Am. 1975.
10.1038/scientificamerican0675-76
PubMed Web of Science® Google Scholar
10 J. Sivic and A. Zisserman. Video Google: A Text Retrieval Approach to Object Matching in Videos. IEEE Int. Conf. Comput. Vis. 2003, pp 1470–1477.
Web of Science® Google Scholar
11 T. Guha and R. K. Ward. Learning Sparse Representations for Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34 (8), pp 1576–1588.
10.1109/TPAMI.2011.253
PubMed Web of Science® Google Scholar
12 L. Liu, L. Shao, F. Zheng, and X. Li. Realistic Action Recognition via Sparsely-Constructed Gaussian Processes. Pattern Recognit. 2014, 47 (12), pp 3819–3827.
10.1016/j.patcog.2014.07.006
Web of Science® Google Scholar
13 G. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional Learning of Spatio-Temporal Features. Eur. Conf. Comput. Vis. 2010, pp 140–153.
Google Scholar
14 S. Ji, W. Xu, M. Yang, and K. Yu. “3D Convolutional Neural Networks for Human Action Recognition. ACM Int. Conf. Mach. Learn. 2010.
Google Scholar
15 H. Lu, G. Fang, X. Shao, and X. Li. Segmenting Human from Photo Images Based on a Coarse-to-Fine Scheme. IEEE Trans. Syst. Man Cybernet. Part B: Cybernet. 2012, 42 (3), pp 889–899.
10.1109/TSMCB.2011.2182048
Web of Science® Google Scholar
16 N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. IEEE Conf. Comput. Vis. Pattern Recognit. 2005, 1, pp 886–893.
Google Scholar
17 I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning Realistic Human Actions from Movies. IEEE Conf. Comput. Vis. Pattern Recognit. 2008, pp 1–8.
Web of Science® Google Scholar
18 N. Dalal, B. Triggs, and C. Schmid. Human Detection Using Oriented Histograms of Flow and Appearance. Eur. Conf. Comput. Vis. 2006, pp 428–441.
Web of Science® Google Scholar
19 L. Yeffet and L. Wolf. Local Trinary Patterns for Human Action Recognition. IEEE Int. Conf. Comput. Vis. 2009, pp 492–497.
Web of Science® Google Scholar
20 H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A Biologically Inspired System for Action Recognition. IEEE Int. Conf. Comput. Vis. 2007, pp 1–8.
Google Scholar
21 M. Riesenhuber and T. Poggio. Models of Object Recognition. Nat. Neurosci. 2000, 3, pp 1199–204.
10.1038/81479
CAS PubMed Web of Science® Google Scholar
22 P. Scovanner, S. Ali, and M. Shah. A 3-Dimensional Sift Descriptor and Its Application to Action Recognition, in 15th International Conference on Multimedia, 2007; pp 357–360.
10.1145/1291233.1291311
Google Scholar
23 A. Kläser, M. Marszaek, and C. Schmid. A Spatio-Temporal Descriptor Based on 3D-Gradients. Br. Mach. Learn. Conf. 2008, pp 995–1004.
Google Scholar
24 Z. Zhang and D. Tao. Slow Feature Analysis for Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34 (3), pp 436–450.
10.1109/TPAMI.2011.157
PubMed Web of Science® Google Scholar
25 U. Gaur, Y. Zhu, B. Song, and A. Roy-Chowdhury. A String of Feature Graphs Model for Recognition of Complex Activities in Natural Videos. IEEE Int. Conf. Comput. Vis. 2011, pp 2595–2602.
Web of Science® Google Scholar
26 J. Sun, X. Wu, S. Yan, L.-F. Cheong, T.-S. Chua, and J. Li. “Hierarchical Spatio-Temporal Context Modeling for Action Recognition. IEEE Conf. Comput. Vis. Pattern Recognit. 2009, pp 2004–2011.
Google Scholar
27 H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action Recognition by Dense Trajectories. IEEE Conf. Comput. Vis. Pattern Recognit. 2011, pp 3169–3176.
Google Scholar
28 A. Gilbert, J. Illingworth, and R. Bowden. Action Recognition Using Mined Hierarchical Compound Features. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33 (5), pp 883–897.
10.1109/TPAMI.2010.144
PubMed Web of Science® Google Scholar
29 P. Matikainen, M. Hebert, and R. Sukthankar. Representing Pairwise Spatial and Temporal Relations for Action Recognition. Eur. Conf. Comput. Vis. 2010, pp 508–521.
Web of Science® Google Scholar
30 C. Yuan, X. Li, W. Hu, H. Ling, and S. Maybank. 3D R Transform on Spatio-Temporal Interest Points for Action Recognition. IEEE Conf. Comput. Vis. Pattern Recognit. 2013.
Google Scholar
31 B. Wu, C. Yuan, and W. Hu. Human Action Recognition Based on Context-Dependent Graph Kernels. IEEE Conf. Comput. Vis. Pattern Recognit. 2014, pp 2609–2616.
Google Scholar
32 A. Kovashka and K. Grauman. Learning a Hierarchy of Discriminative Space-Time Neighborhood Features for Human Action Recognition. IEEE Conf. Comput. Vis. Pattern Recognit. 2010, pp 2046–2053.
Web of Science® Google Scholar
33 J. Liu, J. Luo, and M. Shah. Recognizing Realistic Actions from Videos in the Wild. IEEE Conf. Comput. Vis. Pattern Recognit. 2009, pp 1996–2003.
Google Scholar
34 Y. Zhang, X. Liu, M.-C. Chang, W. Ge, and T. Chen. Spatio-Temporal Phrases for Activity Recognition. Eur. Conf. Comput. Vis. 2012, pp 707–721.
Google Scholar
35 Z. Lu, Y. Peng, and H. Ip. Spectral Learning of Latent Semantics for Action Recognition. IEEE Int. Conf. Comput. Vis. 2011, pp 1503–1510.
Google Scholar
36 S. Wang, Y. Yang, Z. Ma, X. Li, C. Pang, and A. G. Hauptmann. Action Recognition by Exploring Data Distribution and Feature Correlation. IEEE Conf. Comput. Vis. Pattern Recognit. 2012, pp 1370–1377.
Google Scholar
37 K. Mikolajczyk and H. Uemura. Action Recognition with Appearance–Motion Features and Fast Search Trees. Comput. Vis. Image Understand. 2011, 115 (3), pp 426–438.
10.1016/j.cviu.2010.11.002
Web of Science® Google Scholar
38 Q. Le, W. Zou, S. Yeung, and A. Ng. Learning Hierarchical Invariant Spatio-Temporal Features for Action Recognition with Independent Subspace Analysis. IEEE Conf. Comput. Vis. Pattern Recognit. 2011.
Google Scholar
39 M. Jain, H. Jégou, and P. Bouthemy. Better Exploiting Motion for Better Action Recognition. IEEE Conf. Comput. Vis. Pattern Recognit. 2013.
Google Scholar
40 L. Sun, K. Jia, T.-H. Chan, Y. Fang, G. Wang, and S. Yan. Dl-Sfa: Deeply-Learned Slow Feature Analysis for Action Recognition. IEEE Conf. Comput. Vis. Pattern Recognit. 2014, pp 2625–2632.
Web of Science® Google Scholar
41 V. Kantorov and I. Laptev. Efficient Feature Extraction, Encoding and Classification for Action Recognition, in Proc. Computer Vision and Pattern Recognition (CVPR); IEEE, 2014.
Google Scholar
42 X. Yang and Y. Tian. Action Recognition Using Super Sparse Coding Vector with Spatio-Temporal Awareness in European Conference on Computer Vision; Springer, 2014, pp 727–741.
Google Scholar
43 X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action Recognition with Stacked Fisher Vectors. in European Conference on Computer Vision; Springer, 2014, pp 581–595.
10.1007/978-3-319-10602-1_38
Google Scholar
44 X. Zhen and L. Shao. A Performance Evaluation on Action Recognition with Local Features, in International Conference on Pattern Recognition; IEEE, 2014, pp 4495–4500.
Google Scholar
45 N. Shapovalova, A. Vahdat, K. Cannons, T. Lan, and G. Mori. Similarity Constrained Latent Support Vector Machine: An Application to Weakly Supervised Action Classification, in European Conference on Computer Vision; Springer, 2012, pp 55–68.
Google Scholar
46 H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. “Evaluation of Local Spatio-Temporal Features for Action Recognition. Br. Mach. Vis. Conf. 2009.
Google Scholar
47 Y. Jiang, Q. Dai, X. Xue, W. Liu, and C. Ngo. Trajectory-Based Modeling of Human Actions with Motion Reference Points, in European Conference on Computer Vision; Springer, 2012, pp 425–438.
Google Scholar
48 E. Vig, M. Dorr, and D. Cox. Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements, in European Conference on Computer Vision; 2012, pp 84–97.
Web of Science® Google Scholar
49 Z. Cai, L. Wang, X. Peng, and Y. Qiao. Multi-View Super Vector for Action Recognition, in IEEE Conference on Computer Vision and Pattern Recognition, 2014.
Google Scholar
50 F. Perronnin and C. Dance. Fisher Kernels on Visual Vocabularies for Image Categorization, in IEEE Conference on Computer Vision and Pattern Recognition; IEEE, 2007, pp 1–8.
Google Scholar
51 H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating Local Descriptors into a Compact Image Representation, in IEEE Conference on Computer Vision and Pattern Recognition; IEEE, 2010, pp 3304–3311.
Google Scholar
52 A. Bobick and J. Davis. The Recognition of Human Movement Using Temporal Templates, IEEE Trans. Pattern Anal. Mach. Intell. 2002, 23 (3), pp 257–267.
10.1109/34.910878
Web of Science® Google Scholar
53 A. Efros, A. Berg, G. Mori, and J. Malik. Recognizing Action at a Distance. IEEE Int. Conf. Comput. Vis. 2003, pp 726–733.
Web of Science® Google Scholar
54 A. Yilmaz and M. Shah. Actions Sketch: A Novel Action Representation. IEEE Conf. Comput. Vis. Pattern Recognit. 2005, 1, pp 984–989.
Google Scholar
55 L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as Space-Time Shapes. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29 (12), pp 2247–2253.
10.1109/TPAMI.2007.70711
PubMed Web of Science® Google Scholar
56 M. Rodriguez, J. Ahmed, and M. Shah. Action MACH a Spatio-Temporal Maximum Average Correlation Height Filter for Action Recognition. IEEE Conf. Comput. Vis Pattern Recognit. 2008, pp 1–8.
Google Scholar
57 S. Ali and M. Shah. Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32 (2), pp 288–303.
10.1109/TPAMI.2008.284
PubMed Web of Science® Google Scholar
58 G. Hinton, S. Osindero, and Y. Teh. A Fast Learning Algorithm for Deep Belief Nets. Neural comput. 2006, 18 (7), pp 1527–1554.
10.1162/neco.2006.18.7.1527
PubMed Web of Science® Google Scholar
59 X. Zhen, L. Shao, D. Tao, and X. Li. Embedding Motion and Structure Features for Action Recognition. IEEE Trans. Circ. Syst. Video Technol. 2013, 23 (7), pp 1182–1190.
10.1109/TCSVT.2013.2240916
Web of Science® Google Scholar
60 L. Shao, X. Zhen, D. Tao, and X. Li. “Spatio-Temporal Laplacian Pyramid Coding for Action Recognition. IEEE Trans. Cybernet. 2014, 44 (6), pp 817–827.
10.1109/TCYB.2013.2273174
Web of Science® Google Scholar
61 S. Sadanand and J. Corso. Action Bank: A High-Level Representation of Activity in Video. IEEE Conf. Comput. Vis. Pattern Recognit. 2012, pp 1234–1241.
Web of Science® Google Scholar
62 X. Zhen, L. Shao, and X. Li. Action Recognition by Spatio-Temporal Oriented Energies. Inf. Sci. 2014.
10.1016/j.ins.2014.05.021
Web of Science® Google Scholar
63 C. Schuldt, I. Laptev, and B. Caputo. Recognizing Human Actions: A Local Svm Approach. Int. Conf. Pattern Recognit. 2004, 3, pp 32–36.
Google Scholar
64 T. Hassner. A Critical Review Of Action Recognition Benchmarks, in IEEE Conference on Computer Vision and Pattern Recognition Workshops; IEEE, 2013, pp 245–250.
Google Scholar
65 D. Weinland, E. Boyer, and R. Ronfard. Action Recognition from Arbitrary Views Using 3D Exemplars. IEEE Int. Conf. Comput. Vis. 2007, pp 1–7.
Google Scholar
66 H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A Large Video Database for Human Motion Recognition. IEEE Int. Conf. Comput. Vis. 2011, pp 2556–2563.
Web of Science® Google Scholar
67 M. Marszalek, I. Laptev, and C. Schmid. Actions in Context. IEEE Conf. Comput. Vis. Pattern Recognit. 2009, pp 2929–2936.
Google Scholar
68 K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A Dataset of 101 Human Actions Classes from Videos in the Wild. arXiv preprint arXiv:1212.0402, 2012.
Google Scholar

Citing Literature

Wiley Encyclopedia of Electrical and Electronics Engineering

Browse other articles of this reference work: