Abstract
Human action recognition, as one of the most important topics in computer vision, has been extensively researched during the last decades due to its potential diverse applications. However, it is still regarded as a challenging task especially in realistic scenarios. The main challenge lies in how to design an effective human action representation that is sufficiently descriptive while computationally efficient. In the past decades, local and holistic representations are extensively studied for human action recognition and both achieve state-of-the-art performance on commonly used benchmarks. In this article, we provide an introduction to human action recognition and a comprehensive review on recent progress in both local and holistic representations of actions. In addition, we also describe the widely used benchmark human action datasets on which action recognition methods are evaluated and compared.
Bibliography
- 1 K. Huang, D. Tao, Y. Yuan, X. Li, and T. Tan. View-Independent Behavior Analysis. IEEE Trans. Syst. Man Cybernet. Part B: Cybernet. 2009, 39 (4), pp 1028–1035.
- 2 L. Liu, L. Shao, X. Zhen, and X. Li. Learning Discriminative Key Poses for Action Recognition. IEEE Trans. Cybernet. 43 (6), pp 1860–1870.
- 3 L. Shao, S. Jones, and X. Li. Efficient Search and Localization of Human Actions in Video Databases. IEEE Trans. Circ. Syst. Video Technol. 2014, 24 (3), pp 504–512.
- 4 J. Aggarwal and M. Ryoo. Human Activity Analysis: A Review. ACM Comput. Surv. 2011, 43 (3), p 16.
- 5 I. Laptev. On Space-Time Interest Points. Int. J. Comput. Vis. 2005, 64 (2–3), pp 107–123.
- 6 P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie. Behavior Recognition via Sparse Spatio-Temporal Features, in Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance 2005, pp 65–72.
- 7 A. Oikonomopoulos, I. Patras, and M. Pantic. Spatiotemporal Salient Points for Visual Recognition of Human Actions. IEEE Trans. Syst. Man Cybernet. Part B: Cybernet. 2005, 36 (3), pp 710–719.
- 8 H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense Trajectories and Motion Boundary Descriptors for Action Recognition. Int. J. Comput. Vis. 2013, 103 (1), pp 60–79.
- 9 G. Johansson. Visual Motion Perception. Sci. Am. 1975.
- 10 J. Sivic and A. Zisserman. Video Google: A Text Retrieval Approach to Object Matching in Videos. IEEE Int. Conf. Comput. Vis. 2003, pp 1470–1477.
- 11 T. Guha and R. K. Ward. Learning Sparse Representations for Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34 (8), pp 1576–1588.
- 12 L. Liu, L. Shao, F. Zheng, and X. Li. Realistic Action Recognition via Sparsely-Constructed Gaussian Processes. Pattern Recognit. 2014, 47 (12), pp 3819–3827.
- 13 G. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional Learning of Spatio-Temporal Features. Eur. Conf. Comput. Vis. 2010, pp 140–153.
- 14 S. Ji, W. Xu, M. Yang, and K. Yu. “3D Convolutional Neural Networks for Human Action Recognition. ACM Int. Conf. Mach. Learn. 2010.
- 15 H. Lu, G. Fang, X. Shao, and X. Li. Segmenting Human from Photo Images Based on a Coarse-to-Fine Scheme. IEEE Trans. Syst. Man Cybernet. Part B: Cybernet. 2012, 42 (3), pp 889–899.
- 16 N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. IEEE Conf. Comput. Vis. Pattern Recognit. 2005, 1, pp 886–893.
- 17 I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning Realistic Human Actions from Movies. IEEE Conf. Comput. Vis. Pattern Recognit. 2008, pp 1–8.
- 18 N. Dalal, B. Triggs, and C. Schmid. Human Detection Using Oriented Histograms of Flow and Appearance. Eur. Conf. Comput. Vis. 2006, pp 428–441.
- 19 L. Yeffet and L. Wolf. Local Trinary Patterns for Human Action Recognition. IEEE Int. Conf. Comput. Vis. 2009, pp 492–497.
- 20 H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A Biologically Inspired System for Action Recognition. IEEE Int. Conf. Comput. Vis. 2007, pp 1–8.
- 21 M. Riesenhuber and T. Poggio. Models of Object Recognition. Nat. Neurosci. 2000, 3, pp 1199–204.
- 22
P. Scovanner,
S. Ali, and
M. Shah.
A 3-Dimensional Sift Descriptor and Its Application to Action Recognition, in
15th International Conference on Multimedia,
2007;
pp 357–360.
10.1145/1291233.1291311 Google Scholar
- 23 A. Kläser, M. Marszaek, and C. Schmid. A Spatio-Temporal Descriptor Based on 3D-Gradients. Br. Mach. Learn. Conf. 2008, pp 995–1004.
- 24 Z. Zhang and D. Tao. Slow Feature Analysis for Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34 (3), pp 436–450.
- 25 U. Gaur, Y. Zhu, B. Song, and A. Roy-Chowdhury. A String of Feature Graphs Model for Recognition of Complex Activities in Natural Videos. IEEE Int. Conf. Comput. Vis. 2011, pp 2595–2602.
- 26 J. Sun, X. Wu, S. Yan, L.-F. Cheong, T.-S. Chua, and J. Li. “Hierarchical Spatio-Temporal Context Modeling for Action Recognition. IEEE Conf. Comput. Vis. Pattern Recognit. 2009, pp 2004–2011.
- 27 H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action Recognition by Dense Trajectories. IEEE Conf. Comput. Vis. Pattern Recognit. 2011, pp 3169–3176.
- 28 A. Gilbert, J. Illingworth, and R. Bowden. Action Recognition Using Mined Hierarchical Compound Features. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33 (5), pp 883–897.
- 29 P. Matikainen, M. Hebert, and R. Sukthankar. Representing Pairwise Spatial and Temporal Relations for Action Recognition. Eur. Conf. Comput. Vis. 2010, pp 508–521.
- 30 C. Yuan, X. Li, W. Hu, H. Ling, and S. Maybank. 3D R Transform on Spatio-Temporal Interest Points for Action Recognition. IEEE Conf. Comput. Vis. Pattern Recognit. 2013.
- 31 B. Wu, C. Yuan, and W. Hu. Human Action Recognition Based on Context-Dependent Graph Kernels. IEEE Conf. Comput. Vis. Pattern Recognit. 2014, pp 2609–2616.
- 32 A. Kovashka and K. Grauman. Learning a Hierarchy of Discriminative Space-Time Neighborhood Features for Human Action Recognition. IEEE Conf. Comput. Vis. Pattern Recognit. 2010, pp 2046–2053.
- 33 J. Liu, J. Luo, and M. Shah. Recognizing Realistic Actions from Videos in the Wild. IEEE Conf. Comput. Vis. Pattern Recognit. 2009, pp 1996–2003.
- 34 Y. Zhang, X. Liu, M.-C. Chang, W. Ge, and T. Chen. Spatio-Temporal Phrases for Activity Recognition. Eur. Conf. Comput. Vis. 2012, pp 707–721.
- 35 Z. Lu, Y. Peng, and H. Ip. Spectral Learning of Latent Semantics for Action Recognition. IEEE Int. Conf. Comput. Vis. 2011, pp 1503–1510.
- 36 S. Wang, Y. Yang, Z. Ma, X. Li, C. Pang, and A. G. Hauptmann. Action Recognition by Exploring Data Distribution and Feature Correlation. IEEE Conf. Comput. Vis. Pattern Recognit. 2012, pp 1370–1377.
- 37 K. Mikolajczyk and H. Uemura. Action Recognition with Appearance–Motion Features and Fast Search Trees. Comput. Vis. Image Understand. 2011, 115 (3), pp 426–438.
- 38 Q. Le, W. Zou, S. Yeung, and A. Ng. Learning Hierarchical Invariant Spatio-Temporal Features for Action Recognition with Independent Subspace Analysis. IEEE Conf. Comput. Vis. Pattern Recognit. 2011.
- 39 M. Jain, H. Jégou, and P. Bouthemy. Better Exploiting Motion for Better Action Recognition. IEEE Conf. Comput. Vis. Pattern Recognit. 2013.
- 40 L. Sun, K. Jia, T.-H. Chan, Y. Fang, G. Wang, and S. Yan. Dl-Sfa: Deeply-Learned Slow Feature Analysis for Action Recognition. IEEE Conf. Comput. Vis. Pattern Recognit. 2014, pp 2625–2632.
- 41 V. Kantorov and I. Laptev. Efficient Feature Extraction, Encoding and Classification for Action Recognition, in Proc. Computer Vision and Pattern Recognition (CVPR); IEEE, 2014.
- 42 X. Yang and Y. Tian. Action Recognition Using Super Sparse Coding Vector with Spatio-Temporal Awareness in European Conference on Computer Vision; Springer, 2014, pp 727–741.
- 43
X. Peng,
C. Zou,
Y. Qiao, and
Q. Peng.
Action Recognition with Stacked Fisher Vectors. in
European Conference on Computer Vision;
Springer,
2014, pp
581–595.
10.1007/978-3-319-10602-1_38 Google Scholar
- 44 X. Zhen and L. Shao. A Performance Evaluation on Action Recognition with Local Features, in International Conference on Pattern Recognition; IEEE, 2014, pp 4495–4500.
- 45 N. Shapovalova, A. Vahdat, K. Cannons, T. Lan, and G. Mori. Similarity Constrained Latent Support Vector Machine: An Application to Weakly Supervised Action Classification, in European Conference on Computer Vision; Springer, 2012, pp 55–68.
- 46 H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. “Evaluation of Local Spatio-Temporal Features for Action Recognition. Br. Mach. Vis. Conf. 2009.
- 47 Y. Jiang, Q. Dai, X. Xue, W. Liu, and C. Ngo. Trajectory-Based Modeling of Human Actions with Motion Reference Points, in European Conference on Computer Vision; Springer, 2012, pp 425–438.
- 48 E. Vig, M. Dorr, and D. Cox. Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements, in European Conference on Computer Vision; 2012, pp 84–97.
- 49 Z. Cai, L. Wang, X. Peng, and Y. Qiao. Multi-View Super Vector for Action Recognition, in IEEE Conference on Computer Vision and Pattern Recognition, 2014.
- 50 F. Perronnin and C. Dance. Fisher Kernels on Visual Vocabularies for Image Categorization, in IEEE Conference on Computer Vision and Pattern Recognition; IEEE, 2007, pp 1–8.
- 51 H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating Local Descriptors into a Compact Image Representation, in IEEE Conference on Computer Vision and Pattern Recognition; IEEE, 2010, pp 3304–3311.
- 52 A. Bobick and J. Davis. The Recognition of Human Movement Using Temporal Templates, IEEE Trans. Pattern Anal. Mach. Intell. 2002, 23 (3), pp 257–267.
- 53 A. Efros, A. Berg, G. Mori, and J. Malik. Recognizing Action at a Distance. IEEE Int. Conf. Comput. Vis. 2003, pp 726–733.
- 54 A. Yilmaz and M. Shah. Actions Sketch: A Novel Action Representation. IEEE Conf. Comput. Vis. Pattern Recognit. 2005, 1, pp 984–989.
- 55 L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as Space-Time Shapes. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29 (12), pp 2247–2253.
- 56 M. Rodriguez, J. Ahmed, and M. Shah. Action MACH a Spatio-Temporal Maximum Average Correlation Height Filter for Action Recognition. IEEE Conf. Comput. Vis Pattern Recognit. 2008, pp 1–8.
- 57 S. Ali and M. Shah. Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32 (2), pp 288–303.
- 58 G. Hinton, S. Osindero, and Y. Teh. A Fast Learning Algorithm for Deep Belief Nets. Neural comput. 2006, 18 (7), pp 1527–1554.
- 59 X. Zhen, L. Shao, D. Tao, and X. Li. Embedding Motion and Structure Features for Action Recognition. IEEE Trans. Circ. Syst. Video Technol. 2013, 23 (7), pp 1182–1190.
- 60 L. Shao, X. Zhen, D. Tao, and X. Li. “Spatio-Temporal Laplacian Pyramid Coding for Action Recognition. IEEE Trans. Cybernet. 2014, 44 (6), pp 817–827.
- 61 S. Sadanand and J. Corso. Action Bank: A High-Level Representation of Activity in Video. IEEE Conf. Comput. Vis. Pattern Recognit. 2012, pp 1234–1241.
- 62 X. Zhen, L. Shao, and X. Li. Action Recognition by Spatio-Temporal Oriented Energies. Inf. Sci. 2014.
- 63 C. Schuldt, I. Laptev, and B. Caputo. Recognizing Human Actions: A Local Svm Approach. Int. Conf. Pattern Recognit. 2004, 3, pp 32–36.
- 64 T. Hassner. A Critical Review Of Action Recognition Benchmarks, in IEEE Conference on Computer Vision and Pattern Recognition Workshops; IEEE, 2013, pp 245–250.
- 65 D. Weinland, E. Boyer, and R. Ronfard. Action Recognition from Arbitrary Views Using 3D Exemplars. IEEE Int. Conf. Comput. Vis. 2007, pp 1–7.
- 66 H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A Large Video Database for Human Motion Recognition. IEEE Int. Conf. Comput. Vis. 2011, pp 2556–2563.
- 67 M. Marszalek, I. Laptev, and C. Schmid. Actions in Context. IEEE Conf. Comput. Vis. Pattern Recognit. 2009, pp 2929–2936.
- 68 K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A Dataset of 101 Human Actions Classes from Videos in the Wild. arXiv preprint arXiv:1212.0402, 2012.
Citing Literature
Wiley Encyclopedia of Electrical and Electronics Engineering
Browse other articles of this reference work: