3D convolutional neural network-based one-stage model for real-time action detection in video of construction equipment
Seunghoon Jung
Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Korea
Search for more papers by this authorJaewon Jeoung
Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Korea
Search for more papers by this authorHyuna Kang
Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Korea
Search for more papers by this authorCorresponding Author
Taehoon Hong
Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Korea
Correspondence
Taehoon Hong, Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Republic of Korea.
Email: [email protected]
Search for more papers by this authorSeunghoon Jung
Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Korea
Search for more papers by this authorJaewon Jeoung
Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Korea
Search for more papers by this authorHyuna Kang
Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Korea
Search for more papers by this authorCorresponding Author
Taehoon Hong
Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Korea
Correspondence
Taehoon Hong, Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Republic of Korea.
Email: [email protected]
Search for more papers by this authorFunding Information:
National Research Foundation of Korea, Grant/Award Number: NRF-2018R1A5A1025137
Abstract
This study aims to propose a three-dimensional convolutional neural network (3D CNN)-based one-stage model for real-time action detection in video of construction equipment (ADVICE). The 3D CNN-based single-stream feature extraction network and detection network are designed with the implementation of the 3D attention module and feature pyramid network developed in this study to improve performance. For model evaluation, 130 videos were collected from YouTube including videos of four types of construction equipment at various construction sites. Trained on 520 clips and tested on 260 clips, ADVICE achieved precision and recall of 82.1% and 83.1%, respectively, with an inference speed of 36.6 frames per second. The evaluation results indicate that the proposed method can implement the 3D CNN-based one-stage model for real-time action detection of construction equipment in videos of diverse, variable, and complex construction sites. The proposed method paved the way to improving safety, productivity, and environmental management of construction projects.
REFERENCES
- Ahmadlou, M., & Adeli, H. (2010). Enhanced probabilistic neural network with local decision circles: A robust classifier. Integrated Computer-Aided Engineering, 17(3), 197–210.
- Akhavian, R., & Behzadan, A. H. (2015). Construction equipment activity recognition for simulation input modeling using mobile sensors and machine learning classifiers. Advanced Engineering Informatics, 29(4), 867–877.
- Alam, K. M. R., Siddique, N., & Adeli, H. (2020). A dynamic ensemble learning algorithm for neural networks. Neural Computing and Applications, 32(12), 8675–8690.
- AlMaadeed, M. A. (2020). Emergent materials and industry 4.0 contribution toward pandemic diseases such as COVID-19. Emergent Materials, 3, 107–108.
- Alshibani, A., & Moselhi, O. (2016). Productivity based method for forecasting cost and time of earthmoving operations using sampling GPS data. Journal of Information Technology in Construction, 21, 39–56.
- Arabi, S., Haghighat, A., & Sharma, A. (2020). A deep learning-based computer vision solution for construction vehicle detection. Computer-Aided Civil and Infrastructure Engineering, 35(7), 753–767.
- Biörck, J., Blanco, J. L., Mischke, J., Ribeirinho, M. J., Rockhill, D., Sjödin, E., & Strube, G. (2020). How construction can emerge stronger after coronavirus. Mckinsey & Company. https://www.mckinsey.com/business-functions/operations/our-insights/how-construction-can-emerge-stronger-after-coronavirus
- Caba Heilbron, F., Escorcia, V., Ghanem, B., & Carlos Niebles, J. (2015). Activitynet: A large-scale video benchmark for human activity understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA (pp. 961–970).
- Carrara, F., Elias, P., Sedmidubsky, J., & Zezula, P. (2019). LSTM-based real-time action detection and prediction in human motion streams. Multimedia Tools and Applications, 78(19), 27309–27331.
- Cha, Y. J., Choi, W., Suh, G., Mahmoudkhani, S., & Büyüköztürk, O. (2018). Autonomous structural visual inspection using region-based deep learning for detecting multiple damage types. Computer-Aided Civil and Infrastructure Engineering, 33(9), 731–747.
- Chen, C., Zhu, Z., & Hammad, A. (2020). Automated excavators activity recognition and productivity analysis from construction site surveillance videos. Automation in Construction, 110, 103045.
- Costabile, I., Kallegias, A., & Robins, J. C. (2020). The corona decade: The transition to the age of hyper-connectivity and the fourth industrial revolution. https://transformationnorthwest.org/wp-content/uploads/2020/05/TNW-The-Corona-Decade-April-2020.pdf
- Diba, A., Fayyaz, M., Sharma, V., Arzani, M. M., Yousefzadeh, R., Gall, J., & Van Gool, L. (2018). Spatio-temporal channel correlation networks for action classification. In V. Ferrari, M. Hebert, C. Sminchisescu & Y. Weiss (Eds.), Proceedings of the European Conference on Computer Vision (ECCV) (pp. 284–299). Springer
- Dingel, J. I., & Neiman, B. (2020). How many jobs can be done at home? Journal of Public Economics, 189, 104235.
- Fang, W., Ding, L., Zhong, B., Love, P. E. D., & Luo, H. (2018). Automated detection of workers and heavy equipment on construction sites: A convolutional neural network approach. Advanced Engineering Informatics, 37, 139–149.
- Fang, W., Love, P. E. D., Luo, H., & Ding, L. (2020). Computer vision for behaviour-based safety in construction: A review and future directions. Advanced Engineering Informatics, 43, 100980.
- Fang, Y., Cho, Y. K., Zhang, S., & Perez, E. (2016). Case study of BIM and cloud-enabled real-time RFID indoor localization for construction management applications. Journal of Construction Engineering and Management, 142(7), 05016003.
- Ge, L., Liang, H., Yuan, J., & Thalmann, D. (2017). 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA (pp. 1991–2000).
- Golparvar-Fard, M., Heydarian, A., & Niebles, J. C. (2013). Vision-based action recognition of earthmoving equipment using spatio-temporal features and support vector machine classifiers. Advanced Engineering Informatics, 27(4), 652–663.
- Goodrum, P. M., Haas, C. T., Caldas, C., Zhai, D., Yeiser, J., & Homm, D. (2011). Model to predict the impact of a technology on construction productivity. Journal of Construction Engineering and Management, 137(9), 678–688.
- Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D CNNS retrace the history of 2D CNNS and ImageNet? Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT (pp. 6546–6555).
- Heydarian, A., Golparvar-Fard, M., & Niebles, J. C. (2012). Automated visual recognition of construction equipment actions using spatio-temporal features and multiple binary support vector machines. In H. Cai, A. Kandil, M. Hastak & P. S. Dunston (Eds.), Construction Research Congress 2012: Construction Challenges in a Flat World (pp. 889–898). American Society of Civil Engineers.
10.1061/9780784412329.090 Google Scholar
- Hou, R., Chen, C., & Shah, M. (2017). An end-to-end 3D convolutional neural network for action detection and segmentation in videos. arXiv:1712.01111.
- Idrees, H., Zamir, A. R., Jiang, Y. G., Gorban, A., Laptev, I., Sukthankar, R., & Shah, M. (2017). The THUMOS challenge on action recognition for videos “in the wild. Computer Vision and Image Understanding, 155, 1–23.
- Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D Convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231.
- Jung, S., Kang, H., Choi, J., Hong, T., Park, H. S., & Lee, D.-E. (2020). Quantitative health impact assessment of construction noise exposure on the nearby region for noise barrier optimization. Building and Environment, 176, 106869.
- Jung, S., Kang, H., Sung, S., & Hong, T. (2019). Health risk assessment for occupants as a decision-making tool to quantify the environmental effects of particulate matter in construction projects. Building and Environment, 161, 106267.
- Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Action tubelet detector for spatio-temporal action localization. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy (pp. 4405–4413).
- Kang, H., Sung, S., Hong, J., Jung, S., Hong, T., Park, H. S., & Lee, D. (2020). Development of a real-time automated monitoring system for managing the hazardous environmental pollutants at the construction site. Journal of Hazardous Materials, 402, 123483.
- Kim, H., Ahn, C. R., Engelhaupt, D., & Lee, S. H. (2018). Application of dynamic time warping to the recognition of mixed equipment activities in cycle time measurement. Automation in Construction, 87, 225–234.
- Kim, H., Bang, S., Jeong, H., Ham, Y., & Kim, H. (2018). Analyzing context and productivity of tunnel earthmoving processes using imaging and simulation. Automation in Construction, 92, 188–198.
- Kim, H., Kim, H., Hong, Y. W., & Byun, H. (2018). Detecting construction equipment using a region-based fully convolutional network and transfer learning. Journal of Computing in Civil Engineering, 32(2), 04017082.
- Kim, J., & Chi, S. (2019). Action recognition of earthmoving excavators based on sequential pattern analysis of visual features and operation cycles. Automation in Construction, 104, 255–264.
- Kim, J., & Chi, S.(2020). Multi-camera vision-based productivity monitoring of earthmoving operations. Automation in Construction, 112, 103121.
- Köpüklü, O., Wei, X., & Rigoll, G. (2019). You only watch once: A unified CNN architecture for real-time spatiotemporal action localization. arXiv:1911.06644.
- Kornblith, S., Shlens, J., & Le, Q. V. (2019). Do better imagenet models transfer better? Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA (pp. 2661–2671).
- Lea, C., Flynn, M. D., Vidal, R., Reiter, A., & Hager, G. D. (2017). Temporal convolutional networks for action segmentation and detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI (pp. 156–165).
- Lewnard, J. A., & Lo, N. C. (2020). Scientific and ethical basis for social-distancing interventions against COVID-19. The Lancet Infectious diseases, 20(6), 631–633.
- Liang, X. (2019). Image-based post-disaster inspection of reinforced concrete bridge systems using deep learning with bayesian optimization. Computer-Aided Civil and Infrastructure Engineering, 34(5), 415–430.
- Li, J., Liu, X., Zhang, W., Zhang, M., Song, J., & Sebe, N. (2020). Spatio-temporal attention networks for action recognition and detection. IEEE Transactions on Multimedia, 22(11), 2990–3001.
- Li, R., Yuan, Y., Zhang, W., & Yuan, Y. (2018). Unified vision-based methodology for simultaneous concrete defect detection and geolocalization. Computer-Aided Civil and Infrastructure Engineering, 33(7), 527–544.
- Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., & Snoek, C. G. (2018). VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding, 166, 41–50.
- Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI (pp. 2117–2125).
- Liu, K., Liu, W., Gan, C., Tan, M., & Ma, H. (2018, April). T-C3D: Temporal convolutional 3D network for real-time action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 7138–7145.
- Lu, N., Wu, Y., Feng, L., & Song, J. (2018). Deep learning for fall detection: Three-dimensional CNN combined with LSTM on video kinematic data. IEEE Journal of Biomedical and Health Informatics, 23(1), 314–323.
- Luo, X., Li, H., Cao, D., Yu, Y., Yang, X., & Huang, T. (2018). Towards efficient and objective work sampling: Recognizing workers’ activities in site surveillance videos with two-stream convolutional networks. Automation in Construction, 94, 360–370.
- Luo, X., Li, H., Yu, Y., Zhou, C., & Cao, D. (2020). Combining deep features and activity context to improve recognition of activities of workers in groups. Computer-Aided Civil and Infrastructure Engineering, 35(9), 965–978.
- Maskuriy, R., Selamat, A., Maresova, P., Krejcar, O., & David, O. O. (2019). Industry 4.0 for the construction industry: Review of management perspective. Economies, 7(3), 68.
- Narazaki, Y., Hoskere, V., Hoang, T. A., & Spencer Jr, B. F. (2018). Automated vision-based bridge component extraction using multiscale convolutional neural networks. arXiv:1805.06042.
- Onal Ertugrul, I., Yang, L., Jeni, L. A., & Cohn, J. F. (2019). D-PAttNet: Dynamic patch-attentive deep network for action unit detection. Frontiers in Computer Science, 1, 11.
- Pereira, D. R., Piteri, M. A., Souza, A. N., Papa, J. P., & Adeli, H. (2020). FEMa: A finite element machine for fast learning. Neural Computing and Applications, 32(10), 6393–6404.
- Rafiei, M. H., & Adeli, H. (2017). A new neural dynamic classification algorithm. IEEE Transactions on Neural Networks and Learning Systems, 28(12), 3074–3083.
- Roberts, D., & Golparvar-Fard, M. (2019). End-to-end vision-based detection, tracking and activity analysis of earthmoving equipment filmed at ground level. Automation in Construction, 105, 102811.
- Slaton, T., Hernandez, C., & Akhavian, R. (2020). Construction activity recognition with convolutional recurrent networks. Automation in Construction, 113, 103138.
- Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402.
- Spinelli, A., & Pellino, G. (2020). COVID-19 pandemic: Perspectives on an unfolding crisis. Journal of British Surgery, 107(7), 785–787.
- Thakar, V., Saini, H., Ahmed, W., Soltani, M. M., Aly, A., & Yu, J. Y. (2018, September). Efficient single-shot multibox detector for construction site monitoring. 2018 IEEE International Smart Cities Conference (ISC2), Kansas City, MO (pp. 1–6).
- Wang, J., Sun, W., Shou, W., Wang, X., Wu, C., Chong, H. Y., Liu, Y., & Sun, C. (2015). Integrating BIM and LiDAR for real-time construction quality control. Journal of Intelligent & Robotic Systems, 79(3), 417–432.
- Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). CBAM: Convolutional block attention module. In V. Ferrari, M. Hebert, C. Sminchisescu & Y. Weiss (Eds.), Proceedings of the European Conference on Computer Vision (ECCV) (pp. 3–19). Springer.
- Yan, X., Zhang, H., & Li, H. (2020). Computer vision-based recognition of 3D relationship between construction entities for monitoring struck-by accidents. Computer-Aided Civil and Infrastructure Engineering, 35(9), 1023–1038.
- Zhai, D., Goodrum, P. M., Haas, C. T., & Caldas, C. H. (2009). Relationship between automation and integration of construction information systems and labor productivity. Journal of Construction Engineering and Management, 135(8), 746–753.
- Zhao, J., & Snoek, C. G. (2019). Dance with flow: Two-in-one stream action detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA (pp. 9935–9944).
- Zhao, R., Ali, H., & Van der Smagt, P. (2017, September). Two-stream RNN/CNN for action recognition in 3D videos. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, Canada (pp. 4260–4267).
- Zhou, S., Bai, L., Wang, H., Deng, Z., Zhu, X., & Gong, C. (2019). A spatial-temporal attention module for 3D convolution network in action recognition. 2019 International Conference on Computer Intelligent Systems and Network Remote Control (CISNRC 2019), Shanghai, China (pp. 35–42).