Computer-Aided Civil and Infrastructure Engineering

Volume 37, Issue 1 pp. 126-142

INDUSTRIAL APPLICATION

Free to Read

3D convolutional neural network-based one-stage model for real-time action detection in video of construction equipment

Seunghoon Jung,

Seunghoon Jung

Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Korea

Search for more papers by this author

Jaewon Jeoung,

Jaewon Jeoung

Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Korea

Search for more papers by this author

Hyuna Kang,

Hyuna Kang

Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Korea

Search for more papers by this author

Taehoon Hong,

Corresponding Author

Taehoon Hong

[email protected]

Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Korea

Correspondence

Taehoon Hong, Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Republic of Korea.

Email: [email protected]

Search for more papers by this author

Seunghoon Jung,

Seunghoon Jung

Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Korea

Search for more papers by this author

Jaewon Jeoung,

Jaewon Jeoung

Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Korea

Search for more papers by this author

Hyuna Kang,

Hyuna Kang

Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Korea

Search for more papers by this author

Taehoon Hong,

Corresponding Author

Taehoon Hong

[email protected]

Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Korea

Correspondence

Taehoon Hong, Department of Architecture & Architectural Engineering, Yonsei University, Seoul, Republic of Korea.

Email: [email protected]

Search for more papers by this author

First published: 10 June 2021

https://doi.org/10.1111/mice.12695

Citations: 10

Funding Information:

National Research Foundation of Korea, Grant/Award Number: NRF-2018R1A5A1025137

Share a link

Email
Wechat
Bluesky

Abstract

This study aims to propose a three-dimensional convolutional neural network (3D CNN)-based one-stage model for real-time action detection in video of construction equipment (ADVICE). The 3D CNN-based single-stream feature extraction network and detection network are designed with the implementation of the 3D attention module and feature pyramid network developed in this study to improve performance. For model evaluation, 130 videos were collected from YouTube including videos of four types of construction equipment at various construction sites. Trained on 520 clips and tested on 260 clips, ADVICE achieved precision and recall of 82.1% and 83.1%, respectively, with an inference speed of 36.6 frames per second. The evaluation results indicate that the proposed method can implement the 3D CNN-based one-stage model for real-time action detection of construction equipment in videos of diverse, variable, and complex construction sites. The proposed method paved the way to improving safety, productivity, and environmental management of construction projects.

REFERENCES

Ahmadlou, M., & Adeli, H. (2010). Enhanced probabilistic neural network with local decision circles: A robust classifier. Integrated Computer-Aided Engineering, 17(3), 197–210.
10.3233/ICA-2010-0345
Web of Science® Google Scholar
Akhavian, R., & Behzadan, A. H. (2015). Construction equipment activity recognition for simulation input modeling using mobile sensors and machine learning classifiers. Advanced Engineering Informatics, 29(4), 867–877.
10.1016/j.aei.2015.03.001
Web of Science® Google Scholar
Alam, K. M. R., Siddique, N., & Adeli, H. (2020). A dynamic ensemble learning algorithm for neural networks. Neural Computing and Applications, 32(12), 8675–8690.
10.1007/s00521-019-04359-7
Web of Science® Google Scholar
AlMaadeed, M. A. (2020). Emergent materials and industry 4.0 contribution toward pandemic diseases such as COVID-19. Emergent Materials, 3, 107–108.
10.1007/s42247-020-00102-4
CAS Google Scholar
Alshibani, A., & Moselhi, O. (2016). Productivity based method for forecasting cost and time of earthmoving operations using sampling GPS data. Journal of Information Technology in Construction, 21, 39–56.
Web of Science® Google Scholar
Arabi, S., Haghighat, A., & Sharma, A. (2020). A deep learning-based computer vision solution for construction vehicle detection. Computer-Aided Civil and Infrastructure Engineering, 35(7), 753–767.
10.1111/mice.12530
Web of Science® Google Scholar
Biörck, J., Blanco, J. L., Mischke, J., Ribeirinho, M. J., Rockhill, D., Sjödin, E., & Strube, G. (2020). How construction can emerge stronger after coronavirus. Mckinsey & Company. https://www.mckinsey.com/business-functions/operations/our-insights/how-construction-can-emerge-stronger-after-coronavirus
Google Scholar
Caba Heilbron, F., Escorcia, V., Ghanem, B., & Carlos Niebles, J. (2015). Activitynet: A large-scale video benchmark for human activity understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA (pp. 961–970).
Google Scholar
Carrara, F., Elias, P., Sedmidubsky, J., & Zezula, P. (2019). LSTM-based real-time action detection and prediction in human motion streams. Multimedia Tools and Applications, 78(19), 27309–27331.
10.1007/s11042-019-07827-3
Web of Science® Google Scholar
Cha, Y. J., Choi, W., Suh, G., Mahmoudkhani, S., & Büyüköztürk, O. (2018). Autonomous structural visual inspection using region-based deep learning for detecting multiple damage types. Computer-Aided Civil and Infrastructure Engineering, 33(9), 731–747.
10.1111/mice.12334
Web of Science® Google Scholar
Chen, C., Zhu, Z., & Hammad, A. (2020). Automated excavators activity recognition and productivity analysis from construction site surveillance videos. Automation in Construction, 110, 103045.
10.1016/j.autcon.2019.103045
Web of Science® Google Scholar
Costabile, I., Kallegias, A., & Robins, J. C. (2020). The corona decade: The transition to the age of hyper-connectivity and the fourth industrial revolution. https://transformationnorthwest.org/wp-content/uploads/2020/05/TNW-The-Corona-Decade-April-2020.pdf
Google Scholar
Diba, A., Fayyaz, M., Sharma, V., Arzani, M. M., Yousefzadeh, R., Gall, J., & Van Gool, L. (2018). Spatio-temporal channel correlation networks for action classification. In V. Ferrari, M. Hebert, C. Sminchisescu & Y. Weiss (Eds.), Proceedings of the European Conference on Computer Vision (ECCV) (pp. 284–299). Springer
Google Scholar
Dingel, J. I., & Neiman, B. (2020). How many jobs can be done at home? Journal of Public Economics, 189, 104235.
10.1016/j.jpubeco.2020.104235
PubMed Web of Science® Google Scholar
Fang, W., Ding, L., Zhong, B., Love, P. E. D., & Luo, H. (2018). Automated detection of workers and heavy equipment on construction sites: A convolutional neural network approach. Advanced Engineering Informatics, 37, 139–149.
10.1016/j.aei.2018.05.003
Web of Science® Google Scholar
Fang, W., Love, P. E. D., Luo, H., & Ding, L. (2020). Computer vision for behaviour-based safety in construction: A review and future directions. Advanced Engineering Informatics, 43, 100980.
10.1016/j.aei.2019.100980
Web of Science® Google Scholar
Fang, Y., Cho, Y. K., Zhang, S., & Perez, E. (2016). Case study of BIM and cloud-enabled real-time RFID indoor localization for construction management applications. Journal of Construction Engineering and Management, 142(7), 05016003.
10.1061/(ASCE)CO.1943-7862.0001125
Web of Science® Google Scholar
Ge, L., Liang, H., Yuan, J., & Thalmann, D. (2017). 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA (pp. 1991–2000).
Google Scholar
Golparvar-Fard, M., Heydarian, A., & Niebles, J. C. (2013). Vision-based action recognition of earthmoving equipment using spatio-temporal features and support vector machine classifiers. Advanced Engineering Informatics, 27(4), 652–663.
10.1016/j.aei.2013.09.001
Web of Science® Google Scholar
Goodrum, P. M., Haas, C. T., Caldas, C., Zhai, D., Yeiser, J., & Homm, D. (2011). Model to predict the impact of a technology on construction productivity. Journal of Construction Engineering and Management, 137(9), 678–688.
10.1061/(ASCE)CO.1943-7862.0000328
Web of Science® Google Scholar
Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D CNNS retrace the history of 2D CNNS and ImageNet? Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT (pp. 6546–6555).
Google Scholar
Heydarian, A., Golparvar-Fard, M., & Niebles, J. C. (2012). Automated visual recognition of construction equipment actions using spatio-temporal features and multiple binary support vector machines. In H. Cai, A. Kandil, M. Hastak & P. S. Dunston (Eds.), Construction Research Congress 2012: Construction Challenges in a Flat World (pp. 889–898). American Society of Civil Engineers.
10.1061/9780784412329.090
Google Scholar
Hou, R., Chen, C., & Shah, M. (2017). An end-to-end 3D convolutional neural network for action detection and segmentation in videos. arXiv:1712.01111.
Google Scholar
Idrees, H., Zamir, A. R., Jiang, Y. G., Gorban, A., Laptev, I., Sukthankar, R., & Shah, M. (2017). The THUMOS challenge on action recognition for videos “in the wild. Computer Vision and Image Understanding, 155, 1–23.
10.1016/j.cviu.2016.10.018
Web of Science® Google Scholar
Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D Convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231.
10.1109/TPAMI.2012.59
PubMed Web of Science® Google Scholar
Jung, S., Kang, H., Choi, J., Hong, T., Park, H. S., & Lee, D.-E. (2020). Quantitative health impact assessment of construction noise exposure on the nearby region for noise barrier optimization. Building and Environment, 176, 106869.
10.1016/j.buildenv.2020.106869
Web of Science® Google Scholar
Jung, S., Kang, H., Sung, S., & Hong, T. (2019). Health risk assessment for occupants as a decision-making tool to quantify the environmental effects of particulate matter in construction projects. Building and Environment, 161, 106267.
10.1016/j.buildenv.2019.106267
Web of Science® Google Scholar
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Action tubelet detector for spatio-temporal action localization. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy (pp. 4405–4413).
Google Scholar
Kang, H., Sung, S., Hong, J., Jung, S., Hong, T., Park, H. S., & Lee, D. (2020). Development of a real-time automated monitoring system for managing the hazardous environmental pollutants at the construction site. Journal of Hazardous Materials, 402, 123483.
10.1016/j.jhazmat.2020.123483
PubMed Web of Science® Google Scholar
Kim, H., Ahn, C. R., Engelhaupt, D., & Lee, S. H. (2018). Application of dynamic time warping to the recognition of mixed equipment activities in cycle time measurement. Automation in Construction, 87, 225–234.
10.1016/j.autcon.2017.12.014
Web of Science® Google Scholar
Kim, H., Bang, S., Jeong, H., Ham, Y., & Kim, H. (2018). Analyzing context and productivity of tunnel earthmoving processes using imaging and simulation. Automation in Construction, 92, 188–198.
10.1016/j.autcon.2018.04.002
Web of Science® Google Scholar
Kim, H., Kim, H., Hong, Y. W., & Byun, H. (2018). Detecting construction equipment using a region-based fully convolutional network and transfer learning. Journal of Computing in Civil Engineering, 32(2), 04017082.
10.1061/(ASCE)CP.1943-5487.0000731
Web of Science® Google Scholar
Kim, J., & Chi, S. (2019). Action recognition of earthmoving excavators based on sequential pattern analysis of visual features and operation cycles. Automation in Construction, 104, 255–264.
10.1016/j.autcon.2019.03.025
Web of Science® Google Scholar
Kim, J., & Chi, S.(2020). Multi-camera vision-based productivity monitoring of earthmoving operations. Automation in Construction, 112, 103121.
10.1016/j.autcon.2020.103121
Web of Science® Google Scholar
Köpüklü, O., Wei, X., & Rigoll, G. (2019). You only watch once: A unified CNN architecture for real-time spatiotemporal action localization. arXiv:1911.06644.
Google Scholar
Kornblith, S., Shlens, J., & Le, Q. V. (2019). Do better imagenet models transfer better? Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA (pp. 2661–2671).
Google Scholar
Lea, C., Flynn, M. D., Vidal, R., Reiter, A., & Hager, G. D. (2017). Temporal convolutional networks for action segmentation and detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI (pp. 156–165).
Google Scholar
Lewnard, J. A., & Lo, N. C. (2020). Scientific and ethical basis for social-distancing interventions against COVID-19. The Lancet Infectious diseases, 20(6), 631–633.
10.1016/S1473-3099(20)30190-0
CAS PubMed Web of Science® Google Scholar
Liang, X. (2019). Image-based post-disaster inspection of reinforced concrete bridge systems using deep learning with bayesian optimization. Computer-Aided Civil and Infrastructure Engineering, 34(5), 415–430.
10.1111/mice.12425
Web of Science® Google Scholar
Li, J., Liu, X., Zhang, W., Zhang, M., Song, J., & Sebe, N. (2020). Spatio-temporal attention networks for action recognition and detection. IEEE Transactions on Multimedia, 22(11), 2990–3001.
10.1109/TMM.2020.2965434
CAS Web of Science® Google Scholar
Li, R., Yuan, Y., Zhang, W., & Yuan, Y. (2018). Unified vision-based methodology for simultaneous concrete defect detection and geolocalization. Computer-Aided Civil and Infrastructure Engineering, 33(7), 527–544.
10.1111/mice.12351
Web of Science® Google Scholar
Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., & Snoek, C. G. (2018). VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding, 166, 41–50.
10.1016/j.cviu.2017.10.011
Web of Science® Google Scholar
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI (pp. 2117–2125).
Google Scholar
Liu, K., Liu, W., Gan, C., Tan, M., & Ma, H. (2018, April). T-C3D: Temporal convolutional 3D network for real-time action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 7138–7145.
Google Scholar
Lu, N., Wu, Y., Feng, L., & Song, J. (2018). Deep learning for fall detection: Three-dimensional CNN combined with LSTM on video kinematic data. IEEE Journal of Biomedical and Health Informatics, 23(1), 314–323.
10.1109/JBHI.2018.2808281
PubMed Web of Science® Google Scholar
Luo, X., Li, H., Cao, D., Yu, Y., Yang, X., & Huang, T. (2018). Towards efficient and objective work sampling: Recognizing workers’ activities in site surveillance videos with two-stream convolutional networks. Automation in Construction, 94, 360–370.
10.1016/j.autcon.2018.07.011
Web of Science® Google Scholar
Luo, X., Li, H., Yu, Y., Zhou, C., & Cao, D. (2020). Combining deep features and activity context to improve recognition of activities of workers in groups. Computer-Aided Civil and Infrastructure Engineering, 35(9), 965–978.
10.1111/mice.12538
Web of Science® Google Scholar
Maskuriy, R., Selamat, A., Maresova, P., Krejcar, O., & David, O. O. (2019). Industry 4.0 for the construction industry: Review of management perspective. Economies, 7(3), 68.
10.3390/economies7030068
Web of Science® Google Scholar
Narazaki, Y., Hoskere, V., Hoang, T. A., & Spencer Jr, B. F. (2018). Automated vision-based bridge component extraction using multiscale convolutional neural networks. arXiv:1805.06042.
Google Scholar
Onal Ertugrul, I., Yang, L., Jeni, L. A., & Cohn, J. F. (2019). D-PAttNet: Dynamic patch-attentive deep network for action unit detection. Frontiers in Computer Science, 1, 11.
10.3389/fcomp.2019.00011
PubMed Web of Science® Google Scholar
Pereira, D. R., Piteri, M. A., Souza, A. N., Papa, J. P., & Adeli, H. (2020). FEMa: A finite element machine for fast learning. Neural Computing and Applications, 32(10), 6393–6404.
10.1007/s00521-019-04146-4
Web of Science® Google Scholar
Rafiei, M. H., & Adeli, H. (2017). A new neural dynamic classification algorithm. IEEE Transactions on Neural Networks and Learning Systems, 28(12), 3074–3083.
10.1109/TNNLS.2017.2682102
PubMed Web of Science® Google Scholar
Roberts, D., & Golparvar-Fard, M. (2019). End-to-end vision-based detection, tracking and activity analysis of earthmoving equipment filmed at ground level. Automation in Construction, 105, 102811.
10.1016/j.autcon.2019.04.006
Web of Science® Google Scholar
Slaton, T., Hernandez, C., & Akhavian, R. (2020). Construction activity recognition with convolutional recurrent networks. Automation in Construction, 113, 103138.
10.1016/j.autcon.2020.103138
Web of Science® Google Scholar
Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402.
Google Scholar
Spinelli, A., & Pellino, G. (2020). COVID-19 pandemic: Perspectives on an unfolding crisis. Journal of British Surgery, 107(7), 785–787.
10.1002/bjs.11627
CAS PubMed Web of Science® Google Scholar
Thakar, V., Saini, H., Ahmed, W., Soltani, M. M., Aly, A., & Yu, J. Y. (2018, September). Efficient single-shot multibox detector for construction site monitoring. 2018 IEEE International Smart Cities Conference (ISC2), Kansas City, MO (pp. 1–6).
Google Scholar
Wang, J., Sun, W., Shou, W., Wang, X., Wu, C., Chong, H. Y., Liu, Y., & Sun, C. (2015). Integrating BIM and LiDAR for real-time construction quality control. Journal of Intelligent & Robotic Systems, 79(3), 417–432.
10.1007/s10846-014-0116-8
Web of Science® Google Scholar
Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). CBAM: Convolutional block attention module. In V. Ferrari, M. Hebert, C. Sminchisescu & Y. Weiss (Eds.), Proceedings of the European Conference on Computer Vision (ECCV) (pp. 3–19). Springer.
Google Scholar
Yan, X., Zhang, H., & Li, H. (2020). Computer vision-based recognition of 3D relationship between construction entities for monitoring struck-by accidents. Computer-Aided Civil and Infrastructure Engineering, 35(9), 1023–1038.
10.1111/mice.12536
Web of Science® Google Scholar
Zhai, D., Goodrum, P. M., Haas, C. T., & Caldas, C. H. (2009). Relationship between automation and integration of construction information systems and labor productivity. Journal of Construction Engineering and Management, 135(8), 746–753.
10.1061/(ASCE)CO.1943-7862.0000024
Web of Science® Google Scholar
Zhao, J., & Snoek, C. G. (2019). Dance with flow: Two-in-one stream action detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA (pp. 9935–9944).
Google Scholar
Zhao, R., Ali, H., & Van der Smagt, P. (2017, September). Two-stream RNN/CNN for action recognition in 3D videos. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, Canada (pp. 4260–4267).
Google Scholar
Zhou, S., Bai, L., Wang, H., Deng, Z., Zhu, X., & Gong, C. (2019). A spatial-temporal attention module for 3D convolution network in action recognition. 2019 International Conference on Computer Intelligent Systems and Network Remote Control (CISNRC 2019), Shanghai, China (pp. 35–42).
Google Scholar

Citing Literature

Volume37, Issue1

January 2022

Pages 126-142

3D convolutional neural network-based one-stage model for real-time action detection in video of construction equipment

Abstract

REFERENCES

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

3D convolutional neural network-based one-stage model for real-time action detection in video of construction equipment

Abstract

REFERENCES

Citing Literature

References

Related

Information