Volume 39, Issue 5 e12804

ORIGINAL ARTICLE

Environmental sound classification using convolution neural networks with different integrated loss functions

Joy Krishan Das,

Joy Krishan Das

Department of Computer Science and Engineering, Brac University, Dhaka, Bangladesh

Search for more papers by this author

Amitabha Chakrabarty,

Amitabha Chakrabarty

Department of Computer Science and Engineering, Brac University, Dhaka, Bangladesh

Search for more papers by this author

Md. Jalil Piran,

Corresponding Author

Md. Jalil Piran

[email protected]

orcid.org/0000-0003-3229-6785

Department of Computer Science and Engineering, Sejong University, Seoul, South Korea

Correspondence

Md. Jalil Piran, Department of Computer Science and Engineering, Sejong University, Seoul 05006, South Korea.

Email: [email protected]

Search for more papers by this author

Joy Krishan Das,

Joy Krishan Das

Department of Computer Science and Engineering, Brac University, Dhaka, Bangladesh

Search for more papers by this author

Amitabha Chakrabarty,

Amitabha Chakrabarty

Department of Computer Science and Engineering, Brac University, Dhaka, Bangladesh

Search for more papers by this author

Md. Jalil Piran,

Corresponding Author

Md. Jalil Piran

[email protected]

orcid.org/0000-0003-3229-6785

Department of Computer Science and Engineering, Sejong University, Seoul, South Korea

Correspondence

Md. Jalil Piran, Department of Computer Science and Engineering, Sejong University, Seoul 05006, South Korea.

Email: [email protected]

Search for more papers by this author

First published: 15 September 2021

https://doi.org/10.1111/exsy.12804

Citations: 5

Share a link

Email
Wechat
Bluesky

Abstract

The hike in the demand for smart cities has gathered the interest of researchers to work on environmental sound classification. Most researchers' goal is to reach the Bayesian optimal error in the field of audio classification. Nonetheless, it is very baffling to interpret meaning from a three-dimensional audio and this is where different types of spectrograms become effective. Using benchmark spectral features such as mel frequency cepstral coefficients (MFCCs), chromagram, log-mel spectrogram (LM), and so on audio can be converted into meaningful 2D spectrograms. In this paper, we propose a convolutional neural network (CNN) model, which is fabricated with additive angular margin loss (AAML), large margin cosine loss (LMCL) and a-softmax loss. These loss functions proposed for face recognition, hold their value in the other fields of study if they are implemented in a systematic manner. The mentioned loss functions are more dominant than conventional softmax loss when it comes to classification task because of its capability to increase intra-class compactness and inter-class discrepancy. Thus, with MCAAM-Net, MCAS-Net and MCLCM-Net models, a classification accuracy of 99.60%, 99.43% and 99.37% is achieved on UrbanSound8K dataset respectively without any augmentation. This paper also demonstrates the benefit of stacking features together and the above-mentioned validation accuracies are achieved after stacking MFCCs and chromagram on the x-axis. We also visualized the clusters formed by the embedded vectors of test data for further acknowledgement of our results, after passing it through different proposed models. Finally, we show that the MCAAM-Net model achieved an accuracy of 99.60% on UrbanSound8K dataset, which outperforms the benchmark models like TSCNN-DS, ADCNN-5, ESResNet-Attention, and so on that are introduced over the recent years.

CONFLICT OF INTEREST

None.

Open Research

DATA AVAILABILITY STATEMENT

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

REFERENCES

Abdoli, S., Cardinal, P., & Koerich, A. L. (2019). End-to-end environmental sound classification using a 1d convolutional neural network. Expert Systems with Applications, 136, 252–263.
10.1016/j.eswa.2019.06.040
Web of Science® Google Scholar
Agrawal, D. M., Sailor, H. B., Soni, M. H., & Patil, H. A. (2017). Novel teo-based gammatone features for environmental sound classification. In Proc. 2017 25th European signal processing conference (EUSIPCO) (pp. 1809–1813). IEEE.
Google Scholar
Ajmera, P. K., Jadhav, D. V., & Holambe, R. S. (2011). Text-independent speaker identification using radon and discrete cosine transforms based features from speech spectrogram. Pattern Recognition, 44(10–11), 2749–2759.
10.1016/j.patcog.2011.04.009
Web of Science® Google Scholar
Ali, F., El-Sappagh, S., Islam, S. R., Ali, A., Attique, M., Imran, M., & Kwak, K.-S. (2021). An intelligent healthcare monitoring framework using wearable sensors and social networking data. Future Generation Computer Systems, 114, 23–43.
10.1016/j.future.2020.07.047
Web of Science® Google Scholar
Ali, F., El-Sappagh, S., Islam, S. R., Kwak, D., Ali, A., Imran, M., & Kwak, K.-S. (2020). A smart healthcare monitoring system for heart disease prediction based on ensemble deep learning and feature fusion. Information Fusion, 63, 208–222.
10.1016/j.inffus.2020.06.008
Web of Science® Google Scholar
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Proc. 3rd International Conference on Learning Representations, ICLR 2015.
Google Scholar
Baum, E., Harper, M., Alicea, R., & Ordonez, C. (2018). Sound identification for fire-fighting mobile robots. In Proc. 2018 Second IEEE International Conference on Robotic Computing (IRC) (pp. 79–86). IEEE.
Google Scholar
Blumensath, T., & Davies, M. E. (2009). Sampling theorems for signals from the union of finite-dimensional linear subspaces. IEEE Transactions on Information Theory, 55(4), 1872–1882.
10.1109/TIT.2009.2013003
Web of Science® Google Scholar
Boddapati, V., Petef, A., Rasmusson, J., & Lundberg, L. (2017). Classifying environmental sounds using image recognition networks. Procedia Computer Science, 112, 2048–2056.
10.1016/j.procs.2017.08.250
Google Scholar
Butt, U. A., Mehmood, M., Shah, S. B. H., Amin, R., Shaukat, M. W., Raza, S. M., Suh, D. Y., Piran, M., et al. (2020). A review of machine learning algorithms for cloud computing security. Electronics, 9(9), 1379.
10.3390/electronics9091379
Web of Science® Google Scholar
Dang, L. M., Min, K., Wang, H., Piran, M. J., Lee, C. H., & Moon, H. (2020). Sensor-based and vision-based human activity recognition: A comprehensive survey. Pattern Recognition, 108, 107561.
10.1016/j.patcog.2020.107561
Web of Science® Google Scholar
Das, J. K., Ghosh, A., Pal, A. K., Dutta, S., & Chakrabarty, A. (2020). Urban sound classification using convolutional neural network and long short term memory based on multiple features. In Proc. 2020 Fourth International Conference on Intelligent Computing in Data Sciences (ICDS) (pp. 1–9). IEEE.
Google Scholar
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
10.1109/TASSP.1980.1163420
Web of Science® Google Scholar
Demir, F., Turkoglu, M., Aslan, M., & Sengur, A. (2020). A new pyramidal concatenated CNN approach for environmental sound classification. Applied Acoustics, 170, 107520.
10.1016/j.apacoust.2020.107520
Web of Science® Google Scholar
Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2019). Arcface: Additive angular margin loss for deep face recognition. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4690–4699).
Google Scholar
Gadekallu, T. R., Alazab, M., Kaluri, R., Maddikunta, P. K. R., Bhattacharya, S., Lakshmanna, K., & Parimala, M. (2021). Hand gesture classification using a novel CNN-crow search algorithm. Complex & Intelligent Systems, 7, 1855–1868. https://doi.org/10.1007/s40747-021-00324-x
10.1007/s40747-021-00324-x
Web of Science® Google Scholar
Guzhov, A., Raue, F., Hees, J., & Dengel, A. (2020). Esresnet: Environmental sound classification based on visual domain models. CoRR, abs/2004.07301.
Google Scholar
Hanin, B. (2018). Which neural net architectures give rise to exploding and vanishing gradients? In Proc. of the 32nd International Conference on Neural Information Processing Systems, NIPS'18 (pp. 580 589). Red Hook, NY: Curran Associates Inc.
Google Scholar
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. International Conference on Machine Learning (pp. 448–456). PMLR.
Google Scholar
Jerri, A. J. (1977). The Shannon sampling theorem its various extensions and applications: A tutorial review. Proceedings of the IEEE, 65(11), 1565–1596.
10.1109/PROC.1977.10771
Web of Science® Google Scholar
Khan, L. U., Yaqoob, I., Imran, M., Han, Z., & Hong, C. S. (2020). 6g wireless systems: A vision, architectural elements, and future directions. IEEE Access, 8, 147029–147044.
10.1109/ACCESS.2020.3015289
Web of Science® Google Scholar
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Proc. 3rd International Conference on Learning Representations, ICLR 2015.
Google Scholar
Li, J., Dai, W., Metze, F., Qu, S., & Das, S. (2017). A comparison of deep learning methods for environmental sound detection. In Proc. 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 126–130). IEEE.
Google Scholar
Li, S., Yao, Y., Hu, J., Liu, G., Yao, X., & Hu, J. (2018). An ensemble stacked convolutional neural network model for environmental event sound recognition. Applied Sciences, 8(7), 1152.
10.3390/app8071152
Google Scholar
Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News, 2(3), 18–22.
Google Scholar
Liu, J.-M., You, M., Li, G.-Z., Wang, Z., Xu, X., Qiu, Z., Xie, W., An, C., & Chen, S. (2013). Cough signal recognition with gammatone cepstral coefficients. In Proc. 2013 IEEE China Summit and International Conference on Signal and Information Processing (pp. 160–164). IEEE.
Google Scholar
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., & Song, L. (2017). Sphereface: Deep hypersphere embedding for face recognition. In Proc. of the IEEE conference on computer vision and pattern recognition (pp. 212–220).
Google Scholar
Lu, J., Ma, R., Liu, G., & Qin, Z. (2021). Deep convolutional neural network with transfer learning for environmental sound classification. In 2021 International Conference on Computer, Control and Robotics (ICCCR) (pp. 242–245).
Google Scholar
Lu, L., Zhang, X., Cho, K., & Renals, S. (2015). A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association.
Google Scholar
McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). Librosa: Audio and music signal analysis in python. In Proc. of the 14th Python in Science Conference, volume 8 (pp. 18–25). Citeseer.
Google Scholar
Muda, L., Begam, M., & Elamvazuthi, I. (2010). Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. CoRR, abs/1003.4083.
Google Scholar
Muller, M., Kurth, F., & Clausen, M. (2005). Chroma-based statistical audio features for audio matching. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 275–278). IEEE.
Google Scholar
Mushtaq, Z., Su, S.-F., & Tran, Q.-V. (2021). Spectral images based environmental sound classification using cnn with meaningful data augmentation. Applied Acoustics, 172, 107581.
10.1016/j.apacoust.2020.107581
Web of Science® Google Scholar
Mydlarz, C., Salamon, J., & Bello, J. P. (2017). The implementation of low-cost urban acoustic monitoring devices. Applied Acoustics, 117, 207–218.
10.1016/j.apacoust.2016.06.010
Web of Science® Google Scholar
Piczak, K. J. (2015a). Environmental sound classification with convolutional neural networks. In Proc. 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP) (pp. 1–6). IEEE.
Google Scholar
Piczak, K. J. (2015b). ESC: Dataset for environmental sound classification. In Proc. of the 23rd ACM International Conference on Multimedia, MM'15 (pp. 1015–1018). New York, NY: Association for Computing Machinery.
Google Scholar
Piran, M. J., Tran, N. H., Suh, D. Y., Song, J. B., Hong, C. S., & Han, Z. (2016). Qoe-driven channel allocation and handoff management for seamless multimedia in cognitive 5g cellular networks. IEEE Transactions on Vehicular Technology, 66(7), 6569–6585.
10.1109/TVT.2016.2629507
Web of Science® Google Scholar
Rafique, W., Qi, L., Yaqoob, I., Imran, M., Rasool, R. U., & Dou, W. (2020). Complementing IOT services through software defined networking and edge computing: A comprehensive survey. IEEE Communications Surveys & Tutorials, 22(3), 1761–1804.
10.1109/COMST.2020.2997475
Web of Science® Google Scholar
Rehman, A., Rehman, S. U., Khan, M., Alazab, M., & Reddy, T. (2021). Canintelliids: Detecting in-vehicle intrusion attacks on a controller area network using CNN and attention-based GRU. IEEE Transactions on Network Science and Engineering.
Google Scholar
Ridley, M., & MacQueen, D. (2004). Sampling plan optimization: A data review and sampling frequency evaluation process. Bioremediation Journal, 8(3–4), 167–175.
10.1080/10889860490887572
CAS Google Scholar
Salamon, J., Jacoby, C., & Bello, J. P. (2014). A dataset and taxonomy for urban sound research. In Proc. of the 22nd ACM International Conference on Multimedia (pp. 1041–1044).
Google Scholar
Shafer, G. (1976). A mathematical theory of evidence, volume 42. Princeton University Press.
10.1515/9780691214696
Google Scholar
Sharma, J., Granmo, O.-C., & Goodwin, M. (2020). Environment sound classification using multiple feature channels and attention based deep convolutional neural network. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020 (pp. 1186–1190). ISCA.
Google Scholar
Sharma, J., Granmo, O.-C., & Goodwin, M. (2021). Emergency detection with environment sound using deep convolutional neural networks. In Proc. of Fifth International Congress on Information and Communication Technology (pp. 144–154). Springer.
Google Scholar
Su, Y., Zhang, K., Wang, J., & Madani, K. (2019). Environment sound classification using a two-stream cnn based on decision-level fusion. Sensors, 19(7), 1733.
10.3390/s19071733
Web of Science® Google Scholar
Sun, Y., Chen, Y., Wang, X., & Tang, X. (2014). Deep learning face representation by joint identification-verification. In Proc. of the 27th International Conference on Neural Information Processing Systems—Volume 2, NIPS'14 (pp. 1988-1996). Cambridge, MA: MIT Press.
Google Scholar
Tokozume, Y., & Harada, T. (2017). Learning environmental sounds with end-to-end convolutional neural network. In Proc. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2721–2725). IEEE.
Google Scholar
Ur Rehman, M. H., Yaqoob, I., Salah, K., Imran, M., Jayaraman, P. P., & Perera, C. (2019). The role of big data analytics in industrial internet of things. Future Generation Computer Systems, 99, 247–259.
10.1016/j.future.2019.04.020
Web of Science® Google Scholar
Vasan, D., Alazab, M., Wassan, S., Safaei, B., & Zheng, Q. (2020). Image-based malware classification using ensemble of cnn architectures (imcec). Computers & Security, 92, 101748.
10.1016/j.cose.2020.101748
Web of Science® Google Scholar
Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., & Liu, W. (2018). Cosface: Large margin cosine loss for deep face recognition. In Proc. of the IEEE conference on computer vision and pattern recognition (pp. 5265–5274).
Google Scholar
Wang, H., Zou, Y., Chong, D., & Wang, W. (2020). Environmental sound classification with parallel temporal-spectral attention. Proceedings of Interspeech, 2020, 821–825.
Google Scholar
Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2016). A discriminative feature learning approach for deep face recognition. In Proc. European Conference on Computer Vision (pp. 499–515). Springer.
Google Scholar
Yu, D., & Deng, L. (2015). Deep neural network-hidden Markov model hybrid systems. In Proc. Automatic Speech Recognition (pp. 99–116). Springer.
Google Scholar
Zhu, B., Wang, C., Liu, F., Lei, J., Huang, Z., Peng, Y., & Li, F. (2018). Learning environmental sounds with multi-scale convolutional neural network. In Proc. 2018 International Joint Conference on Neural Networks (IJCNN) (pp. 1–8). IEEE.
Google Scholar

Citing Literature

Volume39, Issue5

Special Issue:Big Data Analytics with IoT‐oriented Infrastructures for Future Smart Cities/Cognitive Smart Cities: Challenges and Trending Solutions

June 2022

e12804

Environmental sound classification using convolution neural networks with different integrated loss functions

Abstract

CONFLICT OF INTEREST

Open Research

DATA AVAILABILITY STATEMENT

REFERENCES

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Environmental sound classification using convolution neural networks with different integrated loss functions

Abstract

CONFLICT OF INTEREST

Open Research

DATA AVAILABILITY STATEMENT

REFERENCES

Citing Literature

References

Related

Information