Environmental sound classification using convolution neural networks with different integrated loss functions
Joy Krishan Das
Department of Computer Science and Engineering, Brac University, Dhaka, Bangladesh
Search for more papers by this authorAmitabha Chakrabarty
Department of Computer Science and Engineering, Brac University, Dhaka, Bangladesh
Search for more papers by this authorCorresponding Author
Md. Jalil Piran
Department of Computer Science and Engineering, Sejong University, Seoul, South Korea
Correspondence
Md. Jalil Piran, Department of Computer Science and Engineering, Sejong University, Seoul 05006, South Korea.
Email: [email protected]
Search for more papers by this authorJoy Krishan Das
Department of Computer Science and Engineering, Brac University, Dhaka, Bangladesh
Search for more papers by this authorAmitabha Chakrabarty
Department of Computer Science and Engineering, Brac University, Dhaka, Bangladesh
Search for more papers by this authorCorresponding Author
Md. Jalil Piran
Department of Computer Science and Engineering, Sejong University, Seoul, South Korea
Correspondence
Md. Jalil Piran, Department of Computer Science and Engineering, Sejong University, Seoul 05006, South Korea.
Email: [email protected]
Search for more papers by this authorAbstract
The hike in the demand for smart cities has gathered the interest of researchers to work on environmental sound classification. Most researchers' goal is to reach the Bayesian optimal error in the field of audio classification. Nonetheless, it is very baffling to interpret meaning from a three-dimensional audio and this is where different types of spectrograms become effective. Using benchmark spectral features such as mel frequency cepstral coefficients (MFCCs), chromagram, log-mel spectrogram (LM), and so on audio can be converted into meaningful 2D spectrograms. In this paper, we propose a convolutional neural network (CNN) model, which is fabricated with additive angular margin loss (AAML), large margin cosine loss (LMCL) and a-softmax loss. These loss functions proposed for face recognition, hold their value in the other fields of study if they are implemented in a systematic manner. The mentioned loss functions are more dominant than conventional softmax loss when it comes to classification task because of its capability to increase intra-class compactness and inter-class discrepancy. Thus, with MCAAM-Net, MCAS-Net and MCLCM-Net models, a classification accuracy of 99.60%, 99.43% and 99.37% is achieved on UrbanSound8K dataset respectively without any augmentation. This paper also demonstrates the benefit of stacking features together and the above-mentioned validation accuracies are achieved after stacking MFCCs and chromagram on the x-axis. We also visualized the clusters formed by the embedded vectors of test data for further acknowledgement of our results, after passing it through different proposed models. Finally, we show that the MCAAM-Net model achieved an accuracy of 99.60% on UrbanSound8K dataset, which outperforms the benchmark models like TSCNN-DS, ADCNN-5, ESResNet-Attention, and so on that are introduced over the recent years.
CONFLICT OF INTEREST
None.
Open Research
DATA AVAILABILITY STATEMENT
Data sharing is not applicable to this article as no new data were created or analyzed in this study.
REFERENCES
- Abdoli, S., Cardinal, P., & Koerich, A. L. (2019). End-to-end environmental sound classification using a 1d convolutional neural network. Expert Systems with Applications, 136, 252–263.
- Agrawal, D. M., Sailor, H. B., Soni, M. H., & Patil, H. A. (2017). Novel teo-based gammatone features for environmental sound classification. In Proc. 2017 25th European signal processing conference (EUSIPCO) (pp. 1809–1813). IEEE.
- Ajmera, P. K., Jadhav, D. V., & Holambe, R. S. (2011). Text-independent speaker identification using radon and discrete cosine transforms based features from speech spectrogram. Pattern Recognition, 44(10–11), 2749–2759.
- Ali, F., El-Sappagh, S., Islam, S. R., Ali, A., Attique, M., Imran, M., & Kwak, K.-S. (2021). An intelligent healthcare monitoring framework using wearable sensors and social networking data. Future Generation Computer Systems, 114, 23–43.
- Ali, F., El-Sappagh, S., Islam, S. R., Kwak, D., Ali, A., Imran, M., & Kwak, K.-S. (2020). A smart healthcare monitoring system for heart disease prediction based on ensemble deep learning and feature fusion. Information Fusion, 63, 208–222.
- Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Proc. 3rd International Conference on Learning Representations, ICLR 2015.
- Baum, E., Harper, M., Alicea, R., & Ordonez, C. (2018). Sound identification for fire-fighting mobile robots. In Proc. 2018 Second IEEE International Conference on Robotic Computing (IRC) (pp. 79–86). IEEE.
- Blumensath, T., & Davies, M. E. (2009). Sampling theorems for signals from the union of finite-dimensional linear subspaces. IEEE Transactions on Information Theory, 55(4), 1872–1882.
- Boddapati, V., Petef, A., Rasmusson, J., & Lundberg, L. (2017). Classifying environmental sounds using image recognition networks. Procedia Computer Science, 112, 2048–2056.
10.1016/j.procs.2017.08.250 Google Scholar
- Butt, U. A., Mehmood, M., Shah, S. B. H., Amin, R., Shaukat, M. W., Raza, S. M., Suh, D. Y., Piran, M., et al. (2020). A review of machine learning algorithms for cloud computing security. Electronics, 9(9), 1379.
- Dang, L. M., Min, K., Wang, H., Piran, M. J., Lee, C. H., & Moon, H. (2020). Sensor-based and vision-based human activity recognition: A comprehensive survey. Pattern Recognition, 108, 107561.
- Das, J. K., Ghosh, A., Pal, A. K., Dutta, S., & Chakrabarty, A. (2020). Urban sound classification using convolutional neural network and long short term memory based on multiple features. In Proc. 2020 Fourth International Conference on Intelligent Computing in Data Sciences (ICDS) (pp. 1–9). IEEE.
- Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
- Demir, F., Turkoglu, M., Aslan, M., & Sengur, A. (2020). A new pyramidal concatenated CNN approach for environmental sound classification. Applied Acoustics, 170, 107520.
- Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2019). Arcface: Additive angular margin loss for deep face recognition. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4690–4699).
- Gadekallu, T. R., Alazab, M., Kaluri, R., Maddikunta, P. K. R., Bhattacharya, S., Lakshmanna, K., & Parimala, M. (2021). Hand gesture classification using a novel CNN-crow search algorithm. Complex & Intelligent Systems, 7, 1855–1868. https://doi.org/10.1007/s40747-021-00324-x
- Guzhov, A., Raue, F., Hees, J., & Dengel, A. (2020). Esresnet: Environmental sound classification based on visual domain models. CoRR, abs/2004.07301.
- Hanin, B. (2018). Which neural net architectures give rise to exploding and vanishing gradients? In Proc. of the 32nd International Conference on Neural Information Processing Systems, NIPS'18 (pp. 580 589). Red Hook, NY: Curran Associates Inc.
- Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. International Conference on Machine Learning (pp. 448–456). PMLR.
- Jerri, A. J. (1977). The Shannon sampling theorem its various extensions and applications: A tutorial review. Proceedings of the IEEE, 65(11), 1565–1596.
- Khan, L. U., Yaqoob, I., Imran, M., Han, Z., & Hong, C. S. (2020). 6g wireless systems: A vision, architectural elements, and future directions. IEEE Access, 8, 147029–147044.
- Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Proc. 3rd International Conference on Learning Representations, ICLR 2015.
- Li, J., Dai, W., Metze, F., Qu, S., & Das, S. (2017). A comparison of deep learning methods for environmental sound detection. In Proc. 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 126–130). IEEE.
- Li, S., Yao, Y., Hu, J., Liu, G., Yao, X., & Hu, J. (2018). An ensemble stacked convolutional neural network model for environmental event sound recognition. Applied Sciences, 8(7), 1152.
10.3390/app8071152 Google Scholar
- Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News, 2(3), 18–22.
- Liu, J.-M., You, M., Li, G.-Z., Wang, Z., Xu, X., Qiu, Z., Xie, W., An, C., & Chen, S. (2013). Cough signal recognition with gammatone cepstral coefficients. In Proc. 2013 IEEE China Summit and International Conference on Signal and Information Processing (pp. 160–164). IEEE.
- Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., & Song, L. (2017). Sphereface: Deep hypersphere embedding for face recognition. In Proc. of the IEEE conference on computer vision and pattern recognition (pp. 212–220).
- Lu, J., Ma, R., Liu, G., & Qin, Z. (2021). Deep convolutional neural network with transfer learning for environmental sound classification. In 2021 International Conference on Computer, Control and Robotics (ICCCR) (pp. 242–245).
- Lu, L., Zhang, X., Cho, K., & Renals, S. (2015). A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association.
- McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). Librosa: Audio and music signal analysis in python. In Proc. of the 14th Python in Science Conference, volume 8 (pp. 18–25). Citeseer.
- Muda, L., Begam, M., & Elamvazuthi, I. (2010). Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. CoRR, abs/1003.4083.
- Muller, M., Kurth, F., & Clausen, M. (2005). Chroma-based statistical audio features for audio matching. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 275–278). IEEE.
- Mushtaq, Z., Su, S.-F., & Tran, Q.-V. (2021). Spectral images based environmental sound classification using cnn with meaningful data augmentation. Applied Acoustics, 172, 107581.
- Mydlarz, C., Salamon, J., & Bello, J. P. (2017). The implementation of low-cost urban acoustic monitoring devices. Applied Acoustics, 117, 207–218.
- Piczak, K. J. (2015a). Environmental sound classification with convolutional neural networks. In Proc. 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP) (pp. 1–6). IEEE.
- Piczak, K. J. (2015b). ESC: Dataset for environmental sound classification. In Proc. of the 23rd ACM International Conference on Multimedia, MM'15 (pp. 1015–1018). New York, NY: Association for Computing Machinery.
- Piran, M. J., Tran, N. H., Suh, D. Y., Song, J. B., Hong, C. S., & Han, Z. (2016). Qoe-driven channel allocation and handoff management for seamless multimedia in cognitive 5g cellular networks. IEEE Transactions on Vehicular Technology, 66(7), 6569–6585.
- Rafique, W., Qi, L., Yaqoob, I., Imran, M., Rasool, R. U., & Dou, W. (2020). Complementing IOT services through software defined networking and edge computing: A comprehensive survey. IEEE Communications Surveys & Tutorials, 22(3), 1761–1804.
- Rehman, A., Rehman, S. U., Khan, M., Alazab, M., & Reddy, T. (2021). Canintelliids: Detecting in-vehicle intrusion attacks on a controller area network using CNN and attention-based GRU. IEEE Transactions on Network Science and Engineering.
- Ridley, M., & MacQueen, D. (2004). Sampling plan optimization: A data review and sampling frequency evaluation process. Bioremediation Journal, 8(3–4), 167–175.
- Salamon, J., Jacoby, C., & Bello, J. P. (2014). A dataset and taxonomy for urban sound research. In Proc. of the 22nd ACM International Conference on Multimedia (pp. 1041–1044).
- Shafer, G. (1976). A mathematical theory of evidence, volume 42. Princeton University Press.
10.1515/9780691214696 Google Scholar
- Sharma, J., Granmo, O.-C., & Goodwin, M. (2020). Environment sound classification using multiple feature channels and attention based deep convolutional neural network. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020 (pp. 1186–1190). ISCA.
- Sharma, J., Granmo, O.-C., & Goodwin, M. (2021). Emergency detection with environment sound using deep convolutional neural networks. In Proc. of Fifth International Congress on Information and Communication Technology (pp. 144–154). Springer.
- Su, Y., Zhang, K., Wang, J., & Madani, K. (2019). Environment sound classification using a two-stream cnn based on decision-level fusion. Sensors, 19(7), 1733.
- Sun, Y., Chen, Y., Wang, X., & Tang, X. (2014). Deep learning face representation by joint identification-verification. In Proc. of the 27th International Conference on Neural Information Processing Systems—Volume 2, NIPS'14 (pp. 1988-1996). Cambridge, MA: MIT Press.
- Tokozume, Y., & Harada, T. (2017). Learning environmental sounds with end-to-end convolutional neural network. In Proc. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2721–2725). IEEE.
- Ur Rehman, M. H., Yaqoob, I., Salah, K., Imran, M., Jayaraman, P. P., & Perera, C. (2019). The role of big data analytics in industrial internet of things. Future Generation Computer Systems, 99, 247–259.
- Vasan, D., Alazab, M., Wassan, S., Safaei, B., & Zheng, Q. (2020). Image-based malware classification using ensemble of cnn architectures (imcec). Computers & Security, 92, 101748.
- Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., & Liu, W. (2018). Cosface: Large margin cosine loss for deep face recognition. In Proc. of the IEEE conference on computer vision and pattern recognition (pp. 5265–5274).
- Wang, H., Zou, Y., Chong, D., & Wang, W. (2020). Environmental sound classification with parallel temporal-spectral attention. Proceedings of Interspeech, 2020, 821–825.
- Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2016). A discriminative feature learning approach for deep face recognition. In Proc. European Conference on Computer Vision (pp. 499–515). Springer.
- Yu, D., & Deng, L. (2015). Deep neural network-hidden Markov model hybrid systems. In Proc. Automatic Speech Recognition (pp. 99–116). Springer.
- Zhu, B., Wang, C., Liu, F., Lei, J., Huang, Z., Peng, Y., & Li, F. (2018). Learning environmental sounds with multi-scale convolutional neural network. In Proc. 2018 International Joint Conference on Neural Networks (IJCNN) (pp. 1–8). IEEE.