Volume 39, Issue 5 e12804
ORIGINAL ARTICLE

Environmental sound classification using convolution neural networks with different integrated loss functions

Joy Krishan Das

Joy Krishan Das

Department of Computer Science and Engineering, Brac University, Dhaka, Bangladesh

Search for more papers by this author
Amitabha Chakrabarty

Amitabha Chakrabarty

Department of Computer Science and Engineering, Brac University, Dhaka, Bangladesh

Search for more papers by this author
Md. Jalil Piran

Corresponding Author

Md. Jalil Piran

Department of Computer Science and Engineering, Sejong University, Seoul, South Korea

Correspondence

Md. Jalil Piran, Department of Computer Science and Engineering, Sejong University, Seoul 05006, South Korea.

Email: [email protected]

Search for more papers by this author
First published: 15 September 2021
Citations: 5

Abstract

The hike in the demand for smart cities has gathered the interest of researchers to work on environmental sound classification. Most researchers' goal is to reach the Bayesian optimal error in the field of audio classification. Nonetheless, it is very baffling to interpret meaning from a three-dimensional audio and this is where different types of spectrograms become effective. Using benchmark spectral features such as mel frequency cepstral coefficients (MFCCs), chromagram, log-mel spectrogram (LM), and so on audio can be converted into meaningful 2D spectrograms. In this paper, we propose a convolutional neural network (CNN) model, which is fabricated with additive angular margin loss (AAML), large margin cosine loss (LMCL) and a-softmax loss. These loss functions proposed for face recognition, hold their value in the other fields of study if they are implemented in a systematic manner. The mentioned loss functions are more dominant than conventional softmax loss when it comes to classification task because of its capability to increase intra-class compactness and inter-class discrepancy. Thus, with MCAAM-Net, MCAS-Net and MCLCM-Net models, a classification accuracy of 99.60%, 99.43% and 99.37% is achieved on UrbanSound8K dataset respectively without any augmentation. This paper also demonstrates the benefit of stacking features together and the above-mentioned validation accuracies are achieved after stacking MFCCs and chromagram on the x-axis. We also visualized the clusters formed by the embedded vectors of test data for further acknowledgement of our results, after passing it through different proposed models. Finally, we show that the MCAAM-Net model achieved an accuracy of 99.60% on UrbanSound8K dataset, which outperforms the benchmark models like TSCNN-DS, ADCNN-5, ESResNet-Attention, and so on that are introduced over the recent years.

CONFLICT OF INTEREST

None.

DATA AVAILABILITY STATEMENT

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.