Volume 42, Issue 9 e70103

REVIEW

A Comprehensive Review of Unimodal and Multimodal Emotion Detection: Datasets, Approaches, and Limitations

Priyanka Thakur,

Priyanka Thakur

Department of CSE, University Institute of Engineering and Technology, Panjab University, Chandigarh, India

Search for more papers by this author

Nirmal Kaur,

Corresponding Author

Nirmal Kaur

[email protected]

orcid.org/0000-0003-1725-4624

Department of CSE, University Institute of Engineering and Technology, Panjab University, Chandigarh, India

Correspondence:

Nirmal Kaur ([email protected])

Search for more papers by this author

Naveen Aggarwal,

Naveen Aggarwal

orcid.org/0000-0003-1549-531X

Department of CSE, University Institute of Engineering and Technology, Panjab University, Chandigarh, India

Search for more papers by this author

Sarbjeet Singh,

Sarbjeet Singh

Department of CSE, University Institute of Engineering and Technology, Panjab University, Chandigarh, India

Search for more papers by this author

Priyanka Thakur,

Priyanka Thakur

Department of CSE, University Institute of Engineering and Technology, Panjab University, Chandigarh, India

Search for more papers by this author

Nirmal Kaur,

Corresponding Author

Nirmal Kaur

[email protected]

orcid.org/0000-0003-1725-4624

Department of CSE, University Institute of Engineering and Technology, Panjab University, Chandigarh, India

Correspondence:

Nirmal Kaur ([email protected])

Search for more papers by this author

Naveen Aggarwal,

Naveen Aggarwal

orcid.org/0000-0003-1549-531X

Department of CSE, University Institute of Engineering and Technology, Panjab University, Chandigarh, India

Search for more papers by this author

Sarbjeet Singh,

Sarbjeet Singh

Department of CSE, University Institute of Engineering and Technology, Panjab University, Chandigarh, India

Search for more papers by this author

First published: 21 July 2025

https://doi.org/10.1111/exsy.70103

Funding: This work was supported by IIT Mandi iHub & HCI Foundation under grant number: IIT MANDI iHub/RD/2023-2025/04, as part of the project ‘Development of Multimodal and Multilingual Human Emotion Detection System’.

Share a link

Email
Wechat
Bluesky

ABSTRACT

Emotion detection from face and speech is inherent for human–computer interaction, mental health assessment, social robotics, and emotional intelligence. Traditional machine learning methods typically depend on handcrafted features and are primarily centred on unimodal systems. However, the unique characteristics of facial expressions and the variability in speech features present challenges in capturing complex emotional states. Accordingly, deep learning models have been substantial in automatically extracting intrinsic emotional features with greater accuracy across multiple modalities. The proposed article presents a comprehensive review of recent progress in emotion detection, spanning from unimodal to multimodal systems, with a focus on facial and speech modalities. It examines state-of-the-art machine learning, deep learning, and the latest transformer-based approaches for emotion detection. The review aims to provide an in-depth analysis of both unimodal and multimodal emotion detection techniques, highlighting their limitations, popular datasets, challenges, and the best-performing models. Such analysis aids researchers in judicious selection of the most appropriate dataset and audio-visual emotion detection models. Key findings suggest that integrating multimodal data significantly improves emotion recognition, particularly when utilising deep learning methods trained on synchronised audio and video datasets. By assessing recent advancements and current challenges, this article serves as a fundamental resource for researchers and practitioners in the field of emotional AI, thereby aiding in the creation of more intuitive and empathetic technologies.

Open Research

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

References

Abdelhamid, A. A., E.-S. M. El-Kenawy, B. Alotaibi, et al. 2022. “Robust Speech Emotion Recognition Using CNN+ LSTM Based on Stochastic Fractal Search Optimization Algorithm.” IEEE Access 10: 49265–49284.
10.1109/ACCESS.2022.3172954
Google Scholar
Agarwal, G., and H. Om. 2021. “Performance of Deer Hunting Optimization Based Deep Learning Algorithm for Speech Emotion Recognition.” Multimedia Tools and Applications 80, no. 7: 9961–9992.
10.1007/s11042-020-10118-x
Web of Science® Google Scholar
Agrawal, A., and N. Mittal. 2020. “Using CNN for Facial Expression Recognition: A Study of the Effects of Kernel Size and Number of Filters on Accuracy.” Visual Computer 36, no. 2: 405–412.
10.1007/s00371-019-01630-9
Web of Science® Google Scholar
Ahmed, N., Z. Al Aghbari, and S. Girija. 2023. “A Systematic Survey on Multimodal Emotion Recognition Using Learning Algorithms.” Intelligent Systems with Applications 17: 200171.
10.1016/j.iswa.2022.200171
Web of Science® Google Scholar
Akhand, M. A. H., S. Roy, N. Siddique, M. A. S. Kamal, and T. Shimamura. 2021. “Facial Emotion Recognition Using Transfer Learning in the Deep CNN.” Electronics 10, no. 9: 1036.
10.3390/electronics10091036
Web of Science® Google Scholar
Al-Dujaili Al-Khazraji, M. J., and A. Ebrahimi-Moghadam. 2024. “An Innovative Method for Speech Signal Emotion Recognition Based on Spectral Features Using GMM and HMM Techniques.” Wireless Personal Communications 134: 1–753.
10.1007/s11277-024-10918-6
Google Scholar
Alluhaidan, A. S., O. Saidani, R. Jahangir, M. A. Nauman, and O. S. Neffati. 2023. “Speech Emotion Recognition Through Hybrid Features and Convolutional Neural Network.” Applied Sciences 13, no. 8: 4750.
10.3390/app13084750
CAS Google Scholar
Alnuaim, A. A., M. Zakariah, P. K. Shukla, et al. 2022. “Human-Computer Interaction for Recognizing Speech Emotions Using Multilayer Perceptron Classifier.” Journal of Healthcare Engineering 1: 6005446.
Google Scholar
Alslaity, A., and R. Orji. 2024. “Machine Learning Techniques for Emotion Detection and Sentiment Analysis: Current State, Challenges, and Future Directions.” Behaviour & Information Technology 43, no. 1: 139–164.
10.1080/0144929X.2022.2156387
Web of Science® Google Scholar
Andayani, F., L. B. Theng, M. T. Tsun, and C. Chua. 2022. “Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files.” IEEE Access 10: 36018–36027.
10.1109/ACCESS.2022.3163856
Web of Science® Google Scholar
Aranha, R. V., C. G. Corrêa, and F. L. S. Nunes. 2019. “Adapting Software With Affective Computing: A Systematic Review.” IEEE Transactions on Affective Computing 12, no. 4: 883–899.
10.1109/TAFFC.2019.2902379
Google Scholar
Avots, E., T. Sapiński, M. Bachmann, and D. Kamińska. 2019. “Audiovisual Emotion Recognition in Wild.” Machine Vision and Applications 30, no. 5: 975–985.
10.1007/s00138-018-0960-9
Web of Science® Google Scholar
Bakhshi, A., A. Harimi, and S. Chalup. 2022. “CyTex: Transforming Speech to Textured Images for Speech Emotion Recognition.” Speech Communication 139: 62–75.
10.1016/j.specom.2022.02.007
Web of Science® Google Scholar
Banskota, N., A. Alsadoon, P. W. C. Prasad, A. Dawoud, T. A. Rashid, and O. H. Alsadoon. 2023. “A Novel Enhanced Convolution Neural Network With Extreme Learning Machine: Facial Emotional Recognition in Psychology Practices.” Multimedia Tools and Applications 82, no. 5: 6479–6503.
10.1007/s11042-022-13567-8
Web of Science® Google Scholar
Bänziger, T., H. Pirker, and K. Scherer. 2006. “GEMEP-GEneva Multimodal Emotion Portrayals: A Corpus for the Study of Multimodal Emotional Expressions.” Proceedings of LREC 6: 15–19.
Google Scholar
Baveye, Y., E. Dellandrea, C. Chamaret, and L. Chen. 2015. “LIRIS-ACCEDE: A Video Database for Affective Content Analysis.” IEEE Transactions on Affective Computing 6, no. 1: 43–55.
10.1109/TAFFC.2015.2396531
Web of Science® Google Scholar
Bhangale, K. B., and M. Kothandaraman. 2023. “Speech Emotion Recognition Using the Novel PEmoNet (Parallel Emotion Network).” Applied Acoustics 212: 109613.
10.1016/j.apacoust.2023.109613
Web of Science® Google Scholar
Bhattacharya, S., S. Borah, B. K. Mishra, and A. Mondal. 2022. “Emotion Detection From Multilingual Audio Using Deep Analysis.” Multimedia Tools and Applications 81, no. 28: 41309–41338.
10.1007/s11042-022-12411-3
Web of Science® Google Scholar
Bhavan, A., P. Chauhan, and R. R. Shah. 2019. “Bagged Support Vector Machines for Emotion Recognition From Speech.” Knowledge-Based Systems 184: 104886.
10.1016/j.knosys.2019.104886
Web of Science® Google Scholar
Bilquise, G., S. Ibrahim, and K. Shaalan. 2022. “Emotionally Intelligent Chatbots: A Systematic Literature Review.” Human Behavior and Emerging Technologies 1: 9601630.
Google Scholar
Burkhardt, F., A. Paeschke, M. Rolfes, et al. 2005. A Database of German Emotional Speech. Vol. 5. Interspeech.
10.21437/Interspeech.2005-446
Google Scholar
Busso, C., M. Bulut, C.-C. Lee, et al. 2008. “IEMOCAP: Interactive Emotional Dyadic Motion Capture Database.” Language Resources and Evaluation 42: 335–359.
10.1007/s10579-008-9076-6
Web of Science® Google Scholar
Busso, C., S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. M. Provost. 2016. “MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception.” IEEE Transactions on Affective Computing 8, no. 1: 67–80.
10.1109/TAFFC.2016.2515617
Web of Science® Google Scholar
Cai, L., J. Dong, and M. Wei. 2020. “ Multi-Modal Emotion Recognition From Speech and Facial Expression Based on Deep Learning.” In 2020 Chinese Automation Congress (CAC). IEEE.
10.1109/CAC51589.2020.9327178
Google Scholar
Calvo, M. G., A. Fernández-Martín, G. Recio, and D. Lundqvist. 2018. “Human Observers and Automated Assessment of Dynamic Emotional Facial Expressions: KDEF-Dyn Database Validation.” Frontiers in Psychology 9: 2052.
10.3389/fpsyg.2018.02052
PubMed Web of Science® Google Scholar
Canal, F. Z., T. R. Müller, J. C. Matias, et al. 2022. “A Survey on Facial Emotion Recognition Techniques: A State-Of-The-Art Literature Review.” Information Sciences 582: 593–617.
10.1016/j.ins.2021.10.005
Web of Science® Google Scholar
Cao, H., D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma. 2014. “Crema-d: Crowd-Sourced Emotional Multimodal Actors Dataset.” IEEE Transactions on Affective Computing 5, no. 4: 377–390.
10.1109/TAFFC.2014.2336244
PubMed Web of Science® Google Scholar
Cen, L., F. Wu, Z. L. Yu, and F. Hu. 2016. “ A Real-Time Speech Emotion Recognition System and Its Application in Online Learning.” In Emotions, Technology, Design, and Learning, 27–46. Academic Press.
10.1016/B978-0-12-801856-9.00002-5
Google Scholar
Chamishka, S., I. Madhavi, R. Nawaratne, et al. 2022. “A Voice-Based Real-Time Emotion Detection Technique Using Recurrent Neural Network Empowered Feature Modelling.” Multimedia Tools and Applications 81, no. 24: 35173–35194.
10.1007/s11042-022-13363-4
Web of Science® Google Scholar
Chaudhari, A., C. Bhatt, A. Krishna, and P. L. Mazzeo. 2022. “ViTFER: Facial Emotion Recognition With Vision Transformers.” Applied System Innovation 5, no. 4: 80.
10.3390/asi5040080
Web of Science® Google Scholar
Chauhan, K., K. K. Sharma, and T. Varma. 2023. “Improved Speech Emotion Recognition Using Channel-Wise Global Head Pooling (Cwghp).” Circuits, Systems, and Signal Processing 42, no. 9: 5500–5522.
10.1007/s00034-023-02367-6
Web of Science® Google Scholar
Che, N., Y. Zhu, H. Wang, et al. 2025. “AFT-SAM: Adaptive Fusion Transformer With a Sparse Attention Mechanism for Audio–Visual Speech Recognition.” Applied Sciences (2076–3417) 15, no. 1: 199.
10.3390/app15010199
CAS Google Scholar
Chen, J., Y. Lv, R. Xu, and C. Xu. 2019. “Automatic Social Signal Analysis: Facial Expression Recognition Using Difference Convolution Neural Network.” Journal of Parallel and Distributed Computing 131: 97–102.
10.1016/j.jpdc.2019.04.017
Web of Science® Google Scholar
Cheng, F., J. Yu, and H. Xiong. 2010. “Facial Expression Recognition in JAFFE Dataset Based on Gaussian Process Classification.” IEEE Transactions on Neural Networks 21, no. 10: 1685–1690.
10.1109/TNN.2010.2064176
PubMed Web of Science® Google Scholar
Chowdary, M. K., T. N. Nguyen, and D. J. Hemanth. 2023. “Deep Learning-Based Facial Emotion Recognition for Human–Computer Interaction Applications.” Neural Computing and Applications 35, no. 32: 23311–23328.
10.1007/s00521-021-06012-8
Web of Science® Google Scholar
Cornejo, J. Y. R., and H. Pedrini. 2019. “ Audio-Visual Emotion Recognition Using a Hybrid Deep Convolutional Neural Network Based on Census Transform.” In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC). IEEE.
10.1109/SMC.2019.8914193
Google Scholar
Costantini, G., I. Iaderola, A. Paoloni, and M. Todisco. 2014. “ EMOVO Corpus: An Italian Emotional Speech Database.” In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). European Language Resources Association (ELRA).
Google Scholar
Dhall, A., R. Goecke, S. Lucey, and T. Gedeon. 2011. “ Static Facial Expression Analysis in Tough Conditions: Data, Evaluation Protocol and Benchmark.” In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops). IEEE.
10.1109/ICCVW.2011.6130508
Google Scholar
Di Luzio, F., A. Rosato, and M. Panella. 2023. “A Randomized Deep Neural Network for Emotion Recognition With Landmarks Detection.” Biomedical Signal Processing and Control 81: 104418.
10.1016/j.bspc.2022.104418
Web of Science® Google Scholar
Do, L.-N., H.-J. Yang, H.-D. Nguyen, S.-H. Kim, G.-S. Lee, and I.-S. Na. 2021. “Deep Neural Network-Based Fusion Model for Emotion Recognition Using Visual Data.” Journal of Supercomputing 77, no. 10: 10773.
10.1007/s11227-021-03690-y
Web of Science® Google Scholar
Douglas-Cowie, E., C. Cox, J.-C. Martin, et al. 2011. “ The HUMAINE Database.” In Emotion-Oriented Systems: The Humaine Handbook, 243–284. Springer.
10.1007/978-3-642-15184-2_14
Google Scholar
Du, G., S. Long, and H. Yuan. 2020. “Non-Contact Emotion Recognition Combining Heart Rate and Facial Expression for Interactive Gaming Environments.” IEEE Access 8: 11896–11906.
10.1109/ACCESS.2020.2964794
Web of Science® Google Scholar
Egger, M., M. Ley, and S. Hanke. 2019. “Emotion Recognition From Physiological Signal Analysis: A Review.” Electronic Notes in Theoretical Computer Science 343: 35–55.
10.1016/j.entcs.2019.04.009
Google Scholar
Ekman, P. 1992. “An Argument for Basic Emotions.” Cognition & Emotion 6, no. 3–4: 169–200.
10.1080/02699939208411068
Web of Science® Google Scholar
Ekman, P., and W. V. Friesen. 1978. “ Facial Action Coding System.” In Environmental Psychology & Nonverbal Behavior.
Google Scholar
Elyoseph, Z., E. Refoua, K. Asraf, M. Lvovsky, Y. Shimoni, and D. Hadar-Shoval. 2024. “Capacity of Generative AI to Interpret Human Emotions From Visual and Textual Data: Pilot Evaluation Study.” JMIR Mental Health 11: e54369.
10.2196/54369
PubMed Web of Science® Google Scholar
Engberg, I. S., A. V. Hansen, O. Andersen, and P. Dalsgaard. 1997. Design, Recording and Verification of a Danish Emotional Speech Database. Eurospeech.
10.21437/Eurospeech.1997-482
Google Scholar
Essa, I. A., and A. P. Pentland. 1997. “Coding, Analysis, Interpretation, and Recognition of Facial Expressions.” IEEE Transactions on Pattern Analysis and Machine Intelligence 19, no. 7: 757–763.
10.1109/34.598232
Web of Science® Google Scholar
Ezzameli, K., and H. Mahersia. 2023. “Emotion Recognition From Unimodal to Multimodal Analysis: A Review.” Information Fusion 99: 101847.
10.1016/j.inffus.2023.101847
Web of Science® Google Scholar
Falahzadeh, M. R., F. Farokhi, A. Harimi, and R. Sabbaghi-Nadooshan. 2023. “Deep Convolutional Neural Network and Gray Wolf Optimization Algorithm for Speech Emotion Recognition.” Circuits, Systems, and Signal Processing 42, no. 1: 449–492.
10.1007/s00034-022-02130-3
Web of Science® Google Scholar
Farhoudi, Z., and S. Setayeshi. 2021. “Fusion of Deep Learning Features With Mixture of Brain Emotional Learning for Audio-Visual Emotion Recognition.” Speech Communication 127: 92–103.
10.1016/j.specom.2020.12.001
Web of Science® Google Scholar
Fayek, H. M., M. Lech, and L. Cavedon. 2017. “Evaluating Deep Learning Architectures for Speech Emotion Recognition.” Neural Networks 92: 60–68.
10.1016/j.neunet.2017.02.013
PubMed Web of Science® Google Scholar
Filali, H., J. Riffi, I. Aboussaleh, A. M. Mahraz, and H. Tairi. 2022. “Meaningful Learning for Deep Facial Emotional Features.” Neural Processing Letters 54, no. 1: 387–404.
10.1007/s11063-021-10636-1
Web of Science® Google Scholar
Fourati, N., and C. Pelachaud. 2014. Emilya: Emotional Body Expression in Daily Actions Database. LREC.
Google Scholar
Fu, J., Q. Mao, J. Tu, and Y. Zhan. 2019. “Multimodal Shared Features Learning for Emotion Recognition by Enhanced Sparse Local Discriminative Canonical Correlation Analysis.” Multimedia Systems 25, no. 5: 451–461.
10.1007/s00530-017-0547-8
Web of Science® Google Scholar
Geetha, A. V., T. Mala, D. Priyanka, and E. Uma. 2024. “Multimodal Emotion Recognition With Deep Learning: Advancements, Challenges, and Future Directions.” Information Fusion 105: 102218.
10.1016/j.inffus.2023.102218
Web of Science® Google Scholar
Ghaleb, E., J. Niehues, and S. Asteriadis. 2020. “ Multimodal Attention-Mechanism for Temporal Emotion Recognition.” In 2020 IEEE International Conference on Image Processing (ICIP). IEEE.
10.1109/ICIP40778.2020.9191019
Google Scholar
Ghaleb, E., J. Niehues, and S. Asteriadis. 2023. “Joint Modelling of Audio-Visual Cues Using Attention Mechanisms for Emotion Recognition.” Multimedia Tools and Applications 82, no. 8: 11239–11264.
10.1007/s11042-022-13557-w
Web of Science® Google Scholar
Ghaleb, E., M. Popa, and S. Asteriadis. 2019. “ Multimodal and Temporal Perception of Audio-Visual Cues for Emotion Recognition.” In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE.
10.1109/ACII.2019.8925444
Google Scholar
Gideon, J., M. G. McInnis, and E. M. Provost. 2019. “Improving Cross-Corpus Speech Emotion Recognition With Adversarial Discriminative Domain Generalization (ADDoG).” IEEE Transactions on Affective Computing 12, no. 4: 1055–1068.
10.1109/TAFFC.2019.2916092
PubMed Google Scholar
Goncalves, L., and C. Busso. 2022. “Robust Audiovisual Emotion Recognition: Aligning Modalities, Capturing Temporal Information, and Handling Missing Features.” IEEE Transactions on Affective Computing 13, no. 4: 2156–2170.
10.1109/TAFFC.2022.3216993
Web of Science® Google Scholar
Goodfellow, I. J., D. Erhan, P. L. Carrier, et al. 2013. “ Challenges in Representation Learning: A Report on Three Machine Learning Contests.” In Neural Information Processing: 20th International Conference, ICONIP 2013. November 3–7, 2013. Proceedings, Part III 20. Springer Berlin Heidelberg.
10.1007/978-3-642-42051-1_16
Google Scholar
Gross, R., I. Matthews, J. Cohn, T. Kanade, and S. Baker. 2010. “Multi-Pie.” Image and Vision Computing 28, no. 5: 807–813.
10.1016/j.imavis.2009.08.002
Web of Science® Google Scholar
Gu, X., Y. Shen, and J. Xu. 2021. “ Multimodal Emotion Recognition in Deep Learning: A Survey.” In 2021 International Conference on Culture-Oriented Science & Technology (ICCST). IEEE.
10.1109/ICCST53801.2021.00027
Google Scholar
Guanghui, C., and Z. Xiaoping. 2021. “Multi-Modal Emotion Recognition by Fusing Correlation Features of Speech-Visual.” IEEE Signal Processing Letters 28: 533–537.
10.1109/LSP.2021.3055755
Google Scholar
Guo, W., X. Zhao, S. Zhang, and X. Pan. 2023. “Learning Inter-Class Optical Flow Difference Using Generative Adversarial Networks for Facial Expression Recognition.” Multimedia Tools and Applications 82, no. 7: 10099–10116.
10.1007/s11042-022-13360-7
Web of Science® Google Scholar
Hajarolasvadi, N., E. Bashirov, and H. Demirel. 2021. “Video-Based Person-Dependent and Person-Independent Facial Emotion Recognition.” Signal, Image and Video Processing 15, no. 5: 1049–1056.
10.1007/s11760-020-01830-0
Web of Science® Google Scholar
Haq, S., P. J. B. Jackson, and J. D. Edge. 2008. Audio-Visual Feature Selection and Reduction for Emotion Classification. AVSP.
Google Scholar
He, J., X. Yu, B. Sun, and L. Yu. 2021. “Facial Expression and Action Unit Recognition Augmented by Their Dependencies on Graph Convolutional Networks.” Journal on Multimodal User Interfaces 15: 412–429.
10.1007/s12193-020-00363-7
Web of Science® Google Scholar
He, Z., Z. Li, F. Yang, et al. 2020. “Advances in Multimodal Emotion Recognition Based on Brain–Computer Interfaces.” Brain Sciences 10, no. 10: 687.
10.3390/brainsci10100687
CAS PubMed Web of Science® Google Scholar
Hossain, M. S., and G. Muhammad. 2019. “Emotion Recognition Using Secure Edge and Cloud Computing.” Information Sciences 504: 589–601.
10.1016/j.ins.2019.07.040
Web of Science® Google Scholar
Hossain, S., S. Umer, R. K. Rout, and M. Tanveer. 2023. “Fine-Grained Image Analysis for Facial Expression Recognition Using Deep Convolutional Neural Networks With Bilinear Pooling.” Applied Soft Computing 134: 109997.
10.1016/j.asoc.2023.109997
Web of Science® Google Scholar
Huang, Q., C. Huang, X. Wang, and F. Jiang. 2021. “Facial Expression Recognition With Grid-Wise Attention and Visual Transformer.” Information Sciences 580: 35–54.
10.1016/j.ins.2021.08.043
Web of Science® Google Scholar
Issa, D., M. Fatih Demirci, and A. Yazici. 2020. “Speech Emotion Recognition With Deep Convolutional Neural Networks.” Biomedical Signal Processing and Control 59: 101894.
10.1016/j.bspc.2020.101894
Web of Science® Google Scholar
Jahangir, R., Y. W. Teh, F. Hanif, and G. Mujtaba. 2021. “Deep Learning Approaches for Speech Emotion Recognition: State of the Art and Research Challenges.” Multimedia Tools and Applications 80, no. 16: 23745–23812.
10.1007/s11042-020-09874-7
Web of Science® Google Scholar
Jain, D. K., P. Shamsolmoali, and P. Sehdev. 2019. “Extended Deep Neural Network for Facial Emotion Recognition.” Pattern Recognition Letters 120: 69–74.
10.1016/j.patrec.2019.01.008
Web of Science® Google Scholar
Jain, D. K., Z. Zhang, and K. Huang. 2020. “Multi Angle Optimal Pattern-Based Deep Learning for Automatic Facial Expression Recognition.” Pattern Recognition Letters 139: 157–165.
10.1016/j.patrec.2017.06.025
Web of Science® Google Scholar
Jaratrotkamjorn, A., and A. Choksuriwong. 2019. “ Bimodal Emotion Recognition Using Deep Belief Network.” In 2019 23rd International Computer Science and Engineering Conference (ICSEC). IEEE.
10.1109/ICSEC47112.2019.8974707
Google Scholar
Jeong, D., B.-G. Kim, and S.-Y. Dong. 2020. “Deep Joint Spatiotemporal Network (DJSTN) for Efficient Facial Expression Recognition.” Sensors 20, no. 7: 1936.
10.3390/s20071936
Web of Science® Google Scholar
Jia, X., S. Xu, Y. Zhou, L. Wang, and W. Li. 2023. “A Novel Dual-Channel Graph Convolutional Neural Network for Facial Action Unit Recognition.” Pattern Recognition Letters 166: 61–68.
10.1016/j.patrec.2023.01.001
Web of Science® Google Scholar
Jiang, P., H. Fu, H. Tao, P. Lei, and L. Zhao. 2019. “Parallelized Convolutional Recurrent Neural Network With Spectral Features for Speech Emotion Recognition.” IEEE Access 7: 90368–90377.
10.1109/ACCESS.2019.2927384
Web of Science® Google Scholar
Jiang, Y., W. Li, M. S. Hossain, M. Chen, A. Alelaiwi, and M. al-Hammadi. 2020. “A Snapshot Research and Implementation of Multimodal Information Fusion for Data-Driven Emotion Recognition.” Information Fusion 53: 209–221.
10.1016/j.inffus.2019.06.019
Web of Science® Google Scholar
John, V., and Y. Kawanishi. 2022. “ Audio and Video-Based Emotion Recognition Using Multimodal Transformers.” In 26th International Conference on Pattern Recognition (ICPR). IEEE.
Google Scholar
Kakuba, S., A. Poulose, and D. S. Han. 2022. “Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution.” IEEE Access 10: 122302–122313.
10.1109/ACCESS.2022.3223705
Web of Science® Google Scholar
Kalateh, S., L. A. Estrada-Jimenez, S. Nikghadam-Hojjati, and J. Barata. 2024. “A Systematic Review on Multimodal Emotion Recognition: Building Blocks, Current State, Applications, and Challenges.” IEEE Access 12: 103976–104019.
10.1109/ACCESS.2024.3430850
Web of Science® Google Scholar
Kansizoglou, I., L. Bampis, and A. Gasteratos. 2022. “An Active Learning Paradigm for Online Audio-Visual Emotion Recognition.” IEEE Transactions on Affective Computing 13, no. 2: 756–768.
10.1109/TAFFC.2019.2961089
Web of Science® Google Scholar
Khan, W. A., H. u. Qudous, and A. A. Farhan. 2024. “Speech Emotion Recognition Using Feature Fusion: A Hybrid Approach to Deep Learning.” Multimedia Tools and Applications 83: 1–75584.
10.1007/s11042-024-18316-7
Web of Science® Google Scholar
Kim, J.-H., B.-G. Kim, P. P. Roy, and D.-M. Jeong. 2019. “Efficient Facial Expression Recognition Algorithm Based on Hierarchical Deep Neural Network Structure.” IEEE Access 7: 41273–41285.
10.1109/ACCESS.2019.2907327
Web of Science® Google Scholar
Kim, N., S. Cho, and B. Bae. 2022. “SMaTE: A Segment-Level Feature Mixing and Temporal Encoding Framework for Facial Expression Recognition.” Sensors 22, no. 15: 5753.
10.3390/s22155753
Web of Science® Google Scholar
Koduru, A., H. B. Valiveti, and A. K. Budati. 2020. “Feature Extraction Algorithms to Improve the Speech Emotion Recognition Rate.” International Journal of Speech Technology 23, no. 1: 45–55.
10.1007/s10772-020-09672-4
Web of Science® Google Scholar
Koolagudi, S. G., R. Reddy, J. Yadav, et al. 2011. “ IITKGP-SEHSC: Hindi Speech Corpus for Emotion Analysis.” In 2011 International Conference on Devices and Communications (ICDeCom). IEEE.
10.1109/ICDECOM.2011.5738540
Google Scholar
Kossaifi, J., R. Walecki, Y. Panagakis, et al. 2019. “Sewa Db: A Rich Database for Audio-Visual Emotion and Sentiment Research in the Wild.” IEEE Transactions on Pattern Analysis and Machine Intelligence 43, no. 3: 1022–1040.
10.1109/TPAMI.2019.2944808
Google Scholar
Kumar, P., S. Malik, and B. Raman. 2024. “Interpretable Multimodal Emotion Recognition Using Hybrid Fusion of Speech and Image Data.” Multimedia Tools and Applications 83, no. 10: 28373–28394.
10.1007/s11042-023-16443-1
Web of Science® Google Scholar
Kwon, S. 2021a. “MLT-DNet: Speech Emotion Recognition Using 1D Dilated CNN Based on Multi-Learning Trick Approach.” Expert Systems with Applications 167: 114177.
10.1016/j.eswa.2020.114177
Web of Science® Google Scholar
Kwon, S. 2021b. “Att-Net: Enhanced Emotion Recognition System Using Lightweight Self-Attention Module.” Applied Soft Computing 102: 107101.
10.1016/j.asoc.2021.107101
Web of Science® Google Scholar
Lakshmi, K. L., P. Muthulakshmi, A. A. Nithya, et al. 2023. “Recognition of Emotions in Speech Using Deep CNN and RESNET.” Soft Computing 5: 1–17.
Web of Science® Google Scholar
Langner, O., R. Dotsch, G. Bijlstra, D. H. J. Wigboldus, S. T. Hawk, and A. van Knippenberg. 2010. “Presentation and Validation of the Radboud Faces Database.” Cognition and Emotion 24, no. 8: 1377–1388.
10.1080/02699930903485076
Web of Science® Google Scholar
Latif, S., R. Rana, S. Khalifa, R. Jurdak, J. Qadir, and B. Schuller. 2021. “Survey of Deep Representation Learning for Speech Emotion Recognition.” IEEE Transactions on Affective Computing 14, no. 2: 1634–1654.
10.1109/TAFFC.2021.3114365
Google Scholar
Latif, S., R. Rana, S. Younis, J. Qadir, and J. Epps. 2018. “Transfer Learning for Improving Speech Emotion Classification Accuracy.” arXiv preprint arXiv:1801.06353.
Google Scholar
Lei, Y., and H. Cao. 2023. “Audio-Visual Emotion Recognition With Preference Learning Based on Intended and Multi-Modal Perceived Labels.” IEEE Transactions on Affective Computing 14: 2954–2969.
10.1109/TAFFC.2023.3234777
Web of Science® Google Scholar
Li, C., J. Wang, Y. Zhang, et al. 2023. “The Good, the Bad, and Why: Unveiling Emotions in Generative Ai.” arXiv preprint arXiv:2312.11111.
Google Scholar
Li, D., Y. Zhou, Z. Wang, and D. Gao. 2021. “Exploiting the Potentialities of Features for Speech Emotion Recognition.” Information Sciences 548: 328–343.
10.1016/j.ins.2020.09.047
Web of Science® Google Scholar
Li, S., W. Deng, and J. P. Du. 2017. “ Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.
10.1109/CVPR.2017.277
Google Scholar
Li, X., and M. Akagi. 2019. “Improving Multilingual Speech Emotion Recognition by Combining Acoustic Features in a Three-Layer Model.” Speech Communication 110: 1–12.
10.1016/j.specom.2019.04.004
Web of Science® Google Scholar
Li, Y., J. Tao, L. Chao, W. Bao, and Y. Liu. 2017. “CHEAVD: A Chinese Natural Emotional Audio–Visual Database.” Journal of Ambient Intelligence and Humanized Computing 8: 913–924.
10.1007/s12652-016-0406-z
Web of Science® Google Scholar
Lian, Z., Y. Li, J.-H. Tao, J. Huang, and M.-Y. Niu. 2020. “Expression Analysis Based on Face Regions in Real-World Conditions.” International Journal of Automation and Computing 17: 96–107.
10.1007/s11633-019-1176-9
Web of Science® Google Scholar
Liu, D., L. Chen, L. Wang, and Z. Wang. 2022. “A Multi-Modal Emotion Fusion Classification Method Combined Expression and Speech Based on Attention Mechanism.” Multimedia Tools and Applications 81, no. 29: 41677–41695.
10.1007/s11042-021-11260-w
Web of Science® Google Scholar
Liu, D., Z. Wang, L. Wang, and L. Chen. 2021. “Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning.” Frontiers in Neurorobotics 15: 697634.
10.3389/fnbot.2021.697634
PubMed Web of Science® Google Scholar
Liu, F., Z. Fu, Y. Wang, and Q. Zheng. 2025. “TACFN: Transformer-Based Adaptive Cross-Modal Fusion Network for Multimodal Emotion Recognition.” arXiv preprint arXiv:2505.06536.
Google Scholar
Liu, L.-Y., W.-Z. Liu, J. Zhou, H.-Y. Deng, and L. Feng. 2022. “ATDA: Attentional Temporal Dynamic Activation for Speech Emotion Recognition.” Knowledge-Based Systems 243: 108472.
10.1016/j.knosys.2022.108472
Web of Science® Google Scholar
Liu, M., A. N. Joseph Raj, V. Rajangam, K. Ma, Z. Zhuang, and S. Zhuang. 2024. “Multiscale-Multichannel Feature Extraction and Classification Through One-Dimensional Convolutional Neural Network for Speech Emotion Recognition.” Speech Communication 156: 103010.
10.1016/j.specom.2023.103010
Web of Science® Google Scholar
Liu, P., K. Li, and H. Meng. 2022. “Group Gated Fusion on Attention-Based Bidirectional Alignment for Multimodal Emotion Recognition.” arXiv preprint arXiv:2201.06309.
Google Scholar
Liu, Y., J. Peng, W. Dai, J. Zeng, and S. Shan. 2023. “Joint Spatial and Scale Attention Network for Multi-View Facial Expression Recognition.” Pattern Recognition 139: 109496.
10.1016/j.patcog.2023.109496
Web of Science® Google Scholar
Livingstone, S. R., and F. A. Russo. 2018. “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English.” PLoS One 13, no. 5: e0196391.
10.1371/journal.pone.0196391
PubMed Web of Science® Google Scholar
Lotfian, R., and C. Busso. 2017. “Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech From Existing Podcast Recordings.” IEEE Transactions on Affective Computing 10, no. 4: 471–483.
10.1109/TAFFC.2017.2736999
Google Scholar
Lucey, P., J. F. Cohn, T. Kanade, et al. 2010. “ The Extended Cohn-Kanade Dataset (Ck+): A Complete Dataset for Action Unit and Emotion-Specified Expression.” In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. IEEE.
Google Scholar
Luna-Jiménez, C., D. Griol, Z. Callejas, R. Kleinlein, J. M. Montero, and F. Fernández-Martínez. 2021. “Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning.” Sensors 21, no. 22: 7665.
10.3390/s21227665
Web of Science® Google Scholar
Luna-Jiménez, C., R. Kleinlein, D. Griol, Z. Callejas, J. M. Montero, and F. Fernández-Martínez. 2022. “A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset.” Applied Sciences 12, no. 1: 327.
10.3390/app12010327
CAS Google Scholar
Ly, S. T., N.-T. Do, G.-S. Lee, et al. 2019. “Multimodal 2D and 3D for in-The-Wild Facial Expression Recognition.” Cvpr Workshops: 2927–2934.
Google Scholar
Ma, F., B. Sun, and S. Li. 2021. “Facial Expression Recognition With Visual Transformers and Attentional Selective Fusion.” IEEE Transactions on Affective Computing 14, no. 2: 1236–1248.
10.1109/TAFFC.2021.3122146
Google Scholar
Ma, F., Y. Yuan, Y. Xie, et al. 2024. “Generative Technology for Human Emotion Recognition: A Scope Review.” arXiv preprint arXiv:2407.03640.
Google Scholar
Ma, F., W. Zhang, Y. Li, S.-L. Huang, and L. Zhang. 2020. “Learning Better Representations for Audio-Visual Emotion Recognition With Common Information.” Applied Sciences 10, no. 20: 7239.
10.3390/app10207239
CAS Google Scholar
Ma, Y., Y. Hao, M. Chen, J. Chen, P. Lu, and A. Košir. 2019. “Audio-Visual Emotion Fusion (AVEF): A Deep Efficient Weighted Approach.” Information Fusion 46: 184–192.
10.1016/j.inffus.2018.06.003
Web of Science® Google Scholar
Mamieva, D., A. B. Abdusalomov, A. Kutlimuratov, B. Muminov, and T. K. Whangbo. 2023. “Multimodal Emotion Detection via Attention-Based Fusion of Extracted Facial and Speech Features.” Sensors 23, no. 12: 5475.
10.3390/s23125475
Web of Science® Google Scholar
Mao, K., Y. Wang, L. Ren, J. Zhang, J. Qiu, and G. Dai. 2023. “Multi-Branch Feature Learning Based Speech Emotion Recognition Using SCAR-NET.” Connection Science 35, no. 1: 2189217.
10.1080/09540091.2023.2189217
Web of Science® Google Scholar
Martin, O., I. Kotsia, B. Macq, and I. Pitas. 2006. “ The eNTERFACE'05 Audio-Visual Emotion Database.” In 22nd International Conference on Data Engineering Workshops (ICDEW'06). IEEE.
10.1109/ICDEW.2006.145
Google Scholar
McKeown, G., M. Valstar, R. Cowie, M. Pantic, and M. Schroder. 2011. “The Semaine Database: Annotated Multimodal Records of Emotionally Colored Conversations Between a Person and a Limited Agent.” IEEE Transactions on Affective Computing 3, no. 1: 5–17.
10.1109/T-AFFC.2011.20
Google Scholar
Meena, G., K. K. Mohbey, A. Indian, M. Z. Khan, and S. Kumar. 2024. “Identifying Emotions From Facial Expressions Using a Deep Convolutional Neural Network-Based Approach.” Multimedia Tools and Applications 83, no. 6: 15711–15732.
10.1007/s11042-023-16174-3
Web of Science® Google Scholar
Mehendale, N. 2020. “Facial Emotion Recognition Using Convolutional Neural Networks (FERC).” SN Applied Sciences 2, no. 3: 446.
10.1007/s42452-020-2234-1
Web of Science® Google Scholar
Meng, H., T. Yan, F. Yuan, and H. Wei. 2019. “Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network.” IEEE Access 7: 125868–125881.
10.1109/ACCESS.2019.2938007
Web of Science® Google Scholar
Middya, A. I., B. Nag, and S. Roy. 2022. “Deep Learning Based Multimodal Emotion Recognition Using Model-Level Fusion of Audio–Visual Modalities.” Knowledge-Based Systems 244: 108580.
10.1016/j.knosys.2022.108580
Web of Science® Google Scholar
Minaee, S., M. Minaei, and A. Abdolrashidi. 2021. “Deep-Emotion: Facial Expression Recognition Using Attentional Convolutional Network.” Sensors 21, no. 9: 3046.
10.3390/s21093046
Web of Science® Google Scholar
Mishra, S. P., P. Warule, and S. Deb. 2024. “Speech Emotion Recognition Using MFCC-Based Entropy Feature.” Signal, Image and Video Processing 18, no. 1: 153–161.
10.1007/s11760-023-02716-7
Web of Science® Google Scholar
Mocanu, B., R. Tapu, and T. Zaharia. 2023. “Multimodal Emotion Recognition Using Cross Modal Audio-Video Fusion With Attention and Deep Metric Learning.” Image and Vision Computing 133: 104676.
10.1016/j.imavis.2023.104676
Web of Science® Google Scholar
Mollahosseini, A., B. Hasani, and M. H. Mahoor. 2017. “Affectnet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild.” IEEE Transactions on Affective Computing 10, no. 1: 18–31.
10.1109/TAFFC.2017.2740923
Web of Science® Google Scholar
Morais, E., R. Hoory, W. Zhu, et al. 2022. “ Speech Emotion Recognition Using Self-Supervised Features.” In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
10.1109/ICASSP43922.2022.9747870
Google Scholar
Mordor Intelligence. n.d. “Emotion Detection and Recognition Market Size & Share Analysis - Industry Research Report - Growth Trends.” https://www.mordorintelligence.com/industry-reports.
Google Scholar
Murugaiyan, S., and S. R. Uyyala. 2023. “Aspect-Based Sentiment Analysis of Customer Speech Data Using Deep Convolutional Neural Network and Bilstm.” Cognitive Computation 15, no. 3: 914–931.
10.1007/s12559-023-10127-6
Web of Science® Google Scholar
Mustaqeem, and S. Kwon. 2020. “CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network.” Mathematics 8, no. 12: 2133.
10.3390/math8122133
Google Scholar
Mustaqeem, and S. Kwon. 2021. “Optimal Feature Selection Based Speech Emotion Recognition Using Two-Stream Deep Convolutional Neural Network.” International Journal of Intelligent Systems 36, no. 9: 5116–5135.
10.1002/int.22505
Web of Science® Google Scholar
Naga, P., S. D. Marri, and R. Borreo. 2023. “Facial Emotion Recognition Methods, Datasets and Technologies: A Literature Survey.” Materials Today Proceedings 80: 2824–2828.
10.1016/j.matpr.2021.07.046
Google Scholar
Nguyen, D., D. T. Nguyen, R. Zeng, et al. 2022. “Deep Auto-Encoders With Sequential Learning for Multimodal Dimensional Emotion Recognition.” IEEE Transactions on Multimedia 24: 1313–1324.
10.1109/TMM.2021.3063612
Web of Science® Google Scholar
Noroozi, F., M. Marjanovic, A. Njegus, S. Escalera, and G. Anbarjafari. 2019. “Audio-visual emotion recognition in video clips.” IEEE Transactions on Affective Computing 10, no. 1: 60–75.
10.1109/TAFFC.2017.2713783
Web of Science® Google Scholar
Ortega, J. D. S., P. Cardinal, and A. L. Koerich. 2019. “ Emotion Recognition Using Fusion of Audio and Video Features.” In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC). IEEE.
10.1109/SMC.2019.8914655
Google Scholar
Pan, B., K. Hirota, Z. Jia, and Y. Dai. 2023. “A Review of Multimodal Emotion Recognition From Datasets, Preprocessing, Features, and Fusion Methods.” Neurocomputing 561: 126866.
10.1016/j.neucom.2023.126866
Google Scholar
Pan, B., K. Hirota, Z. Jia, L. Zhao, X. Jin, and Y. Dai. 2023. “Multimodal Emotion Recognition Based on Feature Selection and Extreme Learning Machine in Video Clips.” Journal of Ambient Intelligence and Humanized Computing 14, no. 3: 1903–1917.
10.1007/s12652-021-03407-2
Web of Science® Google Scholar
Pantic, M., M. Valstar, R. Rademaker, and L. Maat. 2005. “ Web-Based Database for Facial Expression Analysis.” In 2005 IEEE International Conference on Multimedia and Expo. IEEE.
10.1109/ICME.2005.1521424
Google Scholar
Patel, K., D. Mehta, C. Mistry, et al. 2020. “Facial Sentiment Analysis Using AI Techniques: State-Of-The-Art, Taxonomies, and Challenges.” IEEE Access 8: 90495–90519.
10.1109/ACCESS.2020.2993803
Web of Science® Google Scholar
Pichora-Fuller, M. K., and K. Dupuis. 2020. “Toronto Emotional Speech Set (TESS).” Scholars Portal Dataverse 1: 2020.
Google Scholar
Pise, A. A., M. A. Alqahtani, P. Verma, et al. 2022. “Methods for Facial Expression Recognition With Applications in Challenging Situations.” Computational Intelligence and Neuroscience 1, no. 2022: 9261438.
Google Scholar
Poria, S., E. Cambria, R. Bajpai, and A. Hussain. 2017. “A Review of Affective Computing: From Unimodal Analysis to Multimodal Fusion.” Information Fusion 37: 98–125.
10.1016/j.inffus.2017.02.003
Web of Science® Google Scholar
Poria, S., D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea. 2018. “Meld: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. arXiv preprint arXiv:1810.02508.”
Google Scholar
Pravin, S. C., V. B. Sivaraman, and J. Saranya. 2023. “Deep Ensemble Models for Speech Emotion Classification.” Microprocessors and Microsystems 98: 104790.
10.1016/j.micpro.2023.104790
Web of Science® Google Scholar
Rao, K. P., M. V. P. C. S. Rao, and N. H. Chowdary. 2019. “An Integrated Approach to Emotion Recognition and Gender Classification.” Journal of Visual Communication and Image Representation 60: 339–345.
10.1016/j.jvcir.2019.03.002
Web of Science® Google Scholar
Ringeval, F., A. Sonderegger, J. Sauer, et al. 2013. “ Introducing the RECOLA Multimodal Corpus of Remote Collaborative and Affective Interactions.” In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). IEEE.
10.1109/FG.2013.6553805
Google Scholar
Roshanzamir, M., M. Jafari, R. Alizadehsani, et al. 2024. “What Happens in Face During a Facial Expression? Using Data Mining Techniques to Analyze Facial Expression Motion Vectors.” Information Systems Frontiers 26: 1–19.
Google Scholar
Rouast, P. V., M. T. P. Adam, and R. Chiong. 2019. “Deep Learning for Human Affect Recognition: Insights and New Developments.” IEEE Transactions on Affective Computing 12, no. 2: 524–543.
10.1109/TAFFC.2018.2890471
Google Scholar
Russell, J. A. 1980. “A Circumplex Model of Affect.” Journal of Personality and Social Psychology 39, no. 6: 1161–1178.
10.1037/h0077714
Web of Science® Google Scholar
Sahoo, G. K., S. K. Das, and P. Singh. 2023. “Performance Comparison of Facial Emotion Recognition: A Transfer Learning-Based Driver Assistance Framework for In-Vehicle Applications.” Circuits, Systems, and Signal Processing 42, no. 7: 4292–4319.
10.1007/s00034-023-02320-7
Web of Science® Google Scholar
Said, Y., and M. Barr. 2021. “Human Emotion Recognition Based on Facial Expressions via Deep Learning on High-Resolution Images.” Multimedia Tools and Applications 80, no. 16: 25241–25253.
10.1007/s11042-021-10918-9
Web of Science® Google Scholar
Sharafi, M., M. Yazdchi, R. Rasti, and F. Nasimi. 2022. “A Novel Spatio-Temporal Convolutional Neural Framework for Multimodal Emotion Recognition.” Biomedical Signal Processing and Control 78: 103970.
10.1016/j.bspc.2022.103970
Web of Science® Google Scholar
Siddiqui, M. F., and A. Y. Javaid. 2020. “A Multimodal Facial Emotion Recognition Framework Through the Fusion of Speech With Visible and Infrared Images.” Multimodal Technologies and Interaction 4, no. 3: 46.
10.3390/mti4030046
Web of Science® Google Scholar
Singh, L., N. Aggarwal, and S. Singh. 2023. “PUMAVE-D: Panjab University Multilingual Audio and Video Facial Expression Dataset.” Multimedia Tools and Applications 82, no. 7: 10117–10144.
10.1007/s11042-022-14102-5
Web of Science® Google Scholar
Singh, P., M. Sahidullah, and G. Saha. 2023. “Modulation Spectral Features for Speech Emotion Recognition Using Deep Neural Networks.” Speech Communication 146: 53–69.
10.1016/j.specom.2022.11.005
Web of Science® Google Scholar
Singh, Y. B., and S. Goel. 2023. “A Lightweight 2D CNN Based Approach for Speaker-Independent Emotion Recognition From Speech With New Indian Emotional Speech Corpora.” Multimedia Tools and Applications 82, no. 15: 23055–23073.
10.1007/s11042-023-14577-w
Web of Science® Google Scholar
Song, Y., Y. Cai, and L. Tan. 2021. “ Video-Audio Emotion Recognition Based on Feature Fusion Deep Learning Method.” In 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS). IEEE.
10.1109/MWSCAS47672.2021.9531812
Google Scholar
Spezialetti, M., G. Placidi, and S. Rossi. 2020. “Emotion Recognition for Human-Robot Interaction: Recent Advances and Future Perspectives.” Frontiers in Robotics and AI 7: 532279.
10.3389/frobt.2020.532279
PubMed Web of Science® Google Scholar
Sun, L., B. Zou, S. Fu, J. Chen, and F. Wang. 2019. “Speech Emotion Recognition Based on DNN-Decision Tree SVM Model.” Speech Communication 115: 29–37.
10.1016/j.specom.2019.10.004
Web of Science® Google Scholar
Sun, N., Q. Li, R. Huan, J. Liu, and G. Han. 2019. “Deep Spatial-Temporal Feature Fusion for Facial Expression Recognition in Static Images.” Pattern Recognition Letters 119: 49–61.
10.1016/j.patrec.2017.10.022
Web of Science® Google Scholar
Szwoch, M., and W. Szwoch. 2015. “ Emotion Recognition for Affect Aware Video Games.” In Image Processing & Communications Challenges 6. Springer International Publishing.
10.1007/978-3-319-10662-5_28
Google Scholar
Talaat, F. M., Z. H. Ali, R. R. Mostafa, and N. El-Rashidy. 2024. “Real-Time Facial Emotion Recognition Model Based on Kernel Autoencoder and Convolutional Neural Network for Autism Children.” Soft Computing 28, no. 9: 6695–6708.
10.1007/s00500-023-09477-y
Web of Science® Google Scholar
Tang, G., Y. Xie, K. Li, R. Liang, and L. Zhao. 2023. “Multimodal Emotion Recognition From Facial Expression and Speech Based on Feature Fusion.” Multimedia Tools and Applications 82, no. 11: 16359–16373.
10.1007/s11042-022-14185-0
Web of Science® Google Scholar
Tellai, M., L. Gao, and Q. Mao. 2023. “An Efficient Speech Emotion Recognition Based on a Dual-Stream CNN-Transformer Fusion Network.” International Journal of Speech Technology 26, no. 2: 541–557.
10.1007/s10772-023-10035-y
Google Scholar
Thirumuru, R., K. Gurugubelli, and A. K. Vuppala. 2022. “Novel Feature Representation Using Single Frequency Filtering and Nonlinear Energy Operator for Speech Emotion Recognition.” Digital Signal Processing 120: 103293.
10.1016/j.dsp.2021.103293
Web of Science® Google Scholar
Tiwari, P., H. Rathod, S. Thakkar, and A. D. Darji. 2023. “Multimodal Emotion Recognition Using SDA-LDA Algorithm in Video Clips.” Journal of Ambient Intelligence and Humanized Computing 14, no. 6: 6585–6602.
10.1007/s12652-021-03529-7
Web of Science® Google Scholar
Ullah, R., M. Asif, W. A. Shah, et al. 2023. “Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer.” Sensors 23, no. 13: 6212.
10.3390/s23136212
Web of Science® Google Scholar
Vaswani, A., N. Shazeer, N. Parmar, et al. 2017. “Attention is all You Need.” Advances in Neural Information Processing Systems 30: 5998–6008.
Google Scholar
Wagner, J., A. Triantafyllopoulos, H. Wierstorf, et al. 2023. “Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap.” IEEE Transactions on Pattern Analysis and Machine Intelligence 45: 10745–10759.
10.1109/TPAMI.2023.3263585
PubMed Web of Science® Google Scholar
Wang, C., Y. Ren, N. Zhang, F. Cui, and S. Luo. 2022. “Speech Emotion Recognition Based on Multi-Feature and Multi-Lingual Fusion.” Multimedia Tools and Applications 81, no. 4: 4897–4907.
10.1007/s11042-021-10553-4
Web of Science® Google Scholar
Wang, X., X. Chen, and C. Cao. 2020. “Human Emotion Recognition by Optimally Fusing Facial Expression and Speech Feature.” Signal Processing: Image Communication 84: 115831.
10.1016/j.image.2020.115831
Web of Science® Google Scholar
Wang, Y., and L. Guan. 2008. “Recognizing Human Emotional State From Audiovisual Signals.” IEEE Transactions on Multimedia 10, no. 5: 936–946.
10.1109/TMM.2008.927665
Web of Science® Google Scholar
Wang, Z., Y. Wang, J. Zhang, Y. Tang, and Z. Pan. 2023. “A Lightweight Domain Adversarial Neural Network Based on Knowledge Distillation for EEG-Based Cross-Subject Emotion Recognition.” arXiv preprint arXiv:2305.07446.
Google Scholar
Wei, J., X. Yang, and Y. Dong. 2021. “User-Generated Video Emotion Recognition Based on Key Frames.” Multimedia Tools and Applications 80: 14343–14361.
10.1007/s11042-020-10203-1
Web of Science® Google Scholar
Wei, W., Q. Jia, Y. Feng, G. Chen, and M. Chu. 2020. “Multi-Modal Facial Expression Feature Based on Deep-Neural Networks.” Journal on Multimodal User Interfaces 14: 17–23.
10.1007/s12193-019-00308-9
Web of Science® Google Scholar
Wu, M., W. Su, L. Chen, W. Pedrycz, and K. Hirota. 2022. “Two-Stage Fuzzy Fusion Based-Convolution Neural Network for Dynamic Emotion Recognition.” IEEE Transactions on Affective Computing 13, no. 2: 805–817.
10.1109/TAFFC.2020.2966440
Web of Science® Google Scholar
Xie, J., M. Zhu, and K. Hu. 2023. “Fusion-Based Speech Emotion Classification Using Two-Stage Feature Selection.” Speech Communication 152: 102955.
10.1016/j.specom.2023.102955
Web of Science® Google Scholar
Xie, Y., R. Liang, Z. Liang, C. Huang, C. Zou, and B. Schuller. 2019. “Speech Emotion Classification Using Attention-Based LSTM.” IEEE/ACM Transactions on Audio, Speech and Language Processing 27, no. 11: 1675–1685.
10.1109/TASLP.2019.2925934
Web of Science® Google Scholar
Yalamanchili, B., S. K. Samayamantula, and K. R. Anne. 2022. “Neural Network-Based Blended Ensemble Learning for Speech Emotion Recognition.” Multidimensional Systems and Signal Processing 33, no. 4: 1323–1348.
10.1007/s11045-022-00845-9
Web of Science® Google Scholar
Yang, B., J. Wu, K. Ikeda, et al. 2023. “Deep Learning Pipeline for Spotting Macro-and Micro-Expressions in Long Video Sequences Based on Action Units and Optical Flow.” Pattern Recognition Letters 165: 63–74.
10.1016/j.patrec.2022.12.001
Web of Science® Google Scholar
Yin, L., X. Wei, Y. Sun, J. Wang, and M. J. Rosato. 2006. “ A 3D Facial Expression Database for Facial Behavior Research.” In 7th International Conference on Automatic Face and Gesture Recognition (FGR06). IEEE.
Google Scholar
Yolcu, G., I. Oztel, S. Kazan, et al. 2019. “Facial Expression Recognition for Monitoring Neurological Disorders Based on Convolutional Neural Network.” Multimedia Tools and Applications 78: 31581–31603.
10.1007/s11042-019-07959-6
PubMed Web of Science® Google Scholar
Yoon, W.-J., Y.-H. Cho, and K.-S. Park. 2007. “ A Study of Speech Emotion Recognition and Its Application to Mobile Services.” In Ubiquitous Intelligence and Computing: 4th International Conference, UIC, July 11–13, 2007. Proceedings 4. Springer Berlin Heidelberg.
10.1007/978-3-540-73549-6_74
Google Scholar
Yu, M., H. Zheng, Z. Peng, J. Dong, and H. du. 2020. “Facial Expression Recognition Based on a Multi-Task Global-Local Network.” Pattern Recognition Letters 131: 166–171.
10.1016/j.patrec.2020.01.016
Web of Science® Google Scholar
Zadeh, A., R. Zellers, E. Pincus, and L.-P. Morency. 2016. “Mosi: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. arXiv preprint arXiv:1606.06259.”
Google Scholar
Zadeh, A. A. B., P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency. 2018. “ Multimodal Language Analysis in the Wild: Cmu-Mosei Dataset and Interpretable Dynamic Fusion Graph.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1. Association for Computational Linguistics (ACL).
Google Scholar
Zarbakhsh, P., and H. Demirel. 2020. “4D Facial Expression Recognition Using Multimodal Time Series Analysis of Geometric Landmark-Based Deformations.” Visual Computer 36, no. 5: 951–965.
10.1007/s00371-019-01705-7
Web of Science® Google Scholar
Zhalehpour, S., O. Onder, Z. Akhtar, and C. E. Erdem. 2016. “BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States.” IEEE Transactions on Affective Computing 8, no. 3: 300–313.
10.1109/TAFFC.2016.2553038
Google Scholar
Zhang, G., T. Luo, W. Pedrycz, M. A. El-Meligy, M. A. F. Sharaf, and Z. Li. 2020. “Outlier Processing in Multimodal Emotion Recognition.” IEEE Access 8: 55688–55701.
10.1109/ACCESS.2020.2981760
Web of Science® Google Scholar
Zhang, H., B. Huang, and G. Tian. 2020. “Facial Expression Recognition Based on Deep Convolution Long Short-Term Memory Networks of Double-Channel Weighted Mixture.” Pattern Recognition Letters 131: 128–134.
10.1016/j.patrec.2019.12.013
Web of Science® Google Scholar
Zhang, H., A. Jolfaei, and M. Alazab. 2019. “A Face Emotion Recognition Method Using Convolutional Neural Network and Image Edge Computing.” IEEE Access 7: 159081–159089.
10.1109/ACCESS.2019.2949741
Web of Science® Google Scholar
Zhang, J. T. F. L. M., and H. Jia. 2008. “ Design of Speech Corpus for Mandarin Text to Speech.” In The Blizzard Challenge 2008 Workshop. Centre for Speech Technology Research (CSTR), University of Edinburgh.
Google Scholar
Zhang, L., S. Walter, X. Ma, et al. 2016. “ ‘BioVid Emo DB’: A Multimodal Database for Emotion Analyses Validated by Subjective Ratings.” In 2016 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE.
10.1109/SSCI.2016.7849931
Google Scholar
Zhang, S., X. Pan, Y. Cui, X. Zhao, and L. Liu. 2019. “Learning Affective Video Features for Facial Expression Recognition via Hybrid Deep Learning.” IEEE Access 7: 32297–32304.
10.1109/ACCESS.2019.2901521
Web of Science® Google Scholar
Zhang, T., M. Liu, T. Yuan, and N. Al-Nabhan. 2020. “Emotion-Aware and Intelligent Internet of Medical Things Toward Emotion Recognition During COVID-19 Pandemic.” IEEE Internet of Things Journal 8, no. 21: 16002–16013.
10.1109/JIOT.2020.3038631
PubMed Google Scholar
Zhang, T., and Z. Tan. 2024. “Survey of Deep Emotion Recognition in Dynamic Data Using Facial, Speech and Textual Cues.” Multimedia Tools and Applications 83: 1–40.
Web of Science® Google Scholar
Zhao, G., X. Huang, M. Taini, S. Z. Li, and M. Pietikäinen. 2011. “Facial Expression Recognition From Near-Infrared Videos.” Image and Vision Computing 29, no. 9: 607–619.
10.1016/j.imavis.2011.07.002
Web of Science® Google Scholar
Zhao, J., X. Mao, and L. Chen. 2019. “Speech Emotion Recognition Using Deep 1D & 2D CNN LSTM Networks.” Biomedical Signal Processing and Control 47: 312–323.
10.1016/j.bspc.2018.08.035
Web of Science® Google Scholar
Zhao, Z., Z. Bao, Y. Zhao, et al. 2019. “Exploring Deep Spectrum Representations via Attention-Based Recurrent and Convolutional Neural Networks for Speech Emotion Recognition.” IEEE Access 7: 97515–97525.
10.1109/ACCESS.2019.2928625
Web of Science® Google Scholar
Zhu, Z., W. Dai, Y. Hu, and J. Li. 2020. “Speech Emotion Recognition Model Based on bi-GRU and Focal Loss.” Pattern Recognition Letters 140: 358–365.
10.1016/j.patrec.2020.11.009
Web of Science® Google Scholar

Volume42, Issue9

September 2025

e70103

A Comprehensive Review of Unimodal and Multimodal Emotion Detection: Datasets, Approaches, and Limitations

ABSTRACT

Open Research

Data Availability Statement

References

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

A Comprehensive Review of Unimodal and Multimodal Emotion Detection: Datasets, Approaches, and Limitations

ABSTRACT

Open Research

Data Availability Statement

References

References

Related

Information