Volume 42, Issue 9 e70103
REVIEW

A Comprehensive Review of Unimodal and Multimodal Emotion Detection: Datasets, Approaches, and Limitations

Priyanka Thakur

Priyanka Thakur

Department of CSE, University Institute of Engineering and Technology, Panjab University, Chandigarh, India

Search for more papers by this author
Nirmal Kaur

Corresponding Author

Nirmal Kaur

Department of CSE, University Institute of Engineering and Technology, Panjab University, Chandigarh, India

Correspondence:

Nirmal Kaur ([email protected])

Search for more papers by this author
Naveen Aggarwal

Naveen Aggarwal

Department of CSE, University Institute of Engineering and Technology, Panjab University, Chandigarh, India

Search for more papers by this author
Sarbjeet Singh

Sarbjeet Singh

Department of CSE, University Institute of Engineering and Technology, Panjab University, Chandigarh, India

Search for more papers by this author
First published: 21 July 2025

Funding: This work was supported by IIT Mandi iHub & HCI Foundation under grant number: IIT MANDI iHub/RD/2023-2025/04, as part of the project ‘Development of Multimodal and Multilingual Human Emotion Detection System’.

ABSTRACT

Emotion detection from face and speech is inherent for human–computer interaction, mental health assessment, social robotics, and emotional intelligence. Traditional machine learning methods typically depend on handcrafted features and are primarily centred on unimodal systems. However, the unique characteristics of facial expressions and the variability in speech features present challenges in capturing complex emotional states. Accordingly, deep learning models have been substantial in automatically extracting intrinsic emotional features with greater accuracy across multiple modalities. The proposed article presents a comprehensive review of recent progress in emotion detection, spanning from unimodal to multimodal systems, with a focus on facial and speech modalities. It examines state-of-the-art machine learning, deep learning, and the latest transformer-based approaches for emotion detection. The review aims to provide an in-depth analysis of both unimodal and multimodal emotion detection techniques, highlighting their limitations, popular datasets, challenges, and the best-performing models. Such analysis aids researchers in judicious selection of the most appropriate dataset and audio-visual emotion detection models. Key findings suggest that integrating multimodal data significantly improves emotion recognition, particularly when utilising deep learning methods trained on synchronised audio and video datasets. By assessing recent advancements and current challenges, this article serves as a fundamental resource for researchers and practitioners in the field of emotional AI, thereby aiding in the creation of more intuitive and empathetic technologies.

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.