Advances and Challenges in Multi-Modal Emotion Recognition: A Comprehensive Investigation

Kefan Yao

doi:10.54097/0ah5h819

Authors

Kefan Yao

DOI:

https://doi.org/10.54097/0ah5h819

Keywords:

Multi-modal emotion recognition, machine learning, deep learning.

Abstract

Multi-Modal Emotion Recognition (MER) combines information from two or more modalities—such as speech, facial expressions, video, and physiological signals—to more accurately infer people’s emotional states. Previous work shows that relying on a single modality often misses important cues: for instance, audio may capture tone but not facial micro-expressions, video may capture expression but not internal arousal. Recent systems using CNNs, Transformers, or hybrid fusion architectures, applied in driving-safety and healthcare contexts, have improved accuracy significantly, especially when handling missing or noisy modalities. This survey reviews such methods, discusses current challenges like interpretability, modality mismatch, and real-time deployment, and suggests future directions including lightweight models, privacy-preserving fusion, and cross-domain generalization. Furthermore, it highlights the growing importance of explainable and adaptive models that can dynamically adjust to changing environments and user contexts. As MER continues to evolve, these innovations will enhance emotion-aware applications in human-computer interaction, mental health, and intelligent systems.

References

[1] Tzirakis P, Trigeorgis G, Nicolaou M A, et al. End-to-End Multimodal Emotion Recognition using Deep Neural Networks. arXiv preprint arXiv:1704.08619, 2017.

[2] Liu W, Qiu J-L, Zheng W-L, Lu B-L. Multimodal Emotion Recognition Using Deep Canonical Correlation Analysis. arXiv preprint arXiv:1908.05349, 2019.

[3] Hazmoune S, et al. Using Transformers for Multimodal Emotion Recognition. Engineering Applications of Artificial Intelligence, 2024.

[4] Li D, Wang Y, Funakoshi K, Okumura M. Joyful: Joint Modality Fusion and Graph Contrastive Learning for Multimodal Emotion Recognition. arXiv preprint arXiv:2311.11009, 2023.

[5] Nguyen T, et al. MemoCMT: Cross-modal Transformer Fusion for Emotion Recognition. Micromachines, 2024.

[6] Espino-Salinas C H, et al. Multimodal Driver Emotion Recognition Using Motor Activity and Facial Geometry. Frontiers in Artificial Intelligence, 2024.

[7] Xiang Z, et al. A Multimodal Driver Emotion Dataset and Baseline Models. Engineering Applications of Artificial Intelligence, 2024.

[8] Luan X, Wen Q, Hang B. Intelligent Driver Emotion Recognition with Model-Level Multimodal Fusion. Frontiers in Physics, 2025.

[9] Zhang K, et al. Real-Time Emotion Recognition via Multimodal Federated Learning. arXiv preprint, 2025.

[10] Mutawa A M, et al. Real-Time Multimodal Patient Emotion Recognition System. Biomedical Signal Processing and Control, 2024.

[11] He Z, et al. Advances in Multimodal Emotion Recognition Based on Brain–Computer Interfaces. Frontiers in Neuroscience, 2020.

[12] Chen L, et al. A Multimodal Emotion Recognition System for Patient–Clinician Interactions. Journal of Emotion Recognition, 2024.

[13] He Z, Li Z, Yang F, Wang L, Li J, Zhou C, Pan J. Advances in multimodal emotion recognition based on brain–computer interfaces. Brain Sci. 2020;10(10):687.

[14] Almulla MA. A multimodal emotion recognition system using deep convolution neural networks. J Eng Res. 2024;13(4).