Attention Mechanism in Multimodal Emotion Recognition From 2020 to 2025: Technological Evolution, Challenges, and Future Prospects
DOI:
https://doi.org/10.54097/tgf6bd67Keywords:
Multimodality, Emotion Recognition, Attention Mechanisms, Natural Language Processing, Large-scale Models.Abstract
Multimodal emotion recognition, as the core technology of human-computer interaction, has made breakthrough progress in the fusion of attention mechanisms in recent years. This article systematically reviews the research context in this field from 2020 to 2025, and for the first time establishes a classification framework from the dual dimensions of "attention refinement" and "modality interaction deepening". This paper will propose a four stage evolutionary model, including the process from foundation, to optimization, to multi-scale, and finally to cross fusion. It will also present a technical matrix for innovative induction of cross modal, recursive, multi-scale, and dynamic attention mechanisms. Furthermore, it will reveal the bottleneck of feature alignment caused by modal heterogeneity. Finally, a development path will be proposed from the directions of meta learning regulation, interpretability enhancement, etc., providing theoretical reference for constructing robust multimodal systems. Overall, this review not only summarizes the current progress of multimodal attention research but also highlights future opportunities for advancing human-computer interaction, encouraging continued exploration of more intelligent, inclusive, and adaptive multimodal systems.
Downloads
References
[1] Chen L, Ouyang Y, Zeng Y, Li Y. Dynamic facial expression recognition model based on BiLSTM-Attention. In: 2020 15th International Conference on Computer Science & Education (ICCSE). 2020: 828-832.
[2] Wu M, Su W, Chen L, Pedrycz W, Hirota K. Two-stage fuzzy fusion based-convolution neural network for dynamic emotion recognition. IEEE Transactions on Affective Computing, 2020.
[3] Ameen S Y, Nourildean S W. Coordinator and router investigation in IEEE802.15.14 ZigBee wireless sensor network. In: 2013 International Conference on Electrical Communication, Computer, Power, and Control Engineering (ICECCPCE). 2013: 130-134.
[4] Al-Sultan M R, Ameen S Y, Abduallah W M. Real time implementation of Stegofirewall system. International Journal of Computing and Digital Systems, 2019, 8: 498-504.
[5] Zhang H. Expression-EEG based collaborative multimodal emotion recognition using deep autoencoder. IEEE Access, 2020, 8: 164130-164143.
[6] Al-Naima F, Ameen S Y, Al-Saad A F. Destroying steganography content in image files. In: IEEE Proceedings of Fifth International Symposium on Communication Systems, Networks and Digital Signal Processing. University of Patras, Patras, Greece, 2006.
[7] Cai C, He Y, Sun L, et al. Multimodal sentiment analysis based on recurrent neural network and multimodal attention. In: Proceedings of the 2nd Multimodal Sentiment Analysis Challenge. 2021: 61-67.
[8] Praveen R G, Alam J. Recursive joint cross-modal attention for multimodal fusion in dimensional emotion recognition. arXiv preprint arXiv:2403.13659, 2024.
[9] G M H, S S S, V R S. Attention-based multi-modal sentiment analysis and emotion detection in conversation using RNN. International Journal of Interactive Multimedia and Artificial Intelligence, 2021, 6(6): 112-121.
[10] Luo H, Ji L, Huang Y, Wang B, Ji S, Li T. ScaleVLAD: Improving multimodal sentiment analysis via multi-scale fusion of local descriptors. arXiv preprint arXiv:2112.01368, 2021.
[11] Khan M, Tran P N, Pham N T, El Saddik A, Othmani A. MemoCMT: Multimodal emotion recognition using cross-modal transformer-based feature fusion. Scientific Reports, 2025, 15: 5473.
[12] Zhao J, Wei X, Bo L. R1-Omni: Explainable omni-multimodal emotion recognition with reinforcement learning. arXiv preprint arXiv:2503.05379, 2025.
[13] Feng X, Lin Y, He L, et al. Knowledge-guided dynamic modality attention fusion framework for multimodal sentiment analysis. arXiv preprint arXiv:2410.04491, 2024.
[14] Yang L, Zhong J, Wen T, et al. CCIN-SA: Composite cross modal interaction network with attention enhancement for multimodal sentiment analysis. Information Fusion, 2025, 123: 103230.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Academic Journal of Science and Technology

This work is licensed under a Creative Commons Attribution 4.0 International License.








