Research Advanced in Multimodal Emotion Recognition Based on Deep Learning
DOI:
https://doi.org/10.54097/p3yprn36Keywords:
Emotion recognition; deep learning; multimodal.Abstract
In summary, the field of computer science has long been intrigued by emotion recognition, which seeks to decode the emotional content hidden within data. Initial approaches to sentiment analysis were predominantly based on single-mode data sources like textual sentiment analysis, speech-based emotion detection, or the study of facial expressions. In recent years, with the increasingly abundant data representations, Many people has gradually pay attention on the multimodal emotion recognition. Multimodal emotion recognition involves not only text, but also audio, image, and video, which is of great significance for enhancing human-computer interaction, improving user experience, and improving emotion-aware applications. This paper thoroughly discusses the research advancements and primary techniques of multimodal emotion recognition tasks, with an emphasis on the aforementioned tasks. Specifically, this paper first introduces representative methods for single-modal emotion recognition based on graphic data, including its basic process, advantages and disadvantages, etc. Secondly, this article introduces pertinent studies on multimodal emotion recognition and offers a quantitative comparison of how different approaches perform on standard multimodal data sets. Lastly, it addresses the complexities inherent in multimodal emotion recognition research and suggests potential areas for future study.
Downloads
References
Mai S, Hu H, Xing S. Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(01): 164-172.
Zhao J, Li R, Jin Q, et al. Memobert: Pre-training model with prompt-based learning for multimodal emotion recognition [C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 4703-4707.
Fu J, Mao Q, Tu J, et al. Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis[J]. Multimedia Systems, 2019, 25: 451-461.
Hotelling H. Relations between two sets of variates[M]//Breakthroughs in statistics: methodology and distribution. New York, NY: Springer New York, 1992: 162-190.
Zhang K, Li Y, Wang J, et al. Feature fusion for multimodal emotion recognition based on deep canonical correlation analysis [J]. IEEE Signal Processing Letters, 2021, 28: 1898-1902.
Zadeh A, Chen M, Poria S, et al. Tensor Fusion Network for Multimodal Sentiment Analysis [C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 1103-1114.
Zeng Z, Tu J, Pianfetti B, et al. Audio-visual affect recognition through multi-stream fused HMM for HCI[C]//2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05). IEEE, 2005, 2: 967-972.
Zadeh A A B, Liang P P, Poria S, et al. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 2236-2246.
Chen S, Jin Q. Multi-modal conditional attention fusion for dimensional emotion prediction [C]//Proceedings of the 24th ACM international conference on Multimedia. 2016: 571-575.
Huang J, Tao J, Liu B, et al. Multimodal transformer fusion for continuous emotion recognition [C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 3507-3511.
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Highlights in Science, Engineering and Technology

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.







