LMCM: Large Pre-trained Model Driven Multimodal Emotion Recognition via Cross-Modal Attention

Yidong Gao

doi:10.54097/tzf73f09

Authors

Yidong Gao

DOI:

https://doi.org/10.54097/tzf73f09

Keywords:

Emotion Recognition, Multimodal Learning, CLIP, Natural Language Processing, Cross-Attention.

Abstract

With the continuous progress of deep learning technologies, emotion recognition has become an essential research topic in fields including HCI, intelligent assistants, and mental health applications. Compared to unimodal methods, emotion recognition across multiple modalities can offer a deeper insight into human affective states. However, existing methods still suffer from insufficient cross-modal feature alignment, limited generalization ability, and underutilization of large pre-trained models. To tackle these problems, this study proposes a new framework termed LMCM. For the visual modality, CLIP is employed as the feature extractor to leverage its strong image-text alignment capability; for the textual modality, the DeBERTa-V3-base model is adopted to obtain high-quality semantic representations. In the fusion stage, a residual cross-attention and a dual-branch parallel cross-attention mechanism are designed to maximize the use of complementary cues between visual and textual modalities. Experiments conducted on the IEMOCAP demonstrate that the method of this study surpasses earlier baselines, with higher Weighted Accuracy (WA) and Weighted F1-score (WF1). This study not only validates the potential of large-scale pre-trained models for multimodal emotion recognition but also provides a reproducible paradigm for future research on cross-modal feature fusion.

Downloads

Download data is not yet available.

References

[1] Yuan Y, Li Z, Zhao B. A survey of multimodal learning: Methods, applications, and future. ACM Computing Surveys, 2025, 57 (7): Article 167. doi:10.1145/3713070.

[2] Jaiswal A, Raju A K, Deb S. Facial emotion detection using deep learning. In: 2020 International Conference for Emerging Technology (INCET). 2020: 1-5. doi:10.1109/INCET49848.2020.9154121.

[3] Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh A, Cambria E. DialogueRNN: An attentive RNN for emotion detection in conversations. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33 (1): 6818-6825.

[4] Delbrouck J B, Tits N, Brousmiche M, Dupont S. A transformer-based joint-encoding for emotion recognition and sentiment analysis. In: Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). Association for Computational Linguistics, Seattle, USA, July 2020: 1-7.

[5] Radford A, et al. Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML). 2021.

[6] He P, Liu X, Gao J, Chen W. DeBERTa: Decoding-enhanced BERT with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.

[7] Fan S, Jing J, Wang C. Audio-visual learning for multimodal emotion recognition. Symmetry, 2025, 17 (3): 418.

[8] Li D, Wang Y, Funakoshi K, Okumura M. Joyful: Joint modality fusion and graph contrastive learning for multimodal emotion recognition. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Singapore, December 2023: 16051-16069.

[9] Lee S, Han D K, Ko H. Multimodal emotion recognition fusion analysis adapting BERT with heterogeneous feature unification. IEEE Access, 2021, 9: 94557-94572.

[10] Tan L, et al. Speech emotion recognition enhanced traffic efficiency solution for autonomous vehicles in a 5G-enabled space–air–ground integrated intelligent transportation system. IEEE Transactions on Intelligent Transportation Systems, 2022, 23 (3): 2830-2842.

[11] Middya A I, Nag B, Roy S. Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities. Knowledge-Based Systems, 2022, 244: 108580.

[12] Tsai Y H H, Bai S, Liang P P, Kolter J Z, Morency L P, Salakhutdinov R. Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, Florence, Italy, July 2019: 6558-6569.

[13] Diamantini C, Mircoli A, Potena D, Vagnoni S. An experimental comparison of large language models for emotion recognition in Italian tweets, itaDATA. 2023.

[14] Kadiyala R M R. Cross-lingual emotion detection through large language models. In: Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis (WASSA). Association for Computational Linguistics, Bangkok, Thailand, August 2024: 464-469.

[15] Lalk C, et al. Employing large language models for emotion detection in psychotherapy transcripts. Frontiers in Psychiatry, 2025, 16.

[16] Yu E, et al. Merlin: Empowering multimodal LLMs with foresight minds. In: Computer Vision – ECCV 2024. Springer Nature Switzerland, Cham, 2025: 425-443.

[17] Li W, Zhu L, Mao R, Cambria E. SKIER: A symbolic knowledge integrated model for conversational emotion recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37 (11): 13121-13129.

[18] Zhu J, Chen X, He K, LeCun Y, Liu Z. Transformers without normalization. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10-17 June 2025: 14901-14911.

[19] Busso C, et al. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008, 42 (4): 335-359.

[20] Dai W, Cahyawijaya S, Liu Z, Fung P. Multimodal end-to-end sparse model for emotion recognition. arXiv preprint arXiv:2103.09666, 2021.

[21] Ghosal D, Majumder N, Gelbukh A, Mihalcea R, Poria S. COSMIC: COmmonSense knowledge for eMotion identification in conversations. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, November 2020: 2470-2481.

[22] Wang T, Hou Y, Zhou D, Zhang Q. A contextual attention network for multimodal emotion recognition in conversation. In: 2021 International Joint Conference on Neural Networks (IJCNN). 18-22 July 2021: 1-7.

[23] Jiao W, Lyu M, King I. Real-time emotion recognition via attention gated hierarchical memory network. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34 (5): 8002-8009.

[24] Ishiwatari T, Yasuda Y, Miyazaki T, Goto J. Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020: 7360-7370.

[25] Joshi A, Bhat A, Jain A, Singh A V, Modi A. COGMEN: Contextualized GNN based multimodal emotion recognition. arXiv preprint arXiv:2205.02455, 2022.