Multimodal Affective Computing Method with Cross-Modal Attention

Bo He; Hao Han; Hongqian Zhang; Weining Bai

doi:10.54097/jcw9e446

Authors

Bo He
Hao Han
Hongqian Zhang
Weining Bai

DOI:

https://doi.org/10.54097/jcw9e446

Keywords:

Multimodal Affective Computing, Cross-Modal Attention, Sentiment Analysis, Emotion Recognition in Conversation, Sarcasm Detection

Abstract

Multimodal affective computing (MAC) aims to enable machines to recognize and understand human emotions by integrating information from multiple modalities, including text, speech, facial expressions, and gestures. Traditional fusion methods such as early fusion, late fusion, and tensor fusion often struggle to capture fine-grained inter-modal dependencies and handle conflicts between modalities. Cross-modal attention has emerged as a powerful mechanism to address these challenges by selectively aligning and weighting features across modalities, highlighting relevant cues while suppressing noise. This survey reviews recent advances in MAC with a focus on cross-modal attention, covering core tasks such as Multimodal Sentiment Analysis (MSA), Multimodal Emotion Recognition in Conversations (MERC), and Multimodal Sarcasm and Humor Detection (MSD). We analyze the principles of cross-modal attention, summarize representative models, and discuss their empirical performance. Finally, we identify current challenges, including dataset limitations, asynchronous modality alignment, and interpretability, and propose future directions such as large-scale pretraining, knowledge-enhanced modeling, and interpretable attention mechanisms. Overall, cross-modal attention significantly improves robustness, context-awareness, and fine-grained emotion understanding in multimodal affective computing.

References

[1]S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Information Fusion, vol. 37, pp. 98–125, Sep. 2017.

[2]S. K. D’mello and J. Kory, “A Review and Meta-Analysis of Multimodal Affect Detection Systems,” ACM Comput. Surv., vol. 47, no. 3, pp. 1–36, Apr. 2015.

[3]R. Schifanella, P. De Juan, J. Tetreault, and L. Cao, “Detecting Sarcasm in Multimodal Social Platforms,” in Proceedings of the 24th ACM international conference on Multimedia, Amsterdam The Netherlands, 2016, pp. 1136–1145.

[4]R. Das and T. D. Singh, “Multimodal Sentiment Analysis: A Survey of Methods, Trends, and Challenges,” ACM Comput. Surv., vol. 55, no. 13s, pp. 1–38, Dec. 2023.

[5]S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L.-P. Morency, “Context-dependent sentiment analysis in user-generated videos,” presented at the Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), 2017, pp. 873–883.

[6]N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria, “Dialoguernn: An attentive rnn for emotion detection in conversations,” presented at the Proceedings of the AAAI conference on artificial intelligence, 2019, vol. 33, pp. 6818–6825.

[7]D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, and R. Zimmermann, “Conversational memory network for emotion recognition in dyadic dialogue videos,” presented at the Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, 2018, vol. 2018, p. 2122.

[8]Y. Cai, H. Cai, and X. Wan, “Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019.

[9]N. Xu, Z. Zeng, and W. Mao, “Reasoning with Multimodal Sarcastic Tweets via Modeling Cross-Modality Contrast and Semantic Association,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 2020.

[10]J. Hu, Y. Liu, J. Zhao, and Q. Jin, “MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 2021, pp. 5666–5675.

[11]S. Zhou, J. Jia, Q. Wang, Y. Dong, Y. Yin, and K. Lei, “Inferring Emotion from Conversational Voice Data: A Semi-Supervised Multi-Path Generative Neural Network Approach,” AAAI, vol. 32, no. 1, Apr. 2018.

[12]J. Li, X. Wang, Y. Liu, and Z. Zeng, “CFN-ESA: A Cross-Modal Fusion Network With Emotion-Shift Awareness for Dialogue Emotion Recognition,” IEEE Trans. Affective Comput., vol. 15, no. 4, pp. 1919–1933, Oct. 2024.

[13]S. Dutta and S. Ganapathy, “HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion Recognition.” arXiv, 09-Jan-2024.

[14]V. Chudasama, P. Kar, A. Gudmalwar, N. Shah, P. Wasnik, and N. Onoe, “M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 2022, pp. 4651–4660.

[15]T. M. Tashu, S. Hajiyeva, and T. Horvath, “Multimodal Emotion Recognition from Art Using Sequential Co-Attention,” J. Imaging, vol. 7, no. 8, p. 157, Aug. 2021.

[16]F. Huang, X. Zhang, Z. Zhao, J. Xu, and Z. Li, “Image–text sentiment analysis via deep multimodal attentive fusion,” Knowledge-Based Systems, vol. 167, pp. 26–37, Mar. 2019.

[17]A. Galassi, M. Lippi, and P. Torroni, “Attention in Natural Language Processing,” IEEE Trans. Neural Netw. Learning Syst., vol. 32, no. 10, pp. 4291–4308, Oct. 2021.

[18]E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 5797–5808.

[19]K. Clark, U. Khandelwal, O. Levy, and C. D. Manning, “What Does BERT Look at? An Analysis of BERT’s Attention,” in Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Florence, Italy, 2019, pp. 276–286.

[20]S. Serrano and N. A. Smith, “Is Attention Interpretable?,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 2931–2951.

[21]Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchical Attention Networks for Document Classification,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, 2016, pp. 1480–1489.

[22]A. Lauscher, G. Glavaš, S. P. Ponzetto, and K. Eckert, “Investigating the Role of Argumentation in the Rhetorical Analysis of Scientific Publications with Neural Multi-Task Learning Models,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 3326–3338.

[23]L. Wu, F. Tian, L. Zhao, J. Lai, and T.-Y. Liu, “Word Attention for Sequence to Sequence Text Understanding,” AAAI, vol. 32, no. 1, Apr. 2018.

[24]J. Cheng, I. Fostiropoulos, B. Boehm, and M. Soleymani, “Multimodal Phased Transformer for Sentiment Analysis,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 2021, pp. 2447–2458.

[25]C. Liu, H. Ding, Y. Zhang, and X. Jiang, “Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation,” IEEE Trans. on Image Process., vol. 32, pp. 3054–3065, 2023.

[26]Y. Jing and X. Zhao, “DQ-Former: Querying Transformer with Dynamic Modality Priority for Cognitive-aligned Multimodal Emotion Recognition in Conversation,” in Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne VIC Australia, 2024, pp. 4795–4804.

[27]X. Zhang and Y. Li, “A Cross-Modality Context Fusion and Semantic Refinement Network for Emotion Recognition in Conversation,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, 2023, pp. 13099–13110.

[28]L. Qin, S. Huang, Q. Chen, C. Cai, Y. Zhang, B. Liang, W. Che, and R. Xu, “MMSD2.0: Towards a Reliable Multi-modal Sarcasm Detection System,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, 2023, pp. 10834–10845.

[29]H. Pan, Z. Lin, P. Fu, Y. Qi, and W. Wang, “Modeling Intra and Inter-modality Incongruity for Multi-Modal Sarcasm Detection,” in Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 2020.

[30]D. Ghosal, M. S. Akhtar, D. Chauhan, S. Poria, A. Ekbal, and P. Bhattacharyya, “Contextual Inter-modal Attention for Multi-modal Sentiment Analysis,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 3454–3466.

[31]D. Curto, A. Clapes, J. Selva, S. Smeureanu, J. C. S. Jacques Junior, D. Gallardo-Pujol, G. Guilera, D. Leiva, T. B. Moeslund, S. Escalera, and C. Palmero, “Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions,” in 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 2021, pp. 2177–2188.

[32]Y. Zhuang, Y. Zhang, Z. Hu, X. Zhang, J. Deng, and F. Ren, “GLoMo: Global-Local Modal Fusion for Multimodal Sentiment Analysis,” in Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne VIC Australia, 2024, pp. 1800–1809.

[33]A. Zhu, M. Hu, X. Wang, J. Yang, Y. Tang, and F. Ren, “KEBR: Knowledge Enhanced Self-Supervised Balanced Representation for Multimodal Sentiment Analysis,” in Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne VIC Australia, 2024, pp. 5732–5741.

[34]T. Shi and S.-L. Huang, “MultiEMO: An Attention-Based Correlation-Aware Multimodal Fusion Framework for Emotion Recognition in Conversations,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, 2023, pp. 14752–14766.

[35]T. Yun, H. Lim, J. Lee, and M. Song, “TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 2024, pp. 82–95.

[36]P. Desai, T. Chakraborty, and M. S. Akhtar, “Nice Perfume. How Long Did You Marinate in It? Multimodal Sarcasm Explanation,” AAAI, vol. 36, no. 10, pp. 10563–10571, Jun. 2022.

[37]L. Ou and Z. Li, “Modeling Multi-Task Joint Training of Aggregate Networks for Multi-Modal Sarcasm Detection,” in Proceedings of the 2024 International Conference on Multimedia Retrieval, Phuket Thailand, 2024, pp. 833–841.