Noise-Aware Causal Orthogonal Gating Network for Robust and Interpretable Multimodal Sentiment Analysis
DOI:
https://doi.org/10.54097/qy8p7839Keywords:
Multimodal Sentiment Analysis, Signal-to-Noise Causal Orthogonality, Gating Mechanism, Deep LearningAbstract
Across online social platforms, the sheer expansion of multimodal content has made Multimodal Sentiment Analysis (MSA) both indispensable and technically difficult within computational social systems. Cross-modal heterogeneity is the core of the whole problem: textual, visual, and acoustic data streams almost never align completely in a clean, simple way. On top of this basic mismatch, things like noise, missing types of modality, and wrong inter-modal connections end up hurting both how reliable existing models are, and also how easy they are to understand. Given all these issues, this study puts forward the Noise-Aware Causal Orthogonal Gating Network (NCOG), which is a new framework made to add causal inference theory into adaptive multimodal fusion work. It does not use the traditional unimodal nonlinear gating that most methods rely on; instead, NCOG splits the gating work into two separate submodules that are orthogonal but can work together to complement each other. One of these submodules is called the Reliability Gate, and it filters signals using clear noise proxies — things like inter-frame variation, pitch energy fluctuation, and ASR confidence. The other one, named the Causal Contribution Gate, does not work the same way as this front-end screening step. It uses timestep-level counterfactual masking to work out the real causal effects between different modalities, and at the same time it gets rid of spurious correlations that do not mean anything. Along with this two-gate design, there is also a cross-modal attention module that dynamically adjusts how much weight each modality gets. When the input data has noise or some parts are missing, this module also makes the model more stable and harder to break. When we tested it on real data, NCOG-MSA got 88.12 percent accuracy, 87.11 percent F1 score, 0.681 for MAE, and 0.809 for PCC on the CMU-MOSI dataset. For the CMU-MOSEI dataset, the corresponding numbers are 87.44 percent accuracy, 89.04 percent F1, 0.483 MAE, and 0.816 PCC. If we compare it with the best existing fusion baselines, which include ACMG, TMFN, and HCAN, our framework shows better results on every single one of these evaluation metrics. Putting all these test results together, we can see that combining explicit noise modeling and causal reasoning makes multimodal sentiment analysis more robust, easier to understand, and works better on new data. This also gives a solid basic framework for building affective intelligence that can work in complex real-world social systems.
Downloads
References
[1] Poria, S., Cambria, E., Hazarika, D., & Majumder, N. (2017). A review of multimodal sentiment analysis. IEEE Transactions on Affective Computing, 8(4), 437–451. https://doi.org/ 10. 1109/ TAFFC.2017.2716188.
[2] Zadeh, A., Chen, M., Poria, S., Cambria, E., & Morency, L. P. (2017). Tensor fusion network for multimodal sentiment analysis. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 1103–1114. https://doi.org/10.18653/v1/D17-1115.
[3] Liu, Z., Cai, L., Yang, W., & Liu, J. (2024). Sentiment analysis based on text information enhancement and multimodal feature fusion. Pattern Recognition, 147, 109989. https://doi.org/ 10. 1016/j.patcog.2024.109989.
[4] Tsai, Y. H. H., Bai, S., Liang, P. P., Kolter, J. Z., Morency, L. P., & Salakhutdinov, R. (2019). Multimodal transformer for unaligned multimodal language sequences. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 6558–6569. https://doi.org/10.18653/v1/P19-1664.
[5] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
[6] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://arxiv.org/abs/1810.04805.
[7] Rahman, W., Hasan, M. K., Lee, S., Zadeh, A., Mao, C., Morency, L. P., & Hoque, E. (2020). Integrating multimodal information in large pretrained transformers. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2359–2369. https://doi.org/10.18653/v1/2020.acl-main.212.
[8] Li, Z., Zhou, Y., Zhang, W., Liu, Y., Yang, C., Lian, Z., & Hu, S. (2025). TMFN: A text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis. Complex & Intelligent Systems, 11(1), 133. https://doi.org/ 10. 1007/ s40747-024-01724-5.
[9] Han, W., Chen, H., & Poria, S. (2021). Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 9180–9192. https://doi.org/ 10.18653/v1/ 2021. emnlp-main.720.
[10] Baltrušaitis, T., Ahuja, C., & Morency, L. P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443. https://doi.org/ 10.1109/ TPAMI. 2018. 2798316.
[11] Liu, Z., Shen, Y., Lakshminarasimhan, V. B., Liang, P. P., Zadeh, A., & Morency, L. P. (2018). Efficient low-rank multimodal fusion with modality-specific factors. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2247–2256. https://doi.org/10.18653/v1/P18-1213.
[12] Wang, Y., Shen, Y., Liu, Z., Liang, P. P., & Morency, L. P. (2019). Words can shift: Dynamically adjusting word representations using nonverbal behaviors. Proceedings of the AAAI Conference on Artificial Intelligence, 33(1), 7216–7223. https://doi.org/10.1609/aaai.v33i01.33017216.
[13] Yang, X., Feng, S., Zhang, Y., & Wang, D. (2021). Multimodal sentiment detection based on multi-channel graph neural networks. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 328–339. https://doi. org/ 10.18653/v1/2021.acl-long.28.
[14] Sun, T., Wang, W., Jing, L., Cui, Y., Song, X., & Nie, L. (2022). Counterfactual reasoning for out-of-distribution multimodal sentiment analysis. Proceedings of the 30th ACM International Conference on Multimedia, 5219–5227. https://doi.org/10. 1145/ 3503161.3548111.
[15] Huang, C., Chen, J., Huang, Q., Wang, S., Tu, Y., & Huang, X. (2025). AtCAF: Attention-based causality-aware fusion network for multimodal sentiment analysis. Information Fusion, 114, 102725. https://doi.org/10. 1016/j.inffus. 2024. 102725.
[16] Pan, X., & Others. (2024). Hybrid uncertainty calibration for multimodal sentiment analysis. Electronics, 13(3), 662. https://doi.org/10.3390/electronics13030662.
[17] Wang, C., & Zhou, Y. (2024). Rethinking the role of attention mechanism: A causality perspective. Applied Intelligence, 54(10), 12791–12806. https://doi.org/10.1007/s10489-023-05129-2.
[18] Gandhi, A., Adhvaryu, K., Poria, S., Cambria, E., & Hussain, A. (2023). Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion, 91, 424–444. https://doi.org/10.1016/j.inffus.2022.10.001.
[19] Lai, S., & Others. (2023). Multimodal sentiment analysis: A survey. Displays, 69, 102073. https://doi.org/ 10. 1016/ j. displa.2023.102073.
[20] Liu, Y., & Others. (2024). Data uncertainty-aware learning for multimodal aspect-based sentiment analysis. arXiv preprint arXiv:2412.01249. https://arxiv.org/abs/2412.01249.
[21] Du, P. Y., Gao, Y., Li, L., & Li, X. (2024). SGAMF: Sparse gated attention-based multimodal fusion method for fake news detection. IEEE Transactions on Big Data. https://doi.org/ 10. 1109/TBDATA.2024.3414341.
[22] Han, W., & Others. (2024). Beyond simple fusion: Adaptive gated fusion for robust multimodal sentiment analysis. arXiv preprint arXiv:2510.01677. https://arxiv.org/abs/2510.01677.
[23] Pearl, J. (2009). Causality: Models, reasoning and inference (2nd ed.). Cambridge University Press. https://doi. org/ 10. 1017/ CBO9780511803161.
[24] Pearl, J. (2018). The book of why: The new science of cause and effect. Basic Books.
[25] Huang, P., & Others. (2024). Multimodal sentiment analysis based on causal reasoning. arXiv preprint arXiv:2412.07292. https://arxiv.org/abs/2412.07292.
[26] A general debiasing framework with counterfactual reasoning for multimodal public speaking anxiety detection. (2025). Neural Networks. https:// doi. org/ 10. 1016/ j. neunet. 2025. 02. 017.
[27] Li, Z., & Zou, Z. (2025). H²CAN: Heterogeneous hypergraph attention network with counterfactual learning for multimodal sentiment analysis. Complex & Intelligent Systems. https://doi. org/ 10.1007/s40747-025-01806-y.
[28] Radford, A., Narasimhan, I., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI Blog.
[29] Zadeh, A., Liang, P. P., Poria, S., Cambria, E., & Morency, L. P. (2018). Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2236–2246. https://doi.org/10.18653/v1/P18-1208.
[30] Johnson, E., Patel, R., & Smith, M. (2024). Advanced cross-modal gating for enhanced multimodal sentiment analysis. Preprints. https://doi.org/10.20944/preprints202408.0265.v1.
[31] Research on cross-modal emotion recognition based on multi-layer semantic fusion (CM-MSF model). (2024). Mathematical Biosciences and Engineering, 21(2), 2520–2544. https://doi. org/10.3934/mbe.2024110.
[32] Fu, J., Fu, Y., Xue, H., & Xu, Z. (2025). TMFN: A text-based multimodal fusion network with multi-scale feature extraction and unsupervised contrastive learning for multimodal sentiment analysis. Complex & Intelligent Systems, 11(1), 133. https://doi.org/10.1007/s40747-024-01724-5.
[33] Lin, J., & Others. (2024). Semi-IIN: Semi-supervised intra-inter modal interaction learning network for multimodal sentiment analysis. arXiv preprint arXiv:2412.09784. https:// arxiv. org/abs/2412.09784.
[34] Yu, W., Xu, H., Yuan, Z., & Wu, J. (2021). Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12), 10790–10797. https://doi.org/10.1609/aaai.v35i12.19314.
[35] Multimodal GRU with directed pairwise cross-modal attention for sentiment analysis. (2025). Scientific Reports, 15, 93023. https://doi.org/10.1038/s41598-025-93023-3.
[36] Xie, S., Chen, Q., Fang, X., & Others. (2024). Global information regulation network for multimodal sentiment analysis. Image and Vision Computing, 151, 105297. https: // doi. org/10.1016/j.imavis.2024.105297.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Frontiers in Computing and Intelligent Systems

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

