The Emotion Recognition Triathlon: DeepSeek vs. ChatGPT vs. Doubao
DOI:
https://doi.org/10.54097/xvcbsd93Keywords:
Multimodal Emotion Recognition, Large Language Models, Comparative Analysis.Abstract
This study presents a systematic empirical comparison of three leading large language models—DeepSeek, ChatGPT (GPT-4o), and Doubao—in multimodal emotion recognition tasks. Using a self-constructed dataset of 1,200 annotated text-image samples across three emotional scenarios (social gatherings, stress-induced tension, and anticipation-anxiety), the models were evaluated on overall performance, fine-grained emotion recognition, and context sensitivity. Results indicate that ChatGPT achieves the highest overall accuracy (89.5%) and demonstrates superior cross-modal reasoning and interpretability. Doubao excels in Chinese social contexts, with an F1 score of 91.5%, but shows limited cross-lingual generalization. DeepSeek performs stably in text-dominant tasks but lags in multimodal fusion scenarios. The findings highlight the context-dependent strengths of each model and provide practical guidance for model selection in real-world applications, such as global platforms, Chinese social media, and resource-constrained environments. This study addresses a critical gap in the comparative evaluation of multimodal LLMs and offers insights into future research in cross-cultural and lightweight multimodal emotion recognition.
References
[1] Lian H, et al. A survey of deep learning-based multimodal emotion recognition: Speech, text, and face. Entropy. 2023;25(10):1440.
[2] Wang Y, et al. A systematic review on affective computing: Emotion models, databases, and recent advances. arXiv preprint arXiv:2203.06935. 2022.
[3] Zhao S, et al. Affective image content analysis: Two decades review and new perspectives. arXiv preprint arXiv:2106.16125. 2021.
[4] Shayaninasab M, Babaali B. Multi-modal emotion recognition by text, speech and video using pretrained transformers. arXiv preprint arXiv:2402.07327. 2024.
[5] Kumar P, et al. VISTANet: Visual spoken textual additive net for interpretable multimodal emotion recognition. arXiv preprint arXiv:2208.11450. 2022.
[6] HyFusER: Hybrid multimodal transformer for emotion recognition using dual cross modal attention. Appl Sci (Basel). 2025;15(3):1053.
[7] Emotion recognition from videos using multimodal large language models. Future Internet. 2024;16(7):247.
[8] DeepSeek. DeepSeek-V3 technical report. arXiv preprint arXiv:2404.XXXXX. 2024.
[9] Chen L, et al. Multimodal fusion strategies for large language models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2024. p.125–34.
[10] Wang H, Li X. Efficient knowledge distillation in large language models. Nat Mach Intell. 2024;6(3):245–56.
[11] OpenAI. GPT-4o system card. OpenAI Technical Report. 2024.
[12] Brown T, et al. Advances in multimodal reinforcement learning from human feedback. Adv Neural Inf Process Syst (NeurIPS). 2024;37.
[13] Smith J, et al. Benchmarking emotional intelligence in multimodal AI systems. IEEE Trans Affect Comput. 2024;15(2):89–102.
[14] ByteDance. Doubao technical white paper. Volcano Engine Research. 2023.
[15] Zhang W, et al. Understanding Chinese internet culture through multimodal language models. In: Proceedings of the ACL Conference. 2024. p.45–58.
[16] Liu Y, Zhou M. Domain-specific optimization for Chinese social media analysis. Comput Linguist. 2024;50(1):78–95.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.







