The Emotion Recognition Triathlon: DeepSeek vs. ChatGPT vs. Doubao

Zhichang Liu

doi:10.54097/xvcbsd93

Authors

Zhichang Liu

DOI:

https://doi.org/10.54097/xvcbsd93

Keywords:

Multimodal Emotion Recognition, Large Language Models, Comparative Analysis.

Abstract

This study presents a systematic empirical comparison of three leading large language models—DeepSeek, ChatGPT (GPT-4o), and Doubao—in multimodal emotion recognition tasks. Using a self-constructed dataset of 1,200 annotated text-image samples across three emotional scenarios (social gatherings, stress-induced tension, and anticipation-anxiety), the models were evaluated on overall performance, fine-grained emotion recognition, and context sensitivity. Results indicate that ChatGPT achieves the highest overall accuracy (89.5%) and demonstrates superior cross-modal reasoning and interpretability. Doubao excels in Chinese social contexts, with an F1 score of 91.5%, but shows limited cross-lingual generalization. DeepSeek performs stably in text-dominant tasks but lags in multimodal fusion scenarios. The findings highlight the context-dependent strengths of each model and provide practical guidance for model selection in real-world applications, such as global platforms, Chinese social media, and resource-constrained environments. This study addresses a critical gap in the comparative evaluation of multimodal LLMs and offers insights into future research in cross-cultural and lightweight multimodal emotion recognition.

References

[1] Lian H, et al. A survey of deep learning-based multimodal emotion recognition: Speech, text, and face. Entropy. 2023;25(10):1440.

[2] Wang Y, et al. A systematic review on affective computing: Emotion models, databases, and recent advances. arXiv preprint arXiv:2203.06935. 2022.

[3] Zhao S, et al. Affective image content analysis: Two decades review and new perspectives. arXiv preprint arXiv:2106.16125. 2021.

[4] Shayaninasab M, Babaali B. Multi-modal emotion recognition by text, speech and video using pretrained transformers. arXiv preprint arXiv:2402.07327. 2024.

[5] Kumar P, et al. VISTANet: Visual spoken textual additive net for interpretable multimodal emotion recognition. arXiv preprint arXiv:2208.11450. 2022.

[6] HyFusER: Hybrid multimodal transformer for emotion recognition using dual cross modal attention. Appl Sci (Basel). 2025;15(3):1053.

[7] Emotion recognition from videos using multimodal large language models. Future Internet. 2024;16(7):247.

[8] DeepSeek. DeepSeek-V3 technical report. arXiv preprint arXiv:2404.XXXXX. 2024.

[9] Chen L, et al. Multimodal fusion strategies for large language models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2024. p.125–34.

[10] Wang H, Li X. Efficient knowledge distillation in large language models. Nat Mach Intell. 2024;6(3):245–56.

[11] OpenAI. GPT-4o system card. OpenAI Technical Report. 2024.

[12] Brown T, et al. Advances in multimodal reinforcement learning from human feedback. Adv Neural Inf Process Syst (NeurIPS). 2024;37.

[13] Smith J, et al. Benchmarking emotional intelligence in multimodal AI systems. IEEE Trans Affect Comput. 2024;15(2):89–102.

[14] ByteDance. Doubao technical white paper. Volcano Engine Research. 2023.

[15] Zhang W, et al. Understanding Chinese internet culture through multimodal language models. In: Proceedings of the ACL Conference. 2024. p.45–58.

[16] Liu Y, Zhou M. Domain-specific optimization for Chinese social media analysis. Comput Linguist. 2024;50(1):78–95.