Study on Sentiment Analysis Methods for Uyghur Videos

Authors

  • Jiazhi Wang
  • Jiarong Zhang
  • Haijiao Guan
  • Wenxiu He

DOI:

https://doi.org/10.54097/v396ne87

Keywords:

Video Sentiment Analysis, Low-resource Language, Multimodality

Abstract

As a representative minority language in China, Uyghur has important application value in cross-regional communication, trade exchanges, and multilingual information services. Compared with high-resource languages, research on sentiment analysis for Uyghur videos remains limited, especially in terms of systematically constructed multimodal sentiment annotation datasets and highly adaptable analytical models. To address this issue, this paper investigates the task of sentiment analysis for Uyghur videos. A Uyghur video sentiment dataset containing three modalities—text, audio, and video—was constructed. Based on the ToxVidLM framework, the original audio and text encoders were replaced: Whisper and H-RoBERTa were substituted with XLS-R-uyghur-cv and CINO, respectively, which are better suited to Uyghur-language scenarios, while VideoMAE was retained as the video encoder, thereby forming an improved model. Experimental results show that the improved model outperforms the original framework on three sentiment-related tasks, namely emotional tension, emotion intensity, and sentiment polarity. This indicates that, in low-resource language scenarios, the combination of dataset construction and encoder adaptation can effectively improve the performance of sentiment analysis for Uyghur videos.

Downloads

Download data is not yet available.

References

[1] Wu Y, Mi Q W, Gao T H. A comprehensive review of multimodal emotion recognition: techniques, challenges, and future directions[J]. Biomimetics, 2025, 10(7): 418.

[2] Gladys A A, Vetriselvi V. Sentiment analysis on a low-resource language dataset using multimodal representation learning and cross-lingual transfer learning[J]. Applied Soft Computing, 2024, 157: 111553.

[3] Chen L, Guan S, Huang X, et al. Cross-lingual multimodal sentiment analysis for low-resource languages via language family disentanglement and rethinking transfer[C]//Findings of the Association for Computational Linguistics: ACL 2025. Vienna, Austria: Association for Computational Linguistics, 2025: 6513-6522.

[4] Qin L, Chen Q, Zhou Y, et al. A survey of multilingual large language models[J]. Patterns, 2025, 6(1): 101118.

[5] Xu Y, Hu L, Zhao J, et al. A survey on multilingual large language models: corpora, alignment, and bias[J]. Frontiers of Computer Science, 2025, 19(11): 1911362.

[6] Areshey A, Mathkour H. Exploring transformer models for sentiment classification: A comparison of BERT, RoBERTa, ALBERT, DistilBERT, and XLNet[J]. Expert Systems, 2024, 41(11): e13701.

[7] Krugmann J O, Hartmann J. Sentiment Analysis in the Age of Generative AI[J]. Customer Needs and Solutions, 2024, 11(1): 3.

[8] Hsu W N, Bolte B, Tsai Y H H, et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451-3460.

[9] Chen S, Wang C, Chen Z, et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing[J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1505-1518.

[10] Ma Z, Zheng Z, Ye J, et al. emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation[C]//Findings of the Association for Computational Linguistics: ACL 2024. Bangkok: Association for Computational Linguistics, 2024: 15747-15760.

[11] Bertasius G, Wang H, Torresani L. Is Space-Time Attention All You Need for Video Understanding?[C]//Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, 2021, 139: 813-824.

[12] Tong Z, Song Y, Wang J, et al. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training[C]//Advances in Neural Information Processing Systems 35. 2022.

[13] Sun J, Han S, Ruan Y P, et al. Layer-wise Fusion with Modality Independence Modeling for Multi-modal Emotion Recognition[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics, 2023: 658-670.

[14] Wu Z, Gong Z, Koo J, et al. Multimodal Multi-loss Fusion Network for Sentiment Analysis[C]//Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Mexico City, Mexico: Association for Computational Linguistics, 2024: 3588-3602.

[15] Firdaus M, Chauhan H, Ekbal A, et al. MEISD: A Multimodal Multi-Label Emotion, Intensity and Sentiment Dialogue Dataset for Emotion Recognition and Sentiment Analysis in Conversations[C]//Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics, 2020: 4441-4453.

[16] Yu W, Xu H, Meng F, et al. CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020: 3718-3727.

[17] Maity K, Poornash A S, Saha S, et al. ToxVidLM: A Multimodal Framework for Toxicity Detection in Code-Mixed Videos[C]//Findings of the Association for Computational Linguistics: ACL 2024. Bangkok, Thailand: Association for Computational Linguistics, 2024: 11130-11142.

Downloads

Published

30-03-2026

Issue

Section

Articles

How to Cite

Wang, J., Zhang, J., Guan, H., & He, W. (2026). Study on Sentiment Analysis Methods for Uyghur Videos. Frontiers in Computing and Intelligent Systems, 15(3), 161-164. https://doi.org/10.54097/v396ne87