Research on Speech Intonation Recognition Technology Based on Deep Learning in Human-Computer Interaction

Authors

  • Jiajun Ge
  • Beize Lin
  • Jiaqi Zhang

DOI:

https://doi.org/10.54097/wmfh7e61

Keywords:

Speech Intonation Recognition, Deep Learning, Human-Computer Interaction (HCI), Speech Emotion Recognition (SER).

Abstract

As a critical subfield of speech signal processing, speech intonation recognition technology aims to interpret paralinguistic features (such as pitch, rhythm, and energy) beyond the textual content of an utterance. Its development provides the core driver for enhancing the naturalness and emotional intelligence of human-computer interaction. This study focuses on intonation recognition technology, a critical component of speech signal processing. Its development has progressed from rule-based to statistical models, and now to deep learning models, resulting in steadily improving recognition accuracy. Regarding feature extraction, the acquisition of speech signal characteristics such as pitch, duration, and volume provides the data foundation for recognition models. Recognition algorithms have evolved from early Hidden Markov Models (HMMs) and Support Vector Machines (SVMs) to current mainstream deep learning models like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This technology is widely applied in areas such as voice assistants, intelligent customer service, and speech synthesis. For instance, it enables emotion analysis in voice assistants to adjust service strategies and enhances naturalness in synthesized speech. Current research focuses on developing algorithms robust to noise and accent interference while integrating with cognitive science. Future breakthroughs leveraging deep learning are anticipated in model complexity and recognition accuracy. Furthermore, driven by Internet of things and 5G technologies, applications are expected to expand into smart homes, telemedicine, and other domains.

Downloads

Download data is not yet available.

References

[1] Schuhmann C, Kaczmarczyk R, Rabby G, et al. EmoNet-Face: An expert-annotated benchmark for synthetic emotion recognition. 2025.

[2] Tianrun L, Xuan L, Guoxiang D. Intervention combined with surgical treatments for long-segment iliac artery occlusion. Chin J Minim Invasive Surg. 2025 Aug 24.

[3] Xin D, Yang S. End-to-end speech recognition based on ConvTCN–FLASH–Transducer.

[4] Álvarez A, Cearreta I, López JM, et al. Feature subset selection based on evolutionary algorithms for automatic emotion recognition in spoken Spanish and standard Basque language. Springer-Verlag; 2006.

[5] Geng J. Research on user-data-driven sentiment analysis.

[6] Wu Z. Design and implementation of an MCI human–machine interaction system integrating speech recognition and 3D emotional expression.

[7] Wang Y. Modeling methods and applications of emotion-aware dialogue management.

[8] Wang J. Design of an English speech recognition human–machine interaction system based on an improved 1D-CNN.

[9] Qin C, Zhang A, Zhang Z, et al. Is ChatGPT a general-purpose natural language processing task solver? arXiv. 2023; abs/2302.06476. doi:10.48550/arXiv.2302.06476.

[10] Li W. Research on the influence, recognition, and regulation of driver emotion and behavior in intelligent automotive cockpits [dissertation]. Chongqing: Chongqing University; 2021.

[11] Li Y, Li L. Research on the affective mechanism of harmonious human–computer interaction in network-based distance education systems. China Educ Info High Educ. 2015;(2):4.

[12] Mao L, Shi T, Wu L, et al. An unsupervised domain-adaptive text keyword extraction model—exemplified by texts in the “Artificial Intelligence Risks” domain. Inf Stud Theory Appl. 2022;45(3):6.

Downloads

Published

29-01-2026

Issue

Section

Articles

How to Cite

Ge, J., Lin, B., & Zhang, J. (2026). Research on Speech Intonation Recognition Technology Based on Deep Learning in Human-Computer Interaction. Academic Journal of Science and Technology, 19(2), 36-40. https://doi.org/10.54097/wmfh7e61