Uyghur Keyword Spotting and Speech Representation Learning Based on an E-Branchformer Encoder-Decoder Architecture

Haiyang Wang; Jiazhi Wang

doi:10.54097/5ynv3r62

Authors

Haiyang Wang
Jiazhi Wang

DOI:

https://doi.org/10.54097/5ynv3r62

Keywords:

Uyghur Speech, Low-resource Language, Keyword Spotting, Speech Representation, Multi-task Learning

Abstract

Keyword spotting for Uyghur remains challenging because of limited labeled resources, agglutinative morphology, speaker diversity, and unstable boundary cues under partial observation. This paper presents a non-streaming E-Branchformer encoder-decoder framework that unifies Uyghur keyword spotting and speech representation analysis. Beyond a standard keyword spotting pipeline, the study explicitly investigates how hidden representations evolve when only a prefix of an utterance is available. To support this goal, the corpus is subjected to systematic data cleaning, including duplicate removal, damaged-file filtering, language-mix exclusion, and low-quality-sample screening. After unified preprocessing and normalization, a prefix-stage dataset is built by extracting the first 25%, 50%, 75%, and 100% of each utterance, which enables controlled analysis of completeness and discriminability across scanning stages. The proposed model employs an E-Branchformer encoder, an attention-based decoder, and joint CTC/attention training. A representation-oriented multi-task objective combines keyword classification with completeness prediction, while encoded features from different prefix stages are used for discriminability analysis. Experiments on a 134.1 h Uyghur speech corpus demonstrate that the proposed method improves keyword spotting performance over competitive baselines and yields more stable hidden representations under incomplete input. The model reaches an EER of 4.9% and an ATWV of 0.901, while the prefix-stage representation study shows consistent gains in 5-NN discrimination and decreasing completeness error as the observed speech grows. These results indicate that representation-oriented training is beneficial for both keyword spotting accuracy and interpretability.

Downloads

Download data is not yet available.

References

[1] A. Rouzi, S. Yin, Z. Zhang, D. Wang, A. Hamdulla, and F. Zheng, “THUYG-20: A free Uyghur speech database,” Journal of Tsinghua University (Science and Technology), vol. 57, no. 2, pp. 182–187, 2017, doi: 10.16511/j.cnki. qhdxxb. 2017. 22. 012.

[2] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. ICASSP, 2016, pp. 4960–4964, doi: 10.1109/ICASSP.2016.7472621.

[3] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” in Proc. Interspeech, 2017, pp. 949–953, doi: 10.21437/Interspeech.2017-1296.

[4] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented Transformer for speech recognition,” in Proc. Interspeech, 2020, pp. 5036–5040, doi: 10.21437/Interspeech.2020-3015.

[5] K. Kim, F. Wu, Y. Peng, J. Pan, P. Sridhar, K. J. Han, and S. Watanabe, “E-Branchformer: Branchformer with enhanced merging for speech recognition,” in Proc. IEEE Spoken Language Technology Workshop (SLT), 2022, pp. 84–91, doi: 10.1109/SLT54892.2023.10022656.

[6] A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,” in Proc. ICML, vol. 162, 2022, pp. 1298–1312.

[7] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019, pp. 2613–2617, doi: 10.21437/ Interspeech. 2019-2680.

[8] T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proc. EMNLP System Demonstrations, 2018, pp. 66–71, doi: 10.18653/v1/D18-2012.

[9] T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and E. Dupoux, “Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline,” in Proc. Interspeech, 2013, pp. 1781–1785, doi: 10.21437/ Interspeech. 2013-441.

[10] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal Forced Aligner: Trainable text-speech alignment using Kaldi,” in Proc. Interspeech, 2017, pp. 498–502, doi: 10.21437/Interspeech.2017-1386.

[11] Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich, “GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks,” in Proc. ICML, vol. 80, 2018, pp. 794–803.

[12] H.-K. Shin, H. Han, D. Kim, S.-W. Chung, and H.-G. Kang, “Learning audio-text agreement for open-vocabulary keyword spotting,” in Proc. Interspeech, 2022, pp. 1871–1875, doi: 10.21437/Interspeech.2022-580.