End-to-End Speech Hash Retrieval Algorithm based on Speech Content and Pre-training

Authors

  • Yian Li
  • Yibo Huang

DOI:

https://doi.org/10.54097/xa8rfs68

Keywords:

Speech Retrieval, Deep Hashing, WaveNet, Transformer, Pre-training

Abstract

Traditional speech retrieval tasks, such as audio fingerprinting and Spoken Word Detection Query (STD-QbE), focus on feature matching for speech or keyword retrieval. In this paper, we present a content-based speech retrieval algorithm. The algorithm allows matching based on the complete content of sentences in speech, not just local features or keywords. Importantly, it completely bypasses the Automatic Speech Recognition (ASR) transcription process by mapping the acoustic features of the sentence directly to the Hamming space. Retrieval of the same content is then achieved by comparing Hamming distances, thus effectively eliminating the potential impact of transcription errors on retrieval performance. In order to achieve this, our approach employs the Connectionist Temporal Classification (CTC) speech recognition technique to pre-train the model to learn content-dependent representations of speech features. Through experiments, we demonstrate that our approach achieves excellent performance in speech retrieval tasks.

Downloads

Download data is not yet available.

References

M. B. Ak¸cay and K. O˘ guz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modal ities, and classifiers,” Speech Communication, vol. 116, pp. 56–76, 2020. W.-K. Chen, Linear Networks and Systems (Book style). Belmont, CA: Wadsworth, 1993, pp. 123–135.

L.-s. Lee, J. Glass, H.-y. Lee, and C.-a. Chan, “Spoken content retrieval—beyond cascading speech recognition with text retrieval,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 9, pp. 1389–1420, 2015 B. Smith, “An approach to graphs of linear forms (Unpublished work style),” unpublished.

Y.-b. Huang, Y. Wang, H. Li, Y. Zhang, and Q.-y. Zhang, “Encrypted speech retrieval based on long sequence biohashing,” Multimedia Tools and Applications, vol. 81, no. 9, pp. 13065–13085, 2022. J. Wang, “Fundamentals of erbium-doped fiber amplifiers arrays (Periodical style—Submitted for publication),” IEEE J. Quantum Electron., submitted for publication.

W. Khan and K. Kuru, “An intelligent system for spoken term detection that uses belief combination,” IEEE Intelligent Systems, vol. 32, no. 1, p. 70–79, Feb 2017. [Online]. Available: Y. Yorozu, M. Hirano, K. Oka, and Y. Tagawa, “Electron spectroscopy studies on magneto-optical media and plastic substrate interfaces (Translation Journals style),” IEEE Transl. J. Magn. Jpn., vol. 2, Aug. 1987, pp. 740–741 [Dig. 9th Annu. Conf. Magnetics Japan, 1982, p. 301].

T. K. Chia, K. C. Sim, H. Li, and H. T. Ng, “A lattice-based approach to query-by-example spoken document retrieval,” in Proceedings of the J. U. Duncombe, “Infrared navigation—Part I: An assessment of feasibility (Periodical style),” IEEE Trans. Electron Devices, vol. ED-11, pp. 34–39, Jan. 1959. 31st annual international ACM SIGIR conference on Research and de velopment in information retrieval, 2008, pp. 363–370.

C. Parada, A. Sethy, and B. Ramabhadran, “Query-by-example spoken term detection for oov terms,” in 2009 IEEE Workshop on Automatic Speech Recognition & Understanding. IEEE, 2009, pp. 404–409. R. W. Lucky, “Automatic equalization for digital communication,” Bell Syst. Tech. J., vol. 44, no. 4, pp. 547–588, Apr. 1965.

Y. Moriya and G. J. Jones, “Improving noise robustness for spoken content retrieval using semi-supervised asr and n-best transcripts for bert-based ranking models,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 398–405. G. R. Faulhaber, “Design of service systems with priority reservation,” in Conf. Rec. 1995 IEEE Int. Conf. Communications, pp. 3–8.

W. Shen, C. M. White, and T. J. Hazen, “A comparison of query by-example methods for spoken term detection,” MASSACHUSETTS INST OF TECH LEXINGTON LINCOLN LAB, Tech. Rep., 2009. G. W. Juette and L. E. Zeffanella, “Radio noise currents n short sections on bundle conductors (Presented Conference Paper style),” presented at the IEEE Summer power Meeting, Dallas, TX, Jun. 22–27, 1990, Paper 90 SM 690-0 PWRS.

H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE transactions on acoustics, speech, and signal processing, vol. 26, no. 1, pp. 43–49, 1978. J. Williams, “Narrow-band analyzer (Thesis or Dissertation style),” Ph.D. dissertation, Dept. Elect. Eng., Harvard Univ., Cambridge, MA, 1993.

X. Anguera and M. Ferrarons, “Memory efficient subsequence dtw for query-by-example spoken term detection,” in 2013 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2013, pp. 1–6. J. P. Wilkinson, “Nonlinear resonant circuit devices (Patent style),” U.S. Patent 3 624 12, July 16, 1990.

H. Kamper, W. Wang, and K. Livescu, “Deep convolutional acous tic word embeddings using word-pair side information,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4950–4954 Letter Symbols for Quantities, ANSI Standard Y10.5-1968.

C. Jacobs, Y. Matusevych, and H. Kamper, “Acoustic word embeddings for zero-resource languages using self-supervised contrastive learning and multilingual adaptation,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 919–926. E. E. Reber, R. L. Michell, and C. J. Carter, “Oxygen absorption in the Earth’s atmosphere,” Aerospace Corp., Los Angeles, CA, Tech. Rep. TR-0200 (420-46)-3, Nov. 1988.

H. Kamper, Y. Matusevych, and S. Goldwater, “Improved acoustic word embeddings for zero-resource languages using multilingual transfer,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1107–1118, 2021.

Q.-y. Zhang, X.-j. Zhao, Q.-w. Zhang, and Y.-z. Li, “Content-based encrypted speech retrieval scheme with deep hashing,” Multimedia Tools and Applications, p. 10221–10242, Mar 2022.

Y. Yuan, L. Xie, C.-C. Leung, H. Chen, and B. Ma, “Fast query-by example speech search using attention-based deep binary embeddings,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1988–2000, 2020.

S.-W. Fan-Jiang, T.-H. Lo, and B. Chen, “Spoken document retrieval leveraging bert-based modeling and query reformulation,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Sig nal Processing (ICASSP). IEEE, 2020, pp. 8144–8148.

H. Muaidi, A. Al-Ahmad, T. Khdoor, S. Alqrainy, and M. Alkof fash, “Arabic audio news retrieval system using dependent speaker mode, mel frequency cepstral coefficient and dynamic time warp ing techniques,” Research Journal of Applied Sciences, Engineer ing and Technology, p. 5082–5097, Oct 2016. [Online]. Available: http://dx.doi.org/ 10.19026/ rjaset. 7.903.

F. Shen, C. Du, and K. Yu, “Acoustic word embeddings for end-to end speech synthesis,” Applied Sciences, p. 9010, Sep 2021 [Online]. Available:http:// dx.doi.org/ 10.3390/ app 1119901.

A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A gener ative model for raw audio,” en-USSSW,SSW, Sep 2016.

C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Tempo ral convolutional networks for action segmentation and detection,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 156–165.

H. Wang, F. Gao, Y. Zhao, L. Yang, J. Yue, and H. Ma, “Multitask learning with local attention for tibetan speech recognition,” Complex ity, vol. 2020, pp. 1–10, 2020.

B.-H. Sung and S.-C. Wei, “Becmer: A fusion model using bert and cnn for music emotion recognition,” in 2021 IEEE 22nd International Con ference on Information Reuse and Integration for Data Science (IRI). IEEE, 2021, pp. 437–444.

J. Mingyu, Z. Jiawei, and W. Ning, “Afr-bert: Attention-based mech anism feature relevance fusion multimodal sentiment analysis model,” Plos one, vol. 17, no. 9, p. e0273936, 2022.

D. Wang and X. Zhang, “Thchs-30: A free chinese speech corpus,” arXiv preprint arXiv:1512.01882, 2015.

Downloads

Published

28-05-2024

Issue

Section

Articles

How to Cite

Li , Y., & Huang, Y. (2024). End-to-End Speech Hash Retrieval Algorithm based on Speech Content and Pre-training. Frontiers in Computing and Intelligent Systems, 8(2), 22-28. https://doi.org/10.54097/xa8rfs68