Evaluation Of LSTM, Transformer and TCN In the Field of Auditory Research
DOI:
https://doi.org/10.54097/zjtb2s91Keywords:
auditory domain; deep learning; LSTM; Transformer; TCN.Abstract
Capturing the temporal correlations inherent in auditory signals and the interconnections among complex features accurately stands as the core technical challenge currently. This paper conducts a systematic review focusing on three mainstream deep learning architectures: Long Short-Term Memory (LSTM) networks, Transformers, and Temporal Convolutional Networks (TCNs). First, it elaborates on the core mechanisms of each model: LSTMs rely on gating control, Transformers leverage the self-attention mechanism, and TCNs are built on causal/dilated convolutions. Second, it summarizes their typical applications in core auditory tasks—including speech recognition, speech emotion recognition, and audio classification—and analyzes their adaptive strategies for special scenarios such as low-resource environments and noisy conditions. Finally, the paper evaluates the strengths and weaknesses of each model across three dimensions and puts forward scenario-specific selection recommendations. Key findings highlight the complementary advantages of the three architectures: LSTMs, with their lightweight design, are well-suited for edge computing under resource-constrained environments; Transformers excel at high-precision, large-scale tasks due to their superior global feature capture; and TCNs excel at tasks requiring local feature sensitivity and real-time processing. This work offers a comprehensive reference for both auditory research and engineering practice.
Downloads
References
[1] Gers F A, Schmidhuber J, Cummins F. Learning to forget: Continual prediction with LSTM. Neural Computation, 2000, 12(10): 2451-2471.
[2] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Montreal: Curran Associates Inc, 2017: 5998-6008.
[3] Bai S, Kolter J Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. 2018.
[4] OpenAI. Whisper: Robust speech recognition via large-scale supervised training. San Francisco: OpenAI, 2022.
[5] Yuan G, Chung Y A, Glass J R. AST: Audio spectrogram transformer. IEEE Transactions on Audio, Speech, and Language Processing, 2022, 30: 3040-3053.
[6] Zhang X, Wang L, Lee H. Speech emotion recognition using LSTM and CNN//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Chengdu: IEEE, 2018: 52-56.
[7] Schuller B, Batliner A, Steidl S. Dawn of the transformer era in speech emotion recognition: Closing the valence gap. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(5): 5812-5825.
[8] Wang Y, Chen J, Liu W. Noise-robust speech recognition using LSTM with noise adaptive training. Journal of Signal Processing Systems, 2020, 92(7): 893-902.
[9] Hinton G, Deng L, Yu D. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 2012, 29(6): 82-97.
[10] Tsinghua University Speech Laboratory. Zero-resource speech recognition: Cross-lingual transfer learning based on XLSR-53 and HuBERT. Beijing: Tsinghua University, 2021.
[11] Wang S, Chen Y, Zhang H. Acoustic echo cancellation with the dual-signal transformation LSTM network. IEEE Transactions on Audio, Speech, and Language Processing, 2020, 28: 2654-2666.
[12] Lee J, Park S, Kim H. Temporal convolutional networks for environmental sound classification. Applied Acoustics, 2021, 175: 107803.
[13] Li M, Zhang M, Liu G. Speech emotion recognition based on CNN-LSTM with Bayesian optimization. Pattern Recognition and Artificial Intelligence, 2022, 35(4): 321-328.
[14] Pitaojun. A comparison of transformer and LSTM encoder decoder models for ASR. 2024.
[15] Luo Y, Mesgarani N. ConvTasNet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(8): 1256-1266.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Frontiers in Computing and Intelligent Systems

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

