Evaluation Of LSTM, Transformer and TCN In the Field of Auditory Research

Ran Xue

doi:10.54097/zjtb2s91

Authors

Ran Xue Department of Electrical Engineering, Universiti Sains Malaysia, State of Penang,14300, Malaysia

DOI:

https://doi.org/10.54097/zjtb2s91

Keywords:

auditory domain; deep learning; LSTM; Transformer; TCN.

Abstract

Capturing the temporal correlations inherent in auditory signals and the interconnections among complex features accurately stands as the core technical challenge currently. This paper conducts a systematic review focusing on three mainstream deep learning architectures: Long Short-Term Memory (LSTM) networks, Transformers, and Temporal Convolutional Networks (TCNs). First, it elaborates on the core mechanisms of each model: LSTMs rely on gating control, Transformers leverage the self-attention mechanism, and TCNs are built on causal/dilated convolutions. Second, it summarizes their typical applications in core auditory tasks—including speech recognition, speech emotion recognition, and audio classification—and analyzes their adaptive strategies for special scenarios such as low-resource environments and noisy conditions. Finally, the paper evaluates the strengths and weaknesses of each model across three dimensions and puts forward scenario-specific selection recommendations. Key findings highlight the complementary advantages of the three architectures: LSTMs, with their lightweight design, are well-suited for edge computing under resource-constrained environments; Transformers excel at high-precision, large-scale tasks due to their superior global feature capture; and TCNs excel at tasks requiring local feature sensitivity and real-time processing. This work offers a comprehensive reference for both auditory research and engineering practice.

Downloads

Download data is not yet available.

References

[1] Gers F A, Schmidhuber J, Cummins F. Learning to forget: Continual prediction with LSTM. Neural Computation, 2000, 12(10): 2451-2471.

[2] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Montreal: Curran Associates Inc, 2017: 5998-6008.

[3] Bai S, Kolter J Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. 2018.

[4] OpenAI. Whisper: Robust speech recognition via large-scale supervised training. San Francisco: OpenAI, 2022.

[5] Yuan G, Chung Y A, Glass J R. AST: Audio spectrogram transformer. IEEE Transactions on Audio, Speech, and Language Processing, 2022, 30: 3040-3053.

[6] Zhang X, Wang L, Lee H. Speech emotion recognition using LSTM and CNN//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Chengdu: IEEE, 2018: 52-56.

[7] Schuller B, Batliner A, Steidl S. Dawn of the transformer era in speech emotion recognition: Closing the valence gap. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(5): 5812-5825.

[8] Wang Y, Chen J, Liu W. Noise-robust speech recognition using LSTM with noise adaptive training. Journal of Signal Processing Systems, 2020, 92(7): 893-902.

[9] Hinton G, Deng L, Yu D. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 2012, 29(6): 82-97.

[10] Tsinghua University Speech Laboratory. Zero-resource speech recognition: Cross-lingual transfer learning based on XLSR-53 and HuBERT. Beijing: Tsinghua University, 2021.

[11] Wang S, Chen Y, Zhang H. Acoustic echo cancellation with the dual-signal transformation LSTM network. IEEE Transactions on Audio, Speech, and Language Processing, 2020, 28: 2654-2666.

[12] Lee J, Park S, Kim H. Temporal convolutional networks for environmental sound classification. Applied Acoustics, 2021, 175: 107803.

[13] Li M, Zhang M, Liu G. Speech emotion recognition based on CNN-LSTM with Bayesian optimization. Pattern Recognition and Artificial Intelligence, 2022, 35(4): 321-328.

[14] Pitaojun. A comparison of transformer and LSTM encoder decoder models for ASR. 2024.

[15] Luo Y, Mesgarani N. ConvTasNet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(8): 1256-1266.