Performance Optimization and Interpretability Analysis of Lightweight Transformer Models for Speech Emotion Recognition Using Open-Source Emotional Speech Datasets

Authors

  • Shaoyang Zhang

DOI:

https://doi.org/10.54097/1y0ms617

Keywords:

Speech emotion recognition, Multi-corpus labelling, Residual CNN, Squeeze-and-excitation attention, Data augmentation.

Abstract

Accurate speech-emotion recognition (SER) remains challenging across speakers, languages, and recording conditions, especially under tight computing conditions. We present a lightweight three-class SER framework that unifies acted corpora and pairs frozen self-supervised speech embeddings with a compact residual CNN. A Multi-Corpus Unified Labelling Protocol (MCULP) harmonises CREMA-D, RAVDESS, and EmoDB into a balanced three-class taxonomy (negative, sad_fear, pos_neutral), yielding 7,618 utterances with an 80/10/10 speaker-independent split. Our 3.9-M-parameter Residual Squeeze-and-Excitation 1-D CNN (ResAttn1D-CNN) uses five residual blocks with channel attention and 768×400 wav2vec 2.0-base embeddings. A Tri-Aug pipeline—Gaussian noise, random crop-pad, and SpecAugment-style temporal masking—improves robustness. Trained with AdamW and mixed precision, the model converges in ≈10 hours. On the held-out test set, it reaches 62.9% accuracy and 0.627 macro-F1, outperforming a strong three-layer CNN by 20.9 points and exceeding prior CNN results on comparable tasks. Ablations: +4.9 pp (attention), +2.9 pp (Tri-Aug), +3.2 pp (depth); the confusion matrix shows residual ambiguity between low-arousal negative and neutral speech. Real-time inference (<3.5 ms/utterance, <15 MB GPU) enables edge deployment. Code, manifests, and Docker recipes will be released for reproducibility and benchmarking.

Downloads

Download data is not yet available.

References

[1] Cao H, et al. CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset. IEEE Transactions on Affective Computing, 2014, 5 (4): 377–390.

[2] Livingstone S R, Russo F A. The RAVDESS: A Dynamic, Multimodal Set of Facial and Vocal Expressions. PLOS ONE, 2018, 13 (5): e0196391.

[3] Burkhardt F, et al. A Database of German Emotional Speech. In: Proc. Interspeech, 2005: 1517–1520.

[4] Baevski A, Zhou Y, Mohamed A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, 2020, 33: 12449–12460.

[5] Pepino F, Riera P, Ferrer C A. Emotion recognition from speech using Wav2Vec 2.0 embeddings. In: Proc. Interspeech, 2021: 3400–3404.

[6] Young S, Evermann G, Gales M, et al. The HTK Book. Cambridge: Cambridge University Engineering Department, 2006.

[7] Davis S B, Mermelstein P. Comparison of the parametric representation of speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1980, 28 (4): 357–366.

[8] Park D S, Chan W, Zhang Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition. In: Proc. Interspeech, 2019: 2613–2617.

[9] Satt A, Rozenberg S, Hoory R. Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. In: Proc. Interspeech, 2017.

[10] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[11] Hu J, Shen L, Sun G. Squeeze-and-Excitation Networks. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[12] Satt A, Rozenberg S, Hoory R. Efficient emotion recognition from speech using deep learning on spectrograms. In: Proc. Interspeech, 2017.

[13] Loshchilov I, Hutter F. Decoupled weight decay regularization. In: International Conference on Learning Representations (ICLR), 2019.

[14] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[15] Scherer K R. Vocal communication of emotion: A review of research paradigms. Speech Communication, 2003, 40: 227–256.

[16] Neumann M, Vu N T. Attentive CNN-based Speech Emotion Recognition: A Study on IEMOCAP. In: Proc. Interspeech, 2017.

Downloads

Published

13-03-2026

Issue

Section

Articles

How to Cite

Zhang, S. (2026). Performance Optimization and Interpretability Analysis of Lightweight Transformer Models for Speech Emotion Recognition Using Open-Source Emotional Speech Datasets. Academic Journal of Science and Technology, 19(3), 104-112. https://doi.org/10.54097/1y0ms617