Multimodal Speech Emotion Recognition via Transformer-Based Hybrid Fusion and Dual Cross-entropy Techniques

Zhixing Yu

doi:10.54097/6pzrwp08

Authors

Zhixing Yu

DOI:

https://doi.org/10.54097/6pzrwp08

Keywords:

Speech emotion recognition, deep learning, Transformer.

Abstract

Speech emotion recognition is gaining increasing interest in the academic sphere due to the advancement of machine intelligence in the service industries. The previous research has already validated the efficacy of multimodality in Speech Emotion Recognition (SER); yet most studies have focused on one-time fusion techniques. This paper proposes a hybrid fusion architecture which optimizes the advantages of multiple fusion techniques and modalities. The model is predominantly based on the rapidly rising Transformer architecture. This study also extends the classic cross-entropy loss and designs a novel loss function which differentiates the misprediction patterns. The architecture is experimented on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset with sufficient cross-validation. It reaches 89.7% accuracy and outmatches the State-of-the-art (SOTA) methods. The performance is further enhanced by the proposed loss function and arrives at 91.1% accuracy. In addition, the models show computation scalability and few needs for hyperparameter fine-tuning. This article concludes that more comprehensive fusion techniques are worth exploration for multimodal speech emotion recognition and Transformers are suitable for emotional characteristics and lead the classification process.

Downloads

Download data is not yet available.

References

[1] De Lope, Javier, Manuel Graña. An ongoing review of speech emotion recognition. Neurocomputing, 2023, 528: 1-11.

[2] Gandhi Ankita, Adhvaryu Kinjal, Poria Soujanya, et al. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion, 2023, 91: 424-444.

[3] Khare Smith K, Blanes-Vidal Victoria, Nadimi Esmaeil S, et al. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations. Information Fusion, 2023: 102019.

[4] Vaswani Ashish, Shazeer Noam, Parmar, Niki, et al. Attention is all you need. International Conference on Neural Information Processing Systems, 2017: 6000-6010.

[5] Wagner Johannes, Triantafyllopoulos Andreas, Wierstorf Hagen, et al. Dawn of the transformer era in speech emotion recognition: closing the valence gap. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (9): 10745-10759.

[6] Xu Jie, Huang Feiran, Zhang Xiaoming, et al. Sentiment analysis of social images via hierarchical deep fusion of content and links. Applied Soft Computing, 2019, 80: 387-399.

[7] Aziz Abdul, Chowdhury Nihad Karim, Kabir Muhammad Ashad, et al. MMTF-DES: A fusion of multimodal transformer models for desire, emotion, and sentiment analysis of social media data. arXiv preprint, 2023: 2310.14143.

[8] Xie Jie, Mingying Zhu, Kai Hu. Fusion-based speech emotion classification using two-stage feature selection. Speech Communication, 2023, 152: 102955.

[9] Gan Chenquan, Fu Xiang, Feng Qingdong, et al. A multimodal fusion network with attention mechanisms for visual–textual sentiment analysis. Expert Systems with Applications, 2024, 242: 122731.

[10] Bahdanau Dzmitry, Kyunghyun Cho, Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint, 2014: 1409.0473.

[11] Ozsoy Makbule Gulcin. Multi-Margin Loss: Proposal and Application in Recommender Systems. arXiv preprint, 2024: 2405.04614.

[12] Livingstone Steven R., Frank A. Russo. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one, 2018, 13 (5): e0196391.

[13] Eyben Florian, Martin Wöllmer, Björn Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. Proceedings of the 18th ACM international conference on Multimedia. 2010: 1459-1462.

[14] Baltrušaitis Tadas, Peter Robinson, Louis-Philippe Morency. Openface: an open source facial behavior analysis toolkit. IEEE winter conference on applications of computer vision, 2016: 1-10.

[15] Zadeh Amir, Chong Lim Yao, Baltrusaitis Tadas, et al. Convolutional experts constrained local model for 3d facial landmark detection. Proceedings of the IEEE international conference on computer vision workshops. 2017: 2519-2528.

[16] Baltrusaitis Tadas, Peter Robinson, Louis-Philippe Morency. Constrained local neural fields for robust facial landmark detection in the wild. Proceedings of the IEEE international conference on computer vision workshops. 2013: 354-361.

[17] Wood Erroll, Baltrusaitis Tadas, Zhang Xucong, et al. Rendering of eyes for eye-shape registration and gaze estimation. Proceedings of the IEEE international conference on computer vision. 2015: 3756-3764.

[18] Baltrušaitis Tadas, Marwa Mahmoud, Peter Robinson. Cross-dataset learning and person-specific normalisation for automatic action unit detection. IEEE international conference and workshops on automatic face and gesture recognition. 2015, 6: 1-6.

[19] Mocanu Bogdan, Ruxandra Tapu, Titus Zaharia. Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning. Image and Vision Computing, 2023, 133: 104676.

[20] Ong Kah Liang, Lee Chin Poo, Lim Heng Siong, et al. SCQT-MaxViT: Speech Emotion Recognition with Constant-Q Transform and Multi-Axis Vision Transformer. IEEE Access, 2023, 11: 63081-63091.

[21] Ong Kah Liang, Lee Chin Poo, Lim Heng Siong, et al. Mel-MViTv2: Enhanced speech emotion recognition with mel spectrogram and improved multiscale vision transformers. IEEE Access, 2023, 11: 108571-108579.

[22] Haddad Syrine, Olfa Daassi, Safya Belghith. Single Modality and Joint Fusion for Emotion Recognition on RAVDESS Dataset. SN Computer Science, 2024, 5 (6): 669.

[23] Singh Jagjeet, Lakshmi Babu Saheer, Oliver Faust. Speech emotion recognition using attention model. International Journal of Environmental Research and Public Health, 2023, 20 (6): 5140.

[24] Ong Kah Liang, Lee Chin Poo, Lim Heng Siong, et al. MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition. IEEE Access, 2024, 12: 18237-18250.