Integrating Multimodal Data for Deep Learning-Based Facial Emotion Recognition

Jialu Li

doi:10.54097/gpy08650

Authors

Jialu Li

DOI:

https://doi.org/10.54097/gpy08650

Keywords:

Emotion recognition; convolutional neural networks; multilayer perceptron; model fusion.

Abstract

With the rapid development of neural networks, emotion recognition has become a research area of great concern. It has important applications not only in marketing and human-computer interaction but also holds significant importance for improving emotional computing and user experience. This paper studies various methods for emotion recognition in images and videos, utilizing convolutional neural networks (CNN), multi-layer perceptron (MLP), and fusion models. The Facial Expression Recognition 2013 (FER2013) image dataset and the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) audio and video dataset serve as the basis for this study. The experimental results indicate that ResNet18 outperforms others in image emotion recognition, attributed to its residual block design and the incorporation of regularization techniques. In the realm of video emotion recognition, the audio model based on MLP demonstrates a superior ability to identify emotional information. Although the fusion of image and audio models theoretically could enhance accuracy, the randomness of video frames prevents the fusion model from achieving the desired effect. Future research might further explore the application of time series models in video emotion recognition to capture continuous emotional changes within videos.

Downloads

Download data is not yet available.

References

[1] Zeng Zhihong, Maja Pantic, Thomas S. Huang. Emotion recognition based on multimodal information. Affective information processing. London: Springer London, 2009: 241-265. DOI: https://doi.org/10.1007/978-1-84800-306-4_14

[2] Zhang Jianhua, Yin Zhong, Chen Peng, et al. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Information Fusion, 2020, 59: 103-126. DOI: https://doi.org/10.1016/j.inffus.2020.01.011

[3] Mellouk Wafa, Wahida Handouzi. Facial emotion recognition using deep learning: review and insights. Procedia Computer Science, 2020, 175: 689-694. DOI: https://doi.org/10.1016/j.procs.2020.07.101

[4] Lieskovska, Eva and Jakubec, Marov and Jarina, Roman, et al. A review on speech emotion recognition using deep learning and attention mechanism. Electronics, 2021, 10(10): 1163. DOI: https://doi.org/10.3390/electronics10101163

[5] Saxena Anvita, Ashish Khanna, Deepak Gupta. Emotion recognition and detection methods: A comprehensive survey. Journal of Artificial Intelligence and Systems, 2020, 2(1): 53-79. DOI: https://doi.org/10.33969/AIS.2020.21005

[6] Challenges in Representation Learning: Facial Expression Recognition Challenge. URL: https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data. Last Accessed: 2024/10/26

[7] Livingstone Steven R., Frank A. Russo. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one, 2018, 13(5): e0196391. DOI: https://doi.org/10.1371/journal.pone.0196391

[8] Li Zewen, Liu Fan, Yang Wenjie, et al. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE transactions on neural networks and learning systems, 2021, 33(12): 6999-7019. DOI: https://doi.org/10.1109/TNNLS.2021.3084827

[9] He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778. DOI: https://doi.org/10.1109/CVPR.2016.90

[10] Rumelhart David E., Geoffrey E. Hinton, Ronald J. Williams. Learning representations by back-propagating errors. nature, 1986, 323(6088): 533-536. DOI: https://doi.org/10.1038/323533a0