A Survey of Deep Learning ‐ based Facial Expression Recognition Research

: Facial expression is one of the ways to convey emotional expression. Deep learning is used to analyze facial expression to understand people's true feelings, and human-computer interaction is integrated into it. However, in the natural real environment and various interference (such as lighting, age and ethnicity), facial expression recognition will face many challenges. In recent years, with the development of artificial intelligence, scholars have studied more and more facial expression recognition in the case of interference, which not only promotes the theoretical research, but also makes it popularized in the application. Facial expression recognition is to identify facial expressions to carry out emotion analysis, and emotion analysis can be analyzed with the help of facial expressions, speech, text, video and other signals. Therefore, facial expression recognition can be regarded as a research direction of emotion analysis. This paper focuses on the perspective of facial expression recognition to summarize. In the process of facial expression recognition, researchers usually try to combine multiple modal information such as voice, text, picture and video for analysis. Due to the differences between single-modal data set and multi-modal data set, this paper will analyze static facial expression recognition, dynamic facial expression recognition and multi-modal fusion. This research has a wide range of applications, such as: smart elderly care, medical research, detection of fatigue driving and other fields.


Introduction
In recent years, facial expression recognition has been one of the important research topics in computer vision, through the recognition and analysis of various facial expressions, to determine a person's emotional state.This will not only help us better understand human communication, but also expand the scope of human-computer interaction to include more emotional elements of people.Facial expression recognition originated from psychological research.In 1872, Darwin [1] proposed for the first time that human expression evolved from the expression features of animals, and also elaborated the correlation between humans and animals.Since then, expression recognition technology began to rise, and research on expression continues until now.In the 1970s, Ekman and Friesen [2] defined human expressions into six types through research, including happiness, surprise, sadness, fear, disgust, and anger, and first proposed the facial behavior coding system, the most important of which is to use computers to recognize facial expressions.In recent years, with the rapid development of artificial intelligence and deep learning, more and more experts and scholars pay more attention to facial expression recognition.
The advent of the data era has promoted the explosive growth of multimodal data, and many researchers have carried out the work of establishing multimodal data sets to provide data support and research basis for future experiments.Multimodal emotional information can be obtained from different emotional expression modes, including video, voice, text, body posture, walking style and facial expression.Multimodal emotion recognition uses clues from various modes in the same data segment to identify emotions, and uses the complementarity between modes to eliminate ambiguity in emotion recognition [3].Early expression recognition mainly focused on the expression recognition of single-mode data.Although the research effect is getting better and better, the emotion information contained in a single mode is not comprehensive, and the complex emotions expressed by human beings cannot be accurately recognized.With the development of multimedia, people will use multimedia platforms to share daily life and express complex emotions through rich channels such as images, videos and texts.
Cai et al. [4] combined speech and facial expression features, used CNN and LSTM to learn the emotional features of speech, designed multiple small-scale nuclear convolution blocks to extract facial expression features, and finally used DNN to integrate the two.Li et al. [5] used several multimodal fusion strategies to combine various features of acoustic, visual and textual modes.Mittal et al. [6] combinational cues from multiple simultaneous modes such as face, text, and speech.It can be seen that the fusion of information of multiple modes can achieve information supplement, improve the accuracy of prediction results, improve the robustness of prediction models, and make the final results more reliable.
Facial expression recognition is widely used in many fields, such as medicine, education, games, security and entertainment.In medicine, for example, doctors could use the technology to help identify what patients are really feeling when they can't accurately express their pain.In the field of education, teachers can analyze students' facial expressions to determine whether they truly understand what is being taught; In the entertainment industry, facial expression recognition can create more interactive and immersive experiences.
However, facial expression recognition also has its challenges and issues, including accuracy, privacy concerns, and cultural differences.Despite these challenges, we must also acknowledge that facial expression recognition offers great potential to enrich and enhance human-computer interaction.

Single Mode Facial Expression Recognition
Facial expression recognition based on single mode refers to emotion classification through the analysis and recognition of facial expression, which mainly relies on the data of the single mode of facial expression for analysis.First, we need to collect the data set, then preprocess the image, and then use machine learning or deep learning algorithms to recognize and determine the category of facial expressions.Through these steps, we can effectively understand and interpret the emotional information conveyed by facial expressions.Figure 1 shows the flow chart of facial expression recognition based on single mode.

Correlation Data Set
This paper lists common unimodal expression recognition data sets, and provides descriptions of their acquisition methods, sample numbers and annotated categories.Some of the data sets listed in the table were collected under laboratory shooting.The common features of these data sets are small data size and clear frontal face images, whose annotations have been repeatedly confirmed by psychological experts.Therefore, it is generally considered to be reliable, such as CK+ and JAFFE.However, there are also data sets, such as RAF-DB and FER2013, which are large-scale data sets collected in uncontrolled environments, and the quality of these data sets is relatively low, which will be greatly affected by the subjective feelings of the tagger.Common unimodal data sets are shown in Table 1.Video capture 1766 7

DFEW[20]
Video capture 16372 7 Current data sets are scarce in quantity and quality, and the data volume is small enough to train well on large deep network structures.The current lack of large-scale facial expression datasets with occlusive types and head pose labeling also affects the ability of deep networks to address gaps within larger classes and efficiently identify facial features.

Recognition Network
The facial expression recognition method based on deep learning is applied to the facial expression recognition of static images and dynamic videos.Due to the relative scarcity of video data sets and the difficulty in processing the relationship between frames, there are few methods to train video data sets on deep learning models, most of which are facial expression recognition of static images.The singlemode facial expression recognition method is shown in Table 2.
Despite the exploration and research of these methods, the accuracy rate of facial expression recognition based on single mode is generally low, and it is still in the research stage and cannot be widely used in real life.Therefore, we should continue to study and improve the recognition method, constantly optimize and improve the accuracy, so that it can be better applied to practical scenarios.

Multimodal Facial Expression Recognition
Multi-modal facial expression recognition refers to the use of multiple information sources or multiple modal data for facial expression recognition and classification.Facial expression recognition based on single mode only uses the information of the visual mode of facial expression for analysis.However, facial expressions are often accompanied by other modes of information, such as text, voice, movement, and EEG signals.Therefore, multi-modal facial expression recognition uses the information relationship between different modes to improve the recognition accuracy and enhance the robustness.
The methods of multimodal facial expression recognition mainly include two aspects: modal fusion and multimodal learning.Modal fusion is the fusion of features of different modes, which can be either feature level fusion, decision level fusion, or hybrid.Multimodal learning is to synthesize the information of different modes by means of joint modeling or joint training.
The use of multi-modality can more comprehensively understand the emotional state of people and provide more accurate recognition results.Although multimodal facial expression recognition faces challenges such as data set acquisition, modal fusion, and model optimization, it has broad application prospects and can provide more possibilities and solutions for sentiment analysis, humancomputer interaction, virtual reality and other fields.Figure 2 shows the flow chart of facial expression recognition based on multimode.

Section Headings
Several common multimodal data sets are summarized.In the following multimodal data sets, most of them have expression pictures or videos, assisted by one or more modes such as audio, text and EEG signals, so as to improve the recognition accuracy.The following data sets include laboratory collection, talk show collection, news video collection, natural environment recording, film and television clips, etc.Among them, the data mode conforms to the representative meaning: video (V), physiological signal (PS), audio (A), text (T), body movement (BM), facial movement (FM), image (I), electroencephalogram(E) etc. Common multimodal data sets are shown in Table 3.

Recognition Network
There are many kinds of facial expression recognition methods based on multi-mode, among which the most important is the fusion between modes, modal fusion is divided into three ways: feature fusion, decision fusion and hybrid fusion.Effective multi-modal fusion can share and complement information between different modes, and thus improve the accuracy of single mode recognition.
The process of feature fusion is to integrate the corresponding features obtained from different modal data through feature extraction.This method can effectively learn the correlation and complementarity between different modal features.Decision fusion is to use the output data obtained after the deep learning model is trained on different modal features as the input of the regression model in the next stage.Hybrid fusion is a combination of feature-level fusion and decision-level fusion.The network model is very difficult.Although it combines the advantages of feature fusion and decision fusion, it is relatively poor in practical application.The current facial expression recognition methods applied in bimodal and multi-modal are summarized as follows.There are many methods also studying multi-modal facial expression recognition.In this field, multi-modal recognition is challenging and innovative, and it will be widely used in the future.The multimodal facial expression recognition method is shown in Table 4.

Conclusion
With the continuous improvement of computer processing power, deep learning network and fusion algorithm, expression recognition based on multi-modal data will be rapidly developed, but there are still some shortcomings in the progress, such as: (1) The multi-modal facial expression data set is seriously insufficient, and the distribution of data categories is unbalanced.At present, the data of each expression in the existing expression database is relatively small, and they are very deliberate, and the expression is not natural, there are certain differences with the expression in the natural situation, it is difficult to become a very accurate and effective data, and the dynamic sequence image is seriously lacking.Category distribution happy, happy recognition rate is high, anger, contempt recognition rate is low.
(2) Research sites are mostly laboratories, lacking training in real situations.Most of the research on expression recognition is carried out under ideal conditions.However, due to the natural environment will block objects, block faces, different brightness at different times, and other circumstances such as the surrounding environment, will have a greater impact on the facial expression recognition results, and eventually lead to the actual results and experimental results are different.
(3) There are differences in the faces of different races.Because each person's nationality, age, growth conditions and other factors will affect the correctness of the identification.And there are differences in the habits of people of different races, which makes it difficult to use a unified model to classify faces, increasing the difficulty of recognition.
(4) There are still problems in the optimization of effective fusion methods between modes.Can not integrate the information between multiple modes well, multi-angle, multifaceted analysis of the facial expression at this time closer to which category

Figure 1 .
Figure 1.Single-mode facial expression recognition flow chart

Figure 2 .
Figure 2. Multimodal facial expression recognition flow chart

Table 1 .
Common unimodal data sets

Table 2 .
The single-mode facial expression recognition method

Table 3 .
Common multimodal data sets

Table 4 .
The multimodal facial expression recognition method