Research on Sketching Face Headshot Generation based on Improved CycleGAN

: At present, sketch heads generated from realistic heads still has problems such as blurred contours and missing textures. For this reason, this work proposes a sketch head generation method based on CycleGAN. Firstly, the Self-Attention Mechanism (Squeeze-and-Excitation Networks (SENet) module is added to the UNet self-encoder; secondly, the base model is transformed into a supervised learning model so as to add constraints on the generated avatars and the real avatars. The experimental results show that the sketched avatar generated by the method in this paper has a better visual effect on the CUHK student test set with a 0.0274 improvement in SSIM value than the sketched avatar generated by the base model.


Introduction
Sketch avatar generation is to generate a corresponding sketch avatar given a real avatar. With the rapid development of image generation technology, sketch avatars have been widely used in the field of digital entertainment. Sketched avatars are used as personal avatar profiles are favored by more and more Internet users; various social networking software that converts real avatars into sketch style are also welcomed by the majority of Internet users.
At present, there are two main approaches for sketch avatar generation, which are model-driven approach and data-driven approach. The model-driven methods are mainly based on Bayesian learning [1] and multivariate output regression methods [2]. Wang and Tang [3] proposed a data-driven approach. The method is a Markov random field model based on probabilistic graphs and a local linear embedding [4] synthesis method based on subspace learning. The sketched head generated by the model-driven approach and the datadriven approach cannot capture the head detail information well compared with the artist's hand-drawn one, making the generated sketched head not similar enough to the real one, while the generated sketched head is excessively smooth lacking the sketch art style [5].
In 2014, Goodfellow et al [6] proposed Generative Adversarial Network (GAN), which has achieved great success in the field of image generation due to its powerful generative power and also one of the most rapidly developing directions of deep neural networks [7].GAN and its variants have achieved good results in image generation [8] and other fields have achieved good results and compensate the shortcomings of traditional methods .Pix2Pix [9] has achieved good results in the field of image generation, but the images generated by Pix2Pix tend to be blurred. The reason is that Pix2Pix is a single network conversion structure, which cannot guarantee the structural consistency of images before and after conversion. cycleGAN is a novel generative adversarial network model proposed by zhu et al [10], which contains a cyclic reconstruction process of two generative adversarial networks. Compared with other models, CycleGAN can map images from one domain to another and then map the synthesized images back. This dual mapping structure of the network maintains the structure of the generated images well. CycleGAN is an unsupervised learning model that uses cyclic consistency loss to constrain the correlation between the generated and input images.
The literature [11] mentions that using cyclic consistency loss, the generated images have the problem of feature hiding. UNet [12] self-encoder consists of an encoder and a decoder, and the hopping layer connection between the same layer in the encoder and decoder can greatly improve the quality of the generated images. Therefore, in this paper, we propose a sketch head generation method based on CycleGAN and UNet selfencoder. Firstly, a self-attentive mechanism (Squeeze-and-Excitation Networks (SENet) [13] module is embedded in the UNet self-encoder to improve the model's ability to extract features. Secondly, the model is converted to a supervised learning model so as to add L1 constraints on the generated avatars and the real avatars. The overall network framework of the model is shown in Fig. 1. For the purpose of later description, the set of realistic avatars is set as X domain and the set of sketched avatars is set as Y domain. Let the distribution satisfied by the face avatars in domain X be P(x) and the distribution satisfied by the face avatars in domain Y be Q(y). The whole network consists of two generators and two discriminators. The generator G represents the mapping of X-domain image to Ydomain image, and the generator F represents the mapping of Y-domain image to X-domain image. D X discriminator discriminates the real image from X-domain and the transformed image F(y); D Y discriminator discriminates the real image from Y-domain and the transformed image G(x).

Overall Network Framework Diagram
The conversion process of the image from X domain image to Y domain image is explained as an example. A cyclic consistency loss is introduced for the image x in the X domain to ensure that it can remain relevant to x after conversion to the Y domain. Firstly, the image is converted from the X domain to the Y domain by the generator G, i.e., the image x is converted to G(x). Then, the image G(x) converted to the Y domain is converted back to the X domain by the generator F, i.e., the image G(x) is converted to F(G(x)) by the generator F. For image x, F(G(x)) ≈ x should be satisfied after two conversions. The network structure of embedding SENet module in UNet self-coding is shown in Fig. 2. UNet self-coding compressively encodes the input image, and the feature map decreases continuously at this time. When the feature map reaches a minimum, the SENet module is embedded to enhance the feature extraction ability of the model; after up sampling the feature map to achieve cross-domain conversion of the image.

2.3.The Model is Converted into a Model for Permutation Supervised Learning
CycleGAN constrains the correlation between the generated image and the input image by cyclic consistency loss. The literature [14] mentions that there are hidden problems with using cyclic consistency loss for this reason, here the model is converted to a supervised learning model to improve the quality of the generated avatars.

Image Space Constraint for Generated Avatar and
Real Avatar To alleviate the cyclic consistency loss constraint, which brings about the problem of feature hiding of generated images, image L1 constraint is added to generated avatars and real avatars. l1 constraint is calculated as where: λ xy and λ yx are the weighting coefficients.

Generating the Adversarial Loss
Using the least squares loss as the adversarial loss of the GAN leads to a more stable training process of the model. Therefore, the GAN loss function of the model is:

Cyclic Consistency Loss
For each face avatar xi in the X domain mapped by the generator G and then mapped by the generator F should be as consistent as possible with the original image xi. Similarly, the same is true for each face avatar yi in the Y domain. The cyclic consistency loss is calculated as where: λ cyc and λ cxc are the weight coefficients.

Ontology Mapping Loss (Identity Loss) Loss
For each face avatar xi in the X domain, it should be as consistent as possible with xi after being transformed by the generator F. Similarly, for each face avatar yi in domain Y, the same is true. The ontology mapping loss is calculated as where: λ yc and λ xc are the weight coefficients.

The Total Loss Function of the Improved Model
In summary: the total loss function of the improved model is

Experimental Setup, Results and Analysis
To verify the effectiveness of the method in this paper, comparative experiments with different models are conducted in the CUHK student face dataset.

Data Set Introduction
There are 188 sketched face avatars in the CUHK student dataset. 88 were selected as the training set and 100 as the test set. The sketch images are drawn from real face heads of artists.

Evaluation Metrics
Structural Similarity (SSIM) [15], is a metric to measure the similarity of two images. The larger the SSIM value is, the more realistic the generated image is to the real image.

Experimental Environment Setup
The experiments are conducted under the windows 10 system environment, the GPU is NVIDIA GeForce RTX 3060, the display size is 12GB, the CPU is Intel(R) Core (TM) i7-4770, and the deep learning framework of pytorch is used.

Experimental Parameter Settings
In the CUHK student face dataset training, the Adam optimizer with momentum 0.5 is selected for training, the initial learning rate of both the generator and the discriminator is set to 0.0002, the learning rate "MultStepLR" is dynamically adjusted, the batch size is set to 1, and the number of iterations is 200.
In the CUHK dataset training, λ cyc and λ cxc are both taken as 10, λ yc and λ xc are both taken as 0.5, and λ xy and λ yx are taken as 2.

Analysis of Experimental Results
In this paper, we conducted comparison experiments with the mainstream image generation methods (LLE [16], Pix2Pix, CycleGAN and Combogan [17]). The sketched face avatars generated by each model on the CUHK student face test set are shown in Figure 6, SSIM values Table 1.
From Table 1, it can be seen that: the SSIMs of the sketched avatars generated by the method in this paper are all higher than those of other models. It can be seen from Fig. 3 that the sketch face head generated by this paper is clearer, more distinctive and richer in facial details than the sketch face head generated by other models, and has a better subjective feeling.

Conclusion
In this paper, a sketch head generation method is proposed. Firstly, in. Second, in order to further improve the quality of generated sketch face avatars, this paper converts the model into a supervised learning model by adding the generated avatar with real avatar L1 constraint. The experimental results show that the sketch avatar generated by the method in this paper is better than the base model in both subjective perception and objective evaluation index.