Research of Speech Style Transfer Based on Neural Network

: This paper draws inspiration from image style transfer model - neural style transfer, which leads to the research topic of speech style transfer based on neural network. First, the article describes the extraction process on 2D spectrogram of speech signal. Then, the speech style transfer based on convolutional neural network is constructed.


Introduction
Voice Conversion (VC) [1] refers to the conversion of the phonetic style features of the Source Speaker to those of the Target Speaker, while keeping the semantic information of the Source Speaker unchanged. Actually, Speech style transfer can be applied to communication, medical care, entertainment and other fields: In the text-to-speech (TTS) model [2], the synthesized speech might sound more like the voice delivered by a real person if it is processed by speech style transfer at the meantime; For the sake of confidentiality and security, speech style transfer technology can be adopted to change the style characteristics of speaker's voice [3]; In the medical field, the speech signal phonated by patients with damaged larynx can be repaired with the help of voice style transfer technology [4]; In the film dubbing field, especially, when another language is used in film dubbing, speech style transfer technology can make the voice style of the dubbing actor the same as that of the film actor, and the ideal dubbing effect is achieved finally [5].
It can be seen from the aforementioned examples that speech style transfer technology, as a subject with strong interdisciplinarity, has extremely important research role and value, and attracts numerous researchers to explore. In 1988, Abe's group [6] firstly proposed speech style transfer based on Vector Quantization (VQ) and codebook mapping. The spectral envelope characteristic parameters of primitive and stylized speeches are divided into a series of codebooks by vector quantization and speech style transfer is realized by establishing the mapping relationship between the codebooks. Although this method is simple and easy to implement, the quality of transformed voice is poor, discontinued and intermittent. In 1992, Savic [7] et al. improved the codebooks mapping into a neural network on the basis of Abe's research, which greatly enhanced the quality of converted speech. This is the first time to apply the artificial neural network model in the study of speech style transfer and has made a certain breakthrough. Subsequently, the research of speech style transfer based on neural network becomes the mainstream research direction. In 1995, Narendranath [8] [11] et al. took advantage of the generalized regression neural network (GRNN) to transform the style characteristic parameters for speech signals. In 2015, Ghorbandoost [12] et al. combined the personality characteristics of two kinds of speeches into a new speech personality characteristic, and realized the speech style transformation through the classical gaussian mixture model and artificial neural network model.
The above references show that the performance and stability of generated speech have immensely improved based on neural network in the research of speech style transfer. However, in the training stage of neural network, the difficulty in acquiring or producing training data hinders the study of speech style transfer. On the other hand, slow training speed can tremendously enlarge research difficulty even with a wealth of data. Therefore, how to use less training data or even no other data to study the speech style transfer model is the research innovation of this article. Inspired by the study of image style transfer, the paper utilizes the convolutional neural network to extract the features of the spectrograms for speech signals, so as to generate the stylized spectrogram and acquire corresponding stylized speech.

The Spectrogram of Speech Signal
This section will introduce some basic knowledge of speech signal and focus on the extraction process in the 2D spectrogram of speech signal.

The 2D Spectrogram of Speech Signal
Before the experiments of speech style transfer, the feature extraction in the 2D spectrogram of speech signal is usually required. That is to say, the discernable information contained in speech signals can be achieved through the extracted feature information. Moreover, speech signals are generated through the vocal tract, the shape of the vocal tract determines what kind of speech is phonated, to some extent. The shape of the vocal tract can be shown in the envelope of speech short time power spectrum, and the characteristic information of speech signals -2D spectrogram can describe the envelope exactly.  Next, Rotating the spectrum curve by 90 degrees to get the middle graph in figure 2. Furthermore, the amplitudes in the middle graph are mapped to a range of gray level, with gray level 255 and gray level 0 represented by the black area and the white area, respectively. In other words, the larger the amplitude value, the darker the corresponding area is. Thus, the right-most graph is acquired in figure 2.
The aim above is to add the time dimension, so that the spectrum of a speech, rather than a frame, can be manifested integrally. Finally, we obtain a spectrum diagram over time, which is the 2D spectrogram describing the speech signal, as shown in figure 3.

Speech Style Transfer Model Based on Neural Network
This paper proposes an innovative solution -the speech style transfer model based on convolutional neural network.

The Speech Style Transfer Model Based on Convolutional Neural Network
The principle of image style transfer model can be demonstrated through the following flow figure 4, roughly. Since the convolutional neural network is good at dealing with picture type data, processing speech signal into the picture type data is key point and innovation point of speech style transfer. According to the introduction of 2D spectrogram in section 2.1, it can be known that spectrogram can be regarded as 2-dimensional picture type data, to some extent. Therefore, when the research object is speech signal, content image, style image and generated image in figure 4 are replaced by the 2D spectrograms of content speech, style speech and generated speech, respectively. Theoretically, we can achieve the speech style transfer model. The speech style transfer model is shown in figure 5.  figure 5, the essential function of convolutional neural network is to extract the feature information of the input (2D spectrogram) layer by layer. After the layered extraction of convolutional layer, pooling layer, full connection layer and other network layers, the feature information of 2D spectrogram will become more and more advanced and abstract. That is to say, the low-layer convolution filters in the convolutional neural network tend to extract content feature information (edge, texture and color etc.) for 2D spectrogram, the high-level convolution filters tend to extract style feature information (a rough skeleton or layout etc.) for 2D spectrogram.
The 2D spectrograms of content speech, style speech and generated speech are denoted by . (1) The content loss function ( , )

Content
v v J C G ) measures the similarity between content speech spectrogram v C C and generated speech spectrogram v G G in content features such as skeleton and layout.

The Extraction of Style Features from The 2D Spectrogram of Style Speech
The features of the low layers' feature maps extracted by the convolution filters are selected as style features in the spectrogram of generated speech.
Step 1 The style matrix of spectrogram, also known as Gram matrix, is used to measure the associativity between different sections (1, 2...) in a certain feature map.

Define the style matrix
Define the style matrix .
(3)  The Gram matrix of spectrogram measures whether or not two features appear simultaneously in the spectrogram and show the response between two features when they appear together.
Step 2 Define style loss function for the l th layer feature map: . (4) Step 3 Finally, style loss function speech spectrogram is defined as the weighted sum of multilayer style loss functions , 1, 2,...
The style loss function of generated speech can be obtained iteratively with the aid of gradient descent method. Finally, the audio file of generated speech signal can be acquired according to the stylized spectrogram. Figure 8 is an iteration acquisition diagram for the audio file of generated speech.