Lightweight Multi-Attention Fusion Network for Image Super-Resolution

: Single image super-resolution reconstruction (SISR) is one of the important techniques in computer vision and image processing. Most of the existing SISR methods adopt equal processing for different spatial domains and channel domains, resulting in a large amount of computational resources wasted on unimportant features. In order to address these problems, a novel lightweight multi-attention fusion network (LMAFN) is proposed, in which the multiple attention fusion block allocates computational resources more efficiently by capturing the weight information implied by the channel domain and the spatial domain separately, thus effectively reducing the number of parameters. The synthetic channel attention block in the multiple attention fusion block makes full use of inter-channel correlation by introducing both global standard deviation pooling and maximum pooling. Global features are fused through residual linking to alleviate the problem of high frequency information loss. Experimental results on several benchmark datasets show that the proposed method effectively reduces the number of parameters and computational effort without excessive loss of reconstruction performance, and achieves better performance than the compared models.


Introduction
Image super-resolution reconstruction is the process of restoring a low-resolution image to a high-resolution image by algorithms and is one of the key techniques in computer vision and image processing. The concept has been of great academic interest since it was proposed. In 2015, Dong et al. [1] proposed SRCNN by combining SISR with CNN. And then more complex architectures are proposed to improve the performance of SR methods, such as SRGAN [2]. However, it is difficult to directly apply them in portable mobile devices due to their computational and parameter constraints. The lightweight research starts with FSRCNN [3], which directly applies the SR network to the LR image with removing costly up sampling layers to reduce training time. DRCN [4] was the first to apply a recursive algorithm to SISR, and reduce parameters by reusing part of the parameter's multiple times. Later, DRRN [5] utilizes recurrent layers to reduce parameters while maintaining the depth of the network. EDSR [6] modifies the structure of the residual block by design and removes the BN layer. It could save about 40% of GPU memory space. LapSRN [7] uses a pyramidal framework to gradually increase the image size so that SR image can be performed efficiently at very low resolution. CBPN [8] is proposed to replace the previous up-down projection module with a pixel shuffle layer.
With the further development of SISR, many issues arise. Firstly, most available CNN based models are primarily use multiple stacked convolution operations and increase the convolution kernel to enhance the image reconstruction's quality, but at the same time bring a large amount of computation. Secondly, the majority of current CNN networks obtain features by successive convolution operations while using identical processing for feature of each channel and position, but the importance of the features is different from each other, and equal processing causes a waste of computational resources and consequently a huge memory consumption.
In view of the extant shortcomings mentioned above, inspired by [9] , we propose an improved lightweight network. Among them, the multi-attention fusion block (MAFB) focus on the significance of distinct channel and spatial position features to acquire the corresponding position's weight parameters. What's more, the global feature supervision is established via the long skip connections to speed up network convergence and enable features to be effectively utilized. For the reconstruction part, shallow and deep features are fused by residual learning and sub-pixel convolution techniques, which are complementary to the previous feature extraction and feature fusion parts.
The main contributions of this paper are as follows: (1) We proposes an improved lightweight multiple attention image super-resolution reconstruction network (LAFMN) which reduces the number of parameters and computational effort without losing too much reconstruction performance. (2) We propose MAFB, a module that allocates computational resources according to the importance of features by applying channel attention enhancement and spatial attention enhancement to the features respectively. The synthetic channel attention block (SCAB) introduces two different pooling, which makes full use of the correlation information between channels. At the same time, global residual connectivity is added to fuse low-level and high-level features more effectively. (3) Experimental results on several benchmark datasets show that the proposed method in this paper achieves better performance compared to other existing state-of-the-art models.

Single Image Super-Resolution Reconstruction based on Deep Learning
With DONG et al. [1] combining CNN with SISR, more and more neural network-based SISR models have been proposed. SRCNN proposes a simple three-layer network which first up-samples the input low-resolution LR image using a bicubic interpolation algorithm to obtain the target size image. The next step is to process the input LR image through a three-layer convolutional network to obtain a highresolution SR image, with the goal of making it alike as possible to the original HR image. The first part of the network extracts multiple patches of the input LR image, each of which is represented as a multidimensional vector by the convolution operation, and all the feature vectors form an n1dimensional feature mapping matrix, and the process formula is expressed as: In the second part, the n1-dimensional feature mapping matrix is non-linearly mapped by a convolution operation to obtain an n2-dimensional feature matrix, which is expressed by the following equation.
(2) The last step of the reconstruction process is equivalent to the deconvolution, that is, the n2-dimensional characteristic matrix is restored to HR image. It can be expressed as follows:

Image Super-Resolution based on Attention Mechanism
The visual attention mechanism is a unique signal processing mechanism of the human brain. Specifically, by observing the global image, selecting some local focus areas, and then paying more attention to these areas to obtain more detailed information and suppress other useless information. The essence of it is to learn a weight distribution for image features which will be applied to the original features to provide different feature influences for image processing tasks such as image classification, image recognition, etc.
Considering that the feature importance extracted by different convolution kernels is different, SENet [10] network introduce the concept of channel attention to deep neural networks initially. It mainly enhanced the valuable features by learning the weight value of each channel, so as to enhance the learning capacity of the network algorithm effectively while the computing resources are limited. Simultaneously, it can be used for designing a lightweight network architecture. Non-local [11] proposed by Liu et al. aimed to generate a broad range of attention maps by calculating each spatial point's correlation matrix in the feature map, and then uses them to instruct the aggregation of intensive context information. Yet, due to its large calculation, it is difficult to apply in real life. RCAN [9] has significantly improved the reconstruction effect via the use of channel attention in the field of SISR.

Framework of the Proposed Model
In this part, we despict the details of the proposed model in Figure 1. First, a convolutional layer is used to extract the shallow features of the LR image, and then further feature extraction is performed by multiple stacked MAFBs. Finally, the HR image is reconstructed by an upsampling module. The convolution is fused and added with the shallow features to obtain the reconstructed HR image. Furthermore, we add the features of the first layer and the features of the last layer through residual connection, and fuse the features of the shallow layer and the deep layer, so as to maintain the influence of the features of shallow layer on the deep layer to the greatest extent.
As shown in Fig. 1, considering the lightweight design of the model, this part only consists of a simple 3×3 convolution. The process can be expressed as: represents the 3×3 convolution operation for extracting features from the input LR image and x 0 is the output of this layer.
Then, we use a nonlinear mapping module consisting of 16 stacked MAFBs to generate new feature representations, denoted as: where x n represents the the n-th MAFB's output. Finally, the output feature map x 0 and x n is used as the input of the reconstruction module, and further feature fusion is performed by a 3×3 convolution after the reconstruction module Additionally, we add a global residual connection in which performs bilinear interpolation on the input, adding its output to the output of the reconstruction module. HR maps that upsample the features into the target size. Finally, we will get: = ( ( 0 + )) + ( ) (6) where (•) denotes the operation of the upsampling module, f Fusion (·) denotes 3×3 convolution, f UP (·) denotes interpolated upsampling, and I SR is the final output of the network.

Fig.1 The Proposed Lightweight Multi-Attention Fusion
Network's Structure

Multi-Attention Fusion Block
Inspired by SCNet [12], its structure is improved in this paper. Figure 2 shows that the MAFB consists of two parts: The upper layer performs higher-level feature operations, that is, adding weight distribution to the features. And another layer is used to remain the primitive information. We adopt two attention modules, namely SCAB and modified spatial attention block (MSAB), respectively, in the upper layer.

Fig. 2 Multi-Attention Fusion Block
The x n−1 and x n are defined as the input and output of the n-th MAFB separately. Similar to SENet, the two branches of MAFB are first subjected to dimensionality reduction through a 1×1 convolutional layer which called f D−reduction (•). The two pathways in the first layer correspond to the channel attention and spatial attention modules respectively. Given input features: is only half of that of x n−1 .
Next, input the output of the first layer to calculate the channel and spatial position weights. Finally superimpose the learned weight value to each corresponding feature point position by multiplying the corresponding positions, it can be expressed as: = ⨂ (9) where a and b represent the weight, values output by the channel attention block and the spatial attention block separately, ⨂ is the multiplication operation of the corresponding position weight information, u is the output of the previous layer in MAFB which will be passed through a 3×3 convolution and output as ′ .
The operation of the lower layer is to convert x n−1 ′′ to x n ′′ . Likewise, a 1×1 convolutional is used for dimensionality reduction, and then a 3×3 convolutional layer for generating x n ′′ to preserve the original information. Finally, concatenate the outputs x n ′ and x n ′′ of the two layers. The x n is then generated by the 1×1 convolution. To accelerate training, a shortcut is added to generate the output features of this part.
The main difference between this structure and [12] is that we use two attention mechanisms instead of the pooling and upsampling layers in it. It can be expressed as: is the concatenate operation, and x n−1 is the input (i.e., the output of the previous MAFB)?

Synthetic Channel Attention Block
Inspired by [13], SCAB first replaces global average pooling by global standard deviation pooling and maximum pooling, and performs feature concatenation on its output. The expression formula is: where z c and Max c stand for the output of the c-th element and max pooling separately.
To make the most of the detailed information contained in features, the two one-dimensional vectors output by the standard deviation pooling and the maximum pooling are concatenated into one two-dimensional matrix. Fig. 3 shows that the dimensionality reduction and enhancement are carried out respectively through two consecutives 1×1 convolution, and then the weights are normalized by a Sigmoid activation function to apply the weights to the input features. The final output channel is 1×1×C. The process is: = ( ( ( ( )))) (12) where a and z c are the output of the SCAB and the result of two pooling and splicing separately, σ(·) and f(·) represent the ReLU function and the Sigmoid activation function; W s and W i are 1×1 convolution in which W s represents the operation to be compressed into one dimension and W i is used to increase the dimension. They represent compression and amplification of feature channels according to scale r respectively.

Fig. 3 Synthetic Channel Attention Block
The max pooling is to take the largest feature point in the neighborhood. It can learn the edge and texture structure of the image well. Global standard deviation pooling can provide more effective information for channel weight learning. Let x n−1 ′ = [x 1 , . . . , x c , . . . , x C ] as the input, it has C feature maps of size H×W. The formula for standard deviation pooling is: where std c is the standard deviation of the c-th channel?

Modified Spatial Attention Block
Texture details vary at different spatial locations. Fig. 4 shows the structure of the proposed MSAB.Inspired by SENet [10], we design a structure for global average pooling and standard deviation pooling along the channel axis respectively when building a spatial attention mechanism. First, we perform a global average pooling operation for each channel of the input features. We assume that the dimension of it is C×H×W, and the formula of the c-th channel feature is: where x c i,j represents the pixel value of the position (i, j) in the c-th channel of the input feature map. F GAP (•) represents the global average pooling operation, which is used to generate descriptors to describe the significance of distinct location features.

Fig. 4 Modified Spatial Attention Block
The pooled features are concatenated first, and the number of channels is compressed by 1×1 convolution. The nonlinear mapping of spatial weight information is realized by deconvolution operation, which further reduces the volume of calculation and ensures that information of multiple spatial position are able to be strengthened. Ultimately, the calculated spatial weight feature maps are normalized by the Sigmoid activation function. The final output can be expressed as: b = f(W i (σ(W D (g c )))) (15) where b is the output feature of the MSAB, and g c is the concatenating result of two pooling. W i and W D denote for 1×1 convolution and deconvolution respectively. Fig. 5 shows the up sample block which uses sub-pixel convolution [14] to transform the feature in low-frequency into features in high-frequency in reconstruction block. It mainly consists of Conv with Shuffle ×2 represents a 3×3 convolution with Shuffle in scale of 2, and Conv with Shuffle ×3 represents a 3×3 convolution with Shuffle in scale of 3. Conv with Shuffle ×2 and two Conv with Shuffle ×2 are used for ×2 and ×4 scales separately, and Conv with Shuffle ×3 is used for scale of 3.

Datasets and Metrics
The training process uses the DIV2K [15] dataset, which contains 1000 high resolution images with rich scene, edge and texture detail. We selected 800 of them for training. The LR images were downsampled by bicubic to obtain lowresolution images at ×2, ×3 and ×4 scales. Five standard benchmark datasets, Set5 [16], Set14 [8], B100 [4], Urban100 [17] and Manga109 [18], were used as the test set for the testing process. Peak signal-to-noise ratio (PSNR) and structural similarity index [19] (SSIM) are used as evaluation metrics. In this paper, the reconstructed images are converted to YCbCr space and compared in the Y channel. We use parameters and multi-adds to measure the lightness of the model and compare it with the existing mainstream models.

Implementation Details
In order to avoid under-fitting phenomenon during the training process, the training datasets are augmented by random rotations of 90°, 180°, 270° and horizontal flipping to make it 8 times larger than the original. During the training of the model at three scales (×2, ×3 and ×4), 32 image blocks of sizes 128 × 128, 192 × 192 and 256 × 256 were randomly cropped for each batch as input. The L1 loss function [19] and Adam optimizer are chosen for the training. We adapt the cosine annealing learning scheme, and the initial maximum and minimum learning rates were set to 1e-3 and 1e-7, respectively. The cosine period is 250k iterations. Our model is based on the PyTorch framework, and an Nvidia Tesla V100 GPU with 32GB of memory was selected for training acceleration.   Table 1 shows the comparison of the evaluation indicators of the Set5 dataset (×3). We can see that (1) with standard deviation pooling has a 0.04dB improvement in the PSNR and a 0.0003 improvement in the SSIM relative to (2) with maximum pooling. Whereas (3), which adaptively combines maximum pooling and global standard deviation pooling, has a 0.05dB improvement in PSNR and a 0.0002 improvement in SSIM relative to (1). It is easy to see that global standard deviation pooling has a more significant effect on the PSNR values of the model than maximum pooling. The experimental results show that the designed double pooling structure has better impact on learning channel weight, which proves the effectiveness of the structure. Table 2 shows the comparison of the evaluation indicators of the Set5 dataset (×2). The effectiveness of the proposed MAFB is verified by ablation experiments. We first remove SCAB and then keep the rest the same. The removed model is trained and tested with the same Set5 dataset (×2). We can find that after removing the SCAB, the PSNR decreased by 0.08dB, and the SSIM value also decreased, which prove the effectiveness of the module. Then the MSAB is removed for training, and both PSNR and SSIM values decrease. The comparison results are shown in Table 2. It can be seen that the effect of MAFB proposed in this paper is significantly better than that with only single attention. And the PSNR is improved by about 0.06~0.08dB, which also proves that MAFB has better effect on SISR.

Comparison with state-of-the-art Methods
To prove the effectiveness of the proposed method for image super-resolution, we visualize the partially reconstructed images of B100 and Urban100 datasets. We select three groups of images for visual display at the scale of ×2, ×3, and ×4. As shown in Fig. 6, the blue tall building (img012) and the gray building (img001) in the Urban100 dataset are selected for visualization at ×2 and ×3 scale, and the bird (img_8023) in the B100 dataset is visualized at ×4 scale visualization. The reconstructed images are compared with existing models such as LapSRN [7], DRRN [5], MSRN [20]. Fig. 6 first shows img012 (×2) in Urban100 dataset. We can see that LMAFN can restore the detailed texture and edge information of the image very well. For the texture of the window in img012, most existing models will generate abnormal texture which is different from the original image, and the edge of the window is basically blurred. Taking LapSRN [7] and MemNet [27] as examples, the reconstructed images generated by them have poor linear details for the brown-yellow window part, while MAFFSRN [28] and MSRN [20], which have relatively well, cannot fully recover most of the straight lines in the image (especially the blue window part). LMAFN achieves a better visual effect while restoring the straight-line texture of the whole image, which is closer to the GT image.
The second part of Fig. 6 shows img001 (×3) in Urban100 dataset, which mainly compares the details of the window frame in the middle. It can be seen that for the oblique frame in the middle, most of the existing models cannot recover the edge information of the details well, and the reconstructed part is very blurred. For DRRN [5] and MemNet [27], which can recover the oblique border, cannot recover the details of the double-striped border on the right side of the image very well. Compared with GT images, MAFFSRN [28] and MSRN [20] obtain more realistic results and recover more image details, but LMAFN can recover the edge information of the window frame well without introducing blur artifacts, and the results are clearer compared to several other models. These results verify that LMAFN has more powerful representation ability.
The last part of Fig. 6 shows img_8023 (×4) in B100 dataset. Focus on contrasting the details of the bird feather texture in the picture, among them, the reconstructed feather texture of LapSRN [7] is very fuzzy, DRCN [4] and DRRN [5] can retain some striped information, but the edge is very fuzzy, and several other models will produce some wrong mesh information. LMAFN can recover the texture information of bird feathers better while the edge is relatively clear. In contrast, LMAFN obtains sharper results and recovers more high contrast and sharp edges, and the reconstruction effect of the algorithm is improved to a certain extent.
To illustrate the validity of our proposed model, we compared the super-resolution (SR) reconstruction results of 11 advanced SR models based on deep learning, such as SRCNN [1], DRCN [4], LapSRN [7], MemNet [27], MAFFSRN [28], in different scales of five mainstream benchmark datasets. The experimental results are shown in Table 3.
The best results have been bolded and underlined. Compared with other methods, LMAFN maintains better accuracy while keeping the model lightweight, and has the best comprehensive performance. Both LMAFN and MAFFSRN [28] adopt channel attention to learn the interdependence among features, so that the network can focus on more important features, and thus obtaining similar and better results than other methods. However, the parameters and multi-adds of our LMAFN (×2) are reduced by about 211.9K and 34.3G respectively, which shows that our model is more lightweight. Compared with another lightweight model, CARN [21], the PSNR of our model is 0.08, 0.05 and 0.16 dB higher than that of CARN on datasets with more texture information (such as Set5, B100 and Urban100) × 3 scale respectively, Though, the results on edge-informed datasets (such as Manga109) are 0.11dB lower than CARN, LMAFN has about 13% of its parameters and multi-add about 20%. Texture information is a higher-order pattern with more complex statistical features, and edge information is a first-order pattern that are able to be extracted by a first-order gradient operator. Therefore, LMAFN has better reconstruction quality on images with more high-order information such as textures. In summary, compared with the ×3 and ×4 scale, LMAFN has more obvious advantages on the ×2 scale.

Conclusion
We propose a novel lightweight multi-attention fusion image super-resolution network model, LMAFN. The model effectively obtains the weight values of different features by integrating the channel attention and the spatial attention while designing them in a lightweight manner. For the SCAB, two kinds of pooling are introduced. Compared with single pooling, the connection between each channel is fully considered. It can extract richer high-level features. The experimental results imply that LMAFN improves the evaluation index, while having better results in the visual effect of the reconstructed image.