Hybrid Attention Fusion in Dense Crowd Counting

: One of appealing approaches to guiding deep parameter optimization, is attentional supervision, which inspires intelligence in complex networks at a fraction of the cost, but there is still room for improvement. First, the real dense scene with varying scales and uneven density distribution of human heads, the density map cannot be clearly expressed. Second, the heavily occluded areas are extremely similar to the complex background, which further aggravates the counting error. Therefore, we propose a dual-track attention network that distinguishes between global and local information, which is responsible for the target overlap and background confusion problems, respectively, and finally converges and normalizes with the feature map to transform the multi-channel attention map into a single-channel density map. Meanwhile the heterogeneous pyramid design alleviates the distress of scale variation and density dissimilarity. Experiments on several official datasets prove the effectiveness of the scheme to enhance key information and overcome confounding factors.


Introduction
Dense crowd counting is defined as estimating the number of people in an image or video clip, generally using heads as the counting unit. In the density map estimation strategy, each pixel point represents the probability of this location being the center of the head, thus reducing the counting task to the accumulation of probabilities. However, in real scenarios, a robust counting model requires strong generalization ability to external disturbances such as noisy background, scale variation, mutual occlusion, perspective distortion, etc. Traditional counting networks rely on multi-column architectures to extract features at different scales while focusing on valuable visual information based on attention mechanisms.

Multi-scale Feature Extraction Strategy
This strategy emphasizes that targets at different scales need to be perceived by perceptual fields of different sizes, which is generally implemented using a multi-column convolutional architecture. MCNN [1] first uses a threecolumn network architecture to extract multi-scale features to accommodate scale variations due to different camera angles. Inspired by MCNN [1], Switching-CNN [2] retains the multicolumn mode, adds a classifier to select the best branch suitable for the current scale, and then adaptively fuse the multi-scale information. Further, CSRNet [3] adapts a dilated convolutional layer to increase the receptive field as an alternative to the pooling operations, but it tends to lead to grid effects, further leading to local information loss. SANet [4] sets inception layout in the encoder to extract multi-scale features and adds transposed convolution in the decoder to generate high-resolution density maps.

Attention Mechanism Guidance Strategy
Attention mechanism is activated by the sigmoid function, which directs the model to focus on regions where the signal response is obvious and suppresses background noise, thus acting as a top-level supervision. ASNet [5] considers the density of different regions in an image varies greatly, leading to heterogeneous counting performance, and therefore proposes density attention networks to provide multi-scale attention masks for convolutional extraction units. HANet [6] utilizes progressive embedding of scale-context fusion channel attention with spatial attention, without considering that there are differences in the supervised objects of attention in the local and global cases. RANet [7] emphasizes on attention optimisation, using two modules to handle global attention and local attention separately, and then finally fusing them based on the interdependencies between features, but the dependencies are difficult to determine. Recognizing that it is often difficult to generate accurate attention maps directly, CFANet [8] turns to a coarse-to-fine progressive attention mechanism through two branches, the crowd region identifier (CRR) and the density level estimator (DLE).

The Main Network Structure
This paper aims to establish a crowd counting framework that is suitable for dense scenes. The architecture of the proposed method is illustrated in Figure 1. It includes a primary feature extractor taken from the VGG-16 [9] model as the backbone, two heterogeneous pyramid modules acting on global and local information encoding, respectively, and dual-track stacking to obtain the hybrid features, activated to get attention and then fused with the multi-channel feature map to obtain the final predicted density map.
The local information encoding stage is a shallow network, capable of exploiting more fine-grained feature information and thus filtering complex backgrounds with other non-target entities. A four-branch pyramidal architecture is specifically used, with increasing convolutional kernel size from top to bottom and progressively larger perceptual fields.
The global information encoding stage has a deeper layer of network with enhanced nonlinearity, fuller perceptual field, and richer semantic information. It is used to deal with the dense occlusion of human head and has good learning ability for irregular density distribution as well. To reduce the parameter overhead, a three-way merge is used, and the filter size is also chosen among 11  , 33  , and 55  . The dual-track features are merged and split in two again. In the top-side pathway, the probability distribution between [0,1] is generated via ReLU and Sigmoid activation functions in turn, i.e., hybrid attention. And in the bottom-side pathway, two 33 convolution kernels are used to reform the feature information first, and then 11 convolution is used to adjust it to the exact same size as the hybrid attention, i.e., the multichannel feature map. For the fusion strategy, we introduce the softmax function, which converts the multi-channel attention map into a single-channel density map, a move that eliminates the need for the network to resort to costly and poorly robust attention labels. In detail, the hybrid attention is normalized by softmax and each pixel can learn the dynamic weight of this location in all channel layers. Then, it is multiplied with the multi-channel feature maps pairwise, and finally, all channels are summed vertically to obtain the final prediction map of fused attention to features, denoted by 1

Loss Function
In this paper, we choose the mean absolute error (MAE) to measure the pixel-level error values between the final prediction map and the labels, denoted by pre L .

( )
; Where N is the number of images in a training batch, i X denotes the current training image,  is a set of learnable parameters, so ( ; ) i PX  represents the prediction map for it, and GT i G refers to its ground-truth density map.

Experimental Detail
To ensure the experimental authority, four official datasets, ShanghaiTech(A&B) [1], UCF_CC_50 [10], and UCF-QNRF [11], are used in this paper. Among them, the UCF_CC_50 dataset has a limited number of samples, and we follow the official recommendation of 5-fold cross-validation for testing.
We generate training labels by blurring each head annotation with a Gaussian function. In detail, for crowd-sparse datasets, such as ShanghaiTech Part B, we use fixedsize kernels, while for other datasets with denser scenes, geometric adaptive kernel based on the nearest neighbor algorithm is utilized.
Except for Primary Feature Extractor, the parameters of the subsequent layers are randomly initialized by a Gaussian distribution with a mean of 0 and a standard deviation of 0.01. For training details, we choose the Adam optimizer to retrain the model, with an initial learning rate of 1E-4, halved every 100 rounds.
There are two mainstream metrics for evaluating the performance in crowd counting task: mean absolute error (MAE) and mean squared error (MSE). They are defined as follows.

Comparison Experiment
We demonstrate the effectiveness of the proposed method on four official datasets, and the experimental results are shown in Table 1 (The best performance is indicted by bold and the second best is underlined). In the ShanghaiTech Part_A dataset, our MAE is 0.33% ahead of HANet; for the UCF_CC_50 dataset, it outperforms the ASNet result by 4.5%, while the MSE is 2.59% ahead. Also, in the UCF-QNRF dataset, we improve the MSE metric by 1.03%.
To visually compare the effectiveness of the proposed method with RANet, we select the representative samples from each dataset for counting tests, as shown in Figure 2. Further, to observe the overall prediction effect of the two on ShanghaiTech Part_A dataset, we aggregate the PRE-GT information for the entire sample in this dataset and plot it as a scatter diagram with regression lines, the results of which are shown in Figure 3, where the red auxiliary line y=x indicates the ideal case of 100% accuracy in counting. Qualitatively, the closeness of the blue regression line to the auxiliary line y=x is positively correlated with the quality of the prediction; quantitatively, the closer the coefficient of determination 2 R of the regression line is to 1, the lower the overall error fluctuation is.  Figure 2. Visual comparison of different methods

Conclusion
In this paper, we dissect the current problems faced by dense crowd counting, including noisy background, mutual occlusion and variable scale. A dual-track network is designed, using heterogeneous pyramid modules to obtain global and local features respectively, which are transformed into hybrid attention based on the softmax algorithm and fused with highresolution feature maps to effectively deal with the problem of responsible target overlap and background confusion, on the other hand, the pyramid paradigm itself has strong learning capability for variable scales. Experiments show that this strategy is effective and has more stable performance.