Small pedestrian target detection based on YOLOv5

: YOLOv5s is the network with the smallest depth and feature map width and the fastest image inference, but when applied to small pedestrian target detection in complex scenes, the detection still suffers from wrong and missed detections. To address this problem, an improved model based on YOLOv5s is proposed with the addition of a new convolutional neural module, SPD-Conv, which improves the accuracy of the network in detection tasks of low-resolution images or smaller objects. The improved YOLOv5s-SPD model obtained better detection results compared with the original network model, with an average accuracy improvement of 3.9% and an increase in mAP value of about 9.9%.


Introduction
Target detection is a hot research topic in the field of machine vision [1], and pedestrian detection [2], which is one of the important components of the target detection task, has a higher research and commercial value. It is a prerequisite for pedestrian segmentation [3] and pedestrian reidentification [4], and drives the development of other target detection tasks.
As pedestrian detection in realistic environments is unavoidably affected by the environment, e.g. exposure and shadow surfaces due to strong daylight exposure; blurred pedestrian features caused by foggy [5] and rainy [6] weather; the distance of pedestrians from the camera in surveillance scenes, which can lead to differences in scale spanning; and the small pedestrian problem caused by the dense pedestrian flow in special scenes such as high-speed railway stations, airports, and public gathering places [7] can all affect the effectiveness of detection. To address the problems of low detection accuracy of obscured pedestrian targets and small pedestrian targets in real scenes, Zou Ziyin and Li Jinyu [8] proposed a series of solutions to further improve the detection accuracy and optimise the performance of the model. However, for small pedestrian targets, the features extracted by the model contain a large amount of redundant background information, and the detection accuracy still needs to be improved. To address these problems, this paper proposes an improved SPD-Conv (Space-to-depth layer and non-strided Convolution layer) algorithm based on YOLOv5s [9] for the detection of small pedestrian targets in complex scenes [10], which improves the detection capability of the network for small pedestrian targets.

YOLOv5 network structure
The YOLOv5 model was proposed in June 2020 by Glenn Jocher from the Ultralytics team, who updated YOLOv5 after studying YOLOv3. The initial version of YOLOv5 is very fast, efficient and easy to use. YOLOv5s is the smallest network in terms of depth and width of the feature map, and the fastest inference speed of 0.007s. The network structure of YOLOv5s consists of four main components, namely the input, the reference network, the Neck network, and the Head output.The network structure of YOLOv5s is shown in Figure  1.

YOLOv5 effect demonstration
The test results of the different versions of the YOLOv5 detection algorithm on the MS COCO dataset without using any other datasets or pre-trained weights are shown in Figure  2. Where the grey dash is the EfficientDet model and the remaining four are different network models of the YOLOv5 family.

Improvements to the YOLOv5 algorithm
Convolutional neural networks (CNN) have achieved great success in computer vision tasks such as image classification and target detection. However, the loss of fine-grained information caused by the convolutional and pooling layers of the convolutional neural network itself and the low feature extraction capability lead to a rapid degradation of the network's detection accuracy in low-resolution images or detection tasks of smaller objects. Therefore, this paper adds a new convolutional neural module, SPD-Conv, which is composed of a space-to-depth (SPD) layer and a non-strided convolution (Conv) layer. Where space_to_depth means superimposing the dimensions on the length and width to the depth, which is equivalent to the pooling layer, but pooling is choosing one of all sizes, whereas this method takes one of the sizes and superimposes the rest to the depth direction, thus preserving the low latitude features, the figure below shows when block_size=2 (block_size is the pooled size of the block in the pooling) the schematic diagram of SPD-Conv. The YOLOv5s-SPD model after adding SPD-Conv simply replaces the YOLOv5 stride-2 convolution with SPD-Conv, which is structured as follows.

Comparison of detection accuracy of different models
In this experiment, 2000 images from Caltech Pedestrian were randomly selected as the training set and 400 images were trained 300 times as the validation set. The evaluation effect of the datasets of YOLOv5s and YOLOv5s-SPD are shown in Figure 5 and Figure 6. The specific values evaluated for the YOLOv5s and the modified YOLOv5s-SPD datasets are shown in Table 1 below. From the experimental results it can be seen that the improved YOLOv5s has improved accuracy by 3.9%, recall has improved by 6.2% and mAP values have improved by approximately 10%. This shows that the improved YOLOv5s does provide a good improvement on the small pedestrian target detection problem.
After training, 267 images were randomly selected from Caltech Pedestrian's test set for testing. A comparison of the specific test results is shown in  The small pedestrian target person in Fig.7 (a) is not detected; whereas the small target pedestrian person in the improved test result (b) is correctly detected with an accuracy of 83%. Thus YOLOv5s-SPD improves the problem of missed and false detections and poor detection, but the model can be optimised by expanding the dataset and performing more training sessions. In summary, the YOLOv5s-SPD algorithm improves the network's ability to detect small pedestrian targets with better accuracy, reduced miss detection rates and improved detection accuracy.

Conclusion
In this paper, an improved pedestrian detection model for complex scenes based on the YOLOv5s model is proposed to address the problems of missed detection, false detection and poor detection results in complex scenes using YOLOv5s detection. A new convolutional neural module, SPD-Conv, is added to improve the accuracy of the network in detection tasks with low-resolution images or smaller objects. The improved YOLOv5s-SPD model yields relatively good detection results compared to the original network model: the average accuracy improvement is increased by 3.9% and the mAP value is increased by about 9.9%.