Improved Regional Proposal Generation and Proposal Selection Method for Weakly Supervision Object detection

: In recent years, object detection has made great progress with the continuous development of deep neural network. At present, there are many different fully supervised object detection algorithms in the field of computer vision, which are basically saturated, while object detection in a weakly supervised manner is more challenging than strongly supervised object detection. Since nowadays mature object detection algorithms rely heavily on strongly labeled datasets, but strong labeled datasets are very expensive and require huge datasets to support in order to train a better object detection model, weakly supervised object detection has received more and more attention. In this paper, a new module can be embedded in the framework of weakly supervised object detection, three modules are introduced into the weakly supervised object detection framework, which is used to generate high-quality proposals and screen these proposals, and finally selecting more accurate proposal boxes that are beneficial for subsequent training, and demonstrate their effectiveness on the PASCAL VOC2007 and PASCAL VOC2012 datasets, in which this paper achieves a significant improvement over the existing classic weakly supervised object detection algorithms with significant improvements.


Introduction
One of the most fundamental tasks under the direction of computer vision, object detection [3,4,8,9,17,18,19,21,27,35], has made remarkable progress with the continuous development of convolutional neural networks [10,13,14] in re-cent years, and its accuracy has reached a very good level. Simply put, object detection is based on image classification by framing objects in the form of enclosing frames, that is, locating and classifying example images.
At present, these object detection algorithms rely heavily on precisely annotated large-scale datasets [6,7,20,23,24], and the acquisition of such instance-level strongly labeled datasets are very labor-intensive and costly. In addition, the strongly supervised object detection algorithms still have some inevitable limitations, such as the possibility of inadvertently introducing labeling noise during the manual labeling of data, which makes it more difficult for the detector to learn a good model. Therefore, researchers have begun to explore weakly supervised object detection that requires only image-level labeled data for training, meaning that the dataset no longer has precise bounding box annotations, but only annotations of image categories. It is because of the very simple and noisy labeling of their datasets that although many methods [1,2,5,25,26,28,29,37,38] for weakly supervised object detection have been proposed, its performances are still far from those of strongly supervised object detection.
From the recent work, a number of approaches have been proposed for solving the WSOD problem. When the dataset only has image-level annotations, most of them are formulated as a multi-instance learning problem. Integrating the idea of multi-instance learning into CNN can compensate for the deficiencies of training set labels and improve the detection performance better.
The main problem of weakly supervised object detection lies in the poor localization accuracy due to the lack of precise labels. The wraparound box is overly focused on the part of the feature and the SS [30] and EB [41] algorithms are generally used in the generation of the proposed boxes, which is very time consuming. As shown in Fig. 1, this is the classic problem that weakly supervised object detection will encounter.

Figure 1.
Typical WSOD problem, you can see the partial, correct and oversize detection results of an object instance from the first, second and third rows respectively.
Both OICR [2] and PCL [1] are weakly supervised object detection based on multiple instance learning, and since they both use the output of the initial object detector as the true annotation label, their performance is very dependent on the accuracy of the initial object detection results and do not learn the key step of bounding box regression . WSOD2 [11] precisely builds on OICR to obtain the initial object bounding box, based on the localization of each proposed bounding box, we put the bottom-up object evidence to use, which will guide the conversion from image-level to instance-level annotation. The challenge of weakly supervised object detection is that the dataset is weakly labeled, with only image-level labels available, as shown in Fig. 2. But we need to train a good detector with such a dataset, and the result of detection is to get both category information and location information. The limitation of object detection based on multiple example learning is that the most discriminative of all instances can be easily distinguished, while making the network can easily fall into local optima. So how to generate high quality proposals and which method to use to select high quality proposals becomes the key for weakly supervised object detection.
This paper proposes a framework is based on OICR [2] as a baseline and introduce three modules for generating proposals and performing proposal screening. Firstly, we generate high-quality proposals specifically for weakly supervised object detection, and for the proposal generation part we choose to combine the selection search algorithm [30] with an improved version of the gradient-weighted class activation-based mapping [31], and on top of this we add an improved attention module to extract an enhanced feature map from the CNN, and then have ROI pooling to process the generated regions with a combination of bottom-up and A combination of two evidences, bottom-up and top-down, is used to filter the proposals. It is also fed into the basic multiinstance detector and K-level instance optimizer and bounding box regression branch for iterative training as a way to improve its performance.
The contributions of this paper are summarized as follows: 1.In the proposal generation module, this paper uses a combination of Grad CAM++ based class activation graph and selection search algorithm, and incorporate an improved CBAM attention mechanism to achieve better results making it possible to generate high quality candidate frames in the end.
2.In the proposal selection module, in order to better select positive target proposals for weakly supervised target detection tasks, this paper can combine bottom-up target evidence and top-down class confidence scores in a new way to better select the most suitable bounding boxes.
3.This paper adds a bounding box regression branch, and introduces three modules to generate and select proposals respectively, which are unified into a weakly supervised object detection framework for end-to-end training.

Weakly supervised object detection
In recent years, weakly supervised object detection has attracted a lot of attention from researchers. The classic framework on WSOD, WSDDN , is to solve the WSOD problem with a multiple-instance learning (MIL) approach, which contributes by using dual streams to perform object localization and classification simultaneously, but since only image-level labeled data can be accessed during the training phase, the most discriminative parts receive more attention during training than the whole object instance, leading to the model suffers from a discriminative region problem, which is improved by the later work.
In order to alleviate the problem of distinguished regions, online instance classifier refinement strategy (OICR) [2] takes WSDDN as the baseline and adds three more instance classifier refinement processes after the baseline, which improves the performance of weakly supervised target detection but also easily falls into local optimum because only the most distinguished instances are selected for refinement. By combining WADDN and OICR, Zhang et al [40] designed a framework from weakly supervised to fully supervised, which is also implemented with MIL. PCL is a further improvement of the above OICR, which proposes to use proposal clusters on top of OICR to divide all proposals into different pouches and then apply classifiers for refinement, i.e., proposal clustering.
Arun et al [32] designed a new phase difference coefficientbased WSOD framework that implements the WSOD task by minimizing the difference between the annotation agnostic prediction distribution and the annotated perceptual conditional distribution. Shen et al [33] proposed a framework called weakly supervised joint detection and segmentation (WS-JDS) by combining these two tasks into a multi-task learning framework. Li et al [34] proposed a segmentation collaboration network that uses segmentation graphs as a priori information to supervise the learning of object detection. Ze Chen et al [36] proposed a spatial likelihood voting module to converge the localization process of proposed frames without any bounding box annotation. Chenhao Lin et al [37] proposed an end-to-end object instance mining weakly supervised object detection framework that introduces a spatial graph and appearance graph based information propagation mechanism to try to mine all object instances in each image during iterative network learning.

Boundary box regression
Because only image-level labels are available, they only indicate whether the target category has appeared or not. However, in order to train a standard target detector with a regression task, it is necessary to mine instance-level supervised information, e.g., bounding box annotations. Therefore, Yang et al [22] here introduces a MIL branch to obtain pseudo-GT annotation information, and chooses to use a WSDDN-based OICR network for end-to-end training. Bounding box regression is used after refinement using multiple box classifications and only once. C-WSL [28] also explores bounding box regression for weakly supervised object detection networks as in [22]. And both use bounding box regression in an online fashion, C-WSL uses a box regressor to refine each box classifier after the MIL branch.
Bounding box regression is a key step in object detection for predicting rectangular boxes to locate targets, so almost all recent fully supervised object detection [3,4,12,13,21,24] used bounding box regression, which can reduce the localization error of prediction boxes. However, since it is weakly supervised learning and the data lacks the labeling information of the bounding box, only a small number of works have introduced the bounding box into the target detection, and some of them consider the bounding box regression as a post-processing module.

Attention mechanism
The use of attention modules first appeared in natural language and was later introduced into computer vision. Mixed spatial and channel attention mechanisms are widely used in weakly supervised target detection because they can not only focus on important parts of the image but also assign more weight to important channels.
In this paper, the attention module of CBAM [39] is used and improved to make it better embedded in the network. Attention mechanisms are very similar to human ones in that both tend to focus on one part of the information and ignore the others when they see things. The neural network first learns to new features by channel attention, and then learns to the location of key features by serial structure to the spatial attention module, and makes efforts to acquire the features with discriminative nature for images to achieve the effect of adaptive attention of the network.

Method
In this section, we will describe in detail the introduced proposal generation module and the proposal selection and attention modules. . Network structure of our method. Each proposed feature is extracted using a base network with VGG16. Then, the proposed features are passed through two fully connected layers and the generated feature vectors are branched to the basic MIL module and to a new module (reclassification branch). In the basic MIL module, there is one WSDDN branch and three refinement branches. The average classification scores of the three refinement branches are input to the new module to generate supervision.
The overall architecture of the proposed network framework is shown in Fig. 3. This paper puts forward a framework based on OICR, and introduces three modules for generating proposals and filtering them. Firstly, high-quality proposals are generated specifically for weakly supervised target detection. For the proposal generation part, the selection search algorithm is combined with the improved gradient-weighted class activation mapping. On this basis, an attention module is added to extract enhanced feature maps from CNN, and then the enhanced feature maps are sent to the ROI pool layer to process the generated areas. In the proposal selection module, the proposals are screened by combining low-level semantic information with high-level semantic information, and sent to the basic multi-instance detector, the K-level instance optimizer and the bounding box regression branch for iterative training, so as to improve its performance.
The input picture passes through the convolution layer, ReLU activation function and pooling layer of convolutional neural network to generate the feature map of the image, which is used to extract the proposal box later. The selection search algorithm is combined with Grad CAM++ to generate proposals, and an improved CBAM attention module is added to generate an enhanced feature map. The proposals and the enhanced feature map are sent to the ROI pooling layer to generate a 7×7 ROI pooled feature. Finally, the feature vector is processed by the multi-instance learning module and the refined branch instance detector module for subsequent classification and boundary box regression, and the object category and positioning prediction results are output. During the forward propagation of training, the extracted proposal features are sent to the basic MIL module to generate proposal score matrices. After the proposal selection module, more plausible positive proposals are selected, and subsequently, these proposal score matrices are used for subsequent training supervision.

Proposal generation
At first, the VGG16 model is used to train the basic multiinstance classifier with only image-level labels, and the multiclass cross entropy loss function is used in Eq.1: Where c is the total number of image categories, i y is the label representation of the i th image category, and i p is the prediction result of the i-th sigmoid classifier, which finally constitutes this loss function. For each image containing category C, a group of feature maps are weighted and combined by using the basic multi-instance classifier to obtain its category-specific activation map, as shown in Eq.2: Among them, k A is the k-th convolution feature map, and c k w is the importance of the feature map k A of class C in the object, which is calculated as follows in Eq.3: Grad-CAM++ further improves on Grad-CAM, which can better locate the complete object position compared to Grad-CAM, Grad-CAM++ improves the representation when there are multiple targets in the image. It obtains the importance of each pixel in the feature map mainly by adding ReLU and pixel-level weighting to the weights of the feature map output of the corresponding classification to find out more accurate position information. This paper is mainly based on the combination of Grad-CAM++ and SS to generate a large number of object proposals with higher target overlap based on specific categories.

Attention Module
In order to better generate high quality proposal candidate frames, an attention module is added on top of the previously described proposal generation method, starting with a brief description of the spatial attention structure. First, the proposed feature maps generated from the SS-based algorithm combined with Grad CAM++ , which will be used as input to the attention module, are then augmented by a modified CBAM module. As shown in Fig. 4, this is a schematic diagram of the structure of CBAM. First, the size of the feature map F is H×W×C. Then, the global information is extracted through the global average pooling layer and the maximum pooling layer based on width and height to generate a 1×1×C feature map and fed into a two-layer neural network with shared weights, i.e., a Multi-Layer Perceptron (MLP, Multi-Layer Perceptron), which learns through inter-channel dependencies. Dimensionality reduction is achieved between the two neural layers by compression ratio r. The channel attention weighting factor equation is shown in Eq.4: 1 W and 0 W is the full connection weight of two layers contained in MLP, with hidden layer and ReLU activation function in the middle,  represents Sigmoid activation function. As shown in Figure 4, spatial attention takes the output characteristic map of channel attention module as the input characteristic map of this module, focusing on the most informative part, which is a supplement to channel attention. Firstly, the maximum pooling and average pooling are carried out in the channel dimension to fuse the information of different channels in the same position, which is used as the feature information of this position. Then, the position information obtained by the maximum pooling and average pooling is spliced in the channel dimension, and the heat map of spatial importance is obtained through convolution. Finally, the real heat map is generated through Sigmoid activation function, and multiplied by the original input to obtain the calibrated feature map Ms(F), which encodes the position that needs attention or suppression.
As shown in Eq.5, two feature maps are obtained by two pooling operations in the spatial dimension, namely, s avg F and max s F . These two feature maps are spliced based on the channel dimension, and then the channel dimension is reduced by using a 7×7 convolution kernel, 7 7 f  represents a convolution operation with a filter size of 7×7, and the dimension is reduced to a single channel feature map. Finally, the weight of the spatial dimension is generated by learning the dependency relationship between spatial elements through sigmoid.
The fused features after splicing are sent to MLP, which is composed of two fully connected layers. The input features X of the first fully connected layer are reduced in dimension to obtain feature 0 Y , and the second fully connected layer is upgraded in dimension to obtain output feature 1 Y , as shown in Formula Eq.7 and Eq.8: It can be seen that the weight parameters of the first fully connected layer in the improved attention module mentioned above have increased, and the model performance has been relatively enhanced. This improved attention module is embedded into the weak supervision network architecture proposed in this chapter to achieve better detection results.

Basic Multiple Instance Detector
This paper mainly takes OICR as the main framework, and OICR is divided into two parts. The first part is to train the MIDN of the basic case classifier, which is transformed from WSDDN network; The second part is the refinement classifier, and the supervision of the refinement classifier is determined by the output of the previous stage. On this basis, three modules are introduced, namely, proposal generation and proposal selection and attention module. Firstly, Grad-CAM++ is combined with SS algorithm to generate several candidate boxes, and an improved attention module is added to achieve better results. Because of weakly supervised learning, only image-level annotations can be used, that is to say, there is only classification information but no location information in the data set. In order to better understand the semantic information inside the image, it is necessary to examine the map to the regional level and analyze the characteristics of each bounding box. Firstly, a basic detector is used to obtain the preliminary detection results, and the basic detector is optimized by transforming the weak supervised object detection problem into a multi-label classification problem following the idea of WSDDN using multi-instance learning. The proposed score obtained from the basic detector can guide the first level of the multi-level case optimizer, and the supervision of the case optimizer is determined by the output of its previous level. Multiple refinements at the first level can gradually detect a larger part of the target, as shown in Fig. 6.
The regional features x are then fed into the two streams by two separate fully connected layers and produce two feature matrices denoted as x cls and det The formula for generating the region fraction by multiplying the elemental aspects is as follows: Finally, the category C image score can be obtained by summing all the proposed scores with the following equation: Given an image label In this training phase, we can perform the multi-label classification task by the standard multi-category crossentropy loss function like the following equation, and then the instance classifier can be obtained according to the proposed score R x . In this training phase, the loss function can be formulated as Eq.13:

Proposal Selection
After all the regional proposal boxes and scores are obtained through the above modules, how to adaptively select high-quality proposals becomes the key. Due to the lack of accurate location labels in the data, it is difficult for the weakly supervised object detector to select the most suitable bounding box from all the proposals of the object. Suggestions that get the highest classification score usually cover the different parts of the object, while many other suggestions that cover a larger part often have lower scores. Inspired by WSOD2, a simple strategy is used to combine low-level semantic information to train the weakly supervised object detector. Low-level semantic information summarizes the boundary characteristics of common objects, which is helpful to make up for the shortage of CNN in boundary discovery.
In OICR method, given an image, which contains the category of the target object, it selects only the candidate box with the highest category score and the candidate box with spatial overlap, and the rest are all negative examples. However, if the image contains multiple target objects of the same category, it is impossible to distinguish the positive and negative examples well, which will lead to the omission of some actual and valuable positive suggestion candidate boxes and the introduction of some inaccurate negative suggestion boxes. In this paper, a simple but very effective suggestion selection strategy is proposed, which combines low-level semantic information to screen high-quality suggestions.
Firstly, a simple strategy is used to select the proposal candidate box with high score, and then the low-level semantic information is used to screen and adjust the candidate box. Specifically, an image is input, and a group of proposals x are generated through the above-mentioned module. Each proposal in category C has an objetness score [42], marked as ( ) bu O r , the selection of each proposal is as follows: 1. If, it means that the image contains at least one object whose target is Class C, then the proposal c j with the highest score is selected according to the following formula, and it is marked as a pseudo-label class C, 2. If the IOU between the proposal box and is higher than the value we defined (IOU=0.5) and it's ( ) bu O r has the highest score, this paper marks the proposal box as category c.
3. This paper continues to select the highest scoring proposal boxes in addition to the previously selected ones as described above.
4. Repeat this step until the IOU of a proposal box with the highest score is higher than 0.5.

Object Detector Refinement
After the base multi-instance learning detector, K classifier branches are iteratively trained, and the refined instance detector section contains K classifier branches from Cls 1 to Cls K. The final Bbox is obtained by box regression after the last classifier branch Cls K. Each classifier outputs a pseudolabel as the supervision of the next classifier, so the whole process Only the initial classifier Cls 0, which is the base multi-instance learning detector, uses the real image labels. The subsequent k classifiers are trained with pseudo-labels, and for the kth classifier, its loss function is as the following Eq.15: p are respectively the prediction result and label of proposal r for the c-th category, which is the pseudolabel generated by the previous classifier. The focus is on this weighting factor, which is obtained from the top-down information and the bottom-up information, as shown in Eq.17: The bottom-up object evidence is the similarity score objectness mentioned before, which is the four similarity measures, namely MS(Multi-scale Saliency), CC(Color Constrast), ED(Edge Density) and SS(Superpixels Straddling), which is the classification score calculated according to the classification result obtained by the previous one, namely the k-1st classifier, as shown in Eq.18:  is a balance factor set by itself to balance the weight of these two information. The intuitive understanding of this loss function is to penalize the classification result of the current classifier for each proposal with the pseudo-label generated by the previous classifier, and the higher the weight of the proposal the stronger the penalty.

Bounding Box Regression
Since it is weakly supervised learning, there is no strong supervised information in the dataset. In OICR, it relies on the location of the highest scoring region proposal in the multiinstance learning branch, but this label is a coarse label, and this coarse prediction result will definitely give a bad effect to the detector. So this paper adds a regression branch to the previous module.
Although convolutional neural networks can learn features well, they have shortcomings in discovering boundaries, so during training, we explore how to use bottom-up object evidence to guide the target's bounding box for updating.
An objective detector is actually a bounding box sorting function, where an important factor is the objective metric. In weakly supervised target detection, if the classification confidence is considered as an objective score, the shortcoming is that even very good detectors have difficulty in distinguishing complete objects from obviousness object parts or irrelevant backgrounds. In target detection, the most important thing is said to be clear boundaries and centers. Therefore we expect to eventually find a bounding box that completely encloses the complete object, and the abovementioned (bottom-up object) features with object boundaries can exactly compensate for CNN's deficiency in its aspect.
The position loss function uses L1, L2 or smooth loss functions to regress the four coordinate values. The goal of the regressor is to output a correction for each box for each of the four parameters x,y,w,h: ( , , , ) x y w h r r r r r t t t t t  (19) A total of K classifier branches containing Cls 1 to Cls K are divided, and the final bbox is obtained by bounding box regression after the last classifier branch Cls K, the formula as shown in Eq.20: The final loss function for all modules combined is:

Overall Training Framework
Firstly, given an image, a region proposal R is generated by selective search combined with Grad-CAM++, and then the region features are extracted by CNN and ROI pooling layers and two fully connected layers. Then, the region features enter two streams through two full connections, one classification stream and one localization stream, and the region proposal score is obtained by multiplying the corresponding elements according to the formula. Next, the dimensions of region R are aggregated to obtain the image-level classification vector, and the image-level labels are used as supervision to guide the training network training by applying a binary cross-entropy loss function optimization and summarizing its boundary features with bottom-up objects.

Experimental setup
This paper evaluates the method proposed in this paper on three target detection benchmarks: PASCAL VOC2007 and PASCAL VOC2012. After removing these bounding box annotations provided by the data set, only the image and its classification label information are used for training.
Two data sets like PASCAL VOC2007 and PASCAL VOC2012 are the most widely used benchmarks for weak supervised target detection. Performance is measured by the average accuracy (AP) of the maps of all object classes, and CorLoc, a widely used WSOD evaluation, is also reported. Accuracy, Recall and mean average Precision (mAP) can all be used to evaluate the performance of the target detection algorithm. Among them, the mAP and CorLoc obtained in the experiment of this paper all follow the calculation standard stipulated by PASCAL VOC, that is, the IoU between the prediction result frame and the real frame is greater than 0.5.
This paper generates region proposals by combining a selective search algorithm and Grad CAM++, and the proposed features are fed into a modified CBAM attention module. This paper uses the VGG16 network as the base network, and uses the stochastic gradient descent SGD with an initial learning rate set to 0.001, weight decay set to 0.0005 and momentum set to 0.9. On the VOC2007 dataset, the total number of iteration steps is set to 80,000, and the learning rate is reduced to 0.0001 at the 40,000th step. dataset, we double the number of iteration steps and the learning rate decay step to the 80,000th step. This paper follows the multi-scale settings of PCL and OICR in training, specifically, the short edges of the input image are randomly rescaled to a scale of {480,576,588,864,1280}, and the length of the long edges is restricted to no more than 2000.

Ablation experiments
In order to prove the effectiveness of the three modules, namely, regional suggestion generation (PG), regional suggestion selection (PS) and CBAM module, the improved weakly supervised target detection network is ablated in the test set based on PASCAL VOC2007 data set, and the best detection results are displayed in bold, so we can see the performance of the three module methods introduced in this chapter on weakly supervised target detection in these 20 categories, as shown in Table 1: It can be clearly seen from Table 1 that in the weak supervised object detection algorithm model proposed in this chapter, the addition of each sub-module improves the performance of the model to a certain extent. In the MIL baseline, after adding PG and PS modules, the model has been improved obviously, and after adding the improved CBAM module on this basis, the performance has been improved relatively obviously, which shows that it is useful for improving CBAM.
As shown in Table 2, the first column represents the final effect of the image directly generated by the Selective Search algorithm and sent to the basic multi-instance detector for multi-instance detection; The second column represents the introduction of PG(Proposal Generation) and PS(Proposals Selection) modules on the basis of MIL detector; The third column represents that an improved CBAM module is added after the candidate box is generated to enhance the feature map. The third column represents adding PG-PS module and CBAM module on the basis of MIL detector, which is the improved method proposed in this chapter, and it can be seen that it is obviously improved compared with the previous method.

54.0
From Table 2, we can clearly see that the addition of each sub-module in the weak supervised object detection algorithm model we proposed improves the performance of the model to a certain extent. We can find that in the OICR network baseline, the model has been significantly improved after adding PG and PS modules, and on this basis, the performance has also been significantly improved after adding the improved CBAM module, indicating that we have played a role in the improvement of CBAM. Table 3 and Table 4 respectively show the detection performance and positioning performance of the weakly supervised target detection model proposed in this paper and other weakly supervised target detection models in 20 categories of VOC 2007 data set. The bold mark is the highest accuracy in this category. We can see from the table that our model has achieved the highest accuracy in 11 categories of aircraft, birds, cars, cats, chairs, cows, dogs, horses, sheep, sofas and TVS, and the highest positioning accuracy in 7 categories of birds, cats, chairs, cows, dogs, horses and sheep, significantly improving the local positioning problem that is very prone to occur in animal categories. That is, only the head of the animal was detected, and the whole animal was ignored. As shown in Table 4, we can intuitively see the performance of the positioning performance of the method proposed in this chapter and other methods in different categories of VOC 2007 data sets. Compared with other methods, the positioning performance in eight categories such as birds and cats has been improved to some extent, which proves the effectiveness of the method in this chapter. As shown in Table 5, it shows the comparison of the accuracy and positioning performance of the weakly supervised target detection model proposed in this chapter with other weakly supervised target detection models on VOC 2012 data sets. It can be seen that the method proposed in this chapter has improved to some extent compared with other methods, which fully proves the effectiveness of this work.

Comparison with other methods
Compared with the WSOD2 algorithm model, the method proposed in this chapter has been improved to some extent from the aspect of improving the quality of proposal box, and the quality of proposal box has been effectively improved from the aspects of proposal generation and proposal selection. The improved attention module has a good impact on the follow-up training and improved the performance of the weakly supervised target detection model. In PCL algorithm, candidate frame clustering is used to solve the multi-instance problem of weakly supervised target detection according to the standard of whether candidate frames overlap or not, but the method in this chapter has made many improvements from the proposal generation part and used a simple and effective proposal selection strategy, which is obviously more advantageous. By comparing the existing methods, the effectiveness of the algorithm in this chapter is fully verified, and its performance improvement is mainly due to the improvement of the candidate box.  Fig. 7 shows some detection results of the algorithm model proposed in this paper on PASCAL VOC2007 data set, where the green box is the real label of the image and the red box is the detection result of the algorithm proposed in this paper. It can be seen that the prediction results of the algorithm model proposed in this paper are basically close to the real tags, but there may be local optimization problems for human detection, in which the detection frame is too small, but this will not happen for other objects recognition, which is an improvement compared with OICR and WSOD2.

Conclusion
In this paper, firstly, a detection method based on multiinstance learning idea is used to obtain the initial object bounding box, and the weakly supervised object detection problem is understood as a multi-instance learning problem, in which the input image is equivalent to a set of object proposals. In this paper, three modules can be embedded in the framework of weakly supervised target detection, which are used to generate high-quality proposals and filter them. Finally, more accurate proposals that are beneficial to subsequent training are selected, and their effectiveness is demonstrated on PASCAL VOC 2007 and PASCAL VOC 2012 data sets, and the existing weakly supervised target detection algorithms are significantly improved.