Multi-Stage Transformer 3D Object Detection Method

: With the development of autonomous driving, 3D object detection has experience great expectations. As the light detection and ranging (LiDAR) sensor can precisely measure the distance between environments and themselves, it has become the key component of current 3D object detection methods. However, the varing density and unstructure storage of LiDAR points cloud make it hard for feature learning. To tackle this problem, this paper proposes a multi-task transformer 3D object detection method.This method include a fast transformer based 3D encoder and a multi-stage transformer decoder. Extensive experiments demonstrate that our method can supress current other 3D object detection methods with a clear margin.


Introduction
3D Object Detection from the point cloud has experienced great flourish and has become a practical solution to robotic vision. Different from the image, the point cloud generated by LiDAR can precisely measure the distance from the LiDAR to the environment, which is crucial for a 3D object detection scenario. However, the varying density of LiDAR point cloud leads to challenging feature representation learning.
Like the detection methods on 2D images, these 3D detectors can also be divided into two groups: one-stage detectors-and two-stage detectors-, in terms of model structure. As two-stage detectors can take advantage of multistride by their region of interest (ROI) head, they can usually show better accuracy in the classification confidence and the box regression than those one-stage detectors. In these twostage methods, their region proposal network first provides the proposal of bounding boxes. Then, based on those bounding boxes, the ROI head will perform the ROIPooling operation from the original point cloud and temporary voxel space to refine those provided proposals. Through the virtual point sampling strategy, these ROI heads could learn more valuable geometry information from objects far from the LiDAR sensors or severe occlusion. However, the computation brought by these ROI heads makes two-stage detectors challenging to meet the real-time requirement of the autonomous driving scene.
Different from the two-stage detectors, one-stage detectors can usually achieve better speeds. As the size of objects is invariant in the 3D scene, one-stage 3D detectors usually do not concern the pyramid features. Yet, some current works on 2D tiny object detection show that pyramid features can help the detectors perceive the difference between local and global tiny objects, thus helping these detectors achieve better accuracy. This can prove that their carelessness causes poor performance on small-size objects of one-stage 3D detectors with pyramid features. This motivates us to design a one-stage 3D detector to pay more attention to those pyramid features.
This paper presents a one-stage object detection framework consisting of a fast version transformer 3D encoder and a multi-stage transformer decoder module. Then a anchor-free centerpoint head is adopt to prediction the final bounding box prediction. In the proposed fast version transformer 3D encoder blocks, there is a fast version multi-head selfattention has been deployed. To help the proposed 3D detector achieve better performance, we have desin the multi-stage transformer 3D decoder module, which was consisted of fast version transformer decoder blocks. The proposed multistage transformer decoder module will collect the multi-stride information from the temporal output of transformer 3D encoder. Extensive experiments constructed on KITTI and Waymo Open Datasets show that our proposed methods can better balance speed and accuracy.
In summary, we make three-fold contribution: We have proposed a one-stage 3D object detector, which uses the multi-stride residual 3D backbone and the path aggregation 2D backbones to help the detector make great use of pyramid features. We design a voxel-level auxiliary network, which removes the need for voxel-to-point operations and achieves a better generality.
We conduct extensive experiments on both KITTI Dataset and Waymo Open Dataset. Its results show that our methods could achieve a great balance between accuracy and computation cost.

Methods
In this section, we first present the multi-path transformer 3D encoder. Then, a transformer decoder will be introducing to integrate multi-stage information from 3D encoder. denote the centriod of voxel cells. As the transformer block do not concern the position information about input point cloud sequence, we should adopt a position embedding for them. As mentioned in [], the Fourier position can help deep learning module learn the high-dimension texture information in a low-dimension feature space. In this paper, we adopt this Fourier position embedding method to help our transformer module to efficiently learn high-dimension information. For a give point $p_i$, the position embedding process can be described as:

Fast Transformer 3D Encoder
Then, we will concat the fourier position vector with the vector feature to get the input feature vector set $F$. With these embedded feature sequences $F$ and position $V$, the proposed fast transformer 3D encoder will be used to extract the abstract semantic feature. As shown in Fig. 1, our fast transformer 3D encoder consisted by a voxel level group block and a transformer encoder block. The transformer block can be described as follow: where −1 denote the input feature sequence, −1 denote the temporal feature. The denote the fast version mult-head self-attention module, denote the layer where , , denote the feature map matrix, which is used to project the input feature sequence into a highdimension space. As mentioned in [], the softmax operation consumes relatively large computing resources. To reduce the overhead of the model and improve the abaptability of the model to autonomous driving scenarios, this paper uses a cosine function to replace the softmax operation. Thus, the fast version mult-head self-attention module can be described as: where is the relative position bias matrix of each point; β is a learnable scalar, non-shared across heads and layers. The initial value of β should be set larger than 0.01. The cosine function is naturally normalized, and thus can have milder attention values.

Ball Query Based Downsample Strategy
As the input point cloud sequence is too large, we adopt the point set aggregation operation like pointnet++. A set abstraction layer takes an × ( + ) matrix as input, which is outputed by our fast version transformer 3D encoder. It outputs an × ( + ) matrix of subsampled points with d-dim coordinates and new -dim feature vectors summarizing local context. Then, the output are groups of point sets of size × × ( + ), where each group corresponds to a local region and K is the number of points in the neighborhood of centroid points. As shown in Fig. 1, the ball query downsample modules are connected behind each transformer 3D encoder blocks.

Mult-Stage Transformer 3D Decoder
As the size of objects is invariant in the 3D scene, one-stage 3D detectors usually do not concern the mult-stage features. In addition, current point-based method usually only adopt encoder only architecture to perform the 3D detection tasks. This motivate us to design a mult-stage transformer 3D encoder. The architecture of our mult-stage transformer 3D detector is shown in Fig. 2. The mult stage transformer 3D decoder is consisted by severals transformer decoder blocks. Different from encoder block, the transformer decoder block is consisted by two fast version mult-head self-attention modules. The first fast version mult-head self-attention head module will use the final output of transformer 3D encoder, while lthe second fast version mult-head self-attention head module will use the temporal output of 3D encoder as query and key input matrix.

Experiment Setting
KITTI Dataset: The KITTI dataset provides 7481 training samples and 7518 test samples. The training samples are provided with annotated of car, pedestrain, and cyclist categories. In this paper, we following the common protocal of [], and dividing these trainig samples into training set and validation set. The training set contain 3712 samples while validation set contain 3769 samples. Each category objects are split into three levels based on occlusion and their distance from the LiDAR sensor. As the test set do not have the ground-truth labels, we need to submit the prediction results to the KITTI official test server to get the final result. Implement Setting: We adopt the ADAM [] optimizer and onecycle learning rate shedule methods. We train our model 80 epoches on the KITTI dataset. The init learning rate is set at 0.003 and the lower-upper is set at 0.0000001. Our detector will contain 5 transformer encoder blocks and 5 transformer decoder blocks.
Experiment Environment: Our local server adopts a Core i9 10980XE with two RTX 3090 GPU. The local server deploys the Ubuntu 20.04 systems with python 3.8 and pytorch 1.8.1.

Comparision on KITTI
We compare our method with state-of-the-art 3D detector in KITTI validation set and test set. The results of validation set are come from our local experiment by running those detector's official release code. And these results are shown in Table1 and Table2. The results of test set are come from the KITTI official benchmark. The test results are shown in Table3. We adopt the 3D average precision (AP) and bird'seye-view average precision (BEV AP) as metric.
From the result of Table 1, we can see that our methods could achieve a significantly better performance than other one-stage detectors. Even compared with some two-stage methods, our method's accuracy is still having considerable competitiveness. In the car category objects with moderate difficulty level, our method outperforms the SECOND with xxx points 3D AP, outperform the pointpillar with xxx points 3D AP, and outperform the CIA-SSD with xxx points 3D AP. In the car category objects with hard level difficulty, our method outperforms the SECOND with xxx points 3D AP, outperform the pointpillar with xxx points 3D AP, and outperform the CIA-SSD with xxx points 3D AP. In the pedestrain category objects with moderate level difficulty, our method outperforms the SECOND with xxx points 3D AP, outperform the pointpillar with xxx points 3D AP, and outperform the CIA-SSD with xxx points 3D AP. In the cyclist category objects with moderate difficulty level, our method outperforms the SECOND with xxx points 3D AP, outperform the pointpillar with xxx points 3D AP, and outperform the CIA-SSD with xxx points 3D AP. In the results of Table 2, our method still has a clear margin with other listed one-stage 3D detectors. It proves that our proposed detector could stride a better balance between speeds and accuracy than other 3D detection methods.

Conclusion
In this paper, we proposed a multi-stage transformer 3D object detection method. In this framework, a fast version transformer 3D encoder block is proposed to combine the ball-query group downsample methods to learn feature from the origin point cloud space. Then, a mult-stage transformer 3D decoder is added to collect information from the 3D encoder's temporal output. Extensive experiments results show that our method hav a significant better performance than other one-stage detectors. In the future, we will attempt to integrate the multi-modal methods with this framework.