Automatic Label Assignment for Object Detection

Hao Wang, Tong Jia, Qilong Wang, Wangmeng Zuo

Published: 01 Jan 2025, Last Modified: 04 Nov 2025IEEE Transactions on Circuits and Systems for Video TechnologyEveryoneRevisionsCC BY-SA 4.0

Abstract: Label assignment, which aims to classify region proposals as positive or negative samples depending on the correlations between their classification and localization predictions with the corresponding ground truth, is recognized as an essential ingredient in object detection and strongly affects the detection performance. Recently, some dynamic label assignment methods have been proposed to overcome the limitations of the static methods and achieve promising performance improvement. Despite eliminating the restrictions of the human prior sampling knowledge in static methods, existing dynamic principles usually suffer from two weaknesses. First, most of them deploy mixture models or implicit branch in prediction head to coarsely estimate the spatial distribution of the positive samples for objects. They give little attention to the effect of appearance information of the objects. Furthermore, these methods still cannot perceive the quality distribution of the positive samples, and these low-quality samples lead to adverse effects on the detection performance. To address issues, this paper presents a novel automatic label assignment for object detection. Specifically, our method first introduces an instance property branch into object detection pipeline to distinguish the foreground from the background. Then, an objectness prediction module which is composed by the confidence and weight mechanisms is developed to generate the positive and negative weight maps for the objects. The instance property branch and objectness prediction module can provide a coarse-to-fine optimization framework to make our method realize the appearance of the objects. Finally, a positive sample selection strategy is proposed to explore the quality statistical distribution of the positive samples, which are trained by different designed label targets. We evaluate our method on the MS COCO dataset and we achieve 48.4%, 47.9%, 48.0% and 49.3% on ResNet-101, ResNeXt-101, DCN-ResNet-101 and DCN-ResNeXt-101 in terms of AP ${}_{0.5:0.95}$ , respectively. We evaluate the timing complexity of ALA by calculating the inference speed and the frame per second (FPS) for these four backbones are 11.9, 10.4, 9.9 and 8.0, respectively. The experiment results demonstrate that we can obtain clear improvement over the competing methods with favorable performance compared to the state-of-the-arts.

External IDs:doi:10.1109/tcsvt.2025.3578021