Efficient Multi-Level Learning for Dense Object Detection

ZhaoHui Zheng; Yuming Chen; Ping Wang; Le Zhang; Xiang Li; Qibin Hou; Ming-Ming Cheng

Efficient Multi-Level Learning for Dense Object Detection

ZhaoHui Zheng, Yuming Chen, Ping Wang, Le Zhang, Xiang Li, Qibin Hou, Ming-Ming Cheng

26 Sept 2024 (modified: 19 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Object Detection, Multi-Level Learning, Head Network

TL;DR: We rethink the efficiency bottleneck in multi-level learning and the outcome is SlimHead to pursue better speed-accuracy trade-off for object detection..

Abstract: Dense object detection is crucial and favorable in the industry and has been popular for years with the success of the multi-level learning framework. By delivering the learning of objects into a multi-level feature pyramid, such a divide-and-conquer solution eases the optimization difficulty. However, this learning paradigm has a major shortcoming left behind. The shallow levels take tons of computational burden due to their high resolutions of the feature maps, heavily slowing down the inference speed. In this paper, we aim for minimal modifications to exchange a better speed-accuracy trade-off. The outcome is SlimHead, a very simple, efficient, and generalizable head network, which further unleashes the potential of multi-level learning for dense object detectors. It operates in two stages: Slim and Fat, initially plugging interpolator before the head network functions to "slim'' the feature pyramid, and then recovering the features to original solution space by "fatting'' the feature pyramid. Thanks to its flexibility, operations with higher computational complexity can be easily integrated to benefit accuracy without loss of inference efficiency. We also extend our SlimHead to multiple high-level vision tasks such as arbitrary-oriented object detection, pedestrian detection, and instance segmentation. Extensive experiments on PASCAL VOC, MS COCO, DOTA, and CrowdHuman demonstrate the broad applicability and the high practical value of our method.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7481

Loading