SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation

Published: 22 Feb 2025, Last Modified: 22 Feb 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and pyramid designs remain a key factor for their empirical success. In this paper, we show that shifting the multiscale inductive bias into the attention mechanism can work well, resulting in a plain detector ‘SimPLR’ whose backbone and detection head are both non-hierarchical and operate on single-scale features. We find through our experiments that SimPLR with scale-aware attention is plain and simple architecture, yet competitive with multi-scale vision transformer alternatives. Compared to the multi-scale and single-scale state-of-the-art, our model scales better with bigger capacity (self-supervised) models and more pre-training data, allowing us to report a consistently better accuracy and faster runtime for object detection, instance segmentation, as well as panoptic segmentation. Code is released at \url{https://github.com/kienduynguyen/SimPLR}.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Evan_G_Shelhamer1
Submission Number: 3114
Loading