SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent years have seen an increase in the use of gigapixel-level image and video capture systems and benchmarks with high-resolution wide (HRW) shots. However, unlike close-up shots in the MS COCO dataset, the higher resolution and wider field of view raise unique challenges, such as extreme sparsity and huge scale changes, causing existing close-up detectors inaccuracy and inefficiency. In this paper, we present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects. In this way, it can jointly explore global and local attention by fusing coarse- and fine-grained features to handle huge scale changes. SparseFormer also benefits from a novel Cross-slice non-maximum suppression (C-NMS) algorithm to precisely localize objects from noisy windows and a simple yet effective multi-scale strategy to improve accuracy. Extensive experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: SparseFormer's innovative approach in utilizing a model-agnostic sparse vision transformer for object detection in both close-up and HRW (high-resolution wide) shots propels the multimedia field toward greater scene understanding, especially in the context of gigapixel detection. This technology paves the way for future research into wide field-of-view, high-resolution scenarios for cross-modal retrieval, expanding the capability from detecting a few objects to several hundreds. Such a leap in technology could open up exciting new research directions, significantly enhancing the scope and depth of multimedia analysis. The ability to accurately process and interpret vast and detailed visual information in such scenarios represents a substantial advancement, suggesting that the exploration of wide field-of-view and high-resolution multimedia research will emerge as a new trend. We believe that the emphasis on developing technologies capable of handling extensive and intricate multimedia content will define future research directions, making high-resolution wide field-of-view multimedia studies an exciting and pivotal area of focus in the years to come.
Supplementary Material: zip
Submission Number: 273
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview