Aligning and Prompting Everything All at Once for Universal Visual Perception

Published: 01 Jan 2024, Last Modified: 11 Apr 2025CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Vision foundation models have been explored recently to build general-purpose vision systems. However, predomi-nant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality in-teraction, which is not effective in prompting object detection and visual grounding. Another line of work that fo-cuses on pixel-level tasks often encounters a large annotation gap of things and stuff, and suffers from mutual inter-ference between foreground-object and background-class segmentation. In stark contrast to the prevailing methods, we present APE, a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks, i.e., detection, segmentation, and grounding, as an instance-level sentence-object matching paradigm. Specifically, APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection, which efficiently scales up model prompting to thousands of category vocab-ularies and region descriptions while maintaining the ef-fectiveness of cross-modality fusion. To bridge the granu-larity gap of different pixel-level tasks, APE equalizes se-mantic and panoptic segmentation to proxy instance learning by considering any isolated regions as individual in-stances. APE aligns vision and language representation on broad data with natural and challenging characteristics all at once without task-specific fine-tuning. The extensive ex-periments on over 160 datasets demonstrate that, with only one-suit of weights, APE outperforms (or is on par with) the state-of-the-art models, proving that an effective yet univer-sal perception for anything aligning and prompting is in-deed feasible. Codes and trained models are released at https://github.com/shenyunhang/APE.
Loading