Abstract: Generalist models have achieved remarkable success in both
language and vision-language tasks, showcasing the potential of unified modeling. However, effectively integrating
fine-grained perception tasks like detection and segmentation
into these models remains a significant challenge. This is primarily because these tasks often rely heavily on task-specific
designs and architectures that can complicate the modeling process. To address this challenge, we present UFO, a
framework that Unifies Fine-grained visual perception tasks
through an Open-ended language interface. By transforming
all perception targets into the language space, UFO unifies
object-level detection, pixel-level segmentation, and imagelevel vision-language tasks into a single model. Additionally,
we introduce a novel embedding retrieval approach that relies solely on the language interface to support segmentation
tasks. Our framework bridges the gap between fine-grained
perception and vision-language tasks, significantly simplifying architectural design and training strategies while achieving comparable or superior performance to methods with
intricate task-specific designs. After multi-task training on
five standard visual perception datasets, UFO outperforms
the previous state-of-the-art generalist models by 12.3 mAP
on COCO instance segmentation and 3.3 mIoU on ADE20K
semantic segmentation. Furthermore, our method seamlessly integrates with existing MLLMs, effectively combining
fine-grained perception capabilities with their advanced language abilities, thereby enabling more challenging tasks
such as reasoning segmentation. Code and models are available at https://github.com/nnnth/UFO.
Loading