Abstract: Open-vocabulary object detection seeks to recognize objects from arbitrary language inputs, extending detection beyond fixed training categories. While recent methods have made progress in detecting unseen categories, they typically require a set of predefined categories during the inference stage, hindering practical deployment in open-world scenarios. To overcome this crucial limitation, we propose UniPerception , a novel universal perception framework based on open-vocabulary object detection. It not only excels at open-vocabulary object detection but is also capable of generating labels for target objects in the absence of predefined vocabularies, and can be adapted to a broad range of vision-language tasks simply by modifying the language instructions. UniPerception seamlessly integrates three key innovations: 1) a robust visual detector trained on diverse data sources to capture rich and generalizable visual representations; 2) a language model with interleaved cross-modality fusion layers to interpret instructions and generate fine-grained responses conditioned on visual features; and 3) a tailored multi-stage training strategy that effectively bridges detection-specific learning with general vision-language understanding. We conduct extensive experiments on multiple benchmarks for open-vocabulary object detection (COCO, LVIS, ODinW), referring expression comprehension (RefCOCO/+/g, D3), and vision-language understanding (Flickr30k, VQAv2, GQA). The results show that UniPerception achieves strong open-world generalization and multi-modal understanding, outperforming the existing state-of-the-art methods and establishing itself as a unified, instruction-driven perception system.
External IDs:doi:10.1145/3746027.3755017
Loading