Uni-YOLO: Vision-Language Model-Guided YOLO for Robust and Fast Universal Detection in the Open World
Abstract: Universal object detectors aim to detect any object in any scene without human annotation, exhibiting superior generalization. However, the current universal object detectors show degraded performance in harsh weather, and their insufficient real-time capabilities limit their application. In this paper, we present Uni-YOLO, a universal detector designed for complex scenes with real-time performance. Uni-YOLO is a one-stage object detector that uses general object confidence to distinguish between objects and backgrounds, and employs a grid cell regression method for real-time detection. To improve its robustness in harsh weather conditions, the input of Uni-YOLO is adaptively enhanced with a physical model-based enhancement module. During training and inference, Uni-YOLO is guided by the extensive knowledge of the vision-language model CLIP. An object augmentation method is proposed to improve generalization in training by utilizing multiple source datasets with heterogeneous annotations. Furthermore, an online self-enhancement method is proposed to allow Uni-YOLO to further focus on specific objects through self-supervised fine-tuning in a given scene. Extensive experiments on public benchmarks and a UAV deployment are conducted to validate its superiority and practical value.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Media Interpretation
Relevance To Conference: In this work, we present Uni-YOLO, a new open-world universal object detector, which is a multimodal model (vision and language model). Uni-YOLO can detect unknown object categories prompted by the user's language input candidate, and Uni-YOLO is designed as a one-stage detector to achieve real-time detection performance. We use the large-scale pre-trained vision-language model (CLIP) to guide our Uni-YOLO for zero-shot detection in the open world during training and inference. We propose an object augmentation strategy to use existing datasets from multiple sources with different annotations for generalization training. To improve detection in a specific scene, we propose an online self-enhancement strategy that allows Uni-YOLO to focus more on specific objects through self-supervised fine-tuning. We also developed a UAV platform for multimedia interaction object detection to demonstrate the value of the practical application of our Uni-YOLO. Extensive experiments were conducted on public benchmarks to validate the superiority of Uni-YOLO. In conclusion, the proposed Uni-YOLO is a new multimodal model for application in multimedia research. We believe that this work is closely related to multimedia/multimodal processing and contributes to further research in this area.
Supplementary Material: zip
Submission Number: 3039
Loading