Zero-Shot Detection of Out-of-Context Objects Using Foundation Models

Published: 01 Jan 2025, Last Modified: 05 Jul 2025WACV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We address the problem of detecting out-of-context (OOC) objects in a scene. Given an image, we aim to detect whether the image has objects that are not present in their usual context and localize such OOC objects. Existing approaches for OOC detection rely on defining the common context in terms of the manually constructed features, such as the co-occurrence of objects, spatial relations between objects, and shape and size of the objects, and then learning such context for a given dataset. But context is often nu-anced ranging from very common to very surprising. Further, learned context from specific datasets may not be generalized as datasets may not truly represent the human notion of what is in context. Motivated by the success of large language models and more generally, foundation models (FMs) in common sense reasoning, we investigate the FM's ability to capture a more generalized notion of context. We find that a pre-trained FM, such as GPT-4, provides a more nuanced notion of OOC and enables zero-shot OOC detection when coupled with other pre-trained FMs for caption generation such as BLIP-2, and image in-painting with Sta-ble Diffusion 2.0. Our approach does not need any dataset-specific training. We demonstrate the efficacy of our approach on two OOC object detection datasets, achieving 90.8% zero-shot accuracy on the MIT-OOC dataset and 87.26% on the IJCAI22-COCO-OOC dataset.
Loading