Abstract: Out-of-Distribution (OOD) detection is crucial when deploying machine learning models in open-world applications. The core
challenge in OOD detection is mitigating the model’s overconfidence on OOD data. While recent methods using auxiliary outlier
datasets or synthesizing outlier features have shown promising OOD detection performance, they are limited due to costly data col-
lection or simplified assumptions. In this paper, we propose a novel OOD detection framework FodFoM that innovatively combines
multiple foundation models to generate two types of challenging fake outlier images for classifier training. The first type is based
on BLIP-2’s image captioning capability, CLIP’s vision-language knowledge, and Stable Diffusion’s image generation ability. Jointly
utilizing these foundation models constructs fake outlier images which are semantically similar to but different from in-distribution
(ID) images. For the second type, GroundingDINO’s object detection ability is utilized to help construct pure background images by blur-
ring foreground ID objects in ID images. The proposed framework can be flexibly combined with multiple existing OOD detection
methods. Extensive empirical evaluations show that image classifiers with the help of constructed fake images can more accurately
differentiate real OOD image from ID ones. New state-of-the-art OOD detection performance is achieved on multiple benchmarks.
The source code will be publicly released.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: Our work investigates how to fully utilize multimodal foundation models to assist in out-of-distribution detection of visual modality and how to fully utilize visual and textual modal data together to guide foundation models in generating pseudo-anomaly data that are highly useful for out-of-distribution detection. Our contributions to multimedia or multimodality are mainly the following:
1) Utilizing the multimodal data relationship and the text image generation capability of the multimodal foundation model, etc., it provides an innovative method for visual out-of-distribution detection and expands the application areas of multimodal processing.
2) Our out-of-distribution detection framework can be easily extended to other modality domains such as text, and video, and opens up new possibilities for research and practice in other areas of multimedia modal data analysis and anomaly detection.
3) It provides a possible way to generate pseudo anomaly data for the multimodal or multimedia domain, which helps the multimodal models to learn the out-of-distribution data better and improves the generalization and security of multimedia or multimodal models in this domain.
Supplementary Material: zip
Submission Number: 3347
Loading