Chain of Visual Perception: Harnessing Multimodal Large Language Models for Zero-shot Camouflaged Object Detection

Lv Tang; Peng-Tao Jiang; Zhihao Shen; Hao Zhang; Jinwei Chen; Bo Li

Chain of Visual Perception: Harnessing Multimodal Large Language Models for Zero-shot Camouflaged Object Detection

Lv Tang, Peng-Tao Jiang, Zhihao Shen, Hao Zhang, Jinwei Chen, Bo Li

Published: 20 Jul 2024, Last Modified: 31 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this paper, we introduce a novel multimodal camo-perceptive framework (MMCPF) aimed at handling zero-shot Camouflaged Object Detection (COD) by leveraging the powerful capabilities of Multimodal Large Language Models (MLLMs). Recognizing the inherent limitations of current COD methodologies, which predominantly rely on supervised learning models demanding extensive and accurately annotated datasets, resulting in weak generalization, our research proposes a zero-shot MMCPF that circumvents these challenges. Although MLLMs hold significant potential for broad applications, their effectiveness in COD is hindered and they would make misinterpretations of camouflaged objects. To address this challenge, we further propose a strategic enhancement called the Chain of Visual Perception (CoVP), which significantly improves the perceptual capabilities of MLLMs in camouflaged scenes by leveraging both linguistic and visual cues more effectively. We validate the effectiveness of MMCPF on five widely used COD datasets, containing CAMO, COD10K, NC4K, MoCA-Mask and OVCamo. Experiments show that MMCPF can outperform all existing state-of-the-art zero-shot COD methods, and achieve competitive performance compared to weakly-supervised and fully-supervised methods, which demonstrates the potential of MMCPF. The Github link of this paper is \url{https://github.com/luckybird1994/MMCPF}.

Primary Subject Area: [Content] Media Interpretation

Secondary Subject Area: [Content] Media Interpretation

Relevance To Conference: Our paper leverages a Multimodal Large Language Model (MLLM), along with a specifically designed enhancement mechanism, to address the challenges of detecting camouflaged objects in both images and videos. We believe that utilizing MLLM to solve image and video tasks aligns closely with the topics of interest at the ACMMM conference.

Supplementary Material: zip

Submission Number: 1686

Loading