Tracing the Hidden: Segment Anything in Camouflaged Videos via Prompt-Free Multimodal LLM Guidance

15 Sept 2025 (modified: 04 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Camouflaged Object Segmentation; Video Understanding; Multimodal Large Language Model
Abstract: Camouflaged object segmentation in videos faces inherent challenges due to the targets’ indistinguishable appearance and irregular motion patterns. While Segment Anything Model 2 (SAM2) provides a flexible framework for prompt-driven segmentation, it heavily relies on handcrafted or external prompts, limiting its potential in complex, real-world scenarios. To address the issue, we present CamoTracer, a prompt-free yet prompt-rich framework that leverages multimodal large language models (MLLMs) to generate diverse and informative prompts, i.e., point, mask and text prompts, to guide SAM2 without any human intervention. We introduce two key components: (1) a Semantic-Guided Adapter that aligns CLIP and SAM2 representations via cross-attention, injecting rich semantic context into high-resolution visual features; and (2) a Semantic-Aware Prompter that transforms semantic response maps into coarse masks and Gumbel-Softmax-based sampling points, which allows end-to-end differentiable optimization. Meanwhile, LLM outputs text tokens to derive implicit text prompts that encode rich visual-language priors. These prompts collaboratively guide the SAM2 mask decoder in a self-adaptive manner. Further, we devise a memory-guided bi-directional keyframe selection strategy to enhance temporal context propagation and prompt reliability across video frames. Extensive experiments on VCOS benchmarks, MoCA-Mask and CAD datasets, demonstrate that CamoTracer achieves new state-of-the-art performance, strong generalization ability, and robust prompt adaptation, outperforming previous approaches by a significant margin. Our results highlight the potential of self-prompted segmentation empowered by multimodal understanding, bringing SAM2 one step closer to human-like perception in camouflaged scenes.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6213
Loading