X-Prompt: Multi-modal Visual Prompt for Video Object Segmentation

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multi-modal Video Object Segmentation (VOS), including RGB-Thermal, RGB-Depth, and RGB-Event, has garnered attention due to its capability to address challenging scenarios where traditional VOS methods struggle, such as extreme illumination, rapid motion, and background distraction. Existing approaches often involve designing specific additional branches and performing full-parameter fine-tuning for fusion in each task. However, this approach not only duplicates research efforts and hardware costs but also risks model collapse with the limited multi-modal annotated data. In this paper, we propose a universal framework named X-Prompt for all multi-modal video object segmentation tasks, designated as RGB+X. The X-Prompt framework first pre-trains a video object segmentation foundation model using RGB data, and then utilize the additional modality of the prompt to adapt it to downstream multi-modal tasks with limited data. Within the X-Prompt framework, we introduce the Multi-modal Visual Prompter (MVP), which allows prompting foundation model the with various modalities to segment objects precisely. We further propose the Multi-modal Adaptation Expert (MAEs) to adapt the foundation model with pluggable modality-specific knowledge without compromising the generalization capacity. To evaluate the effectiveness of the X-Prompt framework, we conduct extensive experiments on 3 tasks across 4 benchmarks. The proposed universal X-Prompt framework consistently outperforms the full fine-tuning paradigm and achieves state-of-the-art performance. Codes will be available.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: This paper addresses the challenge of Multi-modal Video Object Segmentation, which encompasses RGB-Thermal, RGB-Depth, and RGB-Event tasks. Video, a classical multimedia format, exemplifies a multi-modal topic. The proposed X-Prompt framework in this paper addresses the issue of model generalization degradation due to limited and expensive multi-modal densely annotated data, as well as the redundant research efforts and computational deployment costs resulting from designs tailored for each specific modal task. Ultimately, this universal framework achieves state-of-the-art performance across 4 benchmarks in 3 multi-modal VOS tasks. All codes will also be open-source, with the hope that our universal framework can contribute to the community and further research.
Supplementary Material: zip
Submission Number: 1107
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview