Multimodal Inplace Prompt Tuning for Open-set Object Detection

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The integration of large language models into open-world detection frameworks significantly improves versatility in new environments. Prompt representations derived from these models help establish classification boundaries for both base and novel categories within open-world detectors. However, we are the first to discover that directly fine-tuning language models in detection systems results in redundant attention patterns and leads to suboptimal prompt representations. In order to fully leverage the capabilities of large language models and augment prompt encoding for detection, this study introduces a redundancy assessment metric to identify uniform attention patterns. Furthermore, in areas with high redundancy, we incorporate multimodal inplace prompt tuning (MIPT) to enrich the text prompt with visual clues. Experimental results validate the efficacy of our MIPT framework, achieving a notable increase across benchmarks, e.g. elevating GLIP-L from 22.6% to 25.0% on ODinW-35, and 9.0% improvement on LVIS.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: This work advances multimodal processing by proposing a Multimodal Inplace Prompt Tuning (MIPT) framework for open-set object detection. By integrating large language models with visual data, it innovatively addresses the challenge of enhancing object detection capabilities with minimal parameter increase. The introduction of Jensen-Shannon Redundancy identifies inefficiencies in the language model when adapted to detection tasks, pinpointing areas where improvements can significantly impact performance. MIPT further optimizes the process by recalibrating text prompts with visual cues, facilitating a deeper, more efficient integration of multimodal data. This enables the model to leverage the nuanced understanding of language models and the specificity of visual information, resulting in improved detection accuracy. By enhancing the interaction between textual and visual inputs within a detection framework, this work contributes a novel approach to the field, offering significant improvements in the efficiency and effectiveness of multimodal processing systems.
Supplementary Material: zip
Submission Number: 3199
Loading