RCA: Region Conditioned Adaptation for Visual Abductive Reasoning

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Vision foundational models (e.g., CLIP) show strong generalization on various downstream visual perception tasks. However, their ability to reason beyond mere perception is limited, as they are only pre-trained on image-text pairs that hold semantically equivalent meanings. To tackle this, we propose a simple yet effective \textit{Region Conditioned Adaptation} (RCA), a hybrid parameter-efficient fine-tuning method that equips the frozen CLIP with the ability to infer hypotheses from local visual cues. Specifically, the RCA contains two novel modules: regional prompt generator and Adapter$^\textbf{+}$. The prior encodes ''local hints'' and ''global contexts'' into visual prompts separately at fine and coarse-grained levels. The latter enhances the vanilla adapters with a newly designed Map Adapter, that directly steers the focus of attention map with trainable query and key projections. Finally, we train the RCA with a new Dual-Contrastive Loss to regress the visual feature simultaneously toward features of literal description (a.k.a. clue text) and plausible hypothesis (abductive inference text). The loss enables CLIP to maintain both perception and reasoning abilities. Experiments on the Sherlock visual abductive reasoning benchmark show that the RCA significantly outstands previous SOTAs, ranking the \nth{1} on the leaderboards (e.g., Human Acc: RCA 31.74 \textit{vs} CPT-CLIP 29.58, higher =better). We also validate the RCA is generalizable to local perception benchmarks like RefCOCO. We would open-source our codes for future research.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Our work is relevant to multimedia foundational model research. It focuses on a new multimodal retrieval problem called visual abductive reasoning, which retrieves textual hypotheses given partial visual observations. Our framework connects visual and textual modalities with newly designed hybrid "prompt+adapter" tuning, which falls in the scope of multimedia researches.
Supplementary Material: zip
Submission Number: 396
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview