Robust Reasoning with Contextualized Visual Representation Learning

ACL ARR 2025 May Submission3546 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Visual question answering (VQA) requires vision-language models (VLMs) to reason over images and respond to questions that ask about diverse details and inferences of these images. Typically, VLMs use pre-trained vision encoders to map visual inputs to feature representations, and fuse these representations with large language models (LLMs), which generate responses to questions. However, these query-agnostic visual representations only reflect a static set of features of the visual input, which hinders VLMs from robustly responding to queries about out-of-distribution (OOD) features. To address this challenge, we propose to fuse the query as additional context into early-stage vision encoding, enabling models to learn context-aware visual representations that can flexibly adapt to different queries. Our contextualized vision transformer, C-ViT, learns the early fusion of vision and context via a fine-grained curriculum learning scheme, based on a novel Contextual Vision-Inference Alignment (CVIA) dataset. We apply C-ViT to two VLM architectures, and results on both architectures demonstrate that C-ViT effectively improves reasoning robustness of VLMs, particularly when generalizing to OOD VQA data.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, multimodality
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Keywords: Vision Question Answering, Vision-Language Models, Early Fusion, Visual Representation
Submission Number: 3546
Loading