Keywords: Visual Understanding, Mathematical Reasoning, In-context Learning
TL;DR: Testing and Enhancing Visual Capabilities of Multimodal Models
Abstract: Vision-language models (VLMs) have transformed tasks requiring visual and reasoning abilities, such as image retrieval
and visual question answering (VQA). Despite their success,
VLMs face significant challenges with tasks involving geometric reasoning, algebraic problem-solving, and counting.
These limitations stem from difficulties in effectively integrating multiple modalities and accurately interpreting such
tasks. We propose an efficient, question-driven image captioning pipeline to enhance visual question answering abilities in mathematical contexts. Our method extracts keywords from the question, generates targeted captions for each
image-question pair using those keywords, and uses the caption as a prompt for QnA. We propose utilizing task-specific
guidance as an “approach” to enhance the VQA and captioning process. Additionally, we evaluate the robustness of
these models against adversarial prompts to ensure that our
captioning-based approach does not compromise much on robustness. Our pipeline is tested on diverse math-related and
visual reasoning tasks across multiple datasets and VLMs.
Submission Number: 2
Loading