Captioning and Task-Specific Prompting for Improved VLM Performance

AAAI 2025 Workshop NeurMAD Submission2 Authors

15 Nov 2024 (modified: 30 Dec 2024)AAAI 2025 Workshop NeurMAD SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Visual Understanding, Mathematical Reasoning, In-context Learning
TL;DR: Testing and Enhancing Visual Capabilities of Multimodal Models
Abstract: Vision-language models (VLMs) have transformed tasks requiring visual and reasoning abilities, such as image retrieval and visual question answering (VQA). Despite their success, VLMs face significant challenges with tasks involving geometric reasoning, algebraic problem-solving, and counting. These limitations stem from difficulties in effectively integrating multiple modalities and accurately interpreting such tasks. We propose an efficient, question-driven image captioning pipeline to enhance visual question answering abilities in mathematical contexts. Our method extracts keywords from the question, generates targeted captions for each image-question pair using those keywords, and uses the caption as a prompt for QnA. We propose utilizing task-specific guidance as an “approach” to enhance the VQA and captioning process. Additionally, we evaluate the robustness of these models against adversarial prompts to ensure that our captioning-based approach does not compromise much on robustness. Our pipeline is tested on diverse math-related and visual reasoning tasks across multiple datasets and VLMs.
Submission Number: 2
Loading