Keywords: vision-language models, large language models, visual question answering
TL;DR: We propose a training-free visual cropping method that leverages MLLM-internal representations for VQA tasks focusing on small details, achieving strong performance with significantly higher efficiency than prior methods.
Abstract: While Multimodal Large Language Models (MLLMs) offer strong perception and
reasoning capabilities for image-text input, Visual Question Answering (VQA)
focusing on small image details still remains a challenge. Although visual cropping
techniques seem promising, recent approaches have several limitations: the need
for task-specific fine-tuning, low efficiency due to uninformed exhaustive search,
or incompatibility with efficient attention implementations. We address these
shortcomings by proposing a training-free visual cropping method, dubbed FOCUS,
that leverages MLLM-internal representations to guide the search for the most
relevant image region. This is accomplished in four steps: first, we identify the
target object(s) in the VQA prompt; second, we compute an object relevance map
using the key-value (KV) cache; third, we propose and rank relevant image regions
based on the map; and finally, we perform the fine-grained VQA task using the top-ranked
region. As a result of this informed search strategy, FOCUS achieves strong
performance across four fine-grained VQA datasets and three types of MLLMs. It
outperforms three popular visual cropping methods in both accuracy and efficiency,
and matches the best-performing baseline, ZoomEye, while requiring 3 – 6.5 ×
less compute.
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 23842
Loading