Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping

ICLR 2026 Conference Submission13292 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal LLMs, Vision-Language Models, Fine Grained Visual Grounding, Image Warping
TL;DR: We warp images using the model’s own attention so it “looks closer” at important parts, boosting accuracy without changing the model.
Abstract: Multimodal large language models (MLLMs) often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding. We introduce AttWarp, a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas, all while preserving global context. At test time, the approach uses an MLLM's cross-modal attention to perform rectilinear warping of the input image, reallocating spatial resolution toward regions the model deems important, without changing model weights or architecture. This attention-guided warping preserves all original image information but redistributes it non-uniformly, so small objects and subtle relationships become easier for the same model to read while the global layout remains intact. Across ten benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU, MIA-Bench, MMVP, VQAv2, RealWorldQA, BLINK) and four MLLMs (LLaVA, Qwen-VL, InternVL, and InstructBLIP), AttWarp consistently improves accuracy, strengthens compositional reasoning, and reduces hallucinations, outperforming four competitive baselines that manipulate raw images at test time. Together, these results show that attention-guided warping prioritizes information relevant to the query while preserving context, and that the same MLLMs perform better when given such warped inputs.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 13292
Loading