IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models

IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models

ACL ARR 2026 January Submission3882 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: IRIS, eye-tracking, human gaze, fixations, scanpaths, multimodal large language models, MLLM, vision-language models, VLM, visual question answering, VQA, referential ambiguity, disambiguation, grounding, referent grounding, gaze-informed VQA, human-in-the-loop, inference-time steering, training-free method, ambiguity resolution, attention alignment, visual attention, saliency, speech-aligned gaze, temporal dynamics, fixation filtering, real-time protocol, interactive VQA, benchmark dataset, evaluation suite, embedding-based similarity, open-ended generation, ambiguous questions, unambiguous questions, multimodal reasoning, instruction following, zero-shot generalization, AR/VR eye tracking, cognitive signals for AI, human attention signals, gaze augmentation, dataset release

Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in Visual Question Answering (VQA), yet they often struggle with referential ambiguity when multiple objects in an image could satisfy a given query. We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2\% to 77.2\%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.

Paper Type: Long

Research Area: Human-AI Interaction/Cooperation and Human-Centric NLP

Research Area Keywords: Human-Centered NLP, Question Answering

Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

Submission Number: 3882

Loading