Gaze-to-text Generation: Beyond Categorical Decoding of Human Attention

Gaze-to-text Generation: Beyond Categorical Decoding of Human Attention

ICLR 2026 Conference Submission14747 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Human Attention, Multimodal LLMs, Instruction Tuning

TL;DR: We propose a Multimodal LLM-based framework called Gazette which is instruction-tuned using our specialized primary and auxiliary task instructions to generate natural language effectively decoding goal-directed human attention

Abstract: We introduce a novel learning problem: decoding gaze into natural language descriptions of human goals across diverse visual tasks. Unlike prior work, which frames gaze decoding as a classification task over predefined categories, we formulate it as a generative learning problem: training a model to produce free-form descriptions that capture the rich context, nuance, and open-ended nature of human intentions beyond fixed labels. To this end, we introduce Gazette, the first gaze-to-text decoding framework. Based on multimodal large language models (MLLMs), Gazette learns to decode gaze scanpaths into natural language for goals that may extend beyond categorical labels and require articulation in natural language. To help Gazette filter out individual differences in gaze behavior and learn the goal-specific spatiotemporal dynamics crucial for generating accurate natural language goal descriptions, we propose a novel strategy that leverages the encyclopedic knowledge and reasoning abilities of a large language model to synthesize natural language explanations of goal-directed attentional behavior called think-aloud transcripts. Instruction tuning on these synthetic narratives allows Gazette to achieve state-of-the-art performance in gaze decoding across multiple tasks, demonstrating its generalizability and versatility, thereby enabling gaze to serve as a powerful, non-intrusive cue for inferring human goals and intentions in diverse scenarios.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 14747

Loading