LogitGaze: Predicting Human Attention Using Semantic Information from Vision-Language Models

Published: 05 Mar 2025, Last Modified: 19 Mar 2025Reasoning and Planning for LLMs @ ICLR2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Scanpath prediction, visual attention, vision-language models, LLMs, multimodal reasoning, semantic priors, gaze modeling, logit-lens, fixation prediction, interpretability
TL;DR: We enhance scanpath prediction by integrating semantic cues from VLMs. This multimodal approach refines fixation modeling, improves prediction accuracy by ~15%, and deepens our understanding of visual attention dynamics
Abstract:

Modeling human scanpaths remains a challenging task due to the complexity of visual attention dynamics. Traditional approaches rely on low-level visual features, but they often fail to capture the semantic and contextual factors that guide human gaze. To address this, we propose a novel method that integrates LLMs and VLMs to enrich scanpath prediction with semantic priors. By leveraging word-level representations extracted through interpretability tools like the logit lens, our approach aligns spatial-temporal gaze patterns with high-level scene semantics. Our method establishes a new state of the art, improving all key scanpath prediction metrics by approximately 15% on average, demonstrating the effectiveness of integrating linguistic and visual knowledge for enhanced gaze modeling.

Submission Number: 30
Loading