everyone
since 05 Mar 2025">EveryoneRevisionsBibTeXCC BY 4.0
Modeling human scanpaths remains a challenging task due to the complexity of visual attention dynamics. Traditional approaches rely on low-level visual features, but they often fail to capture the semantic and contextual factors that guide human gaze. To address this, we propose a novel method that integrates LLMs and VLMs to enrich scanpath prediction with semantic priors. By leveraging word-level representations extracted through interpretability tools like the logit lens, our approach aligns spatial-temporal gaze patterns with high-level scene semantics. Our method establishes a new state of the art, improving all key scanpath prediction metrics by approximately 15% on average, demonstrating the effectiveness of integrating linguistic and visual knowledge for enhanced gaze modeling.