Effective Text-Directed Scanpath Prediction via Comprehensive Multi-modal Information Fusion

Published: 2025, Last Modified: 04 Feb 2026PRCV (9) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Text-directed scanpath prediction refers to forecasting the sequence of visual attention shifts (scanpaths) exhibited by observers as they view a visual scene while receiving a linguistic description. In this context, the effective fusion of multi-modal information—particularly textual and visual cues—is essential for accurate prediction. In this paper, we propose a novel model called Text-directed Scanpath Prediction Transformer (TSPT), which exploits deep integration of language and visual features to improve scanpath prediction. Specifically, TSPT introduces a comprehensive multi-level feature fusion strategy and a mirrored encoder-decoder architecture, which together enable fine-grained cross-modal interactions and joint modeling of scanpath prediction and text description generation. Extensive experiments on multiple benchmark datasets demonstrate that our approach achieves state-of-the-art performance, validating its effectiveness and robustness.
Loading