Abstract: Text-directed scanpath prediction refers to forecasting the sequence of visual attention shifts (scanpaths) exhibited by observers as they view a visual scene while receiving a linguistic description. In this context, the effective fusion of multi-modal information—particularly textual and visual cues—is essential for accurate prediction. In this paper, we propose a novel model called Text-directed Scanpath Prediction Transformer (TSPT), which exploits deep integration of language and visual features to improve scanpath prediction. Specifically, TSPT introduces a comprehensive multi-level feature fusion strategy and a mirrored encoder-decoder architecture, which together enable fine-grained cross-modal interactions and joint modeling of scanpath prediction and text description generation. Extensive experiments on multiple benchmark datasets demonstrate that our approach achieves state-of-the-art performance, validating its effectiveness and robustness.
External IDs:dblp:conf/prcv/LiuQ25
Loading