Abstract: Goal-directed scanpath prediction aims to predict people’s gaze shift path when searching for objects in a visual scene. Most existing goal-directed scanpath prediction methods cannot generalize to target classes not present during training. Besides, they usually exploit different pre-trained models to extract features for the target prompt and image, resulting in big feature gap and making the subsequent feature matching and fusion very difficult. To solve the above problems, we propose a novel zero-shot goal-directed scanpath prediction model named CLIPGaze. We use CLIP to extract pre-matched features for the target prompt and input image, making the feature fusion easier to receive. Using large model like CLIP can also enhance the whole model’s generalization ability on target classes not present during training. We propose a hierarchical visual-semantic feature fusion module to fuse the target and image features more comprehensively. Furthermore, due to the limited number of classes in goal-directed scanpath dataset, we employ image segmentation as a proxy task to help train the feature fusion module, significantly enhancing our model’s performance in zeroshot setting. Extensive experiments demonstrate the effectiveness of our method on both seen and unseen target classes.
External IDs:dblp:conf/icassp/LaiQ0Q25
Loading