Unifying Top-Down and Bottom-Up Scanpath Prediction Using Transformers

Zhibo Yang; Sounak Mondal; Seoyoung Ahn; Ruoyu Xue; Gregory J. Zelinsky; Minh Hoai; Dimitris Samaras

Unifying Top-Down and Bottom-Up Scanpath Prediction Using Transformers

Zhibo Yang, Sounak Mondal, Seoyoung Ahn, Ruoyu Xue, Gregory J. Zelinsky, Minh Hoai, Dimitris Samaras

Published: 01 Jan 2024, Last Modified: 08 Apr 2025CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Most models of visual attention aim at predicting either top-down or bottom-up control, as studied using different visual search and free-viewing tasks. In this paper we propose the Human Attention Transformer (HAT), a single model that predicts both forms of attention control. HAT uses a novel transformer-based architecture and a simplified foveated retina that collectively create a spatiotemporal awareness akin to the dynamic visual working memory of humans. HAT not only establishes a new state-of-the-art in predicting the scanpath of fixations made during target-present and target-absent visual search and “taskless” free viewing, but also makes human gaze behavior interpretable. Unlike previous methods that rely on a coarse grid of fixation cells and experience information loss due to fixation discretization, HAT features a sequential dense prediction architecture and outputs a dense heatmap for each fixation, thus avoiding discretizing fixations. HAT sets a new standard in computational attention, which emphasizes effectiveness, generality, and interpretability. HAT's demonstrated scope and applica-bility will likely inspire the development of new attention models that can better predict human behavior in various attention-demanding scenarios. Code is available at https://github.com/cvlab-stonybrook/HAT.

Loading