ScanTD: 360° Scanpath Prediction based on Time-Series Diffusion

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Scanpath generation in 360° images aims to model the realistic trajectories of gaze points that viewers follow when exploring panoramic environments. Existing methods for scanpath generation suffer from various limitations, including a lack of global attention to panoramic environments, insufficient diversity in generated scanpaths, and inadequate consideration of the temporal sequence of gaze points. To address these challenges, we propose a novel approach named ScanTD which employs a conditional Diffusion Model-based method to generate multiple scanpaths. Notably, a transformer-based time-series (TTS) module with a novel attention mechanism is integrated into ScanTD to capture the temporal dependency of gaze points effectively. Additionally, ScanTD utilizes a Vision Transformer-based method for image feature extraction, enabling better learning of scene semantic information. Experimental results demonstrate that our approach outperforms state-of-the-art methods across three datasets. We further demonstrate its generalizability by applying it to the 360° saliency detection task.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Experience] Interactions and Quality of Experience
Relevance To Conference: Our research focuses on utilizing Diffusion Model to generate scanpaths from 360° images. Scanpath prediction is critical for advancing multimedia processing and enhancing the capability of VR-based multimedia system to provide better interactive experiences. By predicting scanpaths that contain a series of possible gaze points that the users may focus on, our research can offer an in-depth understanding of viewer interaction dynamics in immersive environments. Our method, ScanTD, leads this advancement by employing a conditional diffusion model integrated with a transformer-based time-series module to capture the temporal and spatial patterns of human gaze behavior in panoramic environments. In VR-based multimedia systems, predicting scanpaths is also critical in optimizing content rendering and streamlining data transmission, leading to resource-efficient multimedia processing. Additionally, ScanTD's accurate capture of the temporal order of gaze points enables it not only to predict where observers will look but also when they will look there. In this case, ScanTD could provide guidance for creating more immersive and interactive VR content. Furthermore, the diversity in the scanpaths generated by ScanTD ensures that multiple realistic attention trajectories are available, reflecting the uniqueness of individual viewer interactions and further enriching the user experience.
Supplementary Material: zip
Submission Number: 3355
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview