Gaze-Guided Multimodal LLMs for Social Scene Understanding

Published: 23 Sept 2025, Last Modified: 22 Nov 2025LAWEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Gaze following, Large language models (LLMs), Zero shot learning, Human centric AI, Scene description
TL;DR: We introduce GGVL, a zero shot framework that integrates gaze estimation with vision language models to produce semantic, socially aware scene descriptions and achieve state of the art results.
Abstract: Understanding where a person is looking is fundamental to human communication and social interaction. In computer vision, this task, known as gaze following, predicts an individual's point of focus within an image. Existing methods often estimate gaze using heatmaps or pixel coordinates, but these approaches fail to capture the semantic meaning of the gaze target, limiting their value for deeper scene understanding. We introduce GGVL (\textbf{G}aze \textbf{G}uided \textbf{V}ision \textbf{L}anguage), a zero shot framework for scene interpretation in static images. GGVL combines head detection, gaze estimation, and gaze conditioned vision language captioning. By leveraging the principle that gaze aligns with the most relevant elements of a scene, our framework generates more accurate and meaningful descriptions of what individuals are likely observing. It also produces holistic summaries of shared attention and overall scene activity, enabling richer social understanding. Comprehensive evaluation demonstrates the effectiveness of GGVL: it achieves state of the art performance on two benchmark datasets for gaze target prediction, while qualitative results show that it often recognizes gaze targets more accurately and meaningfully than ground truth labels. In a user study, participants consistently preferred the gaze guided captions produced by GGVL over those generated by baseline vision language models. These findings highlight the value of integrating gaze into vision language models to advance human centric scene understanding.
Submission Type: Research Paper (4-9 Pages)
Neurips Resubmit Attestation: This submission is not a resubmission of a NeurIPS 2025 submission.
Submission Number: 140
Loading