GazeVLM: Gaze-Guided Vision-Language Models for Efficient and Robust Inference

JinYi Yoon; Kazi Hasan Ibn Arif; Bo Ji

GazeVLM: Gaze-Guided Vision-Language Models for Efficient and Robust Inference

JinYi Yoon, Kazi Hasan Ibn Arif, Bo Ji

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient VLM, Gaze Guidance, Robust Preprocessing, Token Dropping, Human Computer Interaction

TL;DR: We propose a gaze-guided framework for efficient and robust VLM inference under a token budget constraint.

Abstract: Vision-language models (VLMs) are emerging as a core building block of modern intelligent assistants, enabling real-time human-machine interactions based on natural language and vision. However, the excessive number of visual tokens generated from images results in high latency, low throughput, and memory bottlenecks, which hinder real-time interactions in resource-constrained settings. To address this, we aim to reduce the number of tokens by prioritizing tokens for efficient inference using user-relevant context. With the growing usage of smart glasses, eye gaze has emerged as a promising sensing modality that can naturally convey the user intent and interests based on the user's viewing context. Therefore, it can provide useful hints for efficient inference. However, the robustness of gaze-aware VLM depends highly on the quality of gaze data. When gaze data is inaccurate, the model may overlook informative visual content, leading to degraded inference accuracy. To this end, we introduce GazeVLM, a novel gaze-guided context-aware VLM framework for efficient and robust inference under a token budget constraint. GazeVLM consists of two key phases: (i) GazeVLM-Pre: a gaze-aware preprocessing mechanism before image encoding that extracts user-attentive scenes while not losing the global understanding for robust inference; (ii) GazeVLM-Post: a gaze-guided token selection method after image encoding that prioritizes tokens around the gazing area for efficient inference under the token budget constraint. Through extensive experiments using two visual question answering datasets with real human eye-tracking data, we demonstrate that GazeVLM achieves both efficiency and robustness under varying token budgets and gaze data qualities, outperforming diverse gaze-aware and gaze-agnostic baselines. Specifically, given the budget of 500 tokens ($\approx$22\% of the tokens of the vanilla architecture), we can achieve up to 1.9$\times$ higher throughput and 37\% lower latency while slightly improving accuracy compared to the vanilla architecture.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 7879

Loading