Connecting Where You Look With What You Understand: Trajectory-Driven Localized Understanding for Interactive Vision-Language Models

Fan Yang; Yousong Zhu; Shurong Zheng; Yufei Zhan; Hongyin Zhao; Xin Li; Chaoyang Zhao; Zhaowen Li; Yaowei Wang; Jinqiao Wang

Connecting Where You Look With What You Understand: Trajectory-Driven Localized Understanding for Interactive Vision-Language Models

Fan Yang, Yousong Zhu, Shurong Zheng, Yufei Zhan, Hongyin Zhao, Xin Li, Chaoyang Zhao, Zhaowen Li, Yaowei Wang, Jinqiao Wang

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Trajectory; Interactivate; Large Vision Language Model

Abstract: Recent Large Vision Language Models (LVLMs) demonstrate remarkable capabilities in image understanding and natural language generation. However, current approaches focus predominantly on global image understanding, struggling to simulate human visual attention trajectories and explain associations between generated descriptions and specific image regions. To address this challenge, we propose TraceVLM, a unified vision-language model that integrates trajectory-aware spatial understanding within an end-to-end framework. TraceVLM employs a Trajectory-aware Visual Perception (TVP) module for deep bidirectional fusion of visual features and trajectory information. We utilize a geometric simplification algorithm to extract semantic keypoints from raw trajectories and propose a three-stage training pipeline where trajectory information guides description generation and region localization. We further extend TraceVLM to attention trajectory-guided segmentation and video scene understanding tasks, enabling cross-frame trajectory tracking and temporal attention analysis. Based on large vision-language model reasoning capabilities, we construct the Reasoning-based Interactive Localized Narratives (RILN) dataset to enhance logical reasoning and interpretability. Extensive experiments on trajectory-guided captioning, text-guided trajectory prediction, image understanding, and segmentation demonstrate that TraceVLM achieves state-of-the-art performance, establishing a foundation for intuitive human-computer spatial interaction and interpretable visual understanding.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 1632

Loading