Hierarchical Instruction-aware Embodied Visual Tracking

ICLR 2026 Conference Submission12899 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Goal-conditioned RL, Instruction following, Embodied visual tracking
Abstract: User-centric embodied visual tracking (UC-EVT) requires embodied agents to follow dynamic, natural language instructions specifying not only which target to track, but also how to track—including distance, angle, and directional constraints. This dual requirement for robust language understanding and low-latency control poses significant challenges, as current approaches using end-to-end RL, VLM/VLA, and LLM-based methods fail to adequately balance comprehension with low-latency tracking. In this paper, we introduce \textbf{Hierarchical Instruction-aware Embodied Visual Tracking (HIEVT)}, which decomposes the problem into on-demand instruction understanding with spatial goal generation (high-level) and asynchronous continuous goal-conditioned control execution (low-level). HIEVT employs an \textit{LLM-based Semantic-Spatial Goal Aligner} to parse diverse human instructions into spatial goals that directly specify desired target positioning, coupled with an \textit{RL-based Adaptive Goal-Aligned Policy} that enables real-time target positioning according to generated spatial goals. We establish a comprehensive UC-EVT benchmark using over 1.7 million training trajectories, evaluating performance across one seen environment and nine challenging unseen environments. Extensive experiments and real-world deployments demonstrate HIEVT's superior robustness, generalizability, and long-horizon tracking capabilities across diverse environments, varying target dynamics, and complex instruction combinations.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 12899
Loading