Seeing Like Humans: Task-Driven Token Reduction for Accelerated ViT in Robotic Navigation

Seeing Like Humans: Task-Driven Token Reduction for Accelerated ViT in Robotic Navigation

ICLR 2026 Conference Submission15230 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Transformers, Robotic Navigation, Computational Efficiency, Adaptive Attention, Edge Computing

TL;DR: This work presents a task-driven token reduction method for Vision Transformers, boosting inference speed (1.5× to 3×) in robotic navigation while maintaining performance, with scalability across high-performance and resource-constrained platforms.

Abstract: In robotics, vision is critical for enabling agents to perceive and interact with their environment. Recent advancements in vision models, particularly Vision Transformers (ViTs), have shown remarkable performance in pure vision tasks like object recognition and scene understanding, showing great potential for robotic applications such as object navigation. However, their computational cost grows quadratically with respect to the number of tokens, posing significant challenges for real-time deployment on resource-constrained robotic platforms. To enhance ViT efficiency in robotic tasks, we propose a biologically-inspired token reduction framework that dynamically allocates computation to task-relevant regions in images while neglecting those irrelevant regions for efficiency. Our method introduces two key components: (1) a task-driven spatial attention mechanism that selectively prunes redundant tokens based on the current task, and (2) a temporal feature reusing module that reuses stable visual features across frames to minimize redundant computation. Together, these components enable the visual perception model to focus only on relevant regions, significantly improving inference speed. Experiments show that our method notably reduces inference time in object navigation tasks without significant performance degradation. Additionally, it enables practical ViT deployment on edge devices such as the Jetson Orin (high-performance GPU) and Raspberry Pi 4B (lightweight CPU), achieving 56.5 FPS and 2 FPS, respectively. This represents a 1.5~3× speedup over standard ViTs, making real-time robotic vision more feasible.

Supplementary Material: zip

Primary Area: applications to robotics, autonomy, planning

Submission Number: 15230

Loading