Keywords: Imitation Learning, Robots, Vision Language Models
TL;DR: This paper proposes HiRT, a Hierarchical Robot Transformer framework that enables flexible frequency and performance trade-off.
Abstract: Large Vision-Language-Action (VLA) models, leveraging powerful pre-trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes \textbf{HiRT}, a \textbf{Hi}erarchical \textbf{R}obot \textbf{T}ransformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, we achieve a 58\% reduction in inference time delay while maintaining comparable success rates. Additionally, on novel dynamic manipulation benchmarks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.
Supplementary Material: zip
Publication Agreement: pdf
Student Paper: yes
Spotlight Video: mp4
Submission Number: 633
Loading