HiRT: Enhancing Robotic Control with Hierarchical  Robot Transformers

Jianke Zhang; Yanjiang Guo; Xiaoyu Chen; Yen-Jen Wang; Yucheng Hu; Chengming Shi; Jianyu Chen

HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers

Jianke Zhang, Yanjiang Guo, Xiaoyu Chen, Yen-Jen Wang, Yucheng Hu, Chengming Shi, Jianyu Chen

Published: 05 Sept 2024, Last Modified: 08 Nov 2024CoRL 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Imitation Learning, Robots, Vision Language Models

TL;DR: This paper proposes HiRT, a Hierarchical Robot Transformer framework that enables flexible frequency and performance trade-off.

Abstract: Large Vision-Language-Action (VLA) models, leveraging powerful pre-trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes \textbf{HiRT}, a \textbf{Hi}erarchical \textbf{R}obot \textbf{T}ransformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, we achieve a 58\% reduction in inference time delay while maintaining comparable success rates. Additionally, on novel dynamic manipulation benchmarks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.

Supplementary Material: zip

Publication Agreement: pdf

Student Paper: yes

Spotlight Video: mp4

Submission Number: 633

Loading