Hierarchical Autoregressive Modeling With Multi-Scale Refinement for Robot Policy Learning

Published: 01 Jan 2025, Last Modified: 14 Nov 2025IEEE Robotics Autom. Lett. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: While autoregressive models demonstrate remarkable success in text and image generation, their application to robot policies suffers from weak holistic comprehension, cumulative errors, and limited multi-modal modeling capabilities, particularly in long-horizon tasks or multi-modal scenarios. This letter presents HAMR, a novel Hierarchical Autoregressive framework with Multi-scale action Refinement that addresses these limitations through three key contributions: (1) a hierarchical architecture wherein high-level long-horizon motion features guide low-level fine-grained action generation, (2) a multi-scale action tokenization and generation mechanism that actively corrects previous prediction errors during inference, and (3) a lightweight diffusion-based decoder that enhances multi-modal modeling without compromising efficiency. HAMR achieves 12% and 15% improvements over state-of-the-art methods across 60+ tasks in simulation and real-world experiments, respectively, with a 25% improvement in perturbed experiments, demonstrating superior performance in both short- and long-horizon manipulation tasks.
Loading