Keywords: Autoregressive Vision-Language-Action Models, Action Tokenizer, Inference Instability
TL;DR: We introduce Stable-FAST, a tokenization strategy that stabilizes the length of token sequences of each action chunk and markedly improves the performance and action smoothness of autoregressive VLAs.
Abstract: Autoregressive Vision-Language-Action (VLA) models are a promising path toward generalist robot policies, yet their performance is critically dependent on action tokenization. The pioneering FAST tokenizer is the first to enable autoregressive VLAs for dexterous, high-frequency tasks by compressing action sequences via the discrete cosine transform. However, this leads to variable-length token sequences for fixed-length action chunks, which causes cascading errors, jerky motions, and reduced task success. We introduce Stable-FAST, a tokenization strategy that resolves this instability at its source by partitioning action trajectories in the training dataset into variable-length action chunks that yield token sequences with markedly reduced length variance. This simple but effective reframing enables the training of VLAs with stabilized inference and markedly reduced action instability. Extensive experiments on real robots, using multiple VLA backbones including the $\pi_0$-FAST architecture, demonstrate that Stable-FAST improves action smoothness and increases the absolute task success rate by 10.8\% on average, while reducing task completion times by over 40\%. This offers a more reliable foundation for deploying autoregressive VLAs in the real world. We refer to the project page for the code and videos.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 4274
Loading