From Language to Action Streams: Bridging LLM Autoregression for Long-Horizon Robot Action Prediction

Zijian Wang; Yunke Wang; Siyu Xu; Chang Xu

From Language to Action Streams: Bridging LLM Autoregression for Long-Horizon Robot Action Prediction

Zijian Wang, Yunke Wang, Siyu Xu, Chang Xu

17 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision language action model, large language model

TL;DR: We propose an autoregressive generation method for long-horizon action generation in robot control tasks.

Abstract: Vision-Language-Action(VLA) models is a transformative paradigm for robotic control, leveraging pre-trained vision-language models(VLMs) to directly translate natural language instructions and visual observations into low-level actions. The prominent idea of ``Action-as-Language" discretizes action spaces into tokens for large language models(LLMs), reframing action prediction as a standard sequential language generation task. However, current implementations underutilize the LLM's full generation potential, confining action prediction to fixed-length, single-step token sequences and limiting the policy's generation horizon. To overcome this limitation, we propose the \textbf{Action Stream} paradigm, which customizes LLM training and inference recipes to VLAs, enabling the generation of extended chains of action tokens and facilitating implicit long-horizon generation with task performance improvements. For training action streams, we propose a two-phase approach: Long-horizon Behavior Cloning(L-BC) and Step-wise Action Alignment(S-AA). L-BC enables VLA models to generate coherent multi-step action sequences, while S-AA mitigates exposure bias during sequential inference, creating a framework that enables long-horizon generation while reducing error accumulation. During deployment, decoding strategies from language generation can be successfully transferred to action streams, enabling efficient solution search and task performance improvements. Through extensive evaluations on the simulation benchmark and real-world robotic setups, we demonstrate that the Action Stream paradigm achieves improved task performance when extending the generation horizon, representing a significant step toward unified vision-language-action modeling.

Primary Area: applications to robotics, autonomy, planning

Submission Number: 9151

Loading