Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions

Cunxin Fan; Xiaosong Jia; Yihang Sun; Yixiao Wang; Jianglan Wei; ZiYang Gong; Xiangyu Zhao; Masayoshi Tomizuka; Xue Yang; Junchi Yan; Mingyu Ding

Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions

Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, ZiYang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, Mingyu Ding

Published: 12 May 2025, Last Modified: 27 May 2025ICRA-Safe-VLM-WS-2025 SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision language action model, robotic manipulation

TL;DR: We introduce a vision-language-action framework that understands interleaved image-text instructions and generates continuous actions, improving zero-shot generalization in real-world robotics with the first large-scale interleaved embodied dataset.

Abstract: Vision-Language-Action (VLA) models have shown great promise for generalist robotic manipulation in the physical world. However, existing models are restricted to robot observations and text-only instructions, lacking the flexibility of interleaved multimodal instructions enabled by recent advances in foundation models in the digital world. In this paper, we present Interleave-VLA, the first framework capable of comprehending interleaved image-text instructions and directly generating continuous action sequences in the physical world. It offers a flexible, model-agnostic paradigm that extends state-of-the-art VLA models with minimal modifications and strong zero-shot generalization. A key challenge in realizing Interleave-VLA is the absence of large-scale interleaved embodied datasets. To bridge this gap, we develop an automatic pipeline that converts text-only instructions from real-world datasets in Open X-Embodiment into interleaved image-text instructions, resulting in the first large-scale real-world interleaved embodied dataset with 210k episodes. Through comprehensive evaluation on simulation benchmarks and real-robot experiments, we demonstrate that Interleave-VLA offers significant benefits: 1) it improves out-of-domain generalization to unseen objects by 2-3× compared to state-of-the-art baselines, 2) supports flexible task interfaces, and 3) handles diverse user-provided image instructions in a zero-shot manner, such as hand-drawn sketches. We further analyze the factors behind Interleave-VLA’s strong zero-shot performance, showing that the interleaved paradigm effectively leverages heterogeneous datasets and diverse instruction images, including those from the Internet, which demonstrates strong potential for scaling up. More information can be found at the [link](https://interleave-vla-anonymous.github.io/Interleave-VLA-Anonymous/).

Submission Number: 16

Loading