FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

Moritz Reuss; Hongyi Zhou; Marcel Rühle; Ömer Erdinç Yağmurlu; Fabian Otto; Rudolf Lioutikov

FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, Rudolf Lioutikov

Published: 28 Feb 2025, Last Modified: 18 Apr 2025WRL@ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: full paper

Keywords: Vision-Language-Action Policies, Flow Policies, Imitation Learning

TL;DR: A novel Vision-Language-Action Policy with 1B parameters, that achieves sota performance across 9 benchmarks after just 200 GPU hours pretraining.

Abstract: This work introduces FLOWER, an efficient, open-source Vision-Language-Action Flow policy. Vision-Language-Action (VLA) models have demonstrated remarkable potential for language-guided robotic manipulation by leveraging large-scale vision-language pretraining. However, existing approaches often rely on multi-billion-parameter architectures and massive datasets, making them prohibitively expensive to train. FLOWER is a novel generalist policy that not only outperforms current VLAs but also substantially lowers the computational burden for pretraining, fine-tuning, and inference. FLOWER combines a Rectified Flow Policy with a compact Vision-Language Model (VLM) backbone. The Flow Policy enables expressive, multimodal action generation. The compact VLM backbone provides robust semantic grounding while requiring only a fraction of the usual compute cost. Experiments across 4 simulated benchmarks and real-world settings on more than 100 tasks reveal that FLOWER consistently surpasses foundation policies, e.g., OpenVLA. FLOWER achieves superior performance while significantly reducing both training time and memory requirements. Both the performance and the training efficiency are maintained across different action spaces, showcasing the potential of FLOWER to handle diverse control tasks with affordable deployment, fine-tuning and customization. To encourage further research and the democratization of pretrained VLAs, we open-source the full pretraining and fine-tuning code along with the trained weights.

Presenter: ~Fabian_Otto1

Format: Yes, the presenting author will definitely attend in person because they are attending ICLR for other complementary reasons.

Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 34

Loading