FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Flow Models

Moritz Reuss; Hongyi Zhou; Marcel Rühle; Ömer Erdinç Yağmurlu; Fabian Otto; Rudolf Lioutikov

FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Flow Models

Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, Rudolf Lioutikov

Published: 08 Aug 2025, Last Modified: 16 Sept 2025CoRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Imitation Learning, VLA, Language-conditioned Manipulation

TL;DR: FLOWER is a resource-efficient flow-based Vision-Language-Action Policy that achieves sota performance across diverse robotics tasks while substantially lowering computation and enabling broad accessibility.

Abstract: Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to 50% of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by 20% through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers a 25.9% improvement over state-of-the-art baselines across 190 tasks spanning ten simulation and real-world benchmarks and demonstrates robustness across diverse robotic embodiments. All code, pretrained weights, and training recipes are publicly released to democratize efficient VLA development.

Supplementary Material: zip

Spotlight: mp4

Submission Number: 640

Loading