Keywords: Vision-Language-Action, Robotic Manipulation, Diffusion Policy, Flow Matching, Reasoning Modulation, Trajectory Generation, Multimodal Learning, End-to-End Policy
TL;DR: FDVLA unifies flow-guided planning and diffusion-based correction with semantic reasoning to enable coherent and controllable robotic action generation.
Abstract: Recent advances in vision-language models (VLMs) have empowered robots to interpret natural language and perform complex manipulation tasks. Existing vision-language-action (VLA) frameworks typically adopt autoregressive decoding or diffusion-based strategies. While the former may lead to fragmented or less smooth trajectories, the latter often lacks explicit injection of reasoning semantics into the action generation process, which can affect the quality of generated actions. In this paper, we propose FDVLA, a unified framework integrating semantic reasoning with smooth and physically coherent action generation. We introduce a flow-diffusion mechanism that unifies global trajectory planning (via flow fields) and fine-grained action refinement (via diffusion) in a dual-headed policy, enabling physically coherent and stable action generation. Additionally, we design DualMod, a lightweight module that injects semantic signals into both velocity and noise prediction branches, thus integrating high-level reasoning into action generation. Extensive experiments across diverse simulated and real-world robotic tasks, demonstrate that FDVLA achieves solid performance, efficient inference, and shows robust generalization under a variety of task conditions.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 11529
Loading