FDVLA: A Flow-Diffusion Vision-Language-Action Framework with Dual Reasoning Modulation

Maowei Jiang; Qi Wang; Ruiqi Li; Hongfeng Ai; Quangao Liu; Yifan WANG; Hongliang Niu; Pengyu Zeng; Moquan Cheng; Yusong Hu; Xiaoxin Deng; Zhiyong Dong; Peter Búš; Long ZENG

FDVLA: A Flow-Diffusion Vision-Language-Action Framework with Dual Reasoning Modulation

Maowei Jiang, Qi Wang, Ruiqi Li, Hongfeng Ai, Quangao Liu, Yifan WANG, Hongliang Niu, Pengyu Zeng, Moquan Cheng, Yusong Hu, Xiaoxin Deng, Zhiyong Dong, Peter Búš, Long ZENG

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language-Action, Robotic Manipulation, Diffusion Policy, Flow Matching, Reasoning Modulation, Trajectory Generation, Multimodal Learning, End-to-End Policy

TL;DR: FDVLA unifies flow-guided planning and diffusion-based correction with semantic reasoning to enable coherent and controllable robotic action generation.

Abstract: Recent advances in vision-language models (VLMs) have empowered robots to interpret natural language and perform complex manipulation tasks. Existing vision-language-action (VLA) frameworks typically adopt autoregressive decoding or diffusion-based strategies. While the former may lead to fragmented or less smooth trajectories, the latter often lacks explicit injection of reasoning semantics into the action generation process, which can affect the quality of generated actions. In this paper, we propose FDVLA, a unified framework integrating semantic reasoning with smooth and physically coherent action generation. We introduce a flow-diffusion mechanism that unifies global trajectory planning (via flow fields) and fine-grained action refinement (via diffusion) in a dual-headed policy, enabling physically coherent and stable action generation. Additionally, we design DualMod, a lightweight module that injects semantic signals into both velocity and noise prediction branches, thus integrating high-level reasoning into action generation. Extensive experiments across diverse simulated and real-world robotic tasks, demonstrate that FDVLA achieves solid performance, efficient inference, and shows robust generalization under a variety of task conditions.

Primary Area: applications to robotics, autonomy, planning

Submission Number: 11529

Loading