DFA-VLA: Enhancing Robotic Manipulation via Embodied Intelligence

Wentao Lu; Temesgen Alemayehu Tikure

DFA-VLA: Enhancing Robotic Manipulation via Embodied Intelligence

Wentao Lu, Temesgen Alemayehu Tikure

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Embodied AI, Robotics, Natural Language Processing, Computer Vision, Vision-Language-Action Model

TL;DR: This paper proposes a task execution method for single-arm robots by integrating embodied intelligence with the VLA model, enhancing their practical operation capabilities.

Abstract: With the rapid advancement of robotic hardware and software technologies, embodied intelligence has become pivotal, enabling physical agents to interact with the environment in real-time via multimodal inputs and make autonomous decisions through a closed-loop sensor-actuator system. Among mainstream methods, end-to-end Vision-Language-Action (VLA) models efficiently execute robotic tasks by directly mapping perception to actions but suffer from critical limitations: poor modeling of fine-grained visual elements (e.g., occluded regions, small objects) and over-reliance on static cross-modal attention, restricting adaptability and generalization in complex open environments. To address these, this thesis focuses on enhancing task execution accuracy, timeliness, and generalization via embodied intelligence, with a core innovation in the Dynamic Fine-grained Alignmentbased Vision-Language-Action (DFA-VLA) model built on a pre-trained large language model backbone. It integrates two key modules: the Multi-scale Visual-Semantic Modeling (MVSM) Module, which combines a vision transformer and a segment anything model to extract high-resolution semantic features, using semantic masks to boost perception of small objects, occlusions, and cluttered backgrounds (with replaceable encoders for scene adaptation); and the Dynamic Fine-grained Alignment and Fusion (DFAF) Module, which employs mask-guided sparse dynamic attention for efficient language-visual alignment (reducing redundant computations) and a dynamic gating network (via text semantics) to adaptively switch between vision- and languagedriven strategies. Both evaluations on LIBERO benchmarks and real-world settings show that DFA-VLA outperforms state-of-the-art methods, especially in spatial reasoning and long-term tasks, with higher success rates and inference efficiency. Parameter-efficient fine-tuning (e.g., LoRA) reduces resource use for task/hardware adaptation, while a Sim2Real pipeline validates real-world effectiveness on physical robots, confirming improved generalization in unstructured scenarios.

Primary Area: applications to robotics, autonomy, planning

Submission Number: 6794

Loading