PEFT Methods for Embodied VLM Agents: A Systematic Study and MoE-DoRA

Published: 27 May 2026, Last Modified: 04 Jun 2026FMEA @ CVPR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, Embodied Agents, Parameter-Efficient Fine-Tuning
Abstract: Vision Language Models (VLMs) deployed as embodied agents requires domain-specific adaptation, yet parameter-efficient fine-tuning methods have been studied almost exclusively on NLP benchmarks, where tasks demand neither visual grounding nor structured action planning from scarce demonstrations. In this work, we introduce $\textbf{MoE-DoRA}$ (Mixture of Directional Experts), a novel architecture that extends weight-decomposed low-rank adaptation with parallel directional experts governed by a token-level router. By specializing directional updates while sharing a common magnitude vector, MoE-DoRA provides a framework for grounding diverse multimodal inputs in complex action spaces. To evaluate our method, we conduct the first systematic benchmark of diverse PEFT methods on EmbodiedBench's EB-Habitat tasks using Qwen2-VL-7B. Our results reveal that while QDoRA (Quantized DoRA) currently achieves the highest empirical performance (0.72 SR) due to the implicit regularization of 4-bit quantization in data-scarce regimes, we hypothesis that MoE-DoRA offers a scalable path for increasing PEFT expressiveness as embodied training data grows. We additionally show through ablation that the directional component, and not the magnitude, is the critical factor in DoRA's effectiveness for embodied tasks. Our benchmark provides hands-on guidance for practitioners adapting foundation models to embodied settings.
Submission Number: 52
Loading