Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture of Experts; VLA; Load Balancing; Robotics
Abstract: Vision-Language-Action (VLA) models are experiencing rapid development and demonstrating promising capabilities in robotic manipulation tasks. However, scaling up VLA models presents several critical challenges: (1) Training new VLA models from scratch demands substantial computational resources and extensive datasets. Given the current scarcity of robot data, it becomes particularly valuable to fully leverage well-pretrained VLA model weights during the scaling process. (2) Real-time control requires carefully balancing model capacity with computational efficiency. To address these challenges, we present a Mixture-of-Experts (MoE) architecture that naturally scales the VLA model's action expert by replacing dense feedforward layers with sparsely activated MoE layers. However, the conventional MoE framework is hampered by a critical drawback: the auxiliary loss for load balancing generates interfering gradients that misalign with the primary optimization trajectory. Therefore, we propose AdaMoE, a MoE architecture that employs a decoupling technique that decouples expert selection from expert weighting through an independent scale adapter working alongside the traditional router. This decoupling mechanism alleviates the gradient conflict between the primary and load-balancing objectives during the training process, leading to models with enhanced performance. AdaMoE consistently outperforms the baseline model across key benchmarks, delivering performance gains of 1.8\% on LIBERO and 9.3\% on RoboTwin. Most importantly, a substantial 21.5\% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 5617
Loading