Abstract: Diverse Vision-language-action (VLA) models have been proposed and demonstrated remarkable capabilities in robotic manipulation. However, how to effectively ensemble VLAs to further enhance performance remains largely unexplored, as conventional ensemble techniques designed for discriminative tasks cannot be directly applied to generative action policies with high-dimensional, multimodal distributions. To address this challenge, we propose EnsembleVLA, an energy-based framework that enables principled ensemble of diverse VLA models. We establish a unified theoretical framework showing that both diffusion-based and flow-based VLA models can be formulated as energy-based models, where additive energy combination naturally induces policy composition at the distribution level. This theoretical foundation enables multiple pre-trained policies to be seamlessly aggregated into a stronger ensemble policy. Building upon this compositional framework, EnsembleVLA further incorporates learnable composition weights for dynamic policy balancing, coupled with a confidence-aware gating mechanism that adaptively modulates bounded residual corrections, collectively ensuring stable and robust task execution. Extensive experiments demonstrate that EnsembleVLA achieves competitive performance across various tasks in both simulated and real-world environments.
Lay Summary: Robots powered by Vision-Language-Action (VLA) models can now see their surroundings, follow spoken instructions, and carry out physical tasks. But even the best individual models still fail on harder jobs, such as dropping objects, bumping into things, or struggling to coordinate two arms at once. Since different models have different strengths, a natural question arises: can we combine them into something stronger than any single one? The difficulty is that the obvious approach, simply averaging the motions that different models suggest, usually produces nonsensical movements that no robot can actually perform. Our method, EnsembleVLA, instead merges models at a deeper level. We show that two popular families of robot models can be reinterpreted through a shared mathematical picture called an "energy landscape," which lets us blend their decisions in a principled way. We also let the system learn how much to trust each model, and apply small, cautious corrections only when it is confident they will help. In both simulated and real-world experiments, EnsembleVLA succeeded on tasks where individual models failed, including delicate two-arm coordination. We hope this brings us one step closer to robotic assistants reliable enough for everyday use.
Primary Area: Applications->Robotics
Keywords: Vision-Language Action Models, Ensemble Learning, Energy-based Models
Originally Submitted PDF: pdf
Submission Number: 7533
Loading