Energy-Based Action Heads Know When They Don’t Know

Published: 27 May 2026, Last Modified: 27 May 2026ICRA 2026 SRRA Workshop LightningTalkPosterEveryoneRevisionsCC BY 4.0
Keywords: OOD-Detection, Failure Prediction, VLA, Energy-Based Models
TL;DR: An energy-based VLA that matches strong baselines on manipulation while flagging OOD inputs zero-shot
Abstract: Vision-language-action (VLA) models have become popular for robot manipulation but still often fail at their tasks, particularly in out-of-distribution~(OOD) scenarios. However, existing VLAs are very rigid, and require additional post-hoc methods for OOD-detection, failure prediction and some other test time behaviours. We propose the Energy-Based VLA (EB-VLA) to address this limitation. Instead of generating actions directly, EB-VLA learns a scalar energy landscape over the action space conditioned on multimodal context, generating action chunks via test-time optimization. Our experiments across different manipulation benchmarks demonstrate two benefits of EB-VLA: First, we achieve competitive manipulation performance, outperforming diffusion policies on contact-rich tasks and matching token-reasoning VLAs that are an order of magnitude larger. Second, the energy value serves as a learned confidence score, acting as an effective, zero-shot OOD detector against visual perturbations that requires no additional training or calibration. Our results highlight the potential of EB-VLA for self-monitoring robot policies that can tell when they are uncertain.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 41
Loading