Energy-Based Action Heads Know When They Don’t Know

Published: 13 May 2026, Last Modified: 13 May 2026ICRA 2026: From Data to Decisions PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: OOD-Detection, Failure Prediction, VLA, Energy-Based Models
Abstract: Vision-language-action (VLA) models have be- come popular for robot manipulation but still often fail at their tasks, particularly in out-of-distribution (OOD) scenarios. However, existing VLAs are very rigid, and require additional post-hoc methods for OOD-detection, failure prediction and some other test time behaviours. We propose the Energy-Based VLA (EB-VLA) to address this limitation. Instead of generating actions directly, EB-VLA learns a scalar energy landscape over the action space conditioned on multimodal context, generating action chunks via test-time optimization. Our experiments across different manipulation benchmarks demonstrate two benefits of EB-VLA: First, we achieve competitive manipulation performance, outperforming diffusion policies on contact-rich tasks and matching token-reasoning VLAs that are an order of magnitude larger. Second, the energy value serves as a learned confidence score, acting as a highly effective, zero-shot OOD detector against visual perturbations that requires no additional training or calibration. Our results highlight the potential of EB-VLA for self-monitoring robot policies that can tell when they are uncertain.
Submission Number: 43
Loading