Keywords: OOD-Detection, Failure Prediction, VLA, Energy-Based Models
Abstract: Vision-language-action (VLA) models have be-
come popular for robot manipulation but still often fail at
their tasks, particularly in out-of-distribution (OOD) scenarios.
However, existing VLAs are very rigid, and require additional
post-hoc methods for OOD-detection, failure prediction and
some other test time behaviours. We propose the Energy-Based
VLA (EB-VLA) to address this limitation. Instead of generating
actions directly, EB-VLA learns a scalar energy landscape over
the action space conditioned on multimodal context, generating
action chunks via test-time optimization. Our experiments
across different manipulation benchmarks demonstrate two
benefits of EB-VLA: First, we achieve competitive manipulation
performance, outperforming diffusion policies on contact-rich
tasks and matching token-reasoning VLAs that are an order of
magnitude larger. Second, the energy value serves as a learned
confidence score, acting as a highly effective, zero-shot OOD
detector against visual perturbations that requires no additional
training or calibration. Our results highlight the potential of
EB-VLA for self-monitoring robot policies that can tell when
they are uncertain.
Submission Number: 43
Loading