Keywords: Vision-Language-Action Models, Token-Level Uncertainty, Shared Autonomy, Failure Detection
Abstract: Robots operating in open-ended environments must be able to recognize the limits of their understanding and know when to rely on human input. While vision-language-action (VLA) models such as $\pi_0$-FAST provide scalable and expressive policies through next-token prediction, they often lack mechanisms for introspection or fallback under uncertainty. We present a system that leverages token-level uncertainty from a fine-tuned $\pi_0$-FAST model to enable \textit{uncertainty-aware human intervention} during robotic manipulation. When prediction uncertainty exceeds a threshold, the robot halts execution and explicitly requests a one-step corrective action from a human operator. We evaluate our system against two baselines---\textit{random intervention} and \textit{no intervention}---and demonstrate that using uncertainty as a trigger improves success and reliability across manipulation tasks. As a supplementary investigation, we also explore a shared control variant that blends human joystick input with model actions based on uncertainty, illustrating an alternate use of model introspection. Our results suggest that token-level uncertainty in VLA models provides meaningful signals for decision arbitration, failure prediction, and adaptive human-robot collaboration.
Submission Number: 23
Loading