Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture

Hong Lu; Matthias Scheutz

Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture

Hong Lu, Matthias Scheutz

Published: 12 May 2025, Last Modified: 27 May 2025ICRA-Safe-VLM-WS-2025 SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Probing, Vision-language-action model, Cognitive Architecture, Robotics, Explainable AI

TL;DR: Trained probes to predict symbolic states based on the hidden layer activations of OpenVLA and demonstrated the integration of the OpenVLA into a robotics Cognitive Architecture

Abstract: Vision-language-action (VLA) models hold promise as generalist robotics solutions by translating visual and linguistic inputs into robot actions, yet they lack reliability due to their black-box nature and sensitivity to environmental changes. In contrast, cognitive architectures (CA) excel in symbolic reasoning and state monitoring but are constrained by rigid predefined execution. This work bridges these approaches by probing OpenVLA’s hidden layers to uncover symbolic representations of object properties, relations, and action states, enabling integration with a CA for enhanced interpretability and robustness. Through experiments on LIBERO-spatial pick-and-place tasks, we analyze the encoding of symbolic states across different layers of OpenVLA's Llama backbone. Our probing results show consistently high accuracies ($>0.90$) for both object and action states across most layers, though contrary to our hypotheses, we did not observe the expected pattern of object states being encoded earlier than action states. We demonstrate an integrated DIARC-OpenVLA system that leverages these symbolic representations for real-time state monitoring in the appendix, laying the foundation for more interpretable and reliable robotic manipulation.

Submission Number: 19

Loading