S3E: Semantic Symbolic State Estimation With Vision-Language Foundation Models

Guy Azran; Yuval Goshen; Kai Yuan; Sarah Keren

S3E: Semantic Symbolic State Estimation With Vision-Language Foundation Models

Guy Azran, Yuval Goshen, Kai Yuan, Sarah Keren

20 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Task Planning, Symbolic AI, Computer Vision, Vision-Language Models

TL;DR: An empirical evaluation of state-of-the-art visual question-answering as state estimators for symbolic task planning.

Abstract: In automated task planning, state estimation is the process of translating an agent's sensor input into a high-level task state. It is important because real-world environments are unpredictable, and actions often do not lead to expected outcomes. State estimation enables the agent to manage uncertainties, adjust its plans, and make more informed decisions. Traditionally, researchers and practitioners relied on hand-crafted and hard-coded state estimation functions to determine the abstract state defined in the task domain. Recent advancements in Vision Language Models (VLMs) enable autonomous retrieval of semantic information from visual input. We present Semantic Symbolic State Estimation (S3E), the first general-purpose symbolic state estimator based on VLMs that can be applied in various settings without specialized coding or additional exploration. S3E takes advantage of the foundation model's internal world model and semantic understanding to assess the likelihood of certain symbolic components of the environment's state. We analyze S3E as a multi-label classifier, reveal different kinds of uncertainties that arise when using it, and show how they can be mitigated using natural language and targeted environment design. We show that S3E can achieve over 90\% state estimation precision in our simulated and real-world robot experiments.

Supplementary Material: zip

Primary Area: applications to robotics, autonomy, planning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2206

Loading