Keywords: Symbolic Abstraction, Task and Motion Planning, Vision-Language Models, Neuro-Symbolic AI, Long-Horizon Planning, Robot Learning, CALVIN Benchmark
TL;DR: This paper introduces a method to extract sound symbolic representations from Vision-Language Models to enable robust Task and Motion Planning for long-horizon robotics tasks, outperforming end-to-end and open-loop baselines.
Abstract: Complex long-horizon robotics tasks require combining high-level reasoning with low-level control. This paper presents a method to bridge this gap by learning sound symbolic abstractions from Vision-Language Models (VLMs) to enable robust Task and Motion Planning (TAMP). We propose a pipeline that converts the continuous confidence scores of a pre-trained VLM into discrete, verifiable symbolic predicates through confidence thresholding and temporal filtering. Our approach is evaluated on the challenging CALVIN benchmark, where it outperforms end-to-end and open-loop baselines, achieving a 68.5\% task success rate. We provide a detailed analysis of the soundness-completeness trade-off inherent in learning abstractions and demonstrate the superiority of our closed-loop, neuro-symbolic architecture for long-horizon tasks.
Submission Number: 11
Loading