Scene Understanding in Deformable Object Manipulation via Taxonomy-Guided Vision-Language Models

Gawtam Chithra Ramesh; David Blanco-Mulero; Yifei Dong; Júlia Borràs Sol; Carme Torras; Florian T. Pokorny

Scene Understanding in Deformable Object Manipulation via Taxonomy-Guided Vision-Language Models

Gawtam Chithra Ramesh, David Blanco-Mulero, Yifei Dong, Júlia Borràs Sol, Carme Torras, Florian T. Pokorny

Published: 08 Oct 2025, Last Modified: 11 Oct 2025IROS 2025 Workshop ROMADO PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models, Deformable Object Manipulation, Taxonomy

Abstract: Vision-Language Models (VLMs) can describe scenes in natural language, supporting tasks such as robot planning and action grounding. However, they struggle indeformable object manipulation (DOM), where reasoning about motion, interaction, and deformation is critical. In this work, we investigate whether guiding language models with a taxonomy for DOM can provide a structured reasoning of DOM tasks. We evaluate the performance of our approach on three challenging DOM tasks: towel twisting, meat phantom transport, and cloth edge tracing. Our results demonstrate the potential of taxonomy-guided VLMs to interpret these tasks without fine-tuning or curated datasets

Supplementary Material: pdf

Submission Number: 24

Loading