DeiSAM: Segment Anything with Deictic Prompting

Hikaru Shindo; Manuel Brack; Gopika Sudhakaran; Devendra Singh Dhami; Patrick Schramowski; Kristian Kersting

DeiSAM: Segment Anything with Deictic Prompting

Hikaru Shindo, Manuel Brack, Gopika Sudhakaran, Devendra Singh Dhami, Patrick Schramowski, Kristian Kersting

Published: 11 Dec 2023, Last Modified: 22 Dec 2023NuCLeaR 2024EveryoneRevisionsBibTeX

Keywords: neuro-symbolic reasoning, object segmentation, deictic representation, large language models

TL;DR: Segment objects from complex textual prompts using neuro-symbolic reasoning with large-scale neural networks.

Abstract: Large-scale, pre-trained neural networks have demonstrated strong capabilities in various tasks, including zero-shot image segmentation. To identify concrete objects in complex scenes, humans instinctively rely on *deictic* descriptions in natural language, i.e., referring to something depending on the context, e.g. *"The object that is on the desk and behind the cup."*. However, deep learning approaches cannot reliably interpret these deictic representations due to their lack of reasoning capabilities in complex scenarios. To remedy this issue, we propose DeiSAM, which integrates large pre-trained neural networks with differentiable logic reasoners. Given a complex, textual segmentation description, DeiSAM leverages Large Language Models (LLMs) to generate first-order logic rules and performs differentiable forward reasoning on generated scene graphs. Subsequently, DeiSAM segments objects by matching them to the logically inferred image regions. As part of our evaluation, we propose the Deictic Visual Genome (DeiVG) dataset, containing paired visual input and complex, deictic textual prompts. Our empirical results demonstrate that DeiSAM is a substantial improvement over data-driven neural baselines on deictic segmentation tasks.

Submission Number: 12

Loading