Commonsense Storage Reasoning in Domestic Scenes: A Challenge for Vision-Language Models

Michaela Levi Richter; Oren Glickman; Reuth Mirsky

Commonsense Storage Reasoning in Domestic Scenes: A Challenge for Vision-Language Models

Michaela Levi Richter, Oren Glickman, Reuth Mirsky

Published: 14 Jun 2025, Last Modified: 16 Aug 2025MKLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Submission Type: Non-archive

Keywords: Service Robots, Domestic Robots, Object Detection, multimodal learning, vision-language models, commonsense reasoning, spatial reasoning, household robotics, grounding

TL;DR: We introduce a benchmark for evaluating multimodal models on commonsense storage prediction in kitchens, revealing that current vision-language systems often underperform compared to humans.

Abstract: To operate effectively in household environments, service robots must reason about object placement not just through direct visual perception, but also by drawing on context, prior knowledge, and commonsense expectations. For example, when asked to locate a spoon in an unfamiliar kitchen, a human might infer it is most likely stored in a drawer near the countertop. Enabling similar reasoning in multimodal models presents a significant challenge. In this work, we introduce the Stored Household Item Challenge, a benchmark designed to evaluate commonsense spatial reasoning in domestic settings. The task requires models to predict the most likely storage location for a given item—such as a drawer or cabinet door—even when the item is not visible. We release two complementary datasets: a crowdsourced development set containing 6,500 annotated item-image pairs from kitchen scenes, and a real-world test set based on actual item storage in private homes. We evaluate a range of state-of-the-art models, including vision-language models (Kosmos-2, Grounding-DINO) and multimodal large language models (Gemini, GPT-4o). Our results reveal that many models perform at or below random chance, and none come close to matching human-level performance. This highlights key limitations in current models’ ability to integrate visual context with structured commonsense knowledge. By providing this task and dataset, we offer a novel testbed for advancing and benchmarking multimodal reasoning capabilities in real-world environments.

Submission Number: 13

Loading