Keywords: VLM, VQA, Scene understanding, robotics
TL;DR: A new benchmark, Office Hours, tests Vision–Language Models on spatial and temporal object association in realistic, cluttered office videos—revealing major gaps in current models’ ability to ground identities and track changes across space and time.
Abstract: Associating objects with their owners and tracking changes over time are essential capabilities for autonomous robots operating in cluttered, visually redundant, and dynamic environments. Yet existing benchmarks focus on static, uncluttered, and synthetic scenes that fail to capture real-world challenges such as inter-workspace ambiguity and subtle intra-workspace changes. To fill this gap, we introduce the Office Hours benchmark dataset: a large-scale, two-part video benchmark comprising six robot-filmed walkthroughs of 23 cubicles over five temporal episodes (global subset) and handheld recordings of 10 cubicles across 20 temporal episodes (local subset). We annotate ~1,500 object-level changes across four categories (Object Detection, Count, Localization, State Detection) and provide over 1,600 multiple-choice visual question answering (VQA) questions spanning five complementary tasks: Spatial Association VQA, Static Association–Semantic Mapping VQA, Temporal Association VQA, Single-Cubicle-Multi-Temporal VQA, and Multi-Cubicle-Multi-Temporal VQA.
Using Gemini 2.5 Pro as a strong baseline, our experiments reveal persistent shortcomings: on Multi-Cubicle-Multi-Temporal VQA, the accuracy of localization barely exceeds the random guessing level (~25%), on Single-Cubicle-Multi-Temporal VQA, overall accuracy reaches 56.8%, with object counting and object state change questions remaining challenging; These results, among others, highlight critical gaps in current VLMs' ability in maintaining consistent object associations across space and time.
Croissant File: json
Dataset URL: https://kaggle.com/datasets/cfe2e8cb6905fe01f377cb55bfdd97d3bd287c3eaa98528681cf3ea3083de9ec
Code URL: https://github.com/Junf137/office_hours
Supplementary Material: zip
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 1392
Loading