Point-and-Click: A Procedural Benchmark for 2D Adventure Puzzle Solving

Point-and-Click: A Procedural Benchmark for 2D Adventure Puzzle Solving

ICLR 2026 Conference Submission18509 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Procedural Generation, Visual Reasoning

TL;DR: Point-and-Click is a benchmark for 2D adventure games that procedurally generates rich puzzle rooms to test long-horizon reasoning, visual grounding, and implicit goal deduction.

Abstract: Point-and-click adventure games offer an ideal platform for testing multimodal large language model agents on long-horizon reasoning, commonsense knowledge, and language-perception grounding. Such games demand creative, compositional reasoning and the deduction of implicit goals. However, existing benchmarks provide limited support for compositional and generative puzzles, and often suffer from data contamination. To bridge this gap, we present Point-and-Click, a benchmark for 2D adventure games that procedurally generates rich puzzles and provides ground-truth solutions for evaluation. The environment instantiates controllable directed acyclic graphs of puzzle dependencies over primitives like keys/locks, codes, and pattern matching, spanning an exponentially scaling number of layouts with tunable difficulty. Experiments reveal the limitations of current multimodal LLM/VLM agents on this benchmark. We hope Point-and-Click serves as a rigorous testbed for progress on general-purpose embodied reasoning and implicit goal deduction in interactive environments.

Primary Area: datasets and benchmarks

Submission Number: 18509

Loading