PoolBench:Benchmarking Large Language Models on Continuous Physical Action Selection in Eight-Ball Pool

Prapti Patra; Dhruv Kumar

PoolBench:Benchmarking Large Language Models on Continuous Physical Action Selection in Eight-Ball Pool

Prapti Patra, Dhruv Kumar

Published: 07 Jun 2026, Last Modified: 07 Jun 2026ICML 2026 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, Benchmark, Continuous control, Physical reasoning, Simulation-grounded evaluation

TL;DR: PoolBench asks LLMs to play 8-ball by emitting aim angle and cue speed, graded by a physics simulator. Across 300 shots from 6 models, zero balls were pocketed, while a geometric Oracle pots 100%. The contribution is the benchmark, not the gap.

Abstract: We introduce PoolBench, a reproducible benchmark that asks Large Language Models (LLMs) to play eight-ball pool by emitting precise real-valued cue parameters (aim angle and cue speed) whose downstream physical effect is then resolved by a deterministic billiards simulator. Unlike physical-reasoning benchmarks that score textbook problem-solving or multiple-choice commonsense, PoolBench evaluates continuous-action selection: the model's ability to translate a textual board description into a numerical action that succeeds when actuated in the world. We evaluate six LLMs across three frontier and three open small models on 50 deterministic scenarios in seven difficulty categories, against a pocket-aware geometric Oracle that searches over a small grid of speeds and angle perturbations. Across 300 LLM shots, no balls were pocketed, despite a system prompt that explicitly teaches the ghost-ball geometric construction; the strongest model, Claude Sonnet 4, reached only 54% legal first-contact (Wilson 95% CI [40.4, 67.0]). Differences among the LLMs at this scale are largely indistinguishable, but all are consistent with prior findings of Memery et al. (2024), which documented similar failures for simpler systems. Our contribution is the benchmark infrastructure (scenario taxonomy, four-metric scoring, deterministic seeds, and a side-by-side LLM comparison protocol), not the discovery of the gap. We release scenarios and per-shot trace data, and outline the targeted experiments (rounding ablation, target-vs-angle decomposition, reasoning-model evaluation) that future work should run on this scaffold.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Paper Type: Standard paper

Submission Number: 17

Loading