ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?

ICLR 2026 Conference Submission19320 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Agent, Virtual Reality, Benchmark, Evaluation
TL;DR: We benchmark 6 LLMs on 262 VR game tasks: they identify 80%+ of correct actions but achieve <30% accuracy on procedural sequencing, excelling at what to do but failing at how/when to do it.
Abstract: Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays. While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs' capability to translate semantic actions into VR device manipulations across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate six state-of-the-art LLMs—including GPT-4o, GPT-4-turbo, Gemini-1.5-Pro, Llama-3-8B, Mixtral-8x7B, and GLM-4-flash—on their ability to generate appropriate manipulation sequences, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs' VR manipulation capabilities. We release all materials at \url{https://sites.google.com/view/combobench}.
Primary Area: datasets and benchmarks
Submission Number: 19320
Loading