Keywords: LLM, Agent, Virtual Reality, Benchmark, Evaluation
TL;DR: We benchmark 6 LLMs on 262 VR game tasks: they identify 80%+ of correct actions but achieve <30% accuracy on procedural sequencing, excelling at what to do but failing at how/when to do it.
Abstract: Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, \methodname, evaluating LLMs' capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate twelve LLMs, including GPT-3.5, GPT-4, GPT-4o, GPT-5.1, Gemini-1.5-Pro, Gemini-3-Pro, Claude-Sonnet-4.5, Grok-4, GLM-4-Flash, LLaMA-3-8B, LLaMA-3-70B, and Mixtral-8x7B, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-3-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs' VR manipulation capabilities.
We release all materials at \url{https://sites.google.com/view/combobench}.
Primary Area: datasets and benchmarks
Submission Number: 19320
Loading