Keywords: Lunar Exploration; Task-Oriented Reasoning; Large Language Models; Benchmark Evaluation
TL;DR: We introduce Lunar-Bench, a benchmark that evaluates task-oriented reasoning of large language models in realistic lunar mission scenarios using structured, process-level metrics.
Abstract: The deployment of large language models (LLMs) in lunar exploration presents significant challenges, requiring robust reasoning under partial observability, dynamic constraints, and severe resource limitations. Yet, existing benchmarks often overlook these aspects, focusing instead on static and context-agnostic tasks. To fill this gap, we introduce **Lunar-Bench**, the first benchmark tailored to evaluate LLMs in realistic lunar mission scenarios. Constructed from authentic mission protocols and telemetry data, Lunar-Bench contains 3,000 high-fidelity tasks spanning diverse operational domains and difficulty levels. Beyond traditional accuracy-based evaluations, it introduces **Environmental Scenario Indicators**, a set of process-centric metrics assessing safety, efficiency, factual integrity, and alignment. Evaluation of 36 state-of-the-art LLMs shows that the best-performing model achieves only 47.8% accuracy, far below the human expert benchmark of 65.1%. Moreover, while prompting strategies such as Chain-of-Thought yield limited and inconsistent gains, they also substantially increase computational overhead. Our analysis highlights persistent deficiencies in ensuring safety, achieving reasoning completeness, and maintaining alignment. By addressing these gaps, Lunar-Bench provides a principled framework for diagnosing weaknesses and guiding the development of more robust and trustworthy LLMs for high-stakes, safety-critical environments.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 14966
Loading