Abstract: Hallucinations pose critical risks in large language model (LLM)-based agents. When outputs are inconsistent with the contextual or environmental reality, they manifest incorrect or harmful actions. While recent study have exposed such failures, existing evaluations remain fragmented and lack a principled testbed. In this paper, we present the first unified benchmarking framework for eliciting and evaluating hallucinations in interactive LLM-agent scenarios. We begin by introducing a three-part taxonomy to address agentic hallucinations: actions that are unfaithful to (i) task instructions, (ii) execution history, or (iii) environment observations. To analyze, we first elicit such failures by performing a systematic audit of existing agent benchmarks, then synthesize test cases using a snapshot strategy that isolates decision points in deterministic and reproducible manners. To evaluate hallucination behaviors, we adopt the LLM-as-a-Judge paradigm with tailored risk-aware prompts, enabling scalable, high-fidelity assessment of agent actions without enumerating full action spaces. Our framework provides actionable insights on failure modes of LLM agents and lays the groundwork for principled progress in mitigating hallucinations in interactive environments.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Hallucination, LLM agents, Benchmark
Languages Studied: English
Submission Number: 7557
Loading