Agent-as-a-Judge: Multi-Turn Rubric-Guided Hallucination Detection for Embodied Agents

08 Dec 2025 (modified: 08 Dec 2025)NeurIPS 2025 Workshop FMEA SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Hallucination detection, LLM-as-a-judge, rubrics
TL;DR: We use agents as orchestrators, planners, and judges to improve embodied agent navigation
Abstract: Symbolic interfaces for embodied agents allow large language models (LLMs) to plan in terms of goalssubgoalsactionsand transition rules. Howevereven with complete and accurate environment contextLLM planners frequently hallucinate entities or effects that are incompatible with the environmentundermining reliability. We present an inference-time agent-as-judge interface based on a rubric-guided planner– judge architecture for the Transition Modeling (TM) module of the Embodied Agent Interface (EAI) benchmark [8 ]. A planner agent proposes multiple candidate transition ruleswhile a distinct judge agent assigns structured rubric scores that explicitly capture hallucination-related criteria and environment consistency. Our orchestrator uses these rubric scores to filter hallucinated candidates and select a final outputwithout any additional model training. Instantiated for VIRTUALHOME Transition Modeling [10] and evaluated on the local EAI VIRTUALHOME validation splita 4-sample planner–judge variant improves TM accuracy and reduces obvious hallucinations relative to a single-sample planner baseline.
Submission Number: 15
Loading