Track: Main Papers Track (6 to 9 pages)
Keywords: Hallucination, LLM, Subgoals, procedural tasks
Abstract: Evaluating hallucinations in Large Language Models (LLMs) remains a challenge due to the stochastic nature of holistic 'LLM-as-a-judge' metrics. We propose HalluStep, a method that decomposes user queries into latent subgoals to provide a transparent, state-based assessment of model responses. By utilizing a probabilistic filtering mechanism and maximum-weight bipartite matching, HalluStep anchors evaluation in structural milestones rather than mere semantic similarity. Our experiments in the Blocksworld and Logistics domains demonstrate that while G-Eval exhibits significant 'volatility'—depending heavily on the specific judge model utilized (e.g., a 16.28-point drop for Gemini Pro across judges) HalluStep remains stable, showing a marginal deviation of only 1.16 points for the same model.
Submission Number: 59
Loading