A Subgoal-Based Method for Quantifying Procedural Hallucinations in Large Language Models

Mokni Marwa; Abdelkarim EL OUAFI

A Subgoal-Based Method for Quantifying Procedural Hallucinations in Large Language Models

Mokni Marwa, Abdelkarim EL OUAFI

06 Feb 2026 (modified: 14 Apr 2026)Submitted to AFAA 2026EveryoneRevisionsBibTeXCC BY 4.0

Track: Main Papers Track (6 to 9 pages)

Keywords: Hallucination, LLM, Subgoals, procedural tasks

Abstract: Evaluating hallucinations in Large Language Models (LLMs) remains a challenge due to the stochastic nature of holistic 'LLM-as-a-judge' metrics. We propose HalluStep, a method that decomposes user queries into latent subgoals to provide a transparent, state-based assessment of model responses. By utilizing a probabilistic filtering mechanism and maximum-weight bipartite matching, HalluStep anchors evaluation in structural milestones rather than mere semantic similarity. Our experiments in the Blocksworld and Logistics domains demonstrate that while G-Eval exhibits significant 'volatility'—depending heavily on the specific judge model utilized (e.g., a 16.28-point drop for Gemini Pro across judges) HalluStep remains stable, showing a marginal deviation of only 1.16 points for the same model.

Submission Number: 59

Loading