Keywords: LLM, evaluation, hallucination, query, decomposition, NLP, Machine Learning, Deep Learning, Benchmark
Abstract: Current evaluation over large language model (LLM) generation is mostly focus-
ing on instruction following, which misses a critical aspect: even if a response is a
instruct-following generation does not guarantee its factual accuracy. This type of
following instruction but factually wrong hallucination phenomenon, as we called
Intent Hallucination problem, remains under-explored for current LLM evalua-
tion. To this end, we introduce FAITHQA, a novel benchmark for intent hallu-
cination that contains 18,068 problems, covering both query-only and retrieval-
augmented generation (RAG) setups with varying topics and difficulty. Further,
we propose that LLM’s intent hallucination problem can manifest in two granu-
lated ways: minor fabrication, where the response introduces sentence-level fac-
tually incorrect information or major fabrication, where the paragraph level of the
response is entirely factually inaccurate or fabricated. We further evaluate vari-
ous state-of-the-art LLMs on the proposed FAITHQA benchmark. Our analysis
on the results demonstrates that models exhibit varying degrees of omission and
misinterpretation, which leading to intent hallucination phenomenon. To facili-
tate future research, we further introduce an automatic LLM evaluation method
INTENT DECOMPOSE that (1) breaks the query into constraints, each assigned a
different importance label and (2) calculates an importance-weighted score based
on how well the response addresses the constraints. Our analysis shows that IN-
TENT DECOMPOSE significantly outperforms the baseline.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13113
Loading