Instruction Following is not all you need: Rethinking LLM Generation's Evaluation

28 Sept 2024 (modified: 25 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, evaluation, hallucination, query, decomposition, NLP, Machine Learning, Deep Learning, Benchmark
Abstract: Current evaluation over large language model (LLM) generation is mostly focus- ing on instruction following, which misses a critical aspect: even if a response is a instruct-following generation does not guarantee its factual accuracy. This type of following instruction but factually wrong hallucination phenomenon, as we called Intent Hallucination problem, remains under-explored for current LLM evalua- tion. To this end, we introduce FAITHQA, a novel benchmark for intent hallu- cination that contains 18,068 problems, covering both query-only and retrieval- augmented generation (RAG) setups with varying topics and difficulty. Further, we propose that LLM’s intent hallucination problem can manifest in two granu- lated ways: minor fabrication, where the response introduces sentence-level fac- tually incorrect information or major fabrication, where the paragraph level of the response is entirely factually inaccurate or fabricated. We further evaluate vari- ous state-of-the-art LLMs on the proposed FAITHQA benchmark. Our analysis on the results demonstrates that models exhibit varying degrees of omission and misinterpretation, which leading to intent hallucination phenomenon. To facili- tate future research, we further introduce an automatic LLM evaluation method INTENT DECOMPOSE that (1) breaks the query into constraints, each assigned a different importance label and (2) calculates an importance-weighted score based on how well the response addresses the constraints. Our analysis shows that IN- TENT DECOMPOSE significantly outperforms the baseline.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13113
Loading