Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking Using Knowledge Graphs

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, Attributed Question Answering, Knowledge Graph
Abstract: The attribution of question answering (QA), which is to get evidences for supporting the generated answer, has attracted wide research attention. The current methods for automatically evaluating the attribution, typically relying on Large Language Models (LLMs), are still inadequate, particularly in recognizing subtle differences between attributions, and in measuring complex attribution reasoning. Existing benchmarks, which are primarily based on manual annotations, suffer from limited evaluation settings with incomplete and coarse attribution categories and reasoning scenarios, hindering the evaluation and advancement of attribution evaluators. To address this gap, we introduce Complex Attributed Question Answering (CAQA), a large-scale benchmark automatically generated using Knowledge Graphs (KGs), containing more comprehensive attribution categories and complex attribution reasoning scenarios. Our experiments with two specifically developed evaluators and nine LLM evaluators reveal that they struggle in identifying negative attribution categories and handling complex attribution reasoning in both zero-shot and few-shot settings, but mostly perform relatively well in the fine-tuning setting. Moreover, all evaluators perform inadequately in fine-grained attribution identification scenarios. The experiments also demonstrate that CAQA is consistent with human annotations, and is promising for selecting and developing more effective attribution evaluators in QA.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6937
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview