Assessing the Knowledge-intensive Reasoning Capability of Large Language Models with Realistic Benchmarks Generated Programmatically at Scale

Yu Gai; Zhun Wang; Tianneng Shi; Bo Li; Dawn Song

Assessing the Knowledge-intensive Reasoning Capability of Large Language Models with Realistic Benchmarks Generated Programmatically at Scale

Yu Gai, Zhun Wang, Tianneng Shi, Bo Li, Dawn Song

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Evaluation, Reasoning, Hallucination

Abstract: Although LLMs demonstrates strong reasoning capability in such tasks as mathematical problem solving, less is known about their reasoning capability in settings that require extensive real-world knowledge due to the limited scale and knowledge coverage of existing benchmarks. To shed more light into this, we propose a novel pipeline that is capable of programmatically generating realistic knowledge-intensive question answering benchmarks that require complex reasoning. Leveraging open knowledge graphs, the graph query language SPARQL, and LLMs, our pipeline requires no manual annotation and can therefore scale to unprecedented benchmark size and knowledge coverage. We evaluate several state-of-the-art LLMs with benchmarks generated by our pipeline, and find that the LLMs struggle to recall and leverage world knowledge for reasoning, even for world knowledge present in their pre-training corpuses. Additionally, retrieval-augmented generation and chain-of-thoughts prompting does not fully solve the problems. Our benchmarks further enable us to examine to what extent the confidence of LLMs in the outcomes of their reasoning transparently reflects their confidence in the underlying knowledge, a study that is first-of-its-kind to our best knowledge. We find that the confidence of LLMs in the outcomes of their reasoning reflects poorly their confidence in the underlying knowledge, which suggests a direction of future improvement.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10487

Loading