Assessing the Knowledge-intensive Reasoning Capability of Large Language Models with Realistic Benchmarks Generated Programmatically at Scale
Keywords: Large Language Models, Evaluation, Reasoning, Hallucination
Abstract: Although LLMs demonstrates strong reasoning capability in such tasks as mathematical problem solving, less is known about their reasoning capability in settings that require extensive real-world knowledge due to the limited scale and knowledge coverage of existing benchmarks. To shed more light into this, we propose a novel pipeline that is capable of programmatically generating realistic knowledge-intensive question answering benchmarks that require complex reasoning. Leveraging open knowledge graphs, the graph query language SPARQL, and LLMs, our pipeline requires no manual annotation and can therefore scale to unprecedented benchmark size and knowledge coverage. We evaluate several state-of-the-art LLMs with benchmarks generated by our pipeline, and find that the LLMs struggle to recall and leverage world knowledge for reasoning, even for world knowledge present in their pre-training corpuses. Additionally, retrieval-augmented generation and chain-of-thoughts prompting does not fully solve the problems. Our benchmarks further enable us to examine to what extent the confidence of LLMs in the outcomes of their reasoning transparently reflects their confidence in the underlying knowledge, a study that is first-of-its-kind to our best knowledge. We find that the confidence of LLMs in the outcomes of their reasoning reflects poorly their confidence in the underlying knowledge, which suggests a direction of future improvement.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10487
Loading