ChemPaperBench: A Multi-Domain Benchmark for Literature-Grounded Chemical Reasoning of LLM-Based Multi-Agent Systems
Abstract: Existing benchmarks for chemistry-oriented LLMs and multi-agent systems often test foundational knowledge but fail to assess the complex, multi-step reasoning and literature analysis capabilities essential for cutting-edge research. The most advanced systems must be able to extract up-todate information from scientific publications, perform domainspecific calculations, generate predictions, and integrate these outputs through multi-step reasoning. To enable systematic evaluation of such capabilities, we present ChemPaperBench: a benchmark that integrates expert-validated synthetic tasks grounded in real chemical literature. Covering a broad range of sub-disciplines and varying levels of difficulty, this benchmark uniquely assesses a system's ability to search, extract, and reason over scientific sources. We use ChemPaperBench to compare frontier LLMs with a newly created retrieval-augmented multiagent architecture, highlighting current strengths and limitations of both paradigms. Beyond benchmarking, ChemPaperBench contributes to the vision of AI-ready data for scientific discovery, offering a reusable and extensible framework for interdisciplinary evaluation at the intersection of chemistry and artificial intelligence. Code is available here https://github.com/ITMO-NSS-team/chempaperbench. Data is available here https://huggingface.co/datasets/ITMO-NSS/ChemPaperBench.
Loading