BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science

Xinna Lin; Siqi Ma; Junjie Shan; Xiaojing Zhang; Shell Xu Hu; Tiannan Guo; Stan Z. Li; Kaicheng Yu

BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science

Xinna Lin, Siqi Ma, Junjie Shan, Xiaojing Zhang, Shell Xu Hu, Tiannan Guo, Stan Z. Li, Kaicheng Yu

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmark, biomedical agent, knowledge graph, literature

TL;DR: We introduce a benchmark to evaluate biomedical agents in checking knowledge hallucinations in KG with literature cross-verification.

Abstract: Pursuing artificial intelligence for biomedical science, a.k.a. AI Scientist, draws increasing attention, where one common approach is to build a copilot agent driven by Large Language Models (LLMs). However, to evaluate such systems, researchers typically rely on direct Question-Answering (QA) to the LLM itself or through biomedical experiments. How to benchmark biomedical agents precisely from an AI Scientist perspective remains largely unexplored. To this end, we draw inspiration from scientists’ crucial ability to understand the literature and introduce BioKGBench. In contrast to traditional evaluation benchmarks that focus solely on factual QA, where the LLMs are known to have hallucination issues, we first disentangle “Understanding Literature” into two atomic abilities: i) “Understanding” the unstructured text from research papers by performing scientific claim verification, and ii) interacting with structured Knowledge-Graphs for Question-Answering (KGQA) as a form of “Literature” grounding. We then formulate a novel agent task, dubbed KGCheck, using KGQA and domain-based Retrieval-Augmented Generation (RAG) to identify factual errors in existing large-scale knowledge graphs. We collect over two thousand data points for the two atomic tasks and 225 high-quality annotated samples for the agent task. Surprisingly, we find that state-of-the-art general and biomedical agents have either failed or performed inferiorly on our benchmark. We then introduce a simple yet effective baseline, dubbed BKGAgent. On the widely used popular knowledge graph, we discover over 90 factual errors, which provide scenarios for agents to make discoveries and demonstrate the effectiveness of our approach.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7569

Loading