LG-Bench: A Graph-Structured Evaluation Benchmark for Life Sciences

Lu Sun; Xiangyi Zhang; Xiangyang Zhu; Zijian Chen; Qiang Hu; Yuan Tian; Guangtao Zhai

LG-Bench: A Graph-Structured Evaluation Benchmark for Life Sciences

Lu Sun, Xiangyi Zhang, Xiangyang Zhu, Zijian Chen, Qiang Hu, Yuan Tian, Guangtao Zhai

01 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Benchmark, Life Sciences, Graph

Abstract: Traditional evaluation benchmarks reduce inherently interconnected scientific knowledge in life sciences into flat lists of questions, disregarding the underlying topological structure of the knowledge. We introduce LG-Bench, the first graph-structured benchmark for life sciences, featuring over 10,000 high-quality multiple-choice questions across medicine, biology, and chemistry. Our approach constructs a weighted evaluation graph using bidirectional matching and semantic similarity algorithms, where nodes represent questions and edge weights capture their semantic relationships. Leveraging this graph topology, we design two novel evaluation metrics. The Global Coherence Score (GCS) measures a model’s consistency within semantically related neighborhoods, while Knowledge Balance Score (KBS) analyzes how model errors are distributed across the graph to reveal conceptual blind spots. LG-Bench facilitates fine-grained comparison of LLMs by surfacing differences in conceptual coherence and patterns of knowledge organization across models. Our framework shifts the evaluation paradigm from flat accuracy metrics to structure-aware analysis, offering a new lens for diagnosing and improving LLM performance in the life sciences domain.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 267

Loading