SciGuide: Evaluating Literature Understanding, Inductive Reasoning, and Knowledge Utilization in Scientific Research

SciGuide: Evaluating Literature Understanding, Inductive Reasoning, and Knowledge Utilization in Scientific Research

ACL ARR 2026 January Submission3261 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmarking, AI4Research, Evidence-Based Medicine, Inductive Reasoning, Knowledge Utilization, Large Language Models

Abstract: We introduce SciGuide, a large language models (LLMs) benchmark for scientific research scenarios, designed to evaluate model performance in evidence-based clinical guideline development. Compared with existing benchmarks, SciGuide provides three key advances: (1) Scan-oriented scientific literature understanding. We introduce two novel tasks without explicit retrieval targets, requiring comprehensive document scanning. PICO extraction and quality appraisal tasks that require models to capture detailed PICO elements (17.03 PICOs and 451.75 factors per study on average) and methodological features (12.39); (2) Inductive reasoning under uncertainty. Grounded in the GRADE framework, models are required to synthesize multiple studies (3.04 on average, up to 13) under varying or conflicting evidence quality; (3) Priors–driven knowledge utilization. Models are required to rely on prior knowledge to complete expert-level scientific research tasks (7 task settings). We further conduct quantitative experiments to analyze the impact of prior knowledge and reasoning ability. We evaluate 18 LLMs. The best-performing model achieves only 37.64. We expect SciGuide to facilitate the application and improvement of LLMs in real-world scientific research. Data and code are available: https://anonymous.4open.science/r/SciGuide-628/

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking,evaluation methodologies,NLP datasets,evaluation,metrics

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English

Submission Number: 3261

Loading