SciGuide: Evaluating Literature Understanding, Inductive Reasoning, and Knowledge Utilization in Scientific Research
Keywords: Benchmarking, AI4Research, Evidence-Based Medicine, Inductive Reasoning, Knowledge Utilization, Large Language Models
Abstract: We introduce SciGuide, a large language models (LLMs) benchmark for scientific research scenarios, designed to evaluate model performance in evidence-based clinical guideline development. Compared with existing benchmarks, SciGuide provides three key advances: (1) Scan-oriented scientific literature understanding. We introduce two novel tasks without explicit retrieval targets, requiring comprehensive document scanning. PICO extraction and quality appraisal tasks that require models to capture detailed PICO elements (17.03 PICOs and 451.75 factors per study on average) and methodological features (12.39); (2) Inductive reasoning under uncertainty. Grounded in the GRADE framework, models are required to synthesize multiple studies (3.04 on average, up to 13) under varying or conflicting evidence quality; (3) Priors–driven knowledge utilization. Models are required to rely on prior knowledge to complete expert-level scientific research tasks (7 task settings). We further conduct quantitative experiments to analyze the impact of prior knowledge and reasoning ability. We evaluate 18 LLMs. The best-performing model achieves only 37.64. We expect SciGuide to facilitate the application and improvement of LLMs in real-world scientific research. Data and code are available: https://anonymous.4open.science/r/SciGuide-628/
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking,evaluation methodologies,NLP datasets,evaluation,metrics
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 3261
Loading