Keywords: Benchmark, Dataset, Science, Policy, LLM
TL;DR: The first benchmark and training dataset for evaluating and fine-tuning large language models (LLMs) on policy brief generation from a scientific paper.
Abstract: We propose Sci2Pol-Bench and Sci2Pol-Corpus, the first benchmark and training dataset for evaluating and fine-tuning large language models (LLMs) on policy brief generation from a scientific paper.
We build Sci2Pol-Bench on a five-stage taxonomy to mirror the human writing process:
(i) Autocompletion, (ii) Understanding, (iii) Summarization, (iv) Generation, and (v) Verification.
It features 18 tasks in multiple-choice and open-ended formats.
Specifically, for the Generation stage, we show that BERTScore and ROUGE scores fail to capture the quality of brief writing, and introduce a new LLM-based evaluation metric aligned with expert judgement.
Using this benchmark, we evaluate 13 leading open-source and commercial LLMs to uncover key limitations.
To improve LLM performance on brief writing, we curate the Sci2Pol-Corpus for fine-tuning.
We start by linking each cited scientific paper to its corresponding policy document, drawn from 5.6 million policy records.
This produces 140,000 candidate pairs.
We then employ an LLM-as-a-judge to filter high-quality examples, followed by in-context polishing using three expert-written samples as references.
This process yields a final set of 639 new pairs.
Finally, we fine-tune three models on Sci2Pol-Corpus: LLaMA-3.1-8B, Gemma-12B, and Gemma-27B.
Fine-tuning leads to consistent performance improvements across Sci2Pol-Bench.
Notably, after fine-tuning, Gemma-27B surpasses the much larger GPT-4o and DeepSeek-V3 (671B).
These demonstrate the effectiveness of our corpus in bridging the gap between science and policy.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 2230
Loading