# Sci2Pol-Bench: A Benchmark for LLM Policy Brief Generation from Scientific Research

This Repo is the official implementatation of [Sci2Pol-Bench: A Benchmark for LLM Policy Brief Generation from Scientific Research]().

## Contents

- [1. Introduction](#1-introduction)
- [2. Setup Environment](#3-setup-environment)
- [3. SFT](#4-sft)
- [4. Inference](#5-inference)
- [5. Evaluation](#6-evaluation)
- [6. Citation](#7-citation)

## 1. Introduction

<div style="display:flex; gap:12px; align-items:flex-start;">
  <img src="Figure/figure1a.png" alt="Figure 1a — Sci2Pol-Bench overview" style="width:80%; height:auto;" />
</div>

Sci2Pol-Bench and Sci2Pol-Corpus, a benchmark and a training dataset for evaluating and finetuning the ability of large language models (LLMs) to generate policy briefs from scientific research papers. 

- **Five stages of tasks**: Autocompletion, Understanding, Summarization, Generation, and Verification. 
- **Four components of policy brief**: policy problem, scientific research findings, scientific research study methods, and policy implications. 
- **Evaluation Methods**: F1-score for classification tasks, LLM-as-judge with 4-dimension scoring, BERTScore and ROUGE for text generation, and Section-specific evaluation with specialized rubrics.
- **Comprehensive Scope**: 19 tasks, evaluated over 14 leading open-source and commercial LLMs.

## 2. Setup environment
```bash
conda create -y -n Sci2Pol python=3.10
conda activate Sci2Pol
# Install LLaMA-Factory and dependencies for SFT, inference, and evaluation
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install --no-deps -e . --no-build-isolation
pip install wandb
pip install "vllm==0.8.2"
pip install -r requirements.txt
pip install bert-score rouge-score scikit-learn google-generativeai anthropic
# Authentication for HuggingFace model access
huggingface-cli login --token hf_YOUR_TOKEN_HERE
```

## 3. SFT

**Fine-Tuning** - Train models on policy brief generation using LoRA:
```bash
llamafactory-cli train SFT/config/llama3_8B.yaml
```

**Merging LoRA Adapters** - Combine trained adapters with base model:
```bash
python SFT/src/merge_sft.py \
  --model_name_or_path "meta-llama/Llama-3.1-8B-Instruct" \
  --adapter_name_or_path SFT_Results/llama3-8b/checkpoint-216 \
  --template "llama3"
```

## 4. Inference

**Download Sci2Pol-Bench dataset** from HuggingFace:
```bash
python Inference/src/download_sci2pol_db.py
```

**Method 1: Local Model Inference** - For fine-tuned models:

Initialize LLaMA-Factory API server:
```bash
API_PORT=8000 llamafactory-cli api Inference/config/llama3-8b.yaml infer_backend=vllm vllm_enforce_eager=true
```

Run inference via LLaMA-Factory:
```bash
python Inference/src/inference_LF.py --model_name_or_path SFT_Results/llama3-8b/checkpoint-216-merge [--dataset_folder] [--output_folder]
```

**Method 2: API-based Inference** - For published models:
```bash
python Inference/src/inference_api.py --model_name meta-llama/Llama-3.1-8B-Instruct --task task1 [--dataset_folder] [--output_folder]
```

## 5. Evaluation

**Comprehensive Evaluation** across all 19 tasks:
```bash
python Evaluation/src/evaluation.py [--dataset_folder] [--response_folder] [--output_folder]
```

**Single Task Evaluation** for focused testing:
```bash
python Evaluation/src/evaluation.py --task task16 [--dataset_folder] [--response_folder] [--output_folder]
```

## 6. Citation

If you use Sci2Pol-Bench, please cite our paper: 
