# README

This repository contains the code, dataset and instructions for the training and evaluation used in the work "BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs".

## Environment

```shell
pip install -r requirements.txt
```

You should adjust the pytorch and deepspeed version according to your cuda version.

## Codebase Directory

| Directory                     | Description                                                  |
| ----------------------------- | ------------------------------------------------------------ |
| `./data` | Test Data and Training Data (including SFT and GRPO stages) used in our work |
| `./evaluation` | Evaluation scripts for factual scores on test set and reasoning performance on math dataset |
| `./gen_code` | Generation scripts and code for batch generation |
| `./training_scripts` | Training scripts for our SFT and GRPO training |

## Models and Datasets

### Pre-trained Models

- **DeepSeek-R1-Distill-Llama-8B**  
  [https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)

- **DeepSeek-R1-Distill-Qwen-7B**  
  [https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)

### Datasets

1. **TriviaQA**  
   Open-domain question–answering corpus drawn from Wikipedia and the web  
   *License:* Apache 2.0  
   [https://github.com/mandarjoshi90/triviaqa](https://github.com/mandarjoshi90/triviaqa)

2. **SciQ**  
   13,679 multiple-choice science questions spanning physics, chemistry, biology, and more  
   *License:* CC BY-NC 3.0  
   [https://huggingface.co/datasets/allenai/sciq](https://huggingface.co/datasets/allenai/sciq)

3. **NQ-Open**  
   Open-domain variant of Natural Questions covering real Google queries  
   *License:* CC BY-SA 3.0  
   [https://github.com/efficientqa/nq-open](https://github.com/efficientqa/nq-open)

4. **SimpleQA**  
   Complex factuality benchmark  
   *License:* MIT  
   [https://github.com/openai/simple-evals](https://github.com/openai/simple-evals)

5. **MATH-500**  
   500-problem subset of the MATH benchmark for compact maths evaluation  
   *License:* MIT  
   [https://huggingface.co/datasets/HuggingFaceH4/MATH-500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500)

6. **MATH**  
   Full-scale mathematics problem benchmark  
   *License:* MIT  
   [https://github.com/hendrycks/math](https://github.com/hendrycks/math)

7. **SelfAware**  
   Unanswerable questions benchmark  
   *License:* Apache 2.0  
   [https://github.com/yinzhangyue/SelfAware](https://github.com/yinzhangyue/SelfAware)

## Detailed Instructions

### Evaluation

We provide a script to support batch generation of model responses using the vLLM toolkit. Before running the script, make sure to update the `model/tokenizer` path as well as the `input/output` paths accordingly.

To run the generation:

```bash
cd gen_code
bash generate.sh
```

We also provide quick evaluation scripts to compute factual scores on relevant datasets and accuracy scores on math datasets. Before running the evaluations, remember to update the `name`, `model`, `data_path`, and `output_path` fields in both `factual_eval.py` and `math_eval.py` to reflect your own generation results.

To run the evaluation:

```bash
cd evaluation
python factual_eval.py
python math_eval.py
```

### Training
We provide training scripts for both the SFT stage and the GRPO stage. Before running the scripts, make sure to adjust the configurations as needed (e.g., model path, data path, output directories, hyperparameters, etc.).

#### SFT Stage

To run SFT training:
```bash
cd ./training_scripts/sft_stage
bash run_sft.sh
```

#### GRPO Stage

To run GRPO training:
```bash
cd ./training_scripts/grpo_stage
bash run_grpo.sh
```
Remember to change the default paths to the SFT-trained model and tokenizer.