# IUPO

Code for the paper "Improving Reasoning Ability of Large Language Models via Iterative Uncertainty-based Preference Optimization"

## Overview
<div align=center>
<img src="resource/intro.png" width="75%" height="75%" />
</div>

Direct Preference Optimization (DPO) has recently emerged as an efficient and effective method for aligning large language models with human preferences.
However, constructing high-quality preference datasets remains challenging, often necessitating expensive manual or powerful LM annotations. Additionally, standard DPO exhibits suboptimal performance in complex reasoning tasks, such as mathematical and code reasoning.
In this paper, we introduce an approach to collect preference pairs through iterative sampling and execution feedback, tailored to the current learning state (e.g. well-learned, mis-learned, and unlearned) of the policy model.
To alleviate the failures of DPO and improve its applicability in reasoning tasks, we propose IUPO, an iterative uncertainty-based preference optimization method that achieves fine-grained preference control by assessing model confidence.
We validate our approach across three reasoning tasks, incorporating five established reasoning datasets and one self-curated dataset. Our experimental results demonstrate an overall improvement of 3.6% over the standard DPO method. 
Furthermore, our approach exhibits promising generalizability involving weak-to-strong (8B to 70B) and cross-model (Llama to Mistral) generalizations.

##  Preference Data Construction

<div align="center">
  <img src="resource/data.png">
</div>

### SFT datasets
- BIRD: https://bird-bench.github.io/
- APPS+: https://github.com/Ablustrund/APPS_Plus
- DartMath: https://github.com/hkust-nlp/dart-math (SFT model)

### Response Sampling

You can deploy the SFT models and naive models by vLLM. There is a demo scripts at `scripts/preference_data_construction/run_vllm_serving.sh`. Then you can generate N responses per question by calling the deployed LLM.

### Execution Feedback
Run the `get_rewards.py` to get the feecback of the sampling responses.

### Preference Data Generation
We collect the preference data based on the learning state of the policy, and realized it in `main.py`.


## Training

```shell
bash scripts/run_dpo.sh configs/code/llama3_8b_dpo_lora.yaml
```

## Evaluation

Evaluate your models:
- BIRD: https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/bird
- Human Eval and MBPP: https://github.com/open-compass/opencompass/tree/main/opencompass
- GSM8k and MATH: https://github.com/hkust-nlp/dart-math