## Building Math Agents with Multi-Turn Iterative Preference Learning



We consider the math problem solving with python interpreter, which means that the model can write a python code and ask the external environmnet to execute and receive the excutaion result, before the LLM makes its next decision.

## Structure

The main pipeline is divided into three steps:


- [`SFT`](./SFT/) to train the SFT model.
- [`Inference`](./inference/) to generate new data and evaluate the model.
- [`Multi-turn Alignment Algorithms`](./alignment_algorithms/) to conduct the multi-turn DPO/KTO training.


It is recommended to have three separate environments for **sft**, **inference**, and **alignment_train**. Please refer to the corresponding part of this project for the detailed installation instruction. 

## Collection
We delete the private informate due to the requirement of double-blind review.

- SFT Dataset
- Prompt set
- SFT Model
- Aligned Model


## Acknowledgement

The authors would like to thank the great open-source communities, including the Huggingface TRL team, Axolotl team, and Tora project for sharing the models, and codes. 


