<h1 align="center"> 💡Toward Effective Tool-Integrated Reasoning via
Self-Evolved Preference Learning</a></h1>


## 😃 Overview


**Tool-Light** is a framework focused on enabling models to efficiently complete TIR tasks. Tool-Light innovatively introduces the **Entropy-Guided Sampling Strategy** to construct the training set. Besides, it trains the model through the **Self-Evolved DPO Pipeline**. This design empowers the model to gradually acquire the ability to call tools efficiently and accurately. Results on two types of reasoning tasks demonstrate superior performance compared to traditional methods.

## 😋 Quick Start for Data Construction
### 1. Environment Setup

In this step, we should first operate SFT on Qwen2.5-7B-Instruct model. Please first set up the environment for [Llama Factory](https://github.com/hiyouga/LLaMA-Factory).

### 2. Conduct SFT on Qwen2.5-7B-Instruct

1. Download your SFT dataset from Tool-Star and place it in `LLaMA-Factory-main/data/final_sft_edition9.json`. Define the dataset in `dataset_info.json`.

2. In `LLaMA-Factory-main/examples/train_full/llama_factory.sh`, execute the code for model SFT.
### 3. Inference Environment Setup
First, configure the required environment.
### 4. Use SFT Model to Select Source Datas
Use the SFT model to directly perform inference on `LLaMA-Factory-main/data/final_sft_edition9.json`, and screen out the data sources for DPO training. 
### 5. Use Two Strategies to Sample Datas
Based on the data sources you've screened out, use the SFT model for sampling.

You should conduct two types of sampling. Codes are in `entropy_guided_sample/vanilla_sample.py` and `entropy_guided_sample/entropy_guided_sample.py`, respectively.
### 6. Construct Positive-Negative Examples According to Criteria
For Pre-Aligned DPO and Self-Evolved On-Policy DPO parts, we design different criteria for screening positive-negative examples. You can refer to the description in the paper, and then construct the training set for the two types of sampled data.

## 🥰 Self-Evolved DPO Training
### 1. Environment Setup
This part is the same as **Environment Setup** in **Quick Start for Data Construction**.
### 2. Conduct DPO Training
1. Define your constructed DPO dataset in `dataset_info.json`.

2. In `LLaMA-Factory-main/examples/train_full/llama_factory.sh`, execute the code for DPO training.
3. Enter the `Tool-Light` environment. Then, use the DPO model to sample again from the same 4000 data sources. After that, screen the positive-negative examples according to the criteria of the Self-Evolved On-Policy DPO Loop phase.

### 3. Evaluate the Performance of Trained Model
1. Enter the `Tool-Light` environment.
2. Deploy the retriever for performing search tasks on Wikipedia-based datasets. 
3. Deploy judging model for LLM-as-Judge evaluation.
4. Execute code to evaluate the performance of the model. Here, we evaluate the **F1 score** and the **LLM-as-Judge** metric.