# Adaptive Social Learning via Mode Policy Optimization for Language Agents

<img src="CODE/src/asl.png" width="500">

## 👀 Overview
This repository contains code and data for our paper **Adaptive Thinking via Mode Policy Optimization for Social Language Agents**. In this paper, we propose the **A**daptive **M**ode **L**earning framework (**AML**) to empower social agents with the capability for adaptive thinking, enabling them to effectively respond in accordance with the dynamics of social interaction context.
Specifically, we first develop four thinking modes inspired by hierarchical cognitive control theory, covering a spectrum from intuitive response, through shallow and strategic thinking, to deep deliberation. 
Next, we perform the injection of thinking modes, which consists of behavioral cloning for learning basic modes and RL-based adaptive thinking mode enhancement.
For RL-based enhancement, we contrapuntally develop the **A**daptive **M**ode **P**olicy **O**ptimization (**AMPO**) algorithm, which incorporates the mode-level and sample-level information into advantage estimation to strengthen the context-aware thinking mode switching.
In terms of reward, we design three types of reward functions, including answer reward, format reward, and answer length reward, providing feedback for choosing the appropriate thinking mode and answer.

## Main Results
<img src="CODE/src/exp1.png" width="500">

<img src="CODE/src/exp2.png" width="500">

> Extensive experimental results show that AML and AMPO achieves the SOTA performances in comparison with strong baselines. Details can be found in the paper.

## 🔧How to use
<img src="CODE/src/alg.png" width="500">

> The full optimization procedure. We employ a two-phase training procedure: The first phase utilizes mode behavioral cloning to enable the model to understand and follow specific thinking modes accurately. In the second phase, we perform adaptive mode policy optimization to enhance the adaptive thinking mode switch and reasoning.

**Step1** Create conda environment and Install other dependencies.
1. Create BC conda environment (LLaMA Factory).
```shell
conda create --name BC python=3.11 -y
conda activate BC
cd ./CODE/BC 
pip install -e ".[torch,metrics]"
```
2. Create RL conda environment (verl).
```shell
# RL environment (verl)
conda create --name RL python=3.11 -y
conda activate RL
cd ./CODE/RL
pip3 install -e .[vllm]
pip install -r requirements.txt
```

**Step2** Preparing the Model API

1. (**Must**) Set up your OPENAI key in config/gpt_4o.yaml (Evaluation)
```shell
api_key: "Your OPENAI key"
api_url: "API URL"
```

2. (**Must**) Set up your key in config/qwen2.5_72b_instruct.yaml (Reward Model)
```shell
api_key: "Your key"
api_url: "API URL"
# We also recommend using vLLM. And we use HTTP server that implements OpenAI’s Completions and Chat API.
# Set up your vLLM settings in config/*.yaml
```
**Step3** Behavior Cloning Training
```shell
conda activate BC
cd ./CODE/BC
## (Must) Firstly set the bc_training_data_path in ./BC/data/dataset_info.yaml
sh train.sh
```

**Step4** RL Training
```shell
conda activate RL
cd ./CODE/RL
## (Must) Firstly, translate the rl training data into ".parquet" format by using the script in ./RL/example/data_preprocess/sotopia.py
sh sotopia_ampo_llama3.1_8b.sh
sh sotopia_ampo_qwen2.5_7b.sh
```

**Step5** Evaluation and Inference
```shell
conda activate RL
cd ./CODE/RL
sh infer.sh
## show result
python result.py --env sotopia --data_path your_result_path
```

