# 💡DSP: Directional-Stimulus-Prompting

**Directional-Stimulus-Prompting** is a framework that uses a tuneable language model (LM) to provide guidance for the black-box frozen large language model (LLM) towards desirable properties. Specifically, we train a policy LM to generate discrete tokens as *directional stimulus* of each input, which is a hint/cue such as keywords of an article for summarization. The *directional stimulus* is then combined with the original input and fed into the LLM to guide its generation toward the desired target (an example can be seen in **Figure 1**). 

<p align="center">
  <img align="center" src="pics/example.png" width="600px" />
</p>
<p align="left">
  <b>Figure 1:</b> Comparison of our proposed Directional Stimulus Prompting with the standard prompting method to use the LLM such as GPT-3 on the summarization task. Our DSP uses a tuneable policy LM to generate the stimulus (highlighted in orange color), which is keywords in this case, to guide the LLM on generating the desired summary (highlighted in blue color) with higher rouge scores or other measures like human preference. 
</p>

The policy LM can be trained through (1) `supervised finetuning from annotated data (SFT)` and (2) `reinforcement learning from offline and online rewards (RL)` to explore directional stimulus that better aligns LLMs with human preferences. This framework is flexibly applicable to various LMs and tasks. An illustration of the **DSP** framework is shown in **Figure 2**.

<p align="center">
  <img align="center" src="pics/dsp.png" width="600px" />
</p>
<p align="left">
  <b>Figure 2:</b> Overview of our proposed framework DSP, which learns a small policy LM to improve the frozen LLM's performance on specific downstream tasks. Given the input, the policy LM generates stimulus to guide the LLM's generation, which is then evaluated with downstream performance measures or human labelers. The evaluation scores are used as rewards to optimize the policy LM with RL. The parameters of LLM are frozen while the policy LM is tuneable.
</p>

Currently, we test the framework on two benchmark tasks: 
 - Summarization
 - Dialogue Generation

Our code is based on [RL4LMs](https://github.com/allenai/RL4LMs). Users can customize the dataset, metrics, and LLM-based reward function to train transformer-based policy LMs, to provide guidance for the LLMs towards the desirable properties.


---
# Install

## Local Installation 
```bash
pip install -e .
```

## Docker
We provide also a Dockerfile for development using docker containers containing all the dependencies.
```bash
docker build . -t rl4lms
```

## Additional dependencies

Optionally, coreNLP libraries are required for certain metric computations (eg. SPICE) which can be downloaded through `cd rl4lms/envs/text_generation/caption_metrics/spice && bash get_stanford_models.sh`

## Setup OPENAI ACCESS KEY
You should setup your openai access key to call the api. 
`export OPENAI_API_KEY='XXXXXXXX'`


---
# Step 1: Supervised Fine Tuning (SFT)
First, we perform supervised finetuning (SFT) on the policy LM with annotated data to provide a good initial point for the further RL training. The code and data are placed in the `./sft4lms` directory. We provide the script to run the SFT for the two tasks:
```bash
sh run_sft_cnndm.sh # for the summarization task on the CNN/Daily Mail dataset
sh run_sft_multiwoz.sh # for the dialogue generation task on the MultiWOZ dataset
```

---
# Step 2: RL Training with PPO/NLPO
This part is based on [RL4LMs](https://github.com/allenai/RL4LMs). A simple training API that can be invoked via train script that allows to train PPO, NLPO or a supervised model by using a config file (YAML). 

We provide the scripts of training the policy LM T5 on the tasks of summarization and dialogue generation. You can run the scripts:
```bash
sh run_ppo_cnndm.sh
sh run_ppo_multiwoz.sh
```

The config files for the summarization and dialogue generation tasks can be found in the `scripts/training/task_configs/summarization_with_hint` and `scripts/training/task_configs/multiwoz_with_hint` directories, respectively.
You can customize the configuration files as instructed in [RL4LMs](https://github.com/allenai/RL4LMs).

The experiment results are stored in the `./rl4lms_exps/` directory, which contain the generations and corresponding evaluation scores.