# Machine Learning Blog | ML@CMU | Carnegie Mellon University

TL;DR:_Prompting enables large language models (LLMs) to perform various NLP tasks without changing the model. Discrete prompts have many desirable properties, but are difficult to optimize. We propose an efficient approach using reinforcement learning, which shows superior performance and facilitates rich interpretations and analyses._You can easily adapt it for your own tasks using our library[here](https://github.com/mingkaid/rl-prompt)__.

**Prompting** has emerged as a promising approach to solving a wide range of NLP problems using large pre-trained language models (LMs), including left-to-right models such as [GPT](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)[s](https://arxiv.org/abs/2005.14165) and masked LMs such as [BERT](https://arxiv.org/abs/1810.04805), [RoBERTa](https://arxiv.org/abs/1907.11692), etc. 

Compared to conventional fine-tuning that expensively updates the massive LM parameters for each downstream task, prompting concatenates the inputs with an additional piece of text that **steers the LM to produce the desired outputs**. A key question with prompting is how to find the optimal prompts to improve the LM’s performance on various tasks, often with only a few training examples.

Most existing work resorts to tuning **soft prompt** (e.g., embeddings) which falls short of interpretability, reusability across LMs, and applicability when gradients are not accessible. **Discrete prompt** , on the other hand, is difficult to optimize, and is often created by “enumeration (e.g., paraphrasing)-then-selection” heuristics that do not explore the prompt space systematically.

In [our EMNLP 2022 paper](https://arxiv.org/abs/2205.12548), **we instead propose RLPrompt** , an efficient discrete prompt optimization approach with reinforcement learning (RL). RLPrompt is flexibly applicable to different types of LMs (e.g., BERT and GPTs) for both classification and generation tasks. Experiments on few-shot classification and unsupervised text style transfer show superior performance over a wide range of existing finetuning or prompting methods. 

Interestingly, the resulting optimized prompts are often **ungrammatical gibberish text** ; and surprisingly, those gibberish prompts are **transferable between different LMs** to retain significant performance, indicating LMs may have grasped shared structures for prompting, but do not follow human language patterns.

## Discrete Prompt Optimization with RL

This paper presents RLPrompt, a new discrete prompt optimization approach based on reinforcement learning (RL). This approach brings together a wide range of desirable properties for efficient use on diverse tasks and LMs (see the table below). 

Crucially, rather than directly editing the discrete tokens, which has been difficult and inefficient, RLPrompt trains a policy network that generates the desired prompts. Discrete prompt optimization thus amounts to learning a small number of policy parameters which we set as an MLP layer inserted into a frozen compact model such as [distilGPT-2](https://huggingface.co/distilgpt2). We describe the specific formulations in Section §2.1-2.3 of our [paper](https://arxiv.org/abs/2205.12548).

This formulation also allows us to employ off-the-shelf RL algorithms (e.g., [soft Q-learning](https://arxiv.org/abs/2106.07704)) that learn the policy with arbitrary reward functions—defined either with available data (e.g., in few-shot classification) or other weak signals when no supervised data is accessible (e.g., in controllable text generation).

## Reward Stabilization 

On the other hand, RL for prompt optimization poses new challenges to learning efficiency: the large black-box LM presents a highly complex environment that, given the prompt (i.e., actions), goes through a long series of complex transitions (e.g., reading the input and inferring the output) before computing the rewards. This makes the reward signals extremely unstable and hard to learn from. 

To overcome this difficulty, we propose two simple yet surprisingly effective ways to stabilize the rewards and improve the optimization efficiency. 

  1. Normalizing the training signal by computing the _z_ -score of rewards for the same input. 
  2. Designing piecewise reward functions that provide a sparse, qualitative bonus to desirable behaviors (e.g., certain accuracy on certain class).



We describe more details in Section §2.4 of our [paper](https://arxiv.org/abs/2205.12548).

## Experiments

We evaluate our approach on both **classification** (in the few-shot setting) and **generation** (unsupervised text style transfer), and perform rich analyses for new insights on LM prompting. We describe implementation details such as reward function design in Section §3 our [paper](https://arxiv.org/abs/2205.12548), and publish the code at our Github [codebase](https://github.com/mingkaid/rl-prompt). 

### Few-Shot Text Classification

For few-shot classification, we follow previous work and experiment on popular sentiment and topic classification tasks, using 16 examples per class for [both training and validation](https://arxiv.org/abs/2105.11447). Results using RoBERTa-large (left table below) show our approach **improving over a wide range of fine-tuning and prompting methods** , and is as efficient to optimize as similar methods that tune soft prompts (e.g., right figure below). We report detailed dataset-level results in Section §3.1 of our paper.

### Unsupervised Text Style Transfer

For text style transfer, we evaluate on the popular [Yelp](https://github.com/shentianxiao/language-style-transfer) sentiment transfer dataset using popular automatic metrics for content preservation, style accuracy, and fluency, and report their sentence-level joint product \\(J(\cdot)\\) below. Our full paper also includes few-shot experiments on the [Shakespeare](https://github.com/cocoxu/Shakespeare) dataset and human evaluations.

Results using GPT-2 (left table below) show our method **outperforms or competes with various fine-tuning and prompting baselines** , including [DiRR](https://arxiv.org/abs/2010.12771) which expensively fine-tunes all parameters of a GPT-2 model. Ablation study (right figure below) shows that our proposed reward normalization technique is crucial to optimization success. We describe the full evaluation results in Section §3.2 of our paper. 

## Analysis

### Optimal Prompts Don’t Follow Human Language

The resulting discrete prompts also facilitate rich interpretations and analyses for new insights into LM prompting. In particular, the optimized prompts, though inducing strong task performance, tend to be gibberish text without clear human-understandable meaning (e.g., table below), echoing recent research (e.g., [Webson and Pavlick (2021)](https://arxiv.org/abs/2109.01247), [Zhao et al., (2021)](https://arxiv.org/abs/2102.09690), and [Prasad et al., (2022)](https://arxiv.org/abs/2203.07281)) that LMs making use of prompts do not necessarily follow human language patterns. 


### Learned Prompts Transfer Trivially Across LMs

Perhaps surprisingly, those gibberish prompts learned with one LM can be used in other LMs for significant performance, indicating that those different pre-trained LMs have grasped shared structures for prompting (e.g., figures below). 


## Conclusion

We have presented RLPrompt, an efficient and flexible approach for discrete prompt optimization using RL, which improves over a wide range of fine-tuning and prompting methods in experiments on few-shot classification and unsupervised text style transfer. 

Analysis reveals that strong optimized prompts are incoherent but transferable between LMs for remarkable performance. The observation opens up many promising possibilities for prompting, such as learning prompts cheaply from smaller models and performing inference with larger models. We are excited to explore further.