This code implements the empirical study section of the paper *"Mechanism Design for LLM Fine-tuning with Multiple Reward Models."* The implementation builds upon *Multiple Objective RLHF* [1] and *Rewarded Soup* [2].

As outlined in the paper, we model two RLHF games: the "Harmless vs. Humor" game for the Helpful Assistants task, and the "Faithful vs. Summary" game for the Reddit Summary task. This code uses the "Faithful vs. Summary" game as a case study to demonstrate the training process.

### Training Procedure

1. **Supervised Fine-Tuning**  
   First, we perform supervised fine-tuning on the base model using the corresponding dataset. To do so, run the following command:
   ```
   cd ./sft
   sh sft.sh
   ```
   Ensure that the base model (e.g., *llama-2-7b-hf*) is placed in the `./ppo` folder.

2. **RLHF Process**  
   Next, we apply RLHF by training two models with their respective reward models. To start this process, run:
   ```
   cd ./ppo
   sh ppo.sh
   ```
   After completion, you will find the two RLHF fine-tuned models in the `./ppo` folder.

3. **Rewarded Soup**  
   We then use the *Rewarded Soup* method to generate a set of hybrid models. Execute the following code:
   ```
   eval_rs_summary.sh
   ```

4. **Simulation and Visualization**  
   The mechanism is simulated and visualised in the file `./ppo/stat.ipynb`. The figures and training logs are stored in the `./ppo` folder.

### References
[1] Rame A, Couairon G, Dancette C, et al. Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. *Advances in Neural Information Processing Systems*, 2023, 36.

[2] Shi R, Chen Y, Hu Y, et al. Decoding-time language model alignment with multiple objectives. *arXiv preprint* arXiv:2406.18853, 2024.