---
title: 	Real-Time Aligned Reward Model beyond Semantics (R2M)

---

This repository contains the code  for our paper: Real-Time Aligned Reward Model beyond Semantics (R2M).


We directly implemented R2M based on the trl library (https://github.com/huggingface/trl). You only need to download the trl library and replace the trl directory in the root directory with ours.

The files in this repo are:
- `run_scripts/rloo_tldr_feedback_v2.py`: the main entry point for text  summarization task.
- `run_scripts/rloo_ultrafeedback_feedback_v2.py`: the main entry point for dialogue summarization task.
- `trl/modified_reward_model`: We rewrote the reward model class to support the aggregation of policy feedback.
- `trl/trainer/rloo_trainer_feedback_v2.py`: We rewrote the train function to encompass the entire R2M framework, including encoding and decoding for rollout, obtaining policy feedback during policy forward, constructing training data for the reward model, and performing additional reward model training.
- `trl/trainer/utils.py`: We implemented key functional components to support obtaining policy feedback from the sampling process, incorporating policy feedback into scoring, and computing the BT loss and GRE loss.
- `trl/trainer/rloo_config_with_feedback.py`: We extended the parameter class of RLOO to support the newly introduced parameters in R2M.



## Tips for Running R2M

Given the various inquiries about R2M, we provide a list of tips to help you reproduce our paper results and achieve better outcomes for running R2M on your own tasks. 

### Environment

We provide  ``requirements.txt``  including the python package versions we used in our experiments. For optimal reproducibility, we recommend using the same package versions. However, please note that results may still vary due to differences in hardware configurations and CUDA versions, etc.

### Hyperparameter tuning

For R2M, the best hyperparameters for each setting are provided in the running scripts.

### Reproducing AlpacaEval 2 numbers

Please make sure that you use `alpaca-eval==0.6.2`  for successfully reproducing AlpacaEval 2 results. AlpacaEval has a major revision for vllm decoding since `0.6.3` and causes a discrepancy from our experiments. 

## Evaluation

We follow the official implementation for evaluation on AlpacaEval 2 and MT-Bench, as follows :

* AlpacaEval 2: Please refer to the [AlpacaEval repo](https://github.com/tatsu-lab/alpaca_eval) for evaluation.

## Install Requirements

We provide  `requirements.txt`   which contains package dependencies. 

    conda create -n r2m python=3.10 
    conda activate r2m
    pip install -r requirements.txt

## Training Scripts

For specific  tutorial for running the code, please refer to the trl codebase https://github.com/huggingface/trl.

**Run text summarization task:**
    
    cd run_scripts
    sh run_tldr_pythia_R2M.sh 
    

**Run dialogue task on Qwen2.5:**

    cd run_scipts
    sh run_ultrafeedback_qwen2.5_R2M.sh

**Run dialogue task on LLaMA3:**

    cd run_scipts
    sh run_ultrafeedback_llama3_R2M.sh
