# REAR: Scalable Test-time Preference Realignment through Reward Decomposition

This repository contains the official evaluation kit for the paper "REAR: Scalable Test-time Preference Realignment through Reward Decomposition". The code allows for the reproduction of the experiments on preference alignment benchmarks presented in the paper.

## Overview

The paper introduces REAlignment Reward (REAR), a novel method for aligning Large Language Models (LLMs) with user preferences at test time without requiring additional training. REAR works by decomposing the model's internal reward function and can be integrated with Test-Time Scaling (TTS) algorithms like Best-of-N (BoN) and Diverse Verifier Tree Search (DVTS).

This evaluation kit provides the necessary scripts and configurations to run the REAR methods and the baselines evaluated in the paper.

## Setup

1.  **Clone the repository and install dependencies:**

    ```bash
    cd preference-tts-latex/evaluation-kit
    pip install -r requirements.txt
    ```

2.  **Install SGLang:**

    Please follow the official instructions at [docs.sglang.ai/start/install.html](https://docs.sglang.ai/start/install.html) to install the SGLang inference engine.

3.  **Download Models:**

    The experiments use `Qwen/Qwen2.5-7B-Instruct` as the base model. For the external reward model baseline, `Skywork/Skywork-Reward-Llama-3.1-8B` is used. Ensure you have access to these models, for example by downloading them from the Hugging Face Hub. The scripts assume they are available at a local path.

## Reproducing Paper Results

The experiments are managed through Hydra and can be launched via `run_eval.py`. The shell scripts in the `scripts/` directory provide the exact configurations for reproducing the results in the paper, particularly those in Table 2.

### 1. Launch SGLang Server

Before running any evaluation, you need to start the SGLang server with the base model.

```bash
# Example for launching the server for Qwen2.5-7B-Instruct
MODEL_PATH="/path/to/your/models/Qwen2.5-7B-Instruct"
bash scripts/launch-sglang-server.sh ${MODEL_PATH} --host 0.0.0.0 --port 30000 --tp 1
```

You may need to adjust the tensor parallelism (`--tp`) and other parameters based on your hardware.

### 2. Run Evaluations

Here are the commands to reproduce the results for the main methods on the PrefEval Explicit Preference benchmark. The commands for other benchmarks (e.g., `prefeval_choice`, `multifaceted`) can be found in the respective shell scripts.

You will need to set `OPENAI_API_KEY` for evaluations that use an LLM-as-a-judge (like PrefEval).

```bash
export OPENAI_API_KEY="your-openai-api-key"
MODEL_PATH="/path/to/your/models/Qwen2.5-7B-Instruct"
```

#### REAR-guided TTS

The `verifier.mixed_weight` parameter corresponds to the hyperparameter `λ` in the paper. The optimal value found in the paper is `20.0` for most tasks.

**BoN w/ REAR:**

```bash
python run_eval.py \
    data=prefeval_explicit \
    method=best_of_n \
    verifier=preference \
    model_path=${MODEL_PATH} \
    method.n_samples=16 \
    verifier.mixed_weight=20.0 \
    exp_name=bon_rear_prefeval_explicit
```
This corresponds to `scripts/bon-rear-prefeval.sh`.

**DVTS w/ REAR:**
```bash
python run_eval.py \
    data=prefeval_explicit \
    method=dvts \
    verifier=preference \
    model_path=${MODEL_PATH} \
    method.n_samples=16 \
    verifier.mixed_weight=20.0 \
    exp_name=dvts_rear_prefeval_explicit
```
This corresponds to `scripts/prefeval-dvts-rear.sh`.

#### Baselines

**Greedy Decoding:**
```bash
python run_eval.py \
    data=prefeval_explicit \
    method=best_of_n \
    verifier=none \
    model_path=${MODEL_PATH} \
    method.n_samples=1 \
    method.temperature=0.0 \
    exp_name=greedy_prefeval_explicit
```
This corresponds to `scripts/greedy-prefeval.sh`.

**BoN w/ External RM:**

This baseline requires a second SGLang server for the reward model.

```bash
# On a separate terminal/GPU
RM_PATH="/path/to/your/models/Skywork-Reward-Llama-3.1-8B"
bash scripts/launch-sglang-server.sh ${RM_PATH} --host 0.0.0.0 --port 8000 --is-embedding

# Run evaluation
python run_eval.py \
    data=prefeval_explicit \
    method=best_of_n \
    verifier=rm_server \
    model_path=${MODEL_PATH} \
    method.n_samples=16 \
    verifier.base_url=http://localhost:8000/classify \
    exp_name=bon_external_rm_prefeval_explicit
```
This corresponds to `scripts/bon-rm-prefeval.sh`. Remember to shut down the RM server afterwards.

**BoN w/ GenRM:**
```bash
python run_eval.py \
    data=prefeval_explicit \
    method=best_of_n \
    verifier=generative_rm_server \
    model_path=${MODEL_PATH} \
    method.n_samples=16 \
    exp_name=bon_gen_rm_prefeval_explicit
```
This corresponds to `scripts/bon-gen-rm-prefeval.sh`.

## Output Structure

Results are saved in the `outputs/` directory, following this structure:

```
outputs/
└── <model_name>/
    └── <method_name>/
        └── <verifier_name>/
            └── <exp_name>/
                ├── config.yaml           # Full Hydra configuration
                ├── results.jsonl         # Raw generation results
                └── metrics.json          # Aggregated evaluation metrics
```
