# REAR: Scalable Test-time Preference Realignment through Reward Decomposition

This repository contains the implementation for the paper "REAR: Test-time Preference Realignment through Reward Decomposition". The code allows for the reproduction of the experiments on preference alignment benchmarks presented in the paper.

## Overview

The paper introduces REAlignment Reward (REAR), a novel method for aligning Large Language Models (LLMs) with user preferences at test time without requiring additional training. REAR works by decomposing the model's internal reward function and can be integrated with Test-Time Scaling (TTS) algorithms like Best-of-N (BoN) and Diverse Verifier Tree Search (DVTS).

This evaluation kit provides the necessary scripts and configurations to run the REAR methods and the baselines evaluated in the paper.

## Setup

This repo uses a `uv`-managed virtual environment and a locked dependency set (`uv.lock` / `pyproject.toml`).

1.  **Create the environment and install dependencies:**

    ```bash
    cd evaluation-kit
    uv venv
    uv sync
    ```

    Then either activate the environment:

    ```bash
    source .venv/bin/activate
    ```

    Or run commands via `uv run` (no activation needed):

    ```bash
    uv run python run_eval.py --help
    ```

2.  **Download Models:**

    The experiments use `Qwen/Qwen2.5-7B-Instruct` as the base model. Ensure you have access to these models, for example by downloading them from the Hugging Face Hub. The scripts assume they are available at a local path.

## Run Evaluations

The experiments are managed through Hydra and can be launched via `run_eval.py` (see `configs/` for experiment presets). Change the python script to `run_ping_pong.py` to run the ping-pong benchmarks as it has different logics on multi-turn conversations.

### 1. Launch vLLM Server

Before running any evaluation, you need to start a vLLM OpenAI-compatible server with the base model.

```bash
# Example for launching the server for Qwen2.5-7B-Instruct (OpenAI-compatible)
MODEL_PATH="/path/to/your/models/Qwen2.5-7B-Instruct"
bash scripts/launch-vllm-server.sh "${MODEL_PATH}" --host 0.0.0.0 --port 30000 --dp 1 --tp 1 --gpu-memory-utilization 0.8
```

You may need to adjust data/tensor parallelism (`--dp`, `--tp`) and GPU memory utilization based on your hardware.

### 2. Run Evaluations

Here are the commands to reproduce the results for the main methods on the PrefEval Explicit Preference benchmark. The commands for other benchmarks (e.g., `prefeval_choice`, `multifaceted`) can be found in the respective shell scripts.

You will need to set `OPENAI_API_KEY` for evaluations that use an LLM-as-a-judge (like PrefEval).

```bash
export OPENAI_API_KEY="your-openai-api-key"
MODEL_PATH="/path/to/your/models/Qwen2.5-7B-Instruct"
```

#### REAR-guided TTS

The `verifier.mixed_weight` parameter corresponds to the hyperparameter `λ` in the paper. The optimal value found in the paper is `20.0` for most tasks.

**BoN w/ REAR:**

```bash
python run_eval.py \
    data=prefeval_explicit \
    method=best_of_n \
    verifier=preference_verifier \
    model_path=${MODEL_PATH} \
    method.n_samples=16 \
    verifier.mixed_weight=20.0 \
    exp_name=bon_rear_prefeval_explicit
```

**DVTS w/ REAR:**
```bash
python run_eval.py \
    data=prefeval_explicit \
    method=dvts \
    verifier=preference_verifier \
    model_path=${MODEL_PATH} \
    method.n_samples=16 \
    verifier.mixed_weight=20.0 \
    exp_name=dvts_rear_prefeval_explicit
```

#### Baselines

**Greedy Decoding:**
```bash
python run_eval.py \
    data=prefeval_explicit \
    method=best_of_n \
    verifier=none \
    model_path=${MODEL_PATH} \
    method.n_samples=1 \
    method.temperature=0.0 \
    exp_name=greedy_prefeval_explicit
```

**BoN w/ GenRM:**
```bash
python run_eval.py \
    data=prefeval_explicit \
    method=best_of_n \
    verifier=generative_rm_server \
    model_path=${MODEL_PATH} \
    method.n_samples=16 \
    exp_name=bon_gen_rm_prefeval_explicit
```

## Output Structure

Results are saved in the `outputs/` directory, following this structure:

```
outputs/
└── <dataset_identifier>/
    └── <method_name>/
        └── <verifier_name>/
            └── <exp_name>/
                ├── config.yaml      # Full Hydra configuration
                ├── results.jsonl    # Raw generation results
                └── metrics.json     # Aggregated evaluation metrics
```
