# RtRank: Stratified response time ranking for data-efficient reward modeling

This repository contains the implementation of the paper "RtRank: Stratified Response Time Ranking for Data-Efficient Reward Modeling".

## Overview

RtRank is a novel approach for preference learning that leverages response time information to improve utility estimation. The method outperforms traditional methods like Bradley-Terry that only consider choice outcomes in our experiments.

## Installation

There are two ways to install and run the code:

### Option 1: Using devcontainer (recommended)

The easiest way to run the experiments is to use the provided devcontainer. This automates the environment setup.

Requirements:
- Docker
- DevContainer CLI (`npm install -g @devcontainers/cli`) or VS Code

#### Using VS Code:

1. Open the repository in VS Code
2. Install the "Dev Containers" extension
3. Select "Dev Containers: Reopen in Container"

#### Using devcontainer CLI:

The `./run_in_container` script in this repository uses the DevContainer CLI to run commands in the container, starting it if needed.

### Option 2: Manual installation

If you prefer not to use a container, you can install the dependencies directly:

```bash
python -m venv venv
source venv/bin/activate
pip install -e ".[dev]"
```

Required system dependencies:
- Python >= 3.10

## Reproducing the paper experiments

To reproduce all the experiments from the paper, follow these steps. We assume you use the devcontainer CLI here. If you do not, you can omit the `bash ./run_in_container` prefix.

### Hardware requirements

- CPU: Any modern multi-core CPU
- RAM: 16GB minimum, 32GB recommended
- Disk space: ~3GB for the output of each experiment
- GPU: Not required

### Expected runtime

- Each experiment takes approximately 90 minutes on an 8-core machine that is not otherwise occupied
- The full set of experiments (all datasets and variants) will take about 9 hours
- The experiments are nearly perfectly parallelizable and can be executed in a manner of minutes on a sufficiently large cluster

### 1. Run the main experiments (with inter-stratum variation)

The following commands will run the experiments and store results in the ./outputs directory.

```bash
# Run experiments with default configuration (all learners, 100 trials)
bash ./run_in_container python -m rtrank.experiment_runner dataset=deterministic_all
bash ./run_in_container python -m rtrank.experiment_runner dataset=stochastic
bash ./run_in_container python -m rtrank.experiment_runner dataset=drift_diffusion
```


### 2. Run the no-variability ablations (without inter-stratum variation)

```bash
bash ./run_in_container python -m rtrank.experiment_runner dataset=deterministic_all_no_variability
bash ./run_in_container python -m rtrank.experiment_runner dataset=stochastic_no_variability
bash ./run_in_container python -m rtrank.experiment_runner dataset=drift_diffusion_no_variability
```

### 3. Visualize the results

```bash
bash ./run_in_container python -m rtrank.analysis.cli all
```

This will create the figures in the `paper/figures` directory.

### Quick run

For a quick run of the entire pipeline described above, you can run:

```bash
bash ./run_in_container bash run_full_mini.sh
```

This runs a *reduced version* (2 trials, fewer dataset sizes) of all experiments and generates the figures in about 5-10 minutes. Note that results will be much less reliable than the full experiments.

Alternatively, to run a minimal version of any individual experiment, append `num_trials=2 dataset_sizes=[0.5,1.0]` to the end of the command.

### Key results to expect

The key results you should observe in the generated figures, based on our findings:

1. **Choice Accuracy**: RtRank (`rt_rank`) should outperform Bradley-Terry (`bt`) across all dataset sizes, with the largest improvements on smaller dataset fractions.

2. **Pearson Distance Correlation (PDC)**: RtRank should show significantly better correlation between true and predicted utility differences compared to Bradley-Terry.

3. **Ablations**: The permuted RT controls (`rt_rank_perm`, `rt_regression_perm`) should perform worse than their non-permuted counterparts.

## Experiment configurations

### Datasets

- **Deterministic** (`deterministic_all`): Assumes a deterministic relationship between utility difference and both choice and response time
- **Stochastic** (`stochastic`): A stochastic model assuming Bradley-Terry for choices and log-normal distributed response times
- **Drift Diffusion Model** (`drift_diffusion`): A cognitive model of decision-making based on evidence accumulation
 
Each dataset is randomly partitioned into "strata". By default, each stratum is assigned a response-time modifier with which the response times are modulated, simulating inter-stratum-variability. Each dataset additionally has a `_no_variability` variant that omits this modulation.

### Learning algorithms

- **Bradley-Terry** (`bt`): Traditional choice-only baseline model
- **RT Regression** (`rt_regression`): Direct regression from response times to utility differences
- **RT Regression with Permuted RTs** (`rt_regression_perm`): Control condition with permuted response times
- **RtRank** (`rt_rank`): Our proposed model using response times for ranking preference strength
- **RtRank with Pooled Data** (`rt_rank_pooled`): RtRank variant using pooled data
- **RtRank with Permuted RTs** (`rt_rank_perm`): Control condition with permuted response times

## Project structure

- `src/rtrank/`: Core implementation of RtRank algorithm
  - `losses.py`: Implementations of RtRank loss functions
  - `synthetic_data.py`: Generators for synthetic preference data
  - `evaluation.py`: Metrics and evaluation functions
  - `experiment_runner.py`: Main experiment orchestration script
- `conf/`: Hydra configuration files for experiments
  - `config.yaml`: Main configuration file
  - `learner/`: Learner-specific configurations
  - `dataset/`: Dataset-specific configurations
- `tests/`: Unit tests for core functionality

## Development

### Commands

- Run tests: `bash ./run_tests` (single test: `bash ./run_tests tests/test_file.py::test_function`)
- Run linters: `bash ./run_linters` (fix issues automatically: `bash ./run_linters --fix`)
- Execute in container: `bash ./run_in_container <command>` (e.g., `bash ./run_in_container python script.py`)
