# Test-time Calibration

This repository implements joint calibration of token-level logit bias ("delta") and sampling temperature ("T") for improved test-time compute efficiency and accuracy. The method supports:

- **Joint training** of delta and temperature parameters
- **Ablation studies** to learn only delta or only temperature
- **Calibrated generation** using learned mappings
- **Standard Best-of-N sampling** and **beam search** variants

## Quick Start

### Environment Setup

- Python 3.11+ with CUDA support (tested: NVIDIA Driver 575.51.03, CUDA 12.9)
- Install dependencies: `pip install -r requirements.txt`

### Basic Usage

We provide two ready-to-use example scripts:

**1. Standard Best-of-N Sampling Pipeline:**
```bash
bash scripts/example_joint_pipeline.sh
```

**2. Beam Search Pipeline:**
```bash
bash scripts/example_joint_pipeline_beam.sh
```

Before running, edit the scripts to replace placeholder variables:
- `<model_name>`: your model identifier
- `<path_to_your_model>`: path to model weights
- `<path_to_your_config_yaml>`: configuration file path
- `<path_to_calibration_dataset_jsonl>`: training data path
- `<path_to_output_dir>`: output directory

### Configuration

Each script includes configurable parameters:

- `N1`, `K`: selection hyperparameters during calibration
- `CALIB_EPOCHS`, `CALIB_LR`: training settings
- `INIT_TEMP`, `WEIGHT_DECAY`: initialization and regularization

> **Note:** If an inference budget $N$ is given, we set $N_1 = N_2 = N/2$, $k = N_1/4$ throughout the experiment.

Default example hyperparameters used in our experiments:

- `CALIB_EPOCHS`: `100`
- `CALIB_LR`: `0.005`
- `INIT_TEMP`: `0.8`
- `WEIGHT_DECAY`: `1e-2`

### Ablation Studies

Enable ablations by uncommenting flags in the scripts:
- `--ablate_temperature=True`: learn delta only (fixed temperature)
- `--ablate_delta=True`: learn temperature only (no logit bias)

Use exactly one ablation flag for clean single-factor studies.## Method Overview

The joint calibration pipeline consists of:

1. **Training Phase**: Learn per-prompt delta (logit bias) and temperature mappings using `joint_train_delta_temp.py`
2. **Conversion Phase**: Transform delta to bias format via `tools/delta_to_bias.py`
3. **Generation Phase**: Apply learned mappings during inference with `generate_with_temperature_and_bias.py` or `generate_with_temperature_and_bias_beam.py`
4. **Evaluation Phase**: Compute accuracy and merge results using provided utilities

## Key Arguments

**Training:**
- `--model_path`: path to model weights
- `--input_dataset_path`: calibration dataset (JSONL format)
- `--calib_epochs`, `--calib_lr`: training hyperparameters
- `--n1`, `--k`: selection and sampling controls
- `--ablate_temperature/--ablate_delta`: ablation flags

**Generation:**
- `--bias_file_path`, `--temperature_file_path`: learned mappings
- `--n2`, `--n`: generation sampling parameters

## File Structure

```
├── joint_train_delta_temp.py          # Main training script
├── generate_with_temperature_and_bias.py       # Standard Best-of-N generation
├── generate_with_temperature_and_bias_beam.py  # Beam search generation
├── tools/delta_to_bias.py             # Conversion utility
├── compute_accuracy.py                # Accuracy evaluation
├── scripts/
│   ├── example_joint_pipeline.sh      # Complete pipeline (standard)
│   └── example_joint_pipeline_beam.sh # Complete pipeline (beam search)
└── requirements.txt                   # Dependencies
```

