# Details of Experimental Code

## Python Environment

All experiments were conducted using the Anaconda environment with the following packages:

```
- python=3.10
- pytorch=2.5.1
- pytorch-cuda=12.4
- torchaudio=2.5.1
- torchvision=0.20.1
- transformers=4.51.1
- accelerate=1.6.0
- datasets=3.5.0
- peft=0.15.1
- trl=0.16.1
- bitsandbytes=0.45.5
- wandb=0.19.9
- pandas=2.2.3
- numpy=2.0.1
```

## Data Generation (`data_generation_fpt.py`)

### Usage
```bash
python data_generation_gpt.py --w 0.6 --openai_key "" --endpoint ""
```
This script generates the dataset given the value of *w* (the true population share of the first group) and the group context prompt.

### Argument list
- `w`: True population share of group 1 (default: 0.6)
- `batch_size`: Batch size for processing (default: 16)
- `output_dir`: Output directory for the dataset (default="datasets/llama/")
- `sample_interval`: How often to print samples for quality checking (default: every 50)
- `max_samples`: Maximum number of samples to generate (default: 10000)

## Training (`train.py`)

### Usage
```bash
python train.py --config config_train.yaml
```

This script implements the main training pipeline for DPO and the proposed algorithm. The configuration file (`config_train.yaml`) includes all hyperparameters and the path to input dataset and the output model to be saved. The loss function for the mu and policy models can be found in `compute_mu_loss()` and `compute_pi_loss()` within the `trainer.py` file.

## Evaluation (`evaluate.py`)

### Usage
```bash
python evaluate.py --config config_eval.yaml
```

This script evaluates two measures (win rate and PPR (α)), given the policy model and the prompt dataset. The configuration file (`config_eval.yaml`) includes the path to the policy model and the prompt dataset.

To evaluate different types of groups, the commented line should be changed in both `compute_winning_rate()` and `compute_PPR()` of `compute_metrics.py`.

Note that a Hugging Face API token with access to the Llama-3.1-8B-Instruct model is required.

## Synthetic Experiments

### Synthetic Training (`synthetic_train.py`)

#### Usage
```bash
python synthetic_train.py --config synthetic_config_train.yaml
```

This script implements training for synthetic preference learning experiments using color preference data. It supports both DPO (Direct Preference Optimization) and the proposed algorithm on synthetic datasets.

#### Key Features
- Trains on synthetic color preference datasets
- Supports multiple training algorithms (DPO, proposed method)
- Uses LoRA (Low-Rank Adaptation) for efficient fine-tuning
- Configurable through YAML configuration files

### Synthetic Evaluation (`synthetic_evaluate.py`)

#### Usage
```bash
python synthetic_evaluate.py --model_path saved_models/synthetic_dpo/beta_0.1
```

This script evaluates trained models on synthetic preference tasks by computing token probabilities for color preferences and generating detailed policy distributions.

#### Argument list
- `model_path`: Path to saved LoRA adapter (optional - if not provided, uses base model only)
- `base_model`: Base model name (default: 'Qwen/Qwen2.5-3B-Instruct')
- `prompt_file`: Path to prompt file (default: 'datasets/synthetic/color_prompt.json')
- `output`: Output results file (default: 'synthetic_evaluation_results.json')
- `temperature`: Temperature for probability computation (default: 1.0)

#### Key Features
- Computes token probabilities for each possible answer
- Prints individual question policies and logits
- Computes averaged policy across all questions
- Saves detailed evaluation results with per-question breakdowns
