# Noise Contrastive Alignment of Language Models with Explicit Rewards

This code framework is heavily based on [trl](https://github.com/huggingface/trl) and [alignment-handbook](https://github.com/huggingface/alignment-handbook).

## 1. Creating the Dataset
To generate the dataset, run one of the following commands:

```bash
python3 generate_dataset_preference.py
```

or

```bash
python3 generate_dataset_reward.py
```

## 2. Installing Custom trl Package
Navigate to the `trl` directory and install the custom `trl` package:

```bash
cd trl
pip install -e .
```

You can switch between different algorithms by modifying the macro definition of `Training_Method` at the top of the core algorithm files: `trl/trl/trainer/dpo_multi_trainer.py` or `trl/trl/trainer/dpo_trainer.py`.

## 3. Running Experiments

### Step 1: Install the Algorithm Script
First, install the algorithm script in the `NCA` directory:

```bash
cd NCA
pip install -e .
```

### Step 2: Configure Hyperparameters
Configure the hyperparameters in the corresponding YAML files located under `NCA/recipes`.

### Step 3: Start Experiment Run
Begin the experiment run using the following command:

```bash
CUDA_VISIBLE_DEVICES=0 NCCL_P2P_DISABLE=1 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 --main_process_port=7000 scripts/run_dpo_multi.py recipes/zephyr-7b-beta/dpo_multi/config_qlora.yaml
```
