# Differential Information Distribution: A Bayesian Perspective on Direct Preference Optimization

This repository contains supplementary materials required to reproduce the experiments and results presented in the paper, ***Differential Information Distribution: A Bayesian Perspective on Direct Preference Optimization***.

**Contents:**  
- **Energy-Based Model (EBM) Experiment Notebooks**: code + instructions for the EBM experiments.  
- **Magpie‑G27 Data Generation**: scripts to build the Magpie-G27 instruction-following preference dataset.  
- **Evaluation Results**: raw evaluation results for DPO vs. DPO‑PG comparison in Section 5.
- **DPO-PG Implementation**: Pytorch implementation for the DPO-PG algorithm.

*Full model‑training scripts, evaluation pipelines, and model checkpoints will be publicly released upon paper acceptance.*

---

## 1. EBM Experiments

The following interactive Jupyter notebooks reproduce our controlled EBM experiments.
* **Path:**
    ```bash
    ./EBM_Exps/Sec3_Exps.ipynb  # EBM Experiments in Section 3 (Figures 1 and 4)
    ./EBM_Exps/Sec4_Exps.ipynb  # EBM Experiments in Section 4 (Figures 2 and 3)
    ./EBM_Exps/App_Exps.ipynb   # EBM Experiment in Appendix (Figure 7)
    ```
* **Usage:**

  1. Open the notebook in Google Colab CPU environment.
  2. Execute all cells sequentially.
  3. Results and plots will appear inline.


## 2. Magpie-G27 Dataset Generation
To replicate our Magpie‑G27 instruction-following dataset (Appendix J.2):

1. Navigate to the data generation folder:

   ```bash
   cd ./MagpieG27/
   ```

2. Setup & Requirements:

    ```bash
    python -m venv venv
    source venv/bin/activate
    python -m pip install --upgrade pip
    python -m pip install -r requirements.txt
    ```

3. Run the generation script:

   ```bash
   chmod +x ./generate_data.sh
   bash ./generate_data.sh
   ```

4. Generated files will appear under:

   ```bash
   MagpieG27/generations.json         # Sampled Responses
   MagpieG27/annotated_results.json   # Reward Annotated Results
   ```

## 3. Evaluation Results

We provide detailed results for each evaluation benchmark (Table 2 and Figure 12):

* **Arena‑Hard-v0.1**

  ```
  ./Real_Exps/ArenaHard/
  ```
* **WildBench-v2**

  ```
  ./Real_Exps/WildBench/
  ```
* **Knowledge‑Intensive Question Answering**

  ```
  ./Real_Exps/QA/
  ```
  We use the following metrics for each sub-tasks:
    ```json
    {
        "arc_challenge": "acc,none",
        "arc_easy": "acc,none",
        "boolq": "acc,none",
        "gsm8k": "exact_match,flexible-extract",
        "hellaswag":"acc_norm,none",
        "mmlu":"acc,none",
        "piqa":"acc,none",
        "social_iqa":"acc,none"
    }
    ```
## 4. DPO-Projected Gradient (DPO-PG) Implementation

We provide a reference Pytorch implementation for the DPO-PG algorithm (full details in Appendix H).
### DPO‑Projected Gradient (DPO‑PG)

  $$ \theta_{k+1} = \theta_k - \eta \biggl(\nabla L(y_w) - \frac{\max(0, \nabla L(y_w)\cdot\nabla L(y_l))}{(\nabla L(y_l)\cdot\nabla L(y_l))}\,\nabla L(y_l)\biggr) $$

where:

- $\theta_k$ is the model parameter at step $k$.  
- $\eta > 0$ is the step size (learning rate).  
- $L(y)$ is the negative log-likelihood loss of $y$.
- $y_w$ is the chosen sample.
- $y_l$ is the rejected sample.
- $\nabla L(y)$ is the gradient of the NLL loss of $y$, with respect to $\theta_k$
- $\nabla L(y_w)\cdot \nabla L(y_l)$ is the dot product between $\nabla L(y_w)$  and $\nabla L(y_l)$ .


The provided implementation can be used in both Single-GPU and Multi-GPU (*e.g.*, FSDP) training environments.
It can also be used with gradient accumulation and gradient clipping (see Step-2).
* **Path:**
    ```bash
    ./dpo_pg.py
    ```
* **Usage:**

  1. Instantiate the training code as the following:
      ```python
      import torch
      from torch.optim import AdamW
      from dpo_pg import DPOPG_Optimizer
      from transformers import AutoModelForCausalLM, get_scheduler

      model = AutoModelForCausalLM.from_pretrained(...)

      # Apply FSDP if enabled
      if fsdp_enabled:
        model = apply_fsdp_sharding(model)
      
      # 'model' is your PyTorch nn.Module
      dpopg_optimizer = DPOPG_Optimizer(
          params=model.parameters(),
          dtype=torch.bfloat16,
          fsdp_enabled=fsdp_enabled,  # Set to True if using FSDP or Multi-GPU
          main_device=model.device,
      )

      # Optimizer for Parameter Update
      adam_optimizer = AdamW(model.parameters(), lr=1e-6)
      lr_scheduler   = get_scheduler(name='constant_with_warmup', optimizer=adam_optimizer, ...)
      ```
  2. Modify your training loop as the following:
      ```python
      # --- Compute Gradient of L(y_w) ---
      model.zero_grad()
      chosen_logps = compute_chosen_logps(model, batch) 
      loss_chosen = (-chosen_logps).mean() # back-prop NLL loss of chosen samples
      loss_chosen.backward()
      dpopg_optimizer.update_chosen_grad()
      
      # --- Compute Gradient of L(y_l) ---
      model.zero_grad()
      rejected_logps = compute_rejected_logps(model, batch)
      loss_rejected = (-rejected_logps).mean() # back-prop NLL loss of rejected samples
      loss_rejected.backward()
      dpopg_optimizer.update_rejected_grad()
      model.zero_grad()

      ...
      
      # --- Optimizer Step ---
      if (training_steps % grad_accumulation_steps) == 0:
          optim_metrics = dpopg_optimizer.set_gradients()

          # Gradient Clipping
          if fsdp_enabled:
              grad_norm = model.clip_grad_norm_(max_grad_norm).item()
          else:
              grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm).item()

          # Accumulate metrics
          for k, v in optim_metrics.items():
              batch_metrics[k].extend(v)

          adam_optimizer.step()
          lr_scheduler.step()
      ```