# Bootstrapping Zero-Shot Reasoning via Advantage-Weighted Self-Distillation (AWDPO)

This repository contains the official code for the paper: "Bootstrapping Zero-Shot Reasoning in Small Language Models via Advantage-Weighted Self-Distillation."

## Description

This project introduces **AWDPO**, a lightweight alignment method that bridges the gap between few-shot and zero-shot reasoning in small language models. Unlike prior approaches, AWDPO formulates training as a single-pass preference optimization objective that aligns a model’s zero-shot distribution with its own few-shot behavior. This allows small models (0.5B–3B) to achieve strong mathematical reasoning capabilities from minimal supervision.

---

## Installation

To set up the necessary environment, follow these steps. This project uses Conda for environment management and pip for package installation.

1.  **Download the code file and navigate to the directory:**
    ```bash
    cd <repository-folder>
    ```

2.  **Create and activate the Conda environment:**
    ```bash
    conda create -n awdpo_env python=3.10
    conda activate awdpo_env
    ```

3.  **Install required packages:**
    ```bash
    pip install -r requirements.txt
    ```

---

## Running the Code

We provide simple shell scripts to replicate the main experiments from our paper. All scripts should be run from the directory in which they are located

### Training

To train a model using AWDPO or one of the baseline methods, use the provided scripts. You will need to modify the file paths inside the script to point to your data and model directories.

* **Train with AWDPO:**
    ```bash
    bash awdpo_mle_qwen-05B.sh
    ```
* **Train with Filtered-DPO**
    ```bash
    bash awdpo_mle_qwen-05B_filtered.sh
    ```
* **Train with SFT (Supervised Fine-Tuning):**
    ```bash
    bash sft_qwen-05B.sh
    ```
* **Train with DPO:**
    ```bash
    bash dpo_qwen-05B.sh
    ```

### Evaluation

To evaluate a trained model checkpoint, we provide similar evaluation scripts such as:

```bash
bash eval_awdpo_qwen-0_5B.sh
```

## Environment

All experiments were conducted on a SLURM-managed high-performance computing (HPC) cluster. The compute nodes were equipped with NVIDIA A100 GPUs and ran **Ubuntu 22.04 LTS**.

The core software dependencies include:
- PyTorch
- Transformers
- vLLM
- PEFT

A complete list of Python packages and their versions is provided in `requirements.txt`.