<p align="center">
  <img src="https://raw.githubusercontent.com/huggingface/alignment-handbook/main/assets/handbook.png">
</p>

<p align="center">
    🤗 <a href="https://huggingface.co/collections/alignment-handbook/handbook-v01-models-and-datasets-654e424d22e6880da5ebc015" target="_blank">Models & Datasets</a> | 📃 <a href="https://arxiv.org/abs/2310.16944" target="_blank">Technical Report</a>
</p>

# Mitigating Spurious Correlation through Anti-Causal Learning

Robust recipes to continue pretraining and to align language models with human and AI preferences.

## Introduction

This is the codebase for our paper. Our codebase is modified based on Alignment Handbook by incorporating [EleutherAI/Sparisfy](https://github.com/EleutherAI), a clean codebase for Sparse Autoencoder (SAE) loading, training and fine-tuning.

The Alignment Handbook is an open-source repository that consolidates modern techniques, frameworks, and best practices for aligning large language models (LLMs) with human values and intent. It provides modular implementations of key components such as supervised fine-tuning (SFT), preference modeling, reinforcement learning (e.g., PPO, DPO), and safety evaluation. Designed for research and experimentation, the handbook emphasizes clarity, reproducibility, and extensibility, making it a practical starting point for developing and comparing alignment methods across tasks and models.

Meanwhile, * [sparsify](https://github.com/EleutherAI/sparsify) is a library training _k_-sparse autoencoders (SAEs) and transcoders on the activations of HuggingFace language models, roughly following the recipe detailed in [Scaling and evaluating sparse autoencoders](https://arxiv.org/abs/2406.04093v1) (Gao et al. 2024). This is a lean, simple library with few configuration options. Unlike most other SAE libraries (e.g. [SAELens](https://github.com/jbloomAus/SAELens)), it does not cache activations on disk, but rather computes them on-the-fly. This allows us to scale to very large models and datasets with zero storage overhead, but has the downside that trying different hyperparameters for the same model and dataset will be slower than if we cached activations (since activations will be re-computed). We may add caching as an option in the future.

## How to navigate this project 🧭

This project is simple by design and mostly consists of:

* [`scripts`](./scripts/) to train and evaluate models. Four steps are included: continued pretraining, supervised-finetuning (SFT) for chat, preference alignment with DPO, and supervised-finetuning with preference alignment with ORPO. Each script supports distributed training of the full model weights with DeepSpeed ZeRO-3, or LoRA/QLoRA for parameter-efficient fine-tuning.
* [`recipes`](./recipes/) to reproduce models like Zephyr 7B. Each recipe takes the form of a YAML file which contains all the parameters associated with a single training run.
* [`sparsify`](https://github.com/EleutherAI/sparsify) is a direct copy of original EleutherAI/Sparisfy. 


To get started, we recommend the following:

1. Follow the [installation instructions](#installation-instructions) to set up your environment etc.
2. Follow the [downloading instructions](#downloading-instructions) to download the llms or saes
3. Follow the [training instructions](#training-instructions) to either post-train a language model or fine-tune/train a sae

If you would like to train chat models on your own datasets, we recommend following the dataset formatting instructions [here](./scripts/README.md#fine-tuning-on-your-datasets).


## Contents

The initial release of the handbook will focus on the following techniques:

* **Continued pretraining:** adapt language models to a new language or domain, or simply improve it by continued pretraining (causal language modeling) on a new dataset.
* **Supervised fine-tuning:** teach language models to follow instructions and tips on how to collect and curate your training dataset.
* **Reward modeling:** teach language models to distinguish model responses according to human or AI preferences.
* **Rejection sampling:** a simple, but powerful technique to boost the performance of your SFT model.
* **Direct preference optimisation (DPO):** a powerful and promising alternative to PPO.
* **Odds Ratio Preference Optimisation (ORPO)**: a technique to fine-tune language models with human preferences, combining SFT and DPO in a single stage.

## Installation instructions

To run the code in this project, first, create a Python virtual environment using e.g. Conda:

```shell
conda create -n handbook python=3.10 && conda activate handbook
```

Next, install PyTorch `v2.1.2` - the precise version is important for reproducibility! Since this is hardware-dependent, we
direct you to the [PyTorch Installation Page](https://pytorch.org/get-started/locally/).

You can then install the remaining package dependencies as follows:

```shell
git clone https://github.com/huggingface/alignment-handbook.git
cd ./alignment-handbook/
python -m pip install .
```

You will also need Flash Attention 2 installed, which can be done by running:

```shell
python -m pip install flash-attn --no-build-isolation
```

> **Note**
> If your machine has less than 96GB of RAM and many CPU cores, reduce the `MAX_JOBS` arguments, e.g. `MAX_JOBS=4 pip install flash-attn --no-build-isolation`

Next, log into your Hugging Face account as follows:

```shell
huggingface-cli login
```

Finally, install Git LFS so that you can push models to the Huggingface Hub:

```shell
sudo apt-get install git-lfs
```

## Downloading Instructions

To download language models, please navigate to [`model_download`](./model_download/) We currently support downloading of Llama3-8B, Qwen2-1.5b, and Pythia70m. While the default link to download Llama3-8B is the official Huggingface website, Qwen and Pythia is downloaded from mirror site🪞. For example, to download Llama3-8B, we only need to run:

```shell
bash model_download/download_llama3_8b.sh
```

The downloaded model will be saved under:  [`models`](./models/).

For details about loading pretrained sae, please navigate to [`sae_download`](./sae_download/). The usage is inherited from official sample code in Sparsify. For example, to download the pretrained SAE for  residual stream layer 10 of Llama3-8B, make sure you have already downloaded Llama3-8B, then you run:


```shell
python sae_download/sae_load.py
```
This particular SAE is all we need for now, we will extend the downloading procedures to many other SAEs if necessary.


## Training instructions

You can now check out the `scripts` and `recipes` directories for instructions on how to train some models 🪁!

For example, to SFT a model you may run:
```shell
ACCELERATE_LOG_LEVEL=info accelerate launch     --config_file recipes/accelerate_configs/deepspeed_zero3.yaml     scripts/run_sft.py     recipes/qwen2-1.5b/sft/config_full.yaml
```

Similarly, to apply DPO, you may run:
```shell
ACCELERATE_LOG_LEVEL=info accelerate launch     --config_file recipes/accelerate_configs/deepspeed_zero3.yaml     scripts/run_sft.py     recipes/qwen2-0.5b/dpo/config_full.yaml
```

## Project structure

```
├── LICENSE
├── Makefile                    <- Makefile with commands like `make style`
├── README.md                   <- The top-level README for developers using this project
├── chapters                    <- Educational content to render on hf.co/learn
├── model_download              <- function and script for model downloading
├── models                      <- directory for storing downloaded models
├── recipes                     <- Recipe configs, accelerate configs, slurm scripts
├── sae                         <- directory for storing downloaded saes
├── sae_download                <- function and script for pretrained sae loading
├── scripts                     <- Scripts to train and evaluate chat models
├── setup.cfg                   <- Installation config (mostly used for configuring code quality & tests)
├── setup.py                    <- Makes project pip installable (pip install -e .) so `alignment` can be imported
├── src                         <- Source code for use in this project
├── sparsify                    <- Source code related to Sparse Autoencoder
└── tests                       <- Unit tests
```

## Citation

If you find the content of this repo useful in your work, please cite it as follows via `\usepackage{biblatex}`:

```bibtex
@software{Tunstall_The_Alignment_Handbook,
  author = {Tunstall, Lewis and Beeching, Edward and Lambert, Nathan and Rajani, Nazneen and Huang, Shengyi and Rasul, Kashif and Bartolome, Alvaro and M. Rush, Alexander and Wolf, Thomas},
  license = {Apache-2.0},
  title = {{The Alignment Handbook}},
  url = {https://github.com/huggingface/alignment-handbook},
  version = {0.3.0.dev0}
}

```
