# Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4)

This repository is the official implementation of [Improving Block-Wise LLM Quantization by  4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations](). 

## Installation

To install requirements and the `bof4` package:

```setup
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation
pip install -e .
```

## Fine-Tuning

To run QLoRA fine-tuning with configurable quantizers the `scripts/finetune.py` can be used.
The files `config/finetune_code.yaml` and `config/finetune_instruct.yaml` 
contain the configuration for reproducing the fine-tuning experiments from the paper.
All utilized quantizer codebooks can be found in `codebooks`.
For example, to finetune with BOF4-S quantization with block size 64 run:

```train
python scripts/finetune.py --config config/finetune_code.yaml --quantizer codebooks/bof4/bof4-s_mse_64.yaml
```

## Evaluation

To evaluate a model with quantization on the set of benchmarks used in the paper, run 

```eval
python scripts/eval.py -m meta-llama/Llama-3.2-3B -q codebooks/bof4/bof4-s_mse_64.yaml
```

For a full list of options run:
```
python scripts/eval.py -h
```
