# BBQ: Boosting Quantization Entropy with Bell Box Quantization

This submission package is for our paper "BBQ: Boosting Quantization Entropy with Bell Box Quantization".
The code is largely adapted from [QuEST](https://github.com/IST-DASLab/QuEST/tree/main) with a few modifications.


## Quickstart 

Create a conda environment and install dependencies (we recommend Python 3.10):

```bash
conda create -n env python=3.10
conda activate env
pip install -r requirements.txt
```

If your Nvidia GPU belongs to the 50 series, then you might need to manuall install the `fast-hadamard-transform` uploaded in this package, instead of the one in `requirements.txt`.
```bash
cd fast-hadamard-transform;
python3 setup.py install;
```
Note that the only difference between the included version and the public version is the following two lines in `fast-hadamard-transform/setup.py`, which asks the `nvcc` compiler to generate byte code for Blackwell architectures.
```python
      cc_flag.append("-gencode")
      cc_flag.append("arch=compute_120,code=sm_120")
```

To run the source code in this package, you need the following hardware:
1. an Nvidia GPU that supports naive BF16 tensor ops (older Turing GPUs only support FP16 which will not work)
2. potentially 1 TB of disk space to store the raw C4 dataset (can be deleted later) and the pre-processed C4 dataset (used by the script).

## Modifications to QuEST Source Code
1. We changed `src/models/quantization/base_linear.py` to include variants of the BBQ quantizer. The final version shown in the paper is called `BBQV5HD` for activation quantizer and `BBQV5HDChan` for weight quantizer.
2. We modified `src/models/base.py` to avoid applying weight decay to the scaling factor of BBQ quantizers.
3. We modified `src/optim/base.py` to log additional metrics during training, such as the scaling factors of quantizers, entropy of weights, gradient norms. While logging these metrics could potentially slow down training, when a bug occurs, these addition metrics helps identifying the root cause of the problem.

## Reproducing Table 1 and Figure 2
To reproduce the first 4 rows of Table 1, use the following 16 commands:
```bash
bash train_none.sh;
bash train_bbq.sh 4 0; bash train_bbq.sh 3 0; bash train_bbq.sh 2 -0.5; bash train_bbq.sh 1 -0.5;
bash train_quest.sh 4; bash train_quest.sh 3; bash train_quest.sh 2; bash train_quest.sh 1;
bash train_lsq.sh 4; bash train_lsq.sh 3; bash train_lsq.sh 2; bash train_lsq.sh 1;
```
To reproduce the second 4 rows of Table 1, please edit files `train_*.sh` by commenting the following lines

```bash
# 30M
export N_LAYER=6
export N_EMBD=640
export N_HEAD=5
export LR=0.0012
export TOKENS=3000000000 # 3B
export MODEL_SIZE_PREFIX="30M"
```
and un-commenting the following lines
```bash
# # 50M
# export N_LAYER=7
# export N_EMBD=768
# export N_HEAD=6
# export LR=0.0012
# export TOKENS=5000000000 # 5B
# export MODEL_SIZE_PREFIX="50M"
```
And rerun the 16 commands above. Similarly, the other rows of Table 1 can be reproduced by changing the configuration of the model.

All metrics, including weight entropy (Table 1), final evaluation perplexity (Table 1), weight entropy vs. training iterations (Figure 2), can be found on Weights and Biases after training completes.

# Reproducing Figure 5
Please use the following commands to reproduce Figure 5.
```bash
cd benchmark;
python3 gemm.py;
python3 quant.py;
```
You can then find the figures in `benchmark/gemm/matmul-performance.png` and `benchmark/quant/quant-performance.png`. Please note that the original versions of these images are generated using an Nvidia RTX 5090 GPU, and therefore your profiling results may be different if your GPU is different.
In addition, PyTorch 2.8 is required to profile FP4 matmul performance in `gemm.py`.
