# STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs

## Abstract

In this paper, we present STBLLM, the first structural binarization framework for compressing Large Language Models (LLMs) to less than 1-bit precision. LLMs have achieved remarkable performance, but their heavy memory requirements have hindered widespread adoption, particularly on resource-constrained devices. Binarization, which quantifies weights to a mere 1-bit, achieving a milestone in increasing computational efficiency. However, we observe that some weights in binarized LLMs can be randomly flipped without significant performance degradation, indicating the potential for further compression. To exploit this, our STBLLM employs an N:M sparsity to perform structural binarization of the weights. First, we introduce a new Standardized Importance (SI) metric that considers weight magnitude and input feature norm to better evaluate weight significance. Then, we propose a layer-wise approach where different layers of the LLM can be sparsified with varying N:M ratios, balancing compression and accuracy. Finally, we use residual approximation with double binarization to preserve information for salient weights. In addition, we utilize a fine-grained grouping strategy for less important weights that applies different quantization schemes to sparse, intermediate, and dense regions. We conduct extensive experiments on various language models, including the LLaMA-1/2/3, OPT family, and Mistral, to evaluate the effectiveness of STBLLM. The results demonstrate that our approach performs better than other compressed binarization LLM methods while significantly reducing memory requirements.

## Dependencies

- `torch`: tested on v2.0.1+cu117
- `transformers`: tested on v4.35.0
- `sentencepiece`any version
- `datasets`: tested on v2.14.6
- `huggingface-hub`: tested on v0.16.4
- `pyparsing`
- `protobuf`

Most experiments, excluding the LLaMA-1-65B model, can be evaluated on a single NVIDIA A800 GPU. For the LLaMA-1-65B model, we employ four NVIDIA A800 GPUs for evaluation. Notably, the LLaMA-1-7B and LLaMA-2-7B models can be evaluated using a single RTX 4090 GPU.

## Structured Binary LLMs

### Sub 1-bit for LLM (Baseline)

```bash
MODEL_NAME=/path/to/llama
gpu=0
SPARSITY_RATIO 0.5
SPARSITY_TYPE 4:8
CUDA_VISIBLE_DEVICES=$gpu python3 run.py /path/to/${MODEL_NAME} c4 braq --blocksize 128 \
    --salient_metric hessian \
    --prune_method wanda \
    --reconstruction \
    --Lamda 2 \
    --Hyper_m 6 \
    --eval_zero_shot \
    --sparsity_ratio ${SPARSITY_RATIO} \
    --sparsity_type ${SPARSITY_TYPE}
```

### Sub 1-bt for STBLLM (Ours)

```bash
MODEL_NAME=/path/to/llama
gpu=0
SPARSITY_RATIO 0.5
SPARSITY_TYPE 4:8
CUDA_VISIBLE_DEVICES=$gpu python3 run.py ${MODEL_NAME} c4 braq --blocksize 128 \
    --salient_metric hessian \
    --prune_method si_structure \
    --reconstruction \
    --Lamda 2 \
    --Hyper_m 6 \
    --sparsity_ratio ${SPARSITY_RATIO} \
    --sparsity_type ${SPARSITY_TYPE}
```

## Related Project

[GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers](https://github.com/IST-DASLab/gptq)

[PB-LLM: Partially Binarized Large Language Models](https://github.com/hahnyuan/PB-LLM)

[AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://github.com/mit-han-lab/llm-awq)

[BiLLM: Pushing the Limit of Post-Training Quantization of LLMs](https://github.com/Aaronhuang-778/BiLLM)
