# NuBitQ-OCP

<div style="text-align: center;">
 
  <img src="pic1.png" alt="NuBitQ-OCP" />
  <strong>NuBitQ-OCP</strong><br>
</div>


With the rapid growth of large-scale language models, achieving efficient model compression while maintaining model performance has become a significant challenge for both industry and academia. Quantization, as a key technique for model compression, can substantially reduce the computational and storage resource requirements of models. However, traditional quantization methods have certain limitations.

Existing non-uniform quantization methods often rely on fixed codebooks and require expensive optimization processes, lacking flexibility and efficiency. Moreover, most conventional quantization compensation techniques are designed primarily for outliers under uniform quantization. When faced with non-uniform quantization and complex model weight distributions, these methods struggle to effectively handle anomalous distribution characteristics, resulting in performance degradation.

To address these challenges, this project proposes a novel arbitrary bit-width non-uniform quantization framework—NuBitQ. The framework supports flexible quantization strategies tailored to different network layers, greatly enhancing adaptability and quantization efficiency. To overcome the shortcomings in outlier compensation, we design a new outlier evaluation metric that integrates weight perturbation, activation distribution, and perturbation propagation. Based on this metric, we develop a multi-level, fine-grained Outlier Compensation Plugin (OCP) that effectively mitigates the performance degradation caused by outliers.

Compared with traditional approaches, our method avoids complex Hessian matrix computations and costly fine-tuning procedures, offering better applicability and scalability. Extensive experiments across various tasks and model families demonstrate the superior performance and practical value of the proposed solution.

This project aims to advance the efficient compression and deployment of large-scale language models, providing new insights and technical solutions for research and engineering applications in relevant fields.


---

## Contents

- [NuBitQ-OCP](#nubitq-ocp)
  - [Contents](#contents)
  - [Installation](#installation)
  - [Usage](#usage)
  - [Example](#example)
    - [Important Parameter Explanation](#important-parameter-explanation)
    - [LLaMA Series Model Quantization Examples](#llama-series-model-quantization-examples)
    - [4-bit Quantization Example](#4-bit-quantization-example)
    - [3-bit Quantization Example](#3-bit-quantization-example)
    - [2-bit Quantization Example](#2-bit-quantization-example)
  - [License](#license)

---

## Installation
Before installation, please ensure that your GPU has sufficient capacity to run inference on a non-quantized model.

```bash
# Clone the repository
git clone https://github.com/ours.git

# Enter the project directory
cd ours_dir

# (Optional) Create and activate a virtual environment
conda create -n nbq python=3.10
conda activate nbq

# Install dependencies
pip install -r requirements.txt
```

Alternatively, you can directly install the packaged wheel file:

```bash
pip install nub_ocp-0.1.0-py3-none-any.whl
```

After installation, you can proceed with testing.

---

## Usage

Currently, this project provides partial example code and demonstrations for user reference and learning purposes. The complete implementation will be released promptly upon the formal acceptance and publication of the associated paper. Thank you for your attention and support—stay tuned!

---

## Example
Here is an example command to run the quantization training script, demonstrating how to set the model path, quantization parameters, and training-related optimization configurations:

```bash
python quant_beta_llama.py \
    --model_path "/your/model/path" \
     --q 0 \
    --r 4 \
    --c 512 \
    --b 2 \
    --d 16 \
    --g 128 \
    --gate1 165 \
    --gate2 178 \
    --q_update \
    --q_compensate \
    --update_max_epochs 20 \
    --update_early_stop 10 \
    --update_lr 0.00001 \
    --update_batch_size 2 \
    --update_adam_beta1 0.9 \
    --update_adam_beta2 0.95 \
    --update_keep_best \
    --local_batch_size 8 \
    --val_size 256 \
    --print_frequency 20 \
    --offload_activations True
```

---

### Important Parameter Explanation

- **r, c, d**: These three parameters are used to construct the compression ratio and are key tuning parameters of the quantization algorithm, affecting the balance between model size and performance.
- **--q_compensate**: Enables the quantization compensation mechanism. Turning it on can help reduce the performance degradation caused by quantization and improve model accuracy.
- Other parameters such as learning rate, batch size, optimizer settings, etc., are thoroughly explained in the code and can be adjusted based on your specific needs.

---

### LLaMA Series Model Quantization Examples

Below are some small examples for quantizing LLaMA series models. For qwen and gemma models, please refer to the `/examples` directory.

---

### 4-bit Quantization Example

```bash
python quant_beta_llama.py \
    --model_path "/path/to/llama/model" \
    --b 2 \
    --r 4 \
    --c 256 \
    --d 8 \
    --update_max_epochs 10 \
    --update_lr 1e-5 \
    --local_batch_size 16
```

---

### 3-bit Quantization Example

```bash
python quant_beta_llama.py \
    --model_path "/path/to/llama/model" \
    --b 2 \
    --r 3 \
    --c 512 \
    --d 9 \
    --q_compensate \
    --update_max_epochs 10 \
    --update_lr 5e-6 \
    --local_batch_size 16
```

---

### 2-bit Quantization Example

```bash
python quant_beta_llama.py \
    --model_path "/path/to/llama/model" \
    --b 2 \
    --r 2 \
    --c 1024 \
    --d 10 \
    --q_compensate  \
    --update_max_epochs 10 \
    --update_lr 1e-5 \
    --local_batch_size 16
```

For more detailed parameter explanations and usage instructions, please refer to the code comments and the examples provided in the `/examples` directory.

---

## License

This project is licensed under the [Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0). See the LICENSE file for details.

