# GEMQ
GEMQ for mixed-precision MoE-LLMs quantization.


## Installation

```bash
pip install -e .
```

**NOTE:** This project uses **gurobipy** as the integer linear programming (ILP) solver for bit allocation. A Gurobi license may be required for some models.



## Usage

Demo scripts for Mixtral-8×7B and DeepSeek-V2-Lite are provided in `scripts`.

### Bit Allocation

We provide pre-generated bit allocation configs under `configs`, which can be used directly for quantization. You may skip this section if you do not need to regenerate them. To generate the configs from scratch, follow the steps below.


1. Download the first shard of the C4 training dataset (c4-train.00000-of-01024.json) from [allenai/c4](https://huggingface.co/datasets/allenai/c4/blob/main/en/c4-train.00000-of-01024.json.gz) and save it under `./data`.

2. Run `scripts/compute_stats_<model>.sh` to compute model statistics on the calibration dataset. The resulting statistics (gradients and perturbation errors) will be saved under `cache`.


3. Run `scripts/allocate_<model>.sh` to solve the ILP for bit allocation using the generated model statistics. The allocation results (bit configs) will be saved under `configs`. 


### Mixed-Precision Quantization

Simply run `scripts/quantize_<model>.sh` for model quantization. Please refer to the scripts for detailed usage instructions.

The evaluation code will run automatically after quantization.

Quantized models will be saved under `results`.


### Inference

Use `scripts/bench_generate_<model>.sh` to run and benchmark the real quantized models.

