# NeUQI


> ⚠️ **Warning**
> It is highly recommended to use Anaconda to replicate the environment. Identical results are only guaranteed on the NVIDIA A40 and same software version. Other setups or GPUs may lead to failures, inconsistent results, or OOM errors.
>
> The quantized weight format is for idea validation only, without advanced acceleration. vLLM integration is tested only on single-GPU inference; multi-GPU tensor-parallel inference is untested and not guaranteed.

## Algorithm Implementation

In `find_zeropoint.py`, the `find_zp_matrix_fast` function implements the algorithm for solving the zero-point with a fixed scale.


## Environment Setup

Set up the environment with Anaconda. Follow the steps below:

1. Create and activate the environment using the provided `environment.yml` file:

   ```bash
   conda env create -f environment.yml -n test
   conda activate test
   ```

2. Install the optimized module:

   ```bash
   cd optimized_module
   pip install .
   ```


## Calibration Example

Result and log file names, along with a unique **short name** (used as a prefix or directory for all outputs), are determined by arguments like `--method` and `--dataset` and can be obtained with:

```bash
python arguments.py [same arguments as below]
```

Run **NeUQI**:
```bash
python quant_model_sequential.py \
  --dataset c4 --test_dataset c4,wikitext2 --batch_size 4 --dtype bfloat16 \
  --model_path meta-llama/Llama-2-7b-hf --nbits 2 --group_size -1 \
  --method LDLQ --enable_H_reorder True --enable_H_diag_weight True \
  --param_init_method_diag grid2_s_best_zp \
  --result_dir ./result --log_dir ./log --save_dir ./quantized_model
```


## Distillation Example

To distill a quantized model, enter the `llmtrainer` directory and run the following:

```bash
export WANDB_PROJECT="Quant"
export TORCH_LOGS="recompiles_verbose,graph_breaks"

train() {
    bash ./runs/$1.sh outputs/$2
}

export SHORT_NAME=[your short name]
export CUDA_VISIBLE_DEVICES=0
export PORT=$((30000 + $CUDA_VISIBLE_DEVICES))
export DATASET=c4
export nsamples=256
export num_train_epochs=1

export MODEL_NAME=meta-llama/Llama-2-7b-hf
export QUANT_MODEL_NAME="../quantized_model/"$SHORT_NAME

export LR=3e-4
exper_name=$SHORT_NAME.lr$LR.$DATASET.n$nsamples.${num_train_epochs}epochs
train quant_distill $exper_name
```


## Evaluation Example

To evaluate a quantized model, run:

```bash
export SHORT_NAME=[your short name]
model=./quantized_model/$SHORT_NAME
ori_model=meta-llama/Llama-2-7b-hf

python eval.py --model vllm \
  --model_args pretrained="${model}",tokenizer=${ori_model},dtype=bfloat16,gpu_memory_utilization=0.9,max_model_len=2048,enforce_eager=True \
  --tasks arc_easy,winogrande,hellaswag,arc_challenge,piqa \
  --batch_size auto
```
