## WUSH

WUSH is built on the codebase from the paper *“Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization”*.

We added the following files under `src/quantization/`:

* `hsuw_gptq.py`
* `hsuw_utils.py`

We also added a runner script at `scripts/wush.sh` and modified `model_quant.py` to enable HSUW quantization (**to enable MR-GPTQ, revert the change in `model_quant.py`**).

We did not change other parts of the codebase to ensure a fair comparison and consistency. Please refer to the MR-GPTQ README for details.

On the kernel side, we modified the QuTLASS library by extending MXFP kernels to support a distinct transform per block. See the paper for details.

---

## Environment setup


```bash
#!/bin/bash
conda create -n wush python=3.12 ipykernel ipywidgets cmake --yes

source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate wush

pip install --pre torch==2.11.0.dev20260122+cu128 --index-url https://download.pytorch.org/whl/nightly/cu128
pip install numpy pandas

git clone git@github.com:Dao-AILab/fast-hadamard-transform
cd fast-hadamard-transform
pip install -e .
cd ..

# Installing QuTLASS (run from the repository root that contains the submodule)
mv mma_multistage.h third_party/cutlass/include/cutlass/gemm/threadblock/
mv mma_tensor_op.h third_party/cutlass/include/cutlass/gemm/warp/

pip install --no-build-isolation .

# Installing the fp-quant linear layer with WUSH support
cd ../inference_lib
pip install -e .

pip install lm_eval==0.4.9
```

---

## How to run

The script to run WUSH quantization is in `scripts/wush.sh`.

```bash
#!/bin/bash
CUDA_VISIBLE_DEVICES=0 GPTQ=1 TRANSFORM_CLASS=hsuw bash scripts/wush.sh
```

## Kernel Benchmarks

```bash
python qutlass/benchmarks/bench_mxfp4_sm100.py
```
