# GLOWQ Pipeline

This folder contains the three-step GLOWQ pipeline plus a convenience script that orchestrates end-to-end quantization, randomized SVD compression, and decode-cache evaluation. The notes below cover CUDA kernel setup and basic usage of `run_pipeline.sh`.

## 1. Prerequisites

- Python 3.9+ with a CUDA-enabled PyTorch build (`pip install torch --index-url https://download.pytorch.org/whl/cu121` or similar).
- `transformers`, `datasets`, `tqdm`, `sentencepiece`, and `accelerate` are typically required by the step scripts. Install them with:

  ```bash
  pip install -r requirements.txt  # if you maintain a requirements file
  # or
  pip install transformers datasets tqdm sentencepiece accelerate triton
  ```

- CUDA toolkit >= 11.8 available on the system for compiling custom extensions.
- Set `TORCH_CUDA_ARCH_LIST` when building on machines with newer GPUs (for example `export TORCH_CUDA_ARCH_LIST="8.0;8.6;9.0"`).

## 2. Installing the CUDA W4A16 Kernels

The custom 4-bit weight / 16-bit activation kernels live under `cuda_w4a16/` and are built on-demand using `torch.utils.cpp_extension.load`. You can also run the Triton-based 4-bit kernel bundled with PyTorch without installing any additional modules—both implementations are available out of the box.

1. Ensure your environment meets the prerequisites above.
2. (Optional) Warm up the build so later runs do not spend time compiling:

   ```bash
   python - <<'PY'
   from cuda_w4a16 import load_w4a16_extension
   load_w4a16_extension(verbose=True)
   PY
   ```

   This compiles the kernels into PyTorch's `~/.cache/torch_extensions` directory. Subsequent executions that use `--use_cuda_w4a16` will reuse the cached library.

If the build fails, double-check that the CUDA toolkit and PyTorch CUDA version match, and that you exported an appropriate `TORCH_CUDA_ARCH_LIST`.

## 3. Running the Pipeline

`run_pipeline.sh` wires the three steps together:

1. **Step 1:** Quantization error + original weights dump.
2. **Step 2:** Randomized SVD on the error tensors.
3. **Step 3:** Decode evaluation with (optional) CUDA W4A16 kernels.

The minimal invocation requires a Hugging Face model ID and an output directory:

```bash
./run_pipeline.sh \
  --model-name meta-llama/Llama-3.1-8B \
  --output-dir /path/to/output/meta_llama_Llama_3_1_8B
```

### Common Parameters

- `--rank`: Target low-rank for SVD (default: 64).
- `--p` / `--q`: Oversamples and power iterations for randomized SVD.
- `--shrink-alpha`: Covariance shrinkage factor.
- `--calib-dataset` / `--calib-config`: Calibration dataset used in Step 2.
- `--nsamples`, `--seqlen`: Calibration sample count and sequence length.
- `--group-size`: Group size passed into Step 3 (default 128).
- `--device`: Device string (e.g. `cuda:0`).
- `--trust-remote-code`: Forwarded to all Python steps for models that require custom code (Qwen, etc.).
- `--metrics-csv`: Optional CSV path that Step 3 will append metrics to.
- `--step3-use-cuda-w4a16`: Enable the custom kernels compiled above.
- `--step3-cache-mode`: Select whether Step 3 evaluates both caching modes, only BX caching, or ABX no-cache only.

Outputs are organized into `step1/`, `step2_rank*_alpha*_.../`, and `step3/` subdirectories under the chosen output folder. Cached covariance statistics default to the Step 2 directory and can be reused across runs.

### Skipping Individual Steps

If you already have artifacts from earlier runs, you can skip stages:

```bash
./run_pipeline.sh --model-name ... --output-dir ... --skip-step1 --skip-step2
```

The script validates that the required files exist before skipping.

## 4. Troubleshooting

- If Step 1 or Step 2 runs out of GPU memory, decrease `--nsamples`, `--seqlen`, or run with a smaller rank.
- Re-run the CUDA build warm-up with `verbose=True` to see compilation errors if `--use_cuda_w4a16` fails at runtime.
- Clear `~/.cache/torch_extensions` to force a clean rebuild of the kernels when switching CUDA or PyTorch versions.

For further customization, inspect `run_pipeline.sh` and the individual Python scripts in this directory.
