# SSQR CUDA Inference Kernels

This package provides the CUDA inference kernels for SSQR-compressed linear layers.
It is designed for fast matrix multiplication with low-bit quantized weights, with optional full-precision outliers, and can be used together with checkpoints produced by the [quantization](../quantization) program.

## Requirements

NVIDIA GPU of the Ampere architecture.

## Installation

<p>
<a href="https://developer.nvidia.com/cuda-downloads"><img alt="CUDA 13.2" src="https://img.shields.io/badge/CUDA-13.2-green.svg"></a>
<a href="https://www.python.org/downloads/"><img alt="Python 3.14.4" src="https://img.shields.io/badge/Python-3.14.4-blue.svg"></a>
<a href="https://pytorch.org/get-started/"><img alt="PyTorch 2.11.0" src="https://img.shields.io/badge/PyTorch-2.11.0-orange.svg"></a>
</p>

First, install the dependencies.
```bash
pip install ninja torch
```

Then, install this `ssqr` package.
```bash
pip install -e .
```

The kernels are compiled when being run for the first time.
Please execute the basic tests below to finish the installation.

## Tests

The [test.py](./tests/test.py) file contains the **usage demonstrations**, basic tests, and basic benchmarks.
```bash
python tests/test.py
```

## [Optional] End-to-End Benchmarks

First, please have a SSQR model checkpoint saved from our quantization program in the [quantization](../quantization) folder.

Then, make sure the dependencies are already installed.
```bash
pip install transformers==4.55.4 accelerate
```

Finally, run the end-to-end benchmark.
```bash
python tests/test_e2e.py --ckpt CKPT --do-convert DO_CONVERT --n-repeats N_REPEATS
```

*CKPT* is the path to the checkpoint folder.

*DO_CONVERT* choices are 0: full-precision weights; 1: low-bit weights and full-precision outliers; 2: low-bit weights only.

*N_REPEATS* is the number of repeated runs for calculating the average.
