# QuTLASS v0.1
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-yellow.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.12](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/)
[![CUDA 12.8](https://img.shields.io/badge/CUDA-12.8-green.svg)](https://developer.nvidia.com/cuda-toolkit)
[![Static Badge](https://img.shields.io/badge/CUTLASS-3.9-purple)](https://github.com/NVIDIA/cutlass)
[![Static Badge](https://img.shields.io/badge/PyTorch-2.8-red)](https://download.pytorch.org/whl/nightly/cu128)

QuTLASS is a collection of CUTLASS-based template abstractions for low-precision Basic Linear Algebra Subroutines (BLAS) oriented to quantized Deep Learning models.

## Getting Started

### Requirements

- **NVIDIA Blackwell architecture GPU** (Compute capability supported: `sm_120a`)
- **CUDA 12.8** compatible drivers

### Installation

1. Install PyTorch nightly with CUDA 12.8 support:

```bash
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
```

2. Install QuTLASS in editable mode:
```
pip install --no-build-isolation -e .
```

## Try out
```
from qutlass import matmul_mxf4_bf16_tn, fusedQuantize
from fast_hadamard_transform import hadamard_transform

m, n, k = 4096 * 3, 4096 * 2, 4096
a = torch.randn(m, k, dtype=dtype, device=device) * 25.
b = torch.randn(n, k, dtype=dtype, device=device) * 25.
hadamard_matrix = hadamard_transform(torch.eye(32, dtype=dtype, device=device), scale=32. ** -.5)

a_e2m1, a_e8m0, clip_mask = fusedQuantize(a, hadamard_matrix)
b_e2m1, b_e8m0, clip_mask = fusedQuantize(b, hadamard_matrix)

a_dq, *_ = _dq_fp4(a_e2m1, a_e8m0, alpha=1.)
b_dq, *_ = _dq_fp4(b_e2m1, b_e8m0, alpha=1.)
out_ref = a_dq @ b_dq.transpose(-2, -1)

a_scale_block = to_blocked(a_e8m0)
b_scale_block = to_blocked(b_e8m0)

out = matmul_mxf4_bf16_tn(a_e2m1, b_e2m1, a_scale_block, b_scale_block, 1.)
assert out.equal(out_ref.to(dtype=out.dtype))
```