### Installation
```bash
pip install -e . -vvv

# w/ debug
DEBUG_MODE=1 pip install -e . -vvv

```

### Usage
```python
import fastgemv_lib
func = fastgemv_lib.fast_gemv

# Or
from fastgemv import *
func = torch.ops.pre_quant.fast_gemv

# FIXME: set `zero_shift` to be 0 for now.
zero_shift = 0
func(uint4_input_packed, uint4_weight_packed, zero_shift, num_threads_per_row, cols_per_warp)
```

### Reference
Adapted the implementatio from below repos:
- https://github.com/wangsiping97/FastGEMV
- https://github.com/Bruce-Lee-LY/cuda_hgemv
- https://github.com/mobiusml/gemlite