This attachment is a copy of the code, and it contains two folder and a screenshot.
1. The folder decoupelQ contains the python code for the quantization;
2. The folder TensorRT-LLM contains the CUDA kernels. In this folder, we provide the W2 weight-only kernel based on the NVIDIA official code (https://github.com/NVIDIA/TensorRT-LLM), the diff is shown in the screenshot, diff.jpeg;
3. The screenshot diff.jpeg, shows the modifications based on the NVIDIA official code.
