<div align="center" style="font-family: charter;">
<h1> QVGen:<br> Pushing the Limit of Quantized Video Generative Models</h1>
</div>

This is the official implementation of our paper **QVGen**. It is *the first* to reach full-precision comparable quality under 4-bit settings. Moreover, it significantly outperforms existing methods. For instance, our 3-bit CogVideoX-2B achieves improvements of $+25.28$ in Dynamic Degree and $+8.43$ in Scene Consistency on VBench.

## 🎬 Visual Examples

<div align=center>
<table align="center" width="720" height="400">
<tr>
  <td align="center"><video src="./assets/BF16.mp4" width="200" height="180"></video></td>
  <td align="center"><video src="./assets/QVGen.mp4" width="200" height="180"></video></td>
  <td align="center"><video src="./assets/EfficientDM.mp4" width="200" height="180"></video></td>
</tr>
<tr>
  <th align="center">BF16</th>
  <th align="center">W4A4 QVGen</th>
  <th align="center">W4A4 EfficientDM</th>
</tr>
<tr>
<td align="center"><video src="./assets/Q-DM.mp4" width="200" height="180"></video></td>
  <td align="center"><video src="./assets/LSQ.mp4" width="200" height="180"></video></td>
  <td align="center"><video src="./assets/SVDQuant.mp4" width="200" height="180"></video></td>
</tr>
<tr>
  <th align="center">W4A4 EfficientDM</th>
  <th align="center">W4A4 LSQ</th>
  <th align="center">W4A4 W4A4 SVDQuant</th>
</tr>
</table>

<p>
<h align="justify">  <small>"In the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face
is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has
lost its innocence to the ravages of conflict."</small>
</h>
</p>

<h align="justify">  Comparison of samples generated by CogVideoX-2B. Our approach QVGen far
outperforms previous PTQ (<i>i.e.</i>, (f)) and QAT (<i>i.e.</i>, (c)-(e)) methods.
</h>
</div>

## 📖 Overview

<div align="center" style="font-family: charter;">

<img src=./assets/overview.png width="80%"/>

<h align="justify"><strong>Overview pipeline of the proposed QVGen.</strong> (a) This framework integrates auxiliary modules $\Phi$ to improve training convergence. (b) To maintain performance while eliminating inference overhead induced by $\Phi$, we design a <i>rank-decay</i> schedule that progressively shrinks the entire $\Phi$ to $\varnothing$ through <i>iteratively applying</i> the following two strategies: (<i>i</i>) SVD to identify the low-impact components in $\Phi$; (<i>ii</i>) A rank-based regularization $\mathbf{\gamma}$ to decay the identified components to $\varnothing$.
</h>

</div> 

## ✨ Quick Start

After cloning the repository, you can follow these steps to complete the model's training and inference process. Here, we employ Wan 1.3B as an example.

### Requirements

Clone the repository and make sure the requirements are installed: `pip install -r requirements.txt`. Please also make sure you have $8\times$ H100/H800/A100/A800 GPUs to quantize the model, or you may need to change our scripts.

### Prepare data and models

Download the pretrained [Wan2.1 1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B-Diffusers) to `models/Wan2.1-T2V-1.3B-Diffusers`. Then, execute the following commands to install the necessary data:
```shell
# download and preprocess data
python prepare_dataset/download_OpenVid.py --output_directory dataset
sh script/data/prepare_dataset.sh
```

### Training
We'd like to provide the following examples to train the model. More details about the training can be found in our paper.
```shell
# w4a4
sh script/train/w4a4.sh

# w3a3
sh script/train/w3a3.sh
```

### Inference
Here are the corresponding commands for inference.
```shell
# w4a4
sh script/inference/w4a4.sh

# w3a3
sh script/inference/w3a3.sh
```

### Evaluation
We recommend you to employ our inference code and follow the steps in [VBench](https://github.com/Vchitect/VBench).

## 🤝 Acknowledgments

Our code is developed based on open source [finetrainers](https://github.com/huggingface/finetrainers/tree/main).
