<div align="center">
<h1> PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models</h1>  

</div>

# Usage

## EnvSetup

We recommend using conda for environment management. 

- install CogVideoX prerequisites according to [Flux](https://huggingface.co/black-forest-labs/FLUX.1-dev).

- install prerequisite for quantization:
    - cd into `quant_utils/`
    - run `pip install -e .`

## Complete process

Run `main.sh` to generate images with sparse and quantization inference of 30 steps.

This process involves generating calibration data in order to select the permute method and sparse mask for sparse and quantized image generation.

As for sparse, you can adjust the parameters (`max_threshold` and `sum_threshold` in `sparse`) in `./configs/sparse.yaml` to adjust the final sparsity ratio. These two values are recommended to be between `5.e-4` and `5.e-2`. 

As for permute, you can change the parameters in `./configs/permute.yaml`. `max_threshold` and `sparse_percentage` influence the selection of the permute order, as `0.01` and `0.9` respectively opted to make the data more centralized.

As for quant, you can change the options in `./configs/final/final_sparse_quant.yaml`.

## 0.Calib Data Preparation

We first perform Flux FP inference with a small number of steps (e.g., 8) to generate calib data (1/4 downsampling for H and W)for the attention map. The calib data will be saved in `./visualization/calib_data/$CALIB_DATA_NAME1.pth`

```bash
CUDA_VISIBLE_DEVICES=$GPU_ID python quant_inference.py \
	--quant-config ./configs/fp.yaml \
	--log ./logs/calib_data/$EXP_NAME \
	--num-sampling-steps $N_TIMESTEP  \
	--prompt ${PROMPT_PATH_2}.txt \
	--export-calib-data $CALIB_DATA_NAME1
```

## 1. Generate the permute_plan with the exported calib_data

Select the appropriate permute scheme based on the calibration data. 

``` bash
CUDA_VISIBLE_DEVICES=$GPU_ID python get_permute_plan.py \
	 --config ./configs/permute.yaml \
	 --calib_data ./visualization/calib_data/$CALIB_DATA_NAME1.pth \
	 --log ./logs/calib_data/$EXP_NAME   
```

## 2. Export the downsampled permuted attn_map for sparse plan

Generate an attention map with 4x downsample in the last dimension and perform the permute from the previous step, for the attention map mask generation. The calib data will be saved in `./visualization/calib_data/$CALIB_DATA_NAME3.pth`

``` bash
CUDA_VISIBLE_DEVICES=$GPU_ID python quant_inference.py \
	--quant-config ./configs/permute.yaml \
	--log ./logs/calib_data/$EXP_NAME \
	--num-sampling-steps $N_TIMESTEP_2  \
	--prompt ${PROMPT_PATH_2}.txt \
	--export-calib-data ${CALIB_DATA_NAME3} # the calib_data name is specified in the config
```

## 3. Generate the sparse_plan 
 
Generate the sparse mask. 

``` bash
CUDA_VISIBLE_DEVICES=$GPU_ID python get_sparse_plan.py \
	  --config ./configs/sparse.yaml \
	  --calib_data ./visualization/calib_data/${CALIB_DATA_NAME3}.pth \
	  --log ./logs/calib_data/$EXP_NAME  
```

## 4. Final inference with exported sparse_mask

#### 4.1 Sparse-only

Use `--quant-config ./configs/sparse.yaml`, you can generate images with permute plan and sparse masks but no quant. `--num-sampling-steps` specifies the number of steps to generate the images inference. `--prompt` chooses the txt filename of the prompt.

``` bash
CUDA_VISIBLE_DEVICES=$GPU_ID python quant_inference.py \
		--quant-config ./configs/sparse.yaml \
		--log ./logs/calib_data/${EXP_NAME} \
		--num-sampling-steps 30  \
		--prompt ${PROMPT_PATH}.txt  
```

#### 4.2 Sparse + Quant

Use `--quant-config ./configs/final/final_sparse_quant.yaml`, you can generate images with permute plan and sparse masks and quantization. `--num-sampling-steps` specifies the number of steps to generate the picture inference. `--prompt` chooses the txt filename of the prompt.

``` bash
CUDA_VISIBLE_DEVICES=$GPU_ID python quant_inference.py \
		--quant-config ./configs/final/final_sparse_quant.yaml \
		--log ./logs/calib_data/${EXP_NAME} \
		--num-sampling-steps 30  \
		--prompt ${PROMPT_PATH}.txt 
```


