

<!-- GETTING STARTED -->

### 1. Getting Started

This repo implements **CAT-LVDM**, a corruption-aware training framework that improves the robustness of latent video diffusion models via structured noise (BCNI/SACN).

#### Requirements

```
conda create -n catlvdm python=3.8
conda activate catlvdm
pip install -r requirements.txt
```

Ensure compatibility with `torch==2.1.2` compiled with `nvcc 12.1`.

### 2. Checkpoints

We provide a comprehensive suite of CAT-LVDM model checkpoints trained under diverse structured corruption settings across multiple noise levels (2.5%, 5%, 7.5%, 10%, 15%, 20%). These are hosted at:  
👉 [https://huggingface.co/Chikap421/catlvdm-checkpoints](https://huggingface.co/Chikap421/catlvdm-checkpoints)

To download the **base model (ModelScope)** and optionally the CAT-LVDM checkpoints, run:

➡️ [`models/download.sh`](models/download.sh)

This script installs Git LFS and clones the required base model. To also download CAT-LVDM checkpoints, simply uncomment the final line in the script.

---

#### 📊 Corruption Types in CAT-LVDM

CAT-LVDM introduces both **embedding-level** and **text-level** corruption methods to evaluate model robustness under structured noise. Each corruption scheme is applied across six corruption strengths (ρ = 2.5%, 5%, 7.5%, 10%, 15%, 20%).

Each folder on [Hugging Face](https://huggingface.co/Chikap421/catlvdm-checkpoints) follows the format: `corruptiontype_strength`, e.g., `bcni_10`, `swap_5`.
The folder `results_2M_train` is used to denote the clean (non-corrupted) training setup without any embedding or text-level noise.

---

##### 🧬 Embedding-Level Corruptions

| Folder Prefix | Corruption Type                     | Description |
|---------------|-------------------------------------|-------------|
| `bcni`        | Batch-Centered Noise Injection      | Perturbs embeddings along intra-batch semantic axes. Encourages temporal coherence and semantic preservation. |
| `sacn`        | Spectrum-Aware Contextual Noise     | Injects spectral noise aligned with principal low-frequency components. |
| `gaussian`    | Isotropic Gaussian Noise            | Adds unstructured Gaussian noise to each dimension. |
| `uniform`     | Isotropic Uniform Noise             | Injects bounded uniform noise independently across dimensions. |
| `tani`        | Temporally-Aligned Noise Injection  | Aligns noise with motion direction across adjacent video frames. |
| `hscan`       | Hierarchical Spectral Corruption    | Applies multiscale spectral noise with SACN + Gaussian fusion. |

---

##### ✏️ Text-Level Corruptions

| Folder Prefix | Corruption Type      | Description |
|---------------|----------------------|-------------|
| `add`         | Text Addition        | Randomly inserts new tokens into the prompt. |
| `remove`      | Text Removal         | Deletes tokens from the input text. |
| `replace`     | Text Replacement     | Replaces existing tokens with others sampled from batch. |
| `swap`        | Text Swap            | Swaps positions of two tokens in the sequence. |
| `perturb`     | Text Perturbation    | Replaces tokens with visually or semantically noisy variants. |


---

##### 📈 Benchmark Performance

To evaluate the robustness and generation quality of CAT-LVDM, we compare our models (BCNI and SACN) against existing state-of-the-art text-to-video diffusion models using the Fréchet Video Distance (FVD) metric. Lower FVD indicates better visual fidelity and temporal coherence.

<p align="center">
  <img src="assets/benchmark_fvd_scatter.jpg" width="800"/>
</p>

<p align="center"><i>Figure: FVD scores on MSR-VTT and UCF101 benchmarks comparing CAT-LVDM variants (BCNI, SACN) to other leading video generation models. Our methods consistently outperform existing baselines across datasets.</i></p>

---


### 3. Inference

To run inference with pre-trained CAT-LVDM checkpoints:

```bash
bash scripts/inference_deepspeed.sh
```

> Output videos are saved in `log_dir` (specify path in config).

Prompts should be formatted as a CSV file:
```
id,prompt
1,A scientist works in a clean lab.
2,A camel walks across the desert.
```
We provide curated sample prompts in: [`prompts/sampled_captions.json`](prompts/sampled_captions.json)

Configurable options are defined in [`configs/t2v_inference_deepspeed.yaml`](configs/t2v_inference_deepspeed.yaml).


### 4. Training

#### Dataset Setup

This repository supports training on the WebVid-2M training split, and inference on the WebVid-2M validation split, MSR-VTT, MSVD, and UCF101 datasets.

#### Training Command

```bash
bash scripts/train_deepspeed.sh
```

### 5. Multi-Corruption Parallel Training & Inference

To efficiently run parallel training or inference experiments across multiple corruption settings (e.g., BCNI, SACN), ablation variants, or noise levels, refer to the multi-script setup below:

#### Parallel Training
Use the following script to launch multi-GPU training across various corruption settings:
```bash
bash scripts/multi_train.sh
```

#### Parallel Inference
Similarly, run multi-setting inference using:
```bash
bash scripts/multi_inference.sh
```

These scripts automatically adjust corruption schemes and noise parameters defined in:
- `configs/t2v_train_deepspeed.yaml` (for training)
- `configs/t2v_inference_deepspeed.yaml` (for inference)

#### TensorBoard

```bash
tensorboard --logdir=tensorboard_log/catlvdm
```

> Logs are saved in `tensorboard_log/catlvdm/`.








<!-- ROADMAP -->
## TODO
- [x] Structured corruption injection
- [x] Full benchmark on 4 datasets
- [x] Model checkpoints release
- [x] arXiv upload
- [ ] Collab Demo
- [ ] Model evaluation





<!-- LICENSE -->
## License

Distributed under the MIT License. See [LICENSE.txt](./LICENSE.txt) for more information.






<!-- ACKNOWLEDGMENTS -->
## Acknowledgments

Our implementation is adapted from [DEMO](https://github.com/pr-ryan/DEMO) and [VGen](https://github.com/ali-vilab/VGen). We thank the authors for their open-source contributions.
