# Controllable Video Generation with Provable Disentanglement




## Installation
We use Python 3.8 and PyTorch 1.8.2.
To install and activate the environment, run the following command:
```
conda env create -f environment.yaml -p env
conda activate ./env
```
You can also follow StyleGAN2-ADA [here](https://github.com/NVlabs/stylegan2-ada-pytorch#requirements) to build through docker.

## System requirements

Our codebase uses the same system requirements as StyleGAN2-ADA: see them [here](https://github.com/NVlabs/stylegan2-ada-pytorch#requirements).
We trained all the 256x256 models on 4 A100s with 40 GB each for ~2 days.
It is very similar in training time to [StyleGAN2-ADA](https://github.com/NVlabs/stylegan2-ada-pytorch) (even a bit faster).

## Training
### Dataset structure
The dataset should be a directory structured as:
```
dataset/
    video1/
        - frame1.jpg
        - frame2.jpg
        - ...
    video2/
        - frame1.jpg
        - frame2.jpg
        - ...
    ...
```
For incorporating labels from different datasets or using alternative directory formats, please refer to the Python files with the "_dataset" suffix in the training directory.


### Training 
To train on different datasets, you need to set the command-line parameter 'dataset', which defaults to 're' representing the RealEstate Dataset, and adjust the path in every dataset python file. 

Dataset name could be: RealEstate10K(re), KTH, FaceForensics(ffs), SkyTimelapse(sky), Lipreading, BagShiftSynthetic(synthetic), CelebV-HQ(celebv), UCF-101(ucf)

For example, to train on RealEstate Dataset, find `training.re_dataset.RealEstate` and set the 'path' parameter in the RealEstate class according to the location of your dataset.



To train on RealEstate, run:
```
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --outdir=training-runs --dataset=re --gpus=4 --batch=32 --cond_mode=flow --flow_norm=1 --i_dim=4 --lambda_sparse=0.1 --vid_length=8 --channel=3

```
For single-gpu training, adjust the params `gpu` and `batch`, you can run:
```
CUDA_VISIBLE_DEVICES=0 python train.py --outdir=training-runs --dataset=re --gpus=1 --batch=8 --cond_mode=flow --flow_norm=1 --i_dim=4 --lambda_sparse=0.1 --vid_length=8 --channel=3

```



### Resume training
The training configuration can be further customized with additional command line options:
```
--resume=~/training-runs/<NAME>/network-snapshot-<INT>.pkl resumes a previous training run.
```

### Inference
To sample from the model, launch the following command:
```
python generate.py --network_pkl /path/to/network-snapshot.pkl --num_videos 25 --as_grids true --save_as_mp4 true --fps 25 --video_len 128 --batch_size 25 --outdir /path/to/output/dir --truncation_psi 0.9
```
This will sample 25 videos of 25 FPS as a 5x5 grid with the truncation factor of 0.9.
Each video consists of 8 frames.
Adjust the corresponding arguments to change the settings.

### Training other baselines
To train other baselines, used in the paper, we used their original implementations:
- [MoCoGAN](https://github.com/sergeytulyakov/mocogan)
- [MoCoGAN-HD](https://github.com/snap-research/MoCoGAN-HD)
- [DIGAN](https://github.com/sihyun-yu/digan)
- [VideoGPT](https://github.com/wilson1yan/VideoGPT)

## Data
Datasets can be downloaded here:
- RealEstate: https://google.github.io/realestate10k/
- LRW: https://www.robots.ox.ac.uk/~vgg/data/lip_reading/
- KTH: https://www.csc.kth.se/cvap/actions/
- SkyTimelapse: https://github.com/weixiong-ur/mdgan
- UCF: https://www.crcv.ucf.edu/data/UCF101.php
- FaceForensics: https://github.com/ondyari/FaceForensics
- MEAD: https://wywu.github.io/projects/MEAD/MEAD.html

We resize all the datasets to the 256x256 resolution (except for MEAD, which we resize to 1024x1024).
FFS was preprocessed with `/preprocess_ffs.py` to extract face crops.


## Evaluation
In this repo, we re-implemented two popular evaluation measures for video generation:
- [Frechet Video Distance](https://arxiv.org/abs/1812.01717). For this, we re-implemented *perfectly* (up to numerical precision) the original Tensorflow version of the I3D model trained on Kinetics-400 and converted it to TorchScript. This is a precise implementation of the official one and we set up [this comparison repo](https://github.com/universome/fvd-comparison) to demonstrate this.
- [Inception Score](https://arxiv.org/abs/1611.06624) (used only for UCF101). For this, we re-implemented *perfectly* (up to numerical precision) the original [Chainer version of the UCF101-finetuned C3D model](https://github.com/pfnet-research/tgan2) in Pytorch and converted it to TorchScript.




## TODO
Code release TODO:
- [x] Jupyter notebook demos
- [x] [Pre-trained checkpoints]