# LaCT Novel View Synthesis

## Contains
- [x] LaCT NVS codebase
- [x] Object-level model checkpoints
- [x] Scene-level model checkpoints
- [x] Object-level inference example
- [x] Scene-level inference example
- [x] Training code and script

## Environment Setup
Install the python dependencies:
```
pip install -r requirement.txt
```

Install `ffmpeg` to save rendering results as mp4 video:
```
sudo apt install ffmpeg
```
If ffmpeg can not be installed, changing `*_turntable.mp4` to `*_turntable.gif` in code and it will save in gif (but the size of video is larger).



*checkpoints*:  Not contained in the submission, due to anonymous


## Inference


### Object-level inference: 
Run inference with example 512-resolution data in [data_example](/data_example/).
```bash
# Resolution 256:
python inference.py --load weight/obj_res256.pt --image_size 256 256 --data_path data_example/gso_sample_data_path.json

# Resolution 512:
python inference.py --load weight/obj_res512.pt --image_size 512 512 --data_path data_example/gso_sample_data_path.json
```
The command will output example videos like (at 512 resolution):
<p align="center">
  <img src="data_example/gso_character_inference_demo.gif" alt="Example Inference Result">
</p>

Note: the checkpoints are the same as the paper while the code and data are rewritten. For best inference performance, a uniform view selection is preferred. The current example takes random view selections to demonstrate robustness.


### Scene-level inference
First download the DL3DV benchmark (i.e., test; not used in training) data samples:


After downloading the above samples, run
```bash
python data_preprocess/dl3dv_format_converter.py
```

Run inference with example 512-resolution data in [data_example](/data_example/).
```bash
# Resolution 256 x 256, Num input views 64:
python inference.py \
--load weight/scene_res256x256.pt \
--config config/lact_l24_d768_ttt2x.yaml \
--image_size 256 256 \
--scene_inference \
--num_all_views 136 --num_input_views 128 \
--data_path data_example/dl3dv_sample_data_path.json 

# Resolution 72 x 128, Num input views 256:
python inference.py \
--load weight/scene_res72x128.pt \
--config config/lact_l24_d768_ttt4x.yaml \
--image_size 72 128 \
--scene_inference \
--num_all_views 300 --num_input_views 256 \
--data_path data_example/dl3dv_sample_data_path.json 

# Resolution 536 x 960, Num input views 32:
python inference.py \
--load weight/scene_res536x960.pt \
--config config/lact_l24_d768_ttt4x.yaml \
--image_size 536 960 \
--scene_inference \
--num_all_views 52  --num_input_views 48 \
--data_path data_example/dl3dv_sample_data_path.json 
```

Comparing to object-level inference, scene-level inference uses option `--scene_inference`. As scene poses can be unbounded, this normalizes the camera pose to be more regular. We normalize train+test together for simplicity in this codebase; feel free to normalize train only. This option also changes the video's camera trajetories (now would be interpolation over input). 

## Training Script 

We still working on providing an example training data.
The current training code is for a reference.

Note:
1. The TTT model needs the option `--compile` to reach the best performance (about 1.4x~1.5x speedup). However, compilation for the first training step can take about 30s~2min. Thus we recommend removing it for debugging the code. 
2. Please follow train.py to add activation checkpointing; o.w., it can hurt the compilation of backward.

From scratch:
```
torchrun \
--nproc_per_node=8 \
--standalone \
train.py --config config/lact_l14_d768_ttt2x.yaml --actckpt
```



