In the demo folder, we provides some demos to better visualize the results.

- Failure case of existing T2I video generative models.
- Visualization of Block-wise disentanglement.
- Visualization of Component-wise disentanglement.

# Failure case of existing T2I video generative models

We generate videos using the official online versions of [Kling](https://wan.video/wanxiang/videoCreation) and [Wan](https://app.klingai.com/cn/image-to-video/frame-mode/new?ra=4 ). The resulting videos are saved as shown in the table below:

| Model \ Prompt | While this person was speaking, the head gradually shifted from the middle to the right. | The camera gradually moves moves to the right to provide a wider field of view. | The red bag moves from the center toward the upper-left corner. | 
|--------|--------|--------|--------|
|Kling|```./demo/1_kling_face.jpg```||```./demo/1_kling_bag.jpg```|
|Wan|```./demo/1_wan_face.jpg```|```./demo/1_wan_camera.jpg```||

### While this person was speaking, the head gradually shifted from the middle to the right.

As shown in Video ```./demo/1_kling_face.gif```, Kling barely move the postion of the head.

![1_kling_face](./demo/1_kling_face.gif)

For Wan in video ```./demo/1_wan_face.gif```, the camera unwanted zoom out.

![1_wan_face](./demo/1_wan_face.gif)

These two videos are the same as that in the Figure 1, line 39, in the manuscript.

### The camera gradually moves moves to the right to provide a wider field of view.

For Wan in video ```./demo/1_wan_camera.gif```, there is a camera in the view.

![1_wan_camera](./demo/1_wan_camera.gif)

### The red bag moves from the center toward the upper-left corner.

For Kling in video ```./demo/1_kling_bag.gif```, the bag appears to be spinning in place rather than translating toward the upper-left corner as described in the prompt.

![kilng_bag](./demo/1_kling_bag.gif)

# Visualization of Block-wise disentanglement.

Following the experimental setting described in the  manuscript (Block-wise Disentanglement, Section 5.3), we compare CoVoGAN with baseline models on the FaceForensics, RealEstate, and SkyTimelapse datasets. Each column corresponds to a model, while each row shows a sample. The same modification is applied across different samples to highlight consistency and controllability.



- FaceForensics: ```./demo/2_FaceForensics.gif```
![FaceForensics](./demo/2_FaceForensics.gif)

- RealEstate: ```./demo/2_RealEstate.gif```
![FaceForensics](./demo/2_RealEstate.gif)

- SkyTimelapse: ```./demo/2_SkyTimelapse.gif```
![FaceForensics](./demo/2_SkyTimelapse.gif)

# Visualization of Component-wise disentanglement.

Following the experimental setting described in the manuscript (Component-wise Disentanglement, Section 5.3, Figure 5), we demonstrate that adjusting each dimension of the $z^s$ enables disentangled control over different concepts in the video. In ```./demo/3_FaceForensics.gif```, each row shows a different sample. The same modification is applied across samples: the first column shows unmodified videos; in the second column, we modify the dimension controlling eye blinking; in the third column, we modify both the dimension controlling eye blinking and another dimension controlling head movement to the right.

![Component FaceForensics](./demo/3_FaceForensics.gif)

We also train our model on a toy dataset containing a single moving bag. As shown in ```./demo/3_bag.gif```, three different dimensions independently control zoom in/out (row 1), horizontal movement (left/right, row 2), and vertical movement (up/down, row 3) in an unsupervised manner.

![Bag](./demo/3_bag.gif)