Unveiling Temporal Telltales: Are Unconditional Video Generation Models Implicitly Encoding Temporal Information?

15 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Unconditional Video Generation; Video Generation
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: This paper uncovers how current video generation models inadvertently encode temporal information into frames, enabling accurate temporal classification by CNNs. We propose a method to eliminate this without compromising the FVD score.
Abstract: Unconditional video generation models seemed to generate realistic videos. However, in this paper, we delve into what could be the meaning of `realness' in the video generation models, taking into account that Convolutional Neural Networks (CNNs) are built with the inspiration from human visual neuroscience. Similar to human observers, we expected CNNs to struggle in classifying the temporal location of generated videos using a single frame due to the limited temporal information a single frame alone provides. However, our preliminary experiments unveil that current unconditional video generation models actually do inadvertently encode temporal location into each frame, enabling CNNs to correctly classify the temporal location of generated videos. To alleviate such a problem, we propose a method by adding the Gradient Reversal Layer (GRL) with lightweight CNN to the prior works to explicitly neglect this implicitly encoded temporal information. The experimental results, indeed, show that the implicit encoding of temporal information while training the unconditional video generator does negatively influence the FVD score. Moreover, experiments on diverse prior video generation models and datasets show that our approach can be used in a plug-and-play manner. Also, the results show the successful elimination of implicitly encoded temporal information without compromising the FVD score, highlighting the need to consider temporal classification accuracy as a supplementary metric in video generation models.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 14
Loading