This page is supplemental material, including example videos generated by text-to-video diffusion models. The contents in README.md and README.html are the same.
We investigate the recipe for improving dynamic interactions with objects in text-to-video models by leveraging external feedback. We first generate videos from the pre-trained models, and then put the AI feedback and reward labels on the generated videos. For the choice of feedback, we test metric-based feedback on semantics, human preference, and dynamics, and also propose leveraging the binary feedback obtained from large-scale VLMs capable of video understanding (such as Gemini, GPT). Those data are leveraged for offline and iterative RL-finetuning.