Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

This page is supplemental material, including example videos generated by text-to-video diffusion models. The contents in README.md and README.html are the same.

RL-Finetuning with AI Feedback

We investigate the recipe for improving dynamic interactions with objects in text-to-video models by leveraging external feedback. We first generate videos from the pre-trained models, and then put the AI feedback and reward labels on the generated videos. For the choice of feedback, we test metric-based feedback on semantics, human preference, and dynamics, and also propose leveraging the binary feedback obtained from large-scale VLMs capable of video understanding (such as Gemini, GPT). Those data are leveraged for offline and iterative RL-finetuning.

Example Videos

  1. Prompt: taking rose bud from bush
    • Pre-Trained
    • RL-Finetuned (AIF)
  2. Prompt: taking a pen out of the book
    • Pre-Trained
    • RL-Finetuned (AIF)
  3. Prompt: taking one body spray of many similar
    • Pre-Trained
    • RL-Finetuned (AIF)
  4. Prompt: tearing receipt into two pieces
    • Pre-Trained
    • RL-Finetuned (AIF)
  5. Prompt: pushing a bottle so that it falls off the table
    • Pre-Trained
    • RL-Finetuned (AIF)