<!DOCTYPE html>
<html lang="en">
    <body>
        <section class="box">
            <h1>Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback</h1>
            <p class="text">This page is supplemental material, including example videos generated by text-to-video diffusion models. The contents in README.md and README.html are the same.</p>
            <h2>RL-Finetuning with AI Feedback</h2>
            <p class="text">We investigate the recipe for improving dynamic interactions with objects in text-to-video models by leveraging external feedback. We first generate videos from the pre-trained models, and then put the AI feedback and reward labels on the generated videos. For the choice of feedback, we test metric-based feedback on semantics, human preference, and dynamics, and also propose leveraging the binary feedback obtained from large-scale VLMs capable of video understanding (such as Gemini, GPT). Those data are leveraged for offline and iterative RL-finetuning.</p>
            <video src="./asset/RLAIF_VDM_241130-2.mp4" controls="true" width="800"></video>
        </section>
        <section class="box">

            <h2>Example Videos</h2>

            <ol>
                <li>Prompt: <strong>taking rose bud from bush</strong>
                    <ul>
                        <li><strong>Pre-Trained</strong></li>
                            <video src="./asset/rose_pt.mp4" controls="true"></video>
                        <li><strong>RL-Finetuned (AIF)</strong></li>                        
                            <video src="./asset/rose_aif.mp4" controls="true"></video>
                    </ul>
                <li>Prompt: <strong>taking a pen out of the book</strong>
                    <ul>
                        <li><strong>Pre-Trained</strong></li>
                            <video src="./asset/taking_a_pen_out_of_the_book_pt.mp4" controls="true"></video>
                        <li><strong>RL-Finetuned (AIF)</strong></li>                        
                            <video src="./asset/taking_a_pen_out_of_the_book_aif.mp4" controls="true"></video>
                    </ul>
                <li>Prompt: <strong>taking one body spray of many similar</strong>
                    <ul>
                        <li><strong>Pre-Trained</strong></li>
                            <video src="./asset/spray_pt.mp4" controls="true"></video>
                        <li><strong>RL-Finetuned (AIF)</strong></li>                        
                            <video src="./asset/spray_aif.mp4" controls="true"></video>
                    </ul>
                <li>Prompt: <strong>tearing  receipt into two pieces</strong>
                    <ul>
                        <li><strong>Pre-Trained</strong></li>
                            <video src="./asset/tearing_receipt_into_two_pieces_pt.mp4" controls="true"></video>
                        <li><strong>RL-Finetuned (AIF)</strong></li>                        
                            <video src="./asset/tearing_receipt_into_two_pieces_aif.mp4" controls="true"></video>
                    </ul>
                <li>Prompt: <strong>pushing a bottle so that it falls off the table</strong>
                    <ul>
                        <li><strong>Pre-Trained</strong></li>
                            <video src="./asset/pushing_a_bottle_so_that_it_falls_off_the_table_pt.mp4" controls="true"></video>
                        <li><strong>RL-Finetuned (AIF)</strong></li>                        
                            <video src="./asset/pushing_a_bottle_so_that_it_falls_off_the_table_aif.mp4" controls="true"></video>
                    </ul>
                </ol>
        </section>
    </body>
</html>
