<!doctype html>
<html lang="en">
    <head>
        <title>PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning</title>
        <link rel="icon" type="image/x-icon" href="static/img_my/icons/gravity.png">

        <meta charset="utf-8">
        <meta name="viewport" content="width=device-width, initial-scale=1">

        <!-- Open Graph -->
        <meta property="og:url" content="https://cambrian-mllm.github.io/" />
        <meta property="og:image" content="static/img_my/teaser.png" />
        <meta property="og:title" content="Cambrian-1: A Fully Open Vision-Centric Exploration of MLLMs" />
        <meta property="og:description" content="Cambrian-1 is a family of multimodal LLMs with a vision-centric design. We also release CV-Bench, a new vision-centric benchmark, and Cambrian-10M, a multimodal instruction-tuning dataset." />

        <!-- Twitter -->
        <meta name="twitter:url" content="https://cambrian-mllm.github.io/" />
        <meta name="twitter:card" content="summary_large_image" />
        <meta name="twitter:image" content="static/img_my/teaser.png" />
        <meta name="twitter:title" content="Cambrian-1: A Fully Open Vision-Centric Exploration of MLLMs" />
        <meta name="twitter:description" content="Cambrian-1 is a family of multimodal LLMs with a vision-centric design. We also release CV-Bench, a new vision-centric benchmark, and Cambrian-10M, a multimodal instruction-tuning dataset." />

        <script src="./static/js/distill_template.v2.js"></script>
        <script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
        <script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>

        <script src="https://d3js.org/d3.v5.min.js"></script>
        <script src="https://d3js.org/d3-collection.v1.min.js"></script>
        <script src="https://rawgit.com/nstrayer/slid3r/master/dist/slid3r.js"></script>

        <script defer="" src="./static/js/hider.js"></script>
        <script src="./static/js/image_interact.js"></script>
        <script src="./static/js/switch_videos.js"></script>

        <link rel="stylesheet" href="./static/css/style.css">
        <link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
        <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">

        <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.10.2/dist/katex.min.css" integrity="sha384-yFRtMMDnQtDRO8rLpMIKrtPCD5jdktao2TV19YiZYWMDkUR5GQZR/NOVTdquEx1j" crossorigin="anonymous">
        <script defer src="https://cdn.jsdelivr.net/npm/katex@0.10.2/dist/katex.min.js" integrity="sha384-9Nhn55MVVN0/4OFx7EE5kpFBPsEMZxKTCnA+4fqDmg12eCTqGi6+BB2LjY8brQxJ" crossorigin="anonymous"></script>
        <script defer src="https://cdn.jsdelivr.net/npm/katex@0.10.2/dist/contrib/auto-render.min.js" integrity="sha384-kWPLUVMOks5AQFrykwIup5lo0m3iMkkHrD0uJ4H5cjeGihAutqP0yW0J6dpFiVkI" crossorigin="anonymous"
            onload="renderMathInElement(document.body);"></script>
        <script defer src="./static/js/fontawesome.all.min.js"></script>


        <!-- medium zoom https://github.com/francoischalifour/medium-zoom -->
        <script src="https://cdn.jsdelivr.net/npm/jquery@3.7.1/dist/jquery.min.js"></script>  <!-- jquery -->
        <script defer src="./static/js/medium-zoom.min.js"></script>
        <script defer src="./static/js/zoom.js"></script>
    </head>
    <body>
        <div class="header-wrapper">
            <div class="header-container" id="header-container">
                <div class="header-content">
                    <h1 style="margin-top: 0px"><i>PhysMaster</i></h1>
                    <h2>Mastering Physical Representation<i><br>
                        for Video Generation</i><br>
                        via Reinforcement Learning</h2>
                        <p>
                            We propose PhysMaster, which captures physical knowledge as a <em><strong>representation</strong></em> for
                            guiding video generation models to enhance their physics-awareness.
                        </p>

                        <div class="icon-container">
                            <div class="icon-item">
                                <img src="static/img_my/icons/物理世界.svg" alt="Visual Representation Icon">
                                <div><strong>Physical Representation Injection</strong>: Based on the image-to-video task, we devise PhysEncoder to encode physical knowledge from the input image as an extra condition to inject into the video generation process.</div>
                            </div>
                            <div class="icon-item">
                                <img src="static/img_my/icons/天平,公平,天平秤.svg" alt="Connector Design Icon">
                                <div><strong>Representation Learning by RLHF</strong>: PhysEncoder leverages generative feedback from generation models to optimize physical representation with Direct Preference Optimization in an end-to-end manner.</div>
                            </div>
                            <div class="icon-item">
                                <img src="static/img_my/icons/模型训练.svg" alt="Instruction Tuning Data Icon">
                                <div><strong>Training Paradigm</strong>: We improve physics-awareness of PhysEncoder and thus of video generation model in a three-stage training pipeline, 
                                    which proves to generalize effectively to diverse physical scenarios guided by different physical principles.</div>
                            </div>
                            <div class="icon-item">
                                <img src="static/img_my/icons/检测-方案.svg" alt="Instruction Tuning Recipes Icon">
                                <div><strong>Generic Solution</strong>: Our PhysMaster, which learns physical knowledge via representation learning,can act as a generic solution for physics-aware video generation and has potential for broader applications.</div>
                            </div>
                            <!-- <div class="icon-item">
                                <img src="./static/img/icons/eval.svg" alt="Benchmarking Icon">
                                <div><strong>Benchmarking</strong>: We examine existing MLLM benchmarks and introduce a new vision-centric benchmark, "CV-Bench".</div>
                            </div> -->
                        </div>
                </div>
                <div class="header-image">
                    <img draggable="false" src="static/img_my/teaser_img.png" alt="Teaser Image" class="teaser-image">
                </div>
            </div>
        </div>
    <d-article>


        <p class="text abstract">
            Video generation models nowadays are capable of generating visually realistic videos, but often fail to adhere to physical laws, limiting their ability to generate physically plausible videos and serve as ''world models''.
            To address this issue, we propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics-awareness.
            Specifically, PhysMaster is based on the image-to-video task where the model is expected to 
            predict physically plausible dynamics from the input image. Since the input image provides physical priors like relative positions and potential interactions of objects in the scenario, we devise PhysEncoder to encode physical information from it as an extra condition to inject physical knowledge into the
            video generation process. The lack of proper supervision on the model's physical performance beyond mere appearance motivates PhysEncoder to apply reinforcement learning with human feedback to physical representation learning, which
            leverages feedback from generation models to optimize physical representations with Direct Preference Optimization (DPO) in an end-to-end manner. PhysMaster provides a feasible solution for improving physics-awareness of PhysEncoder and thus of video generation, proving its ability on a simple proxy task and generalizability to wide-ranging physical scenarios. This implies that our PhysMaster, which unifies solutions for various physical processes via representation learning in the reinforcement learning paradigm, can act as a generic and plug-in solution for physics-aware video generation and broader applications.

            <!-- We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a <strong>vision-<i>centric</i></strong> approach.
            While stronger language models can enhance multimodal capabilities,
            the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research.

            <br><br>
            Cambrian-1 is structured around five key pillars, each offering important insights into the design space of MLLMs:
            <ol class="text">
                <li><strong><a href="#visual_representations">&sect;Visual Representations</a></strong>: We explore various vision encoders and their combinations.</li>
                <li><strong><a href="#connector_design">&sect;Connector Design</a></strong>: We design a new dynamic and <i>spatially-aware</i> connector that integrates visual features from several models with LLMs while reducing the number of tokens.</li>
                <li><strong><a href="#instruction_data">&sect;Instruction Tuning Data</a></strong>: We curate high-quality visual instruction-tuning data from public sources, emphasizing the importance of distribution balancing.</li>
                <li><strong><a href="#sec:inst_tuning">&sect;Instruction Tuning Recipes</a></strong>: We discuss instruction tuning strategies and practices.</li>
                <li><strong><a href="#sec:benchmarking">&sect;Benchmarking</a></strong>: We examine existing MLLM benchmarks and introduce a new vision-centric benchmark "CV-Bench".</li>
            </ol> -->
        </p>

        <div class="icon-row">
            <a href="#training" class="icon-link">
                <img src="static/img_my/icons/流程.svg" alt="Data Logo" class="icon">
                Training<br>Pipeline
            </a>
            <a href="#simulation" class="icon-link">
                <img src="static/img_my/icons/模拟.svg" alt="Connector Logo" class="icon">
                Results of<br>Proxy Task
            </a>
            <a href="#real-world" class="icon-link">
                <img src="static/img_my/icons/增强现实.svg" alt="Visual Representation Logo" class="icon">
                Results of<br>Broader Scenarios
            </a>
            <!-- <a href="#sec:inst_tuning" class="icon-link">
                <img src="static/img/icons/recipe.svg" alt="Recipe Logo" class="icon">
                Instruction<br>Recipes
            </a>
            <a href="#sec:benchmarking" class="icon-link">
                <img src="static/img/icons/eval.svg" alt="Eval Logo" class="icon">
                Evaluation<br>Protocol
            </a> -->
        </div>

        <p class="click-hint" style="width: 85%;">
            <img src="static/img/icons/click.gif" style="width: 1.5rem">
            <strong>Click to jump to each section.</strong>
        </p>


        <!-- <p class="text abstract">
            To this end, Cambrian-1 not only achieves state-of-the-art performance, but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. See <a href="#State-of-the-art-MLLM-performance">§State-of-the-art MLLM performance</a>.
            We provide <a href="https://huggingface.co/nyu-visionx" target="_blank">model weights</a>,
            <a href="https://github.com/cambrian-mllm/cambrian" target="_blank">code</a>,
            <a href="https://huggingface.co/nyu-visionx" target="_blank">datasets</a>,
            and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning.
        </p> -->



        <hr>

        <div id="training" class="connector-block">
            <h1 class="text">Three-stage Training Pipeline</h1>
            <p class="text">
                We propose a three-stage training pipeline for PhysMaster to enable physical representation learning
                of PhysEncoder by leveraging the generative feedback from the video generation model. The core idea is formulating
                DPO for PhysEncoder <i>E<sub>p</sub></i> with the reward signal from generated videos of pretrained DiT model <i>v<sub>θ</sub></i> , thus
                help physical knowledge learning.
                <ul class="text">
                    <li>
                      <strong>Stage I: SFT for DiT and PhysEncoder.</strong> First, we condition the I2V base model on physical
                      representation from PhysEncoder by SFT, thus it can be possible for us to optimize PhysEncoder with
                      the performance of model as feedback in following stages. 
                      <!-- Since PhysEncoder’s training starts from
                      the frozen DINOv2 with pretrained weights from Depth Anything [40] and trainable physical head
                      with randomly initialized weights, this stage can be viewed as adapting Depth Anything for physical
                      condition injection, thus also denoted as “Depth Baseline” in section 4.  -->
                      As in Figure 1, by concatenating
                      physical embeddings extracted by PhysEncoder with visual embeddings encoded by VAE, we inject
                      physical representation as extra condition to the model. 
                    </li>
                  
                    <li>
                      <strong>Stage II: DPO for DiT.</strong> Second, we expect to adapt the output of the pretrained model to a more
                      physically plausible distribution, paving the way for the PhysEncoder to learn from generated videos
                      with higher physical accuracy. Then in Stage II, we apply LoRA to finetune the DiT model on
                      preference dataset with DPO, during which the model learns to generate positive samples with higher
                      probability and negative samples with lower probability. 
                    </li>
                  
                    <li>
                      <strong>Stage III: DPO for PhysEncoder.</strong> 
                      <!-- We propose an effective framework, PhysMaster, to leverage
                      generative feedback from the pretrained DiT model to optimize PhysEncoder’s physical representation
                      via DPO paradigm. -->
                      We leverage generative feedback from the pretrained DiT model to optimize PhysEncoder’s physical representation via DPO paradigm.
                      <!-- As illustrated in Fig 1, our framework consists of two parts: PhysEncoder to
                      be optimized and the pretrained DiT model providing generative feedback.  -->
                      With physical head of
                      PhysEncoder the only trainable module, Stage III shares the same training objective with
                      Stage II, differing solely in the learnable parameters. 
                      In this manner, by directing the DiT
                      model to generate more accurate physical dynamics, the PhysEncoder’s original representation will
                      be gradually optimized with more physical knowledge through model feedback.
                    </li>
                  </ul>
            </p>
            <d-figure id="fig-vision_connector">
                <figure>
                    <img data-zoomable="" style="width: 100%; height: auto;" draggable="false" src="static/img_my/pipeline_3_stages.png" alt="Spatial Vision Aggregator (SVA)">
                    <br>
                    <figcaption>
                        <strong>Figure 1: Training pipeline of PhysMaster.</strong> 
                        <!-- Given an input image, the DiT model predicts subsequent frames conditioned on physical, visual, and text embeddings. In Stage I, by concatenating
                        physical embeddings extracted by PhysEncoder with visual embeddings encoded by VAE, we inject physical representation as extra condition to the I2V base model through SFT on both PhysEncoder and DiT model; In Stage II, we apply LoRA [7] to finetune the DiT model on preference dataset with DPO; In Stage III, we only optimize PhysEncoder ’s physical representation via feedback from generated video pairs of the model in a DPO paradigm [13]. -->
                    </figcaption>
                </figure>
            </d-figure>
        </div>

        <div id="simulation" class="sub-section">
            <h1 class="text">Results of Proxy Task</h1>
                <p class="text">
                    To validate that our training pipeline can effectively improve the physical performance of base model on the proxy task, we compare the physical accuracy of our model on "free-fall" motion with existing works and ablate different training techniques of PhysEncoder.
                    <br>
                    <br>
                    <strong style="font-size: 24px;">Comparison.</strong>
                    We compare our model with PhysGen and PISA on the real-world subset from PisaBench.
                    <br>
                </p>
                <d-figure id="fig-simple_gen_cases">
                    <figure style="text-align: center;">
                        <!-- 第一行文字 -->
                        <div style="display: flex; justify-content: center; margin-bottom: 10px;">
                            <!-- 第一列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <span style="font-size: 25px;">PhysGen</span>
                                <video poster="" id="ade" autoplay controls muted loop height="200px">
                                    <source src="videos/comp_freefall/physgen/final_video-8.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第二列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <span style="font-size: 25px;">PISA</span>
                                <video poster="" id="ade" autoplay controls muted loop height="200px">
                                    <source src="videos/comp_freefall/pisa/sample_0008.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第三列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <span style="font-size: 25px;">Ours</span>
                                <video poster="" id="ade" autoplay controls muted loop height="200px">
                                    <source src="videos/comp_freefall/ours/0008.A_white_bottle_falls.d11d88ef.seed42_512x512_cfg7.5_shift5.0 (1).mp4" type="video/mp4">
                                </video>
                            </div>
                        </div>
                
                        <!-- 第二行视频 -->
                        <div style="display: flex; justify-content: center; margin-bottom: 10px;">
                            <!-- 第一列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="200px">
                                    <source src="videos/comp_freefall/physgen/final_video-13.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第二列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="200px">
                                    <source src="videos/comp_freefall/pisa/sample_0013.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第三列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="200px">
                                    <source src="videos/comp_freefall/ours/0013.A_black_bottle_falls.30f8a687.seed42_512x512_cfg7.5_shift5.0.mp4" type="video/mp4">
                                </video>
                            </div>
                        </div>
                
                        <!-- 第三行视频 -->
                        <div style="display: flex; justify-content: center; margin-bottom: 10px;">
                            <!-- 第一列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="200px">
                                    <source src="videos/comp_freefall/physgen/final_video-10.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第二列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="200px">
                                    <source src="videos/comp_freefall/pisa/sample_0010.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第三列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="200px">
                                    <source src="videos/comp_freefall/ours/0010.A_black_bottle_falls.30f8a687.seed42_512x512_cfg7.5_shift5.0.mp4" type="video/mp4">
                                </video>
                            </div>
                        </div>
                        <br>
                        <!-- 图注 -->
                        <figcaption>
                            <strong>Figure 2:
                            Qualitative comparison with PhysGen and PISA </strong>which are specialized for rigid-body motion proves the advantage of our model in shape consistency and trajectory accuracy on ''free-fall''.
                        </figcaption>
                    </figure>
                </d-figure>
                <p class="text">
                    <strong style="font-size: 24px;">Ablation Study.</strong>
                    We report the qualitative results from different training stages on the same subset of PisaBench.
                    <br>
                </p>
                <d-figure id="fig-simple_gen_cases">
                    <figure style="text-align: center;">
                        <div style="text-align: center; margin-bottom: 10px;">
                            <span style="font-size: 25px; margin: 0 80px;">Ours</span>
                            <span style="font-size: 25px; margin: 0 80px;">Base</span>
                            <span style="font-size: 25px; margin: 0 80px;">Ours</span>
                            <span style="font-size: 25px; margin: 0 80px;">Base</span>
                        </div>

                        <video  poster=""  id="ade" autoplay controls muted loop height="200px">
                            <source src="videos/drop/0013.A_black_bottle_falls.30f8a687.seed42_512x512_cfg7.5_shift5.0.mp4" type="video/mp4">
                        </video>
                        &nbsp;
                        <video  poster=""  id="ade" autoplay controls muted loop height="200px">
                            <source src="videos/drop/0065.A_brown_bottle_falls.0b1c8cdc.seed42_512x512_cfg7.5_shift5.0.mp4" type="video/mp4">
                        </video>
                        <video  poster=""  id="ade" autoplay controls muted loop height="200px">
                            <source src="videos/drop/0081.A_white_bottle_falls.d11d88ef.seed42_512x512_cfg7.5_shift5.0.mp4" type="video/mp4">
                        </video>
                        &nbsp;
                        <video  poster=""  id="ade" autoplay controls muted loop height="200px">
                            <source src="videos/drop/0008.A_white_bottle_falls.d11d88ef.seed42_512x512_cfg7.5_shift5.0.mp4" type="video/mp4">
                        </video>
                        <br>
                        <br>
                        <figcaption>
                            <strong>Figure 3:
                            Qualitative ablation for models in different training stages</strong> on the real-world test set
                            of "free-fall". Our three-stage training improves model performance in preserving objects’
                            rigidity and complying with physical laws (e.g., gravitational acceleration and collision) over base model.              
                        </figcaption>
                    </figure>
                </d-figure>
        </div>
            
        <div id="real-world" class="sub-section">   
            <h1 class="text">Results of Broader Scenarios</h1>
                <p class="text">
                    We apply our training pipeline on a large-scale dataset broadly covering common physical phenomena observed in real world to substantiate the generalizability of our method.
                    <br>
                    <strong style="font-size: 24px;">Comparison.</strong>
                    We compare with two types of video generation models, general models including CogVideoX-5B, Wan2.1-I2V-14B, and specialized physics-focused models represented by WISA.
                    
                </p>
                <d-figure id="fig-simple_gen_cases">
                    <figure style="text-align: center;">
                        <!-- 第一行文字与视频 -->
                        <div style="display: flex; justify-content: center; margin-bottom: 10px;">
                            <!-- 第一列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <span style="font-size: 20px;">CogVideoX-5B</span>
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/cog/A+swing+swinging+freely+in+the+air.-0.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第二列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <span style="font-size: 20px;">WISA</span>
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/wisa/video_433.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第三列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <span style="font-size: 20px;">Wan2.1-I2V-14B</span>
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <!-- <source src="videos/comp_general/wan/212_i2v-14B_832_480_8_1__m2v_intern_public_datasets_sim_data_image_hr_capt_20250922_211608.mp4" type="video/mp4"> -->
                                    <source src="videos/comp_general/want2v/325_t2v-1.3B_832_480_1_1__m2v_intern_public_datasets_sim_data_image_hr_capt_20250925_040136.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第四列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <span style="font-size: 20px;">Ours</span>
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/our/0212.Dynamic_A_swing_swinging_freely_in_the_air.8a88b67b.seed42_384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                                </video>
                            </div>
                        </div>
                        <div style="display: flex; justify-content: center; margin-bottom: 10px;">
                            <!-- 第一列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/cog/Clay+pinched+with+metal+tongs.-0.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第二列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/wisa/video_294.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第三列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <!-- <source src="videos/comp_general/wan/239_i2v-14B_832_480_8_1__m2v_intern_public_datasets_sim_data_image_hr_capt_20250923_003105.mp4" type="video/mp4"> -->
                                    <source src="videos/comp_general/want2v/239_t2v-1.3B_832_480_1_1__m2v_intern_public_datasets_sim_data_image_hr_capt_20250925_040509.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第四列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/our/0239.Dynamic_Clay_pinched_with_metal_tongs.48a0e05e.seed42_384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                                </video>
                            </div>
                        </div>
                        <div style="display: flex; justify-content: center; margin-bottom: 10px;">
                            <!-- 第一列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/cog/Metal+marbles+jostling+inside+a+marble+run.-0.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第二列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/wisa/video_1.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第三列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <!-- <source src="videos/comp_general/wan/272_i2v-14B_832_480_8_1__m2v_intern_public_datasets_sim_data_image_hr_capt_20250922_125025.mp4" type="video/mp4"> -->
                                    <source src="videos/comp_general/want2v/272_t2v-1.3B_832_480_1_1__m2v_intern_public_datasets_sim_data_image_hr_capt_20250925_035450.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第四列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/our/0272.Dynamic_Metal_marbles_jostling_inside_a_marble_run.839b32b8.seed42_384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                                </video>
                            </div>
                        </div>
                        <div style="display: flex; justify-content: center; margin-bottom: 10px;">
                            <!-- 第一列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/cog/Skateboard+rolls+swiftly+over+the+bumpy+sidewalk.-0.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第二列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/wisa/video_71.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第三列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <!-- <source src="videos/comp_general/wan/300_i2v-14B_832_480_8_1__m2v_intern_public_datasets_sim_data_image_hr_capt_20250922_161158.mp4" type="video/mp4"> -->
                                    <source src="videos/comp_general/want2v/300_t2v-1.3B_832_480_1_1__m2v_intern_public_datasets_sim_data_codes_Wan2.1__20250925_062112.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第四列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/our/R4L37_0300.Dynamic_Skateboard_rolls_swiftly_over_the_bumpy_sidewalk.8f870322.seed42_384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                                </video>
                            </div>
                        </div>
                        <!-- 底部说明 -->
                        <br>
                        <figcaption>
                            <strong>Figure 4:
                            Qualitative comparison with T2V and I2V models on rigid-body related scenarios</strong>.
                        </figcaption>
                    </figure>
                </d-figure>
                
                <d-figure id="fig-simple_gen_cases">
                    <figure style="text-align: center;">
                        <!-- 第一行文字与视频 -->
                        <div style="display: flex; justify-content: center; margin-bottom: 10px;">
                            <!-- 第一列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <span style="font-size: 20px;">CogVideoX-5B</span>
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/cog/Bubble+experiences+a+splashy+burst+in+water.-0.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第二列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <span style="font-size: 20px;">WISA</span>
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/wisa/video_43.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第三列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <span style="font-size: 20px;">Wan2.1-I2V-14B</span>
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <!-- <source src="videos/comp_general/wan/1_i2v-14B_832_480_8_1__m2v_intern_public_datasets_sim_data_image_hr_capt_20250922_113111.mp4" type="video/mp4"> -->
                                    <source src="videos/comp_general/want2v/1_t2v-1.3B_832_480_1_8__m2v_intern_public_datasets_sim_data_image_hr_capt_20250925_023420.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第四列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <span style="font-size: 20px;">Ours</span>
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/our/0001.Dynamic_Bubble_experiences_a_splashy_burst_in_water.94b03a97.seed42_384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                                </video>
                            </div>
                        </div>
                        <!-- 第一行文字与视频 -->
                        <div style="display: flex; justify-content: center; margin-bottom: 10px;">
                            <!-- 第一列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/cog/A+plastic+bottle+slides+down+a+water+slide.-0.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第二列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/wisa/video_127.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第三列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <!-- <source src="videos/comp_general/wan/106_i2v-14B_832_480_8_1__m2v_intern_public_datasets_sim_data_image_hr_capt_20250923_001310.mp4" type="video/mp4"> -->
                                    <source src="videos/comp_general/want2v/106_t2v-1.3B_832_480_1_1__m2v_intern_public_datasets_sim_data_image_hr_capt_20250925_053845 (1).mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第四列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/our/0106.Dynamic_A_plastic_bottle_slides_down_a_water_slide.ce6da528.seed42_384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                                </video>
                            </div>
                        </div>
                        <!-- 第一行文字与视频 -->
                        <div style="display: flex; justify-content: center; margin-bottom: 10px;">
                            <!-- 第一列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/cog/An+apple+bobbing+in+a+bucket+of+water.-0.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第二列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/wisa/video_277.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第三列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <!-- <source src="videos/comp_general/wan/156_i2v-14B_832_480_8_1__m2v_intern_public_datasets_sim_data_image_hr_capt_20250922_143143.mp4" type="video/mp4"> -->
                                    <source src="videos/comp_general/want2v/156_t2v-1.3B_832_480_1_1__m2v_intern_public_datasets_sim_data_image_hr_capt_20250925_034438.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第四列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/our/0156.Dynamic_An_apple_bobbing_in_a_bucket_of_water.eeaaab5e.seed42_384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                                </video>
                            </div>
                        </div>
                        <div style="display: flex; justify-content: center; margin-bottom: 10px;">
                            <!-- 第一列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/cog/Oil+floating+atop+crystal+clear+water.-0.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第二列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/wisa/video_319.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第三列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <!-- <source src="videos/comp_general/wan/22_i2v-14B_832_480_8_1__m2v_intern_public_datasets_sim_data_image_hr_capt_20250922_140331.mp4" type="video/mp4"> -->
                                    <source src="videos/comp_general/want2v/22_t2v-1.3B_832_480_1_1__m2v_intern_public_datasets_sim_data_image_hr_capt_20250925_034438.mp4" type="video/mp4">
                                </video>
                            </div>
                
                            <!-- 第四列 -->
                            <div style="display: flex; flex-direction: column; align-items: center; margin: 0 5px;">
                                <video poster="" id="ade" autoplay controls muted loop height="150px">
                                    <source src="videos/comp_general/our/0022.Dynamic_Oil_floating_atop_crystal_clear_water.458de83f.seed42_384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                                </video>
                            </div>
                        </div>
                        <!-- 底部说明 -->
                        <br>
                        <figcaption>
                            <strong>Figure 5:
                            Qualitative comparison with T2V and I2V models on fluid related scenarios</strong>.
                        </figcaption>
                    </figure>
                </d-figure>
                
                <p class="text">
                    <strong style="font-size: 24px;">Ablation Study.</strong>
                    We conduct ablation analysis to verify the effectiveness of our training pipeline.
                    <br>
                </p>
                <d-figure id="fig-simple_gen_cases">
                    <figure style="text-align: center;">
                        <div style="text-align: center; margin-bottom: 10px;">
                            <span style="font-size: 20px; margin: 0 60px;">Ours (Stage III)</span>
                            <span style="font-size: 20px; margin: 0 60px;">Ours (Stage I)</span>
                            <span style="font-size: 20px; margin: 0 60px;">Ours (Stage III)</span>
                            <span style="font-size: 20px; margin: 0 60px;">Ours (Stage I)</span>
                        </div>
                        <video  poster=""  id="ade" autoplay controls muted loop width="48%">
                            <source src="videos/real/00067.48f143ea6f444760886dd863a0b1245b8476be66596ca740bccf5ef5ec8ae8fc.Reality_The_video_captures_a_serene_moment_of_tea_being.1b2eca4b.seed0_384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                        </video>
                        &nbsp;
                        <video  poster=""  id="ade" autoplay controls muted loop width="48%">
                            <source src="videos/real/00069.52d750af0927547108fb107b2e799be18d45982fd67b286f09a0e14933f829da.Reality_The_video_captures_a_serene_scene_set_in_a.e9ee12d6384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                        </video>
                        <video  poster=""  id="ade" autoplay controls muted loop width="48%">
                            <source src="videos/real/00129.a2b6156f0c4f108fd7ae9faf7abaed237cad292f1092707bb3a03ff6c69022ef.Reality_The_video_captures_a_closeup_scene_set_against_a.fe1a1973.seed0_384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                        </video>
                        &nbsp;
                        <video  poster=""  id="ade" autoplay controls muted loop width="48%">
                            <source src="videos/real/00171.dc2cda7bfcded19e9fd4aac0682cc491593735f90b64113b32820076089fed5d.Reality_The_video_showcases_a_closeup_view_of_four_distinct.dac37404.seed42_384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                        </video>
                        <video  poster=""  id="ade" autoplay controls muted loop width="48%">
                            <source src="videos/real_new/00004.0e2ed727c666dfd524fbd0eb433946a9e90856af6a1895b27d34d8db3bc586a2.Reality_The_video_captures_a_closeup_view_of_a_glass.4a57ae2e384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                        </video>
                        &nbsp;
                        <video  poster=""  id="ade" autoplay controls muted loop width="48%">
                            <source src="videos/real_new/00167.396c3d38f060575de728e7015cdbacad90b18b5d560eb9d5b6adafe4e90905c0.Reality_The_video_captures_a_serene_scene_where_water_is.290d1ba7384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                        </video>
                        <br>
                        <br>
                        <figcaption>
                            <strong>Figure 6:
                            Qualitative ablation for models in different training stages</strong> on fluid related scenarios. 
                            DPO following Stage I improves the physical coherence of model in Stage III.
                        </figcaption>
                    </figure>
                </d-figure>
                
                <d-figure id="fig-simple_gen_cases">
                    <figure style="text-align: center;">
                        <div style="text-align: center; margin-bottom: 10px;">
                            <span style="font-size: 20px; margin: 0 60px;">Ours (Stage III)</span>
                            <span style="font-size: 20px; margin: 0 60px;">Ours (Stage I)</span>
                            <span style="font-size: 20px; margin: 0 60px;">Ours (Stage III)</span>
                            <span style="font-size: 20px; margin: 0 60px;">Ours (Stage I)</span>
                        </div>
                        <video  poster=""  id="ade" autoplay controls muted loop width="48%">
                            <source src="videos/abla_compare/0267.Dynamic_Leather_glove_catching_a_hard_baseball.1d6cc305.seed42_384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                        </video>
                        &nbsp;
                        <video  poster=""  id="ade" autoplay controls muted loop width="48%">
                            <source src="videos/abla_compare/0288.Dynamic_Razor_shaves_skin_smooth.25becbba.seed42_384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                        </video>
                        <video  poster=""  id="ade" autoplay controls muted loop width="48%">
                            <source src="videos/abla_compare/0318.Dynamic_The_heavy_weight_compresses_the_springloaded_pad.265b4343.seed42_384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                        </video>
                        &nbsp;
                        <video  poster=""  id="ade" autoplay controls muted loop width="48%">
                            <source src="videos/abla_compare/0325.Dynamic_The_rope_supports_a_wooden_swing.677d9abe.seed42_384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                        </video>
                        
                        <video  poster=""  id="ade" autoplay controls muted loop width="48%">
                            <source src="videos/real_new/00039.4b531cc8d07f572f5128d538ff651c40fad058dbda6f7ec55c348fc33cecaf91.Reality_The_video_showcases_the_process_of_cleaning_a_large.6a97df26384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                        </video>
                        &nbsp;
                        <video  poster=""  id="ade" autoplay controls muted loop width="48%">
                            <source src="videos/real_new/00093.12dd56789a903010de16412dcd1a929411cde820aa8d21a6b2c4bed10c36e963.Reality_The_video_captures_the_process_of_frying_an_egg.b40f24dd384x672_cfg7.5_shift5.0.mp4" type="video/mp4">
                        </video>
                        <br>
                        <br>
                        <figcaption>
                            <strong>Figure 7:
                            Qualitative ablation for models in different training stages</strong> on rigid-body related scenarios. 
                            DPO following Stage I improves the physical coherence of model in Stage III.
                        </figcaption>
                    </figure>
                </d-figure>
            </div>
        </div>

        <div id="conclusion" style="position: relative; margin-bottom: 0px;">
            <h2 class="text" style="margin-top:0px; margin-bottom:10px">Conclusion</h2>
            <p class="text">
                We propose PhysMaster, which learns physical representation from input image for guiding I2V model to generate physically plausible videos. We optimize physical encoder PhysEncoder based on generative feedback from a pretrained video generation model via DPO on both proxy task and broader scenarios, which proves to enhance the model's physical accuracy and demonstrate generalizability across various physical processes by injecting physical knowledge into generation,  proving its potential to act as a generic solution for physics-aware video generation and broader applications.
            </p>
        </div>
        
        </d-article>
        <!-- bibliography will be inlined during Distill pipeline's pre-rendering -->
        <d-bibliography src="bibliography.bib"></d-bibliography>
        <script src="./static/js/nav-bar.js"></script>
    </body>
</html>
