<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
  <script>
    window.dataLayer = window.dataLayer || [];
    function gtag() { dataLayer.push(arguments); }
    gtag('js', new Date());

    gtag('config', 'G-WLX2Z5QLG8');
  </script>




  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"></script>

  <script type="text/javascript">
    $(document).ready(function () {

      if (localStorage.getItem("my_app_name_here-quote-scroll") != null) {
        $(window).scrollTop(localStorage.getItem("my_app_name_here-quote-scroll"));
 }

      $(window).on("scroll", function () {
        localStorage.setItem("my_app_name_here-quote-scroll", $(window).scrollTop());
 });

 });
  </script>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <title>DFM</title>


  <meta name="description"
    content="Project page for &#39;Improving Progressive Generation with Decomposable Flow Matching.&#39;">
  <link rel="icon" href="./pics/wis_logo.jpg">
  <link rel="stylesheet" href="./static/css/index.css" type="text/css">
</head>





<body>

  <style>
    .navbar {
      position: fixed;
      top: 0;
      width: 100%;
      background-color: #ffffff;
      border-bottom: 1px solid #ccc;
      z-index: 1000;
      font-family: Arial, sans-serif;
      font-size: 14px;
 }

    .navbar a {
      display: inline-block;
      padding: 12px 16px;
      text-decoration: none;
      color: #333;
 }

    .navbar a:hover {
      background-color: #f2f2f2;
 }

    body {
      margin-top: 50px;
      /* Adjust based on navbar height */
 }

    .disclaimer-box {
    border: 1px solid #ccc;
    background-color: #fdfdfd;
    padding: 15px;
    margin: 30px auto;
    width: 100%;
    max-width: 960px;
    font-size: 18px;
    color: #555;
    border-left: 5px solid #e67e22;
    box-shadow: 0 0 5px rgba(0,0,0,0.05);
 }

  </style>

  <div class="navbar">
    <a href="#abstract">Abstract</a>
    <!-- <a href="#method">Method</a> -->
    <a href="#qual">Qualitative Results</a>
    <a href="#flux">FLUX</a>
    <a href="#imagenet512">ImageNet-1K 512px</a>
    <a href="#imagenet1024">ImageNet-1K 1024px</a>
    <a href="#kinetics">Kinetics 700</a>
    <a href="#ablations">Ablations</a>
    <a href="#failures">Limitations</a>
  </div>

  <table width="999" border="0" align="center" class="menu" style="margin-bottom: 8px;">
    <tbody>
      <tr>
        <p class="title">Improving Progressive Generation with Decomposable Flow Matching</p>
        <td style="font-size: 17pt;" align="center"></td>
    </tbody>
  </table>

  <br>


  <div class="container">
    <table width="1000" border="0" align="center">
      <tbody>
        <tr>
          <center>
            <img src="./static/resources/architecture/teaser.png" width="1000">
          </center>
        </tr>
        <tr>
        </tr>
        <tr align="center"></tr>
      </tbody>
    </table>
    &nbsp;
    <div class="abstract-method-section">
      <h2>Abstract</h2>
      <p>
 Generating high-dimensional visual modalities is a computationally intensive task. A common solution is
 progressive generation, where the outputs are synthesized in a coarse-to-fine spectral autoregressive
 manner. While diffusion models benefit from the coarse-to-fine nature of denoising, explicit multi-stage
 architectures are rarely adopted. These architectures have increased the complexity of the overall approach,
 introducing the need for a custom diffusion formulation, decomposition-dependent stage transitions, add-hoc
 samplers, or a model cascade. Our contribution, Decomposable Flow Matching (DFM), is a simple and effective
 framework for the progressive generation of visual media. DFM applies Flow Matching independently
 at each level of a user-defined multi-scale representation (such as Laplacian pyramid). As shown by our
 experiments, our approach improves visual quality for both images and videos, featuring superior results
 compared to prior multistage frameworks. On ImageNet-1K 512px, DFB achieves 35.2% improvements in FDD scores
 over the base architecture and 26.4% over the best-performing baseline, under the same training
 compute. When applied to finetuning of large models, such as FLUX, DFM shows faster convergence
 speed to the training distribution. Crucially, all these advantages are achieved with a single model,
 architectural simplicity, and minimal modifications to existing training pipelines.
      </p>
    </div>

    <hr>
    <p class="section" id="method"><b>Method</b></p>
    <div class="container">
      <table align="center" width="940" border="0">
        <tbody>
          <tr>
            <td>
              <p style="margin-top: -12px; text-align: justify;">
 Our framework (DFM) progressively synthesizes images by combining multiscale decomposition with Flow
 Matching.
 We first split each training sample into a Laplacian pyramid so that coarse structural information and
 fine details are in separate stages. During training we assign an independent flow-timestep to
 every stage and train a single DiT backbone to predict all stage-wise velocities jointly. We modify DiT to use 
 per-scale patchification and timestep-embedding layers. At inference, we denoise one stage at a time,
 starting from the coarsest stage and activating the next stage only after the previous one reaches a
 predetermined low-noise threshold. This yields continuously previewable outputs and lets earlier stages generate global
 structure while later stages generate high-frequency details. Altogether, the method enables
 high-resolution and high-quality progressive image generation through one unified model.
               
              </p>
            </td>
          </tr>
          <tr>
            <td><img src="./static/resources/architecture/architecture.png" alt="" width="1000" /></td>
          </tr>

          <tr>
            <td>
              <p style="margin-top: 20px; text-align: justify;"></p>
 Across image and video generation, DFM outperforms the best-performing baselines,
 achieving the same FDD of Flow Matching baselines with roughly 2x less training cmargiompute.
              </p>
            </td>
          </tr>
          <tr>
            <td align="center"><img src="./static/resources/convergance_fdd_plot.png" alt="" width="800" /></td>
          </tr>
        </tbody>
      </table>
      <br>
      <hr>

      <div class="disclaimer-box">
        <strong>Disclaimer:</strong> All images and videos on this page are compressed for web display due to size limitations. This may reduce visual fidelity compared to the original outputs.
        </div>

      <div style="margin-top: 20px; margin-bottom: 5px;">
        <p class="section" id="content"><b>Contents</b></p>
        <ul>
          <!-- <li><a href="#abstract">Abstract</a></li>
 <li><a href="#method">Method</a></li> -->
          <li><a href="#qual">Qualitative Results</a>
            <ul>
              <li><a href="#flux">FLUX Finetuning with DFM</a></li>
              <li><a href="#imagenet512">Image Generation: ImageNet-1K 512px</a></li>
              <li><a href="#imagenet1024">Image Generation: ImageNet-1K 1024px</a></li>
              <li><a href="#kinetics">Video Generation: Kinetics-700 512px</a></li>
            </ul>
          </li>
          <li><a href="#ablations">Ablations</a>
            <ul>
              <li><a href="#sampling">Sampling Timesteps</a></li>
              <li><a href="#threshold">Threshold</a></li>
            </ul>
          </li>
          <li><a href="#failures">Limitations</a></li>
        </ul>
      </div>

      <p class="section" id="qual"><b>Qualitative Results</b></p>

      <p class="section" id="flux" style="font-size: 120%"><b>FLUX Finetuning with DFM</b></p>
      <!-- <p class="section" style="font-size: 120%"><b>Text-to-Image Generation</b></p> -->
      <p style="">
 We finetune FLUX-dev with DFM on internal dataset and compare it with FLUX-dev finetuned with standard full-finetuning for the same training steps. 
 DFM converges faster to the training distribution and achieves better structural and visual quality.
      </p>
      <tr>
        <td><img src="./static/resources/flux/flux_qual_comparison_page_2.jpg" alt="" width="1000" /></td>
      </tr>


      <!-- <p class="section" id="dfm_vs_baselines"><b>DFM vs Baselines</b></p> -->
      <p class="section" style="font-size: 120%" id="imagenet512"><b>Image Generation: Imagenet-1k 512px</b></p>
      <p style="">
 We train DFM from scratch on Imagenet-1k 512px and compare it to baselines, including Flow Matching, Pyramidal
 Flow, and Cascaded models. All baselines use the same training compute as DFM. We also use the same
 architecture and training hyperparameters for all models. Samples are fully uncurated.
      </p>

      <tr>
        <td><img src="./static/resources/qualitative_comparison/appendix_baselines_comparison.png" alt=""
            width="1000" /></td>
      </tr>

      <p class="section" style="font-size: 120%" id="imagenet1024"><b>Image Generation: ImageNet-1K 1024px</b></p>
      <p style="">
 Such improvements are also observed on 1024px images. When compared to baselines, DFM achieves overall better
 structural and textural quality, with fewer artifacts. Samples are fully uncurated.
      </p>

      <tr>
        <td><img src="./static/resources/qualitative_comparison/appendix_baselines_comparison_1024.png" alt=""
            width="1000" /></td>
      </tr>


      <p class="section" style="font-size: 120%; margin-top: 10px; " id="kinetics"><b>Video Generation: Kinetics-700
 512px</b></p>
      <p style="">
 We extend DFM to video generation by training it on Kinetics-700 dataset for 200k steps. All baselines use the same backbone and training
 compute and hyperparameters as DFM. Samples are fully uncurated.
      </p>

      <tr>
    <tr>
      <td>
        <video width="1000" autoplay loop muted playsinline>
          <source src="./static/resources/video_mp4/video_grid_mp4_page01.mp4" type="video/mp4">
 Your browser does not support the video tag.
        </video>
      </td>
    </tr>

    <tr>
      <td>
        <video width="1000" autoplay loop muted playsinline style="margin-top: 10px;">
          <source src="./static/resources/video_mp4/video_grid_mp4_page02.mp4" type="video/mp4">
 Your browser does not support the video tag.
        </video>
      </td>


      <p class="section" id="ablations"><b>Ablations</b></p>
      <p class="section" style="font-size: 120%" id="sampling"><b>Sampling Timesteps</b></p>
      <p style="">
 We ablated on the number of sampling timesteps used in stage 1 and stage 2 in DFM. We found that using more
 steps in the first stage is beneficial, as it allows the model to learn a better representation of the coarse
 structure.
 However, using too few steps in the second stage can lead to a loss of detail. We found that for a total of 40
 steps, using 30 steps in the first stage is a good balance.
      </p>

      <tr>
        <td><img src="./static/resources/ablations/appendix_sampling_steps.png" alt="" width="1000" /></td>
      </tr>


      <p class="section" style="font-size: 120%" id="threshold"><b>Threshold</b></p>
      <p style="">
 With DFM, the second stage starts the generation when the first stage reaches a certain threshold. Too large of
 a threshold risk suffering from exposure bias, while too small of a threshold can lead to providing weak conditioning
 to the second stage.
 We ablated on this threshold and found that using a threshold of 0.3 is a good balance for 512px ImageNet-1K
 experiments.
      </p>

      <tr>
        <td><img src="./static/resources/ablations/appendix_cfg_thresholds.png" alt="" width="1000" /></td>
      </tr>


      <p class="section" id="failures"><b>Limitations</b></p>
            <p style="">
 We found that for generated examples that contain high level of high-freqency details such as vegetation, fur, or other fine structures such as hair, vegetation and fur may exhibit local artifacts in such ares as the output appears overally smooth in such regions. 
      </p>

      <div style="text-align: center; width: 100%; margin: 20px 0;">
        <img src="./static/resources/limitations/limitations.png" alt="" width="500" />
      </div>

      <p style=""></p>
 However, the number of sampling steps allocation between stage 1 and stage 2 provides a mechansim to control the trade-off between structure and fine-details generation quality. Therefore, such limitations can be mitigated by allocating a higher amount of sampling steps for the second stage.
      </p>
      
      <div style="text-align: center; width: 100%; margin: 20px 0;">
        <img src="./static/resources/limitations/limitations_2.png" alt="" width="500" />
      </div>


  

</body>

</html>