Improving Progressive Generation with Decomposable Flow Matching


 

Abstract

Generating high-dimensional visual modalities is a computationally intensive task. A common solution is progressive generation, where the outputs are synthesized in a coarse-to-fine spectral autoregressive manner. While diffusion models benefit from the coarse-to-fine nature of denoising, explicit multi-stage architectures are rarely adopted. These architectures have increased the complexity of the overall approach, introducing the need for a custom diffusion formulation, decomposition-dependent stage transitions, add-hoc samplers, or a model cascade. Our contribution, Decomposable Flow Matching (DFM), is a simple and effective framework for the progressive generation of visual media. DFM applies Flow Matching independently at each level of a user-defined multi-scale representation (such as Laplacian pyramid). As shown by our experiments, our approach improves visual quality for both images and videos, featuring superior results compared to prior multistage frameworks. On ImageNet-1K 512px, DFB achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline, under the same training compute. When applied to finetuning of large models, such as FLUX, DFM shows faster convergence speed to the training distribution. Crucially, all these advantages are achieved with a single model, architectural simplicity, and minimal modifications to existing training pipelines.


Method

Our framework (DFM) progressively synthesizes images by combining multiscale decomposition with Flow Matching. We first split each training sample into a Laplacian pyramid so that coarse structural information and fine details are in separate stages. During training we assign an independent flow-timestep to every stage and train a single DiT backbone to predict all stage-wise velocities jointly. We modify DiT to use per-scale patchification and timestep-embedding layers. At inference, we denoise one stage at a time, starting from the coarsest stage and activating the next stage only after the previous one reaches a predetermined low-noise threshold. This yields continuously previewable outputs and lets earlier stages generate global structure while later stages generate high-frequency details. Altogether, the method enables high-resolution and high-quality progressive image generation through one unified model.

Across image and video generation, DFM outperforms the best-performing baselines, achieving the same FDD of Flow Matching baselines with roughly 2x less training cmargiompute.



Disclaimer: All images and videos on this page are compressed for web display due to size limitations. This may reduce visual fidelity compared to the original outputs.

Contents

Qualitative Results

FLUX Finetuning with DFM

We finetune FLUX-dev with DFM on internal dataset and compare it with FLUX-dev finetuned with standard full-finetuning for the same training steps. DFM converges faster to the training distribution and achieves better structural and visual quality.

Image Generation: Imagenet-1k 512px

We train DFM from scratch on Imagenet-1k 512px and compare it to baselines, including Flow Matching, Pyramidal Flow, and Cascaded models. All baselines use the same training compute as DFM. We also use the same architecture and training hyperparameters for all models. Samples are fully uncurated.

Image Generation: ImageNet-1K 1024px

Such improvements are also observed on 1024px images. When compared to baselines, DFM achieves overall better structural and textural quality, with fewer artifacts. Samples are fully uncurated.

Video Generation: Kinetics-700 512px

We extend DFM to video generation by training it on Kinetics-700 dataset for 200k steps. All baselines use the same backbone and training compute and hyperparameters as DFM. Samples are fully uncurated.

Ablations

Sampling Timesteps

We ablated on the number of sampling timesteps used in stage 1 and stage 2 in DFM. We found that using more steps in the first stage is beneficial, as it allows the model to learn a better representation of the coarse structure. However, using too few steps in the second stage can lead to a loss of detail. We found that for a total of 40 steps, using 30 steps in the first stage is a good balance.

Threshold

With DFM, the second stage starts the generation when the first stage reaches a certain threshold. Too large of a threshold risk suffering from exposure bias, while too small of a threshold can lead to providing weak conditioning to the second stage. We ablated on this threshold and found that using a threshold of 0.3 is a good balance for 512px ImageNet-1K experiments.

Limitations

We found that for generated examples that contain high level of high-freqency details such as vegetation, fur, or other fine structures such as hair, vegetation and fur may exhibit local artifacts in such ares as the output appears overally smooth in such regions.

However, the number of sampling steps allocation between stage 1 and stage 2 provides a mechansim to control the trade-off between structure and fine-details generation quality. Therefore, such limitations can be mitigated by allocating a higher amount of sampling steps for the second stage.