TL;DR: XYZFlow introduces intensive scaling via multidimensional conditioning and Next Shortcut Prediction, achieving 7-8× speedup without quality loss by making probability paths straighter and more deterministic.
Abstract: High-fidelity image generation faces a trade-off between speed and quality. Diffusion models produce strong visuals but require costly iterative sampling. Existing efficient methods mainly distill pretrained models into few-step samplers, a challenging process that depends heavily on teacher-model quality. In this paper, we introduce XYZFlow, a framework that rethinks efficient generation through multidimensional scaling of flow matching. Unlike single-step mappings, XYZFlow enhances expressivity by making probability paths more identifiable and learnable through structured multidimensional conditioning. We view autoregressive modeling as implicit flow straightening, where richer context reduces trajectory ambiguity. XYZFlow realizes this idea through two orthogonal dimensions: temporal scaling, which uses non-Markovian conditioning on the full denoising history; and spatial scaling, enabled by Next Shortcut Prediction, which sequentially generates patches using preceding patches' denoising trajectories as priors. Experiments show that XYZFlow achieves state-of-the-art performance, with 7.2-8.5x teacher speedups and competitive FID, while Next Shortcut Prediction delivers superior quality-latency trade-offs over model scaling or step reduction.
Lay Summary: Creating realistic images with AI usually forces a tough trade-off: you either wait seconds for high-quality results, or get fast outputs that look blurry and unnatural. Most existing speed-up tricks work by “distilling” knowledge from a slow, high-quality teacher model — but this process is finicky, and the final speed still depends heavily on how good the original teacher is.
We built XYZFlow, a new framework that sidesteps this problem by making the image generation process inherently easier to run fast. Instead of just tweaking training recipes, we add two extra sources of helpful context to the model: first, we let each part of the image “remember” its own full denoising history as it’s being generated, so it doesn’t have to redo redundant work. Second, we make each new patch of the image build on the entire generation path of all previously created patches, not just their final appearance — like giving the model a clear roadmap instead of making it guess from scratch.
Tests on standard image benchmarks show XYZFlow is 7–36 times faster than its teacher models, while matching or beating their quality. It even outperforms much bigger, more complex competitors using far fewer parameters. The approach works reliably even with weaker starting teachers, offering a simpler path to fast, high-quality AI image generation.
Primary Area: Applications->Computer Vision
Keywords: Image Generation
Originally Submitted PDF: pdf
Submission Number: 31669
Loading