Native-Resolution Image Synthesis

ZiDong Wang; LEI BAI; Xiangyu Yue; Wanli Ouyang; Yiyuan Zhang

Native-Resolution Image Synthesis

ZiDong Wang, LEI BAI, Xiangyu Yue, Wanli Ouyang, Yiyuan Zhang

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Native-resolution, Image Synthesis, Diffusion Models

TL;DR: Native-Resolution Image Synthesis

Abstract: We introduce native-resolution image synthesis, a novel paradigm in generative modeling capable of synthesizing images at arbitrary resolutions and aspect ratios. This approach overcomes the limitations of standard fixed-resolution, square-image methods by inherently handling variable-length visual tokens—a core challenge for conventional techniques. To this end, we propose the Native-resolution diffusion Transformer (NiT), an architecture that explicitly models varying resolutions and aspect ratios within its denoising process. Unconstrained by fixed formats, NiT learns intrinsic visual distributions from images encompassing a wide range of resolutions and aspect ratios. Notably, a single NiT model simultaneously achieves the state-of-the-art performance on both ImageNet-256x256 and 512x512 benchmarks. Surprisingly, akin to the robust zero-shot capabilities seen in advanced Large Language Models, NiT, pretrained solely on ImageNet, demonstrates excellent zero-shot generalization performance. It successfully generates high-fidelity images at previously unseen high resolutions (e.g., 1024x1024, 1536x1536) and diverse aspect ratios (e.g., 16:9,3:1, 4:3), as shown in Figure 1. These findings indicate the significant potential of native-resolution modeling as a bridge between visual generative modeling and advanced LLM methodologies.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 6022

Loading