PixNerd: Pixel Neural Field Diffusion

Shuai Wang; Ziteng Gao; Chenhui Zhu; Weilin Huang; Limin Wang

PixNerd: Pixel Neural Field Diffusion

Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, Limin Wang

Published: 26 Jan 2026, Last Modified: 13 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: pixel diffusion model

TL;DR: pixel diffusion transformer with neural field decoder

Abstract: The current success of diffusion transformers are built on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To avoid these problems, researchers return to pixel space modeling but at the cost of complicated cascade pipelines and increased token complexity. Motivated by the simple yet effective diffusion transformer architectures on the latent space, we propose to model pixel space diffusion using a large-patch diffusion transformer and employ neural fields to decode these large patches, leading to a single-stage streamlined end-to-end solution, which we coin as pixel neural field diffusion transformer (**PixNerd**). Thanks to the efficient neural field representation in PixNerd, we achieve **1.93 FID** on ImageNet 256x256 and nearly **8x lower latency** without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieves a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

Primary Area: generative models

Submission Number: 3941

Loading