Exploring Diffusion Transformer Designs via Grafting

Keshigeyan Chandrasegaran; Michael Poli; Daniel Y Fu; Dongjun Kim; Lea M. Hadzic; Manling Li; Agrim Gupta; Stefano Massaroli; Azalia Mirhoseini; Juan Carlos Niebles; Stefano Ermon; Li Fei-Fei

Exploring Diffusion Transformer Designs via Grafting

Keshigeyan Chandrasegaran, Michael Poli, Daniel Y Fu, Dongjun Kim, Lea M. Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, Stefano Ermon, Li Fei-Fei

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Transformers, Model Grafting, Architectural Editing, Hybrid Models, Architecture Exploration

TL;DR: We propose grafting, a simple approach to materialize new architectures by editing pretrained diffusion transformers. It enables architectural exploration under small compute budgets.

Abstract: Model architecture design requires decisions such as selecting operators (e.g., attention, convolution) and configurations (e.g., depth, width). However, evaluating the impact of these decisions on model quality requires costly pretraining, limiting architectural exploration. Inspired by how new software is built on existing code, we ask: can new architecture designs be studied using pretrained models? We present *grafting*, a simple approach for editing pretrained diffusion transformers (DiTs) to materialize new architectures under small compute budgets. We study the impact of grafting on model quality using the DiT-XL/2 design. We develop a family of hybrid designs via grafting: replacing softmax attention with gated convolution, local, and linear attention; and MLPs with variable-width and convolutional variants. Notably, many hybrid designs achieve good quality (FID: 2.38–2.64 vs. 2.27 for DiT-XL/2) using $<2$\% pretraining compute. Next, we graft a text-to-image model (PixArt-$\Sigma$), achieving a 43\% speedup with $<2$\% drop in GenEval score. Finally, we present a case study where we restructure DiT-XL/2 by converting every pair of sequential transformer blocks into parallel blocks via grafting, reducing model depth by 2x, achieving better quality (FID: 2.77) than models of comparable depth. Together, we show that new diffusion model designs can be explored by grafting pretrained DiTs, with edits ranging from operator replacement to architecture restructuring. Code and grafted models: https://grafting.stanford.edu

Submission Number: 132

Loading