Keywords: Diffusion Transformers, Model Grafting, Architectural Editing, Hybrid Models, Architecture Exploration
TL;DR: We propose grafting, a simple approach to materialize new architectures by editing pretrained diffusion transformers. It enables architectural exploration under small compute budgets.
Abstract: Model architecture design requires decisions such as selecting operators (e.g., attention, convolution) and configurations (e.g., depth, width). However, evaluating the impact of these decisions on model quality requires costly pretraining, limiting architectural exploration.
Inspired by how new software is built on existing code, we ask: can new architecture designs be studied using pretrained models?
We present *grafting*, a simple approach for editing pretrained diffusion transformers (DiTs) to materialize new architectures under small compute budgets. We study the impact of grafting on model quality using the DiT-XL/2 design. We develop a family of hybrid designs via grafting: replacing softmax attention with gated convolution, local, and linear attention; and MLPs with variable-width and convolutional variants. Notably, many hybrid designs achieve good quality (FID: 2.38–2.64 vs. 2.27 for DiT-XL/2)
using $<2$\% pretraining compute. Next, we graft a text-to-image model (PixArt-$\Sigma$), achieving a 43\% speedup with $<2$\% drop in GenEval score. Finally, we present a case study where we restructure DiT-XL/2 by converting every pair of sequential transformer blocks into parallel blocks via grafting, reducing model depth by 2x, achieving better quality (FID: 2.77) than models of comparable depth.
Together, we show that new diffusion model designs can be explored by grafting pretrained DiTs, with edits ranging from operator replacement to architecture restructuring.
Code and grafted models: https://grafting.stanford.edu
Submission Number: 132
Loading