Keywords: Diffusion Transformers, In-Context Image Generation, Sparse Reference Conditioning, Efficient Inference, Generative Model Diagnostics
TL;DR: ReCAST probes sparse reference use in in-context image generation and shows that foreground-aligned reference tokens enable training-free 2.4× speedup.
Abstract: Personalized image generation in Diffusion Transformers (DiTs) increasingly relies on in-context conditioning, where a reference image is tokenized and concatenated with the denoising sequence. While effective, this design introduces a substantial computational burden: in FLUX.1 Kontext, 4,096 additional reference tokens increase the quadratic attention cost by up to 3.6×. We observe that this cost is largely redundant. Generation tokens attend overwhelmingly to a small subset of reference tokens aligned with the foreground subject, while many background tokens receive negligible attention. Based on this finding, we propose ReCAST, a training-free sparse reference conditioning method that ranks reference tokens using cross-image attention and selects a timestep-adaptive subset during denoising. On FLUX.1 Kontext, ReCAST reduces the average reference budget from 4,096 to approximately 600 tokens, achieving a 2.4× wall-clock speedup with less than 3% degradation in DINO identity score. Beyond acceleration, our results provide a compact diagnostic of how in-context DiTs use reference images for personalized generation.
Submission Number: 56
Loading