ReCAST: Probing Sparse Reference Use in In-Context Image Generation

YeonGyu Han; Junah Jung; Dongheon Lee

ReCAST: Probing Sparse Reference Use in In-Context Image Generation

YeonGyu Han, Junah Jung, Dongheon Lee

Published: 26 May 2026, Last Modified: 12 Jun 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Transformers, In-Context Image Generation, Sparse Reference Conditioning, Efficient Inference, Generative Model Diagnostics

TL;DR: ReCAST probes sparse reference use in in-context image generation and shows that foreground-aligned reference tokens enable training-free 2.4× speedup.

Abstract: Personalized image generation in Diffusion Transformers (DiTs) increasingly relies on in-context conditioning, where a reference image is tokenized and concatenated with the denoising sequence. While effective, this design introduces a substantial computational burden: in FLUX.1 Kontext, 4,096 additional reference tokens increase the quadratic attention cost by up to 3.6×. We observe that this cost is largely redundant. Generation tokens attend overwhelmingly to a small subset of reference tokens aligned with the foreground subject, while many background tokens receive negligible attention. Based on this finding, we propose ReCAST, a training-free sparse reference conditioning method that ranks reference tokens using cross-image attention and selects a timestep-adaptive subset during denoising. On FLUX.1 Kontext, ReCAST reduces the average reference budget from 4,096 to approximately 600 tokens, achieving a 2.4× wall-clock speedup with less than 3% degradation in DINO identity score. Beyond acceleration, our results provide a compact diagnostic of how in-context DiTs use reference images for personalized generation.

Submission Number: 56

Loading