TRACE: Transcoder-based Concept Editing

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: transcoder, concept editing, text-to-image model, interpretability, Trustworthy GenAI, Safe AI, safe generation, image generation, image generative models, diffusion models, image autoregressive models, IARs
TL;DR: We leverage Transcoders to selectively remove any concepts from image generative models.
Abstract: Image generation with diffusion and autoregressive models can inadvertently output undesirable content, such as copyrighted characters, harmful images, unwanted objects, or protected artistic styles. Therefore, trustworthy content moderation remains a major challenge: retraining for the removal of each of these concepts is infeasible, while existing post-hoc interventions are either easy to bypass or come at the cost of image quality. We introduce a white-box, model-agnostic framework that uses *Transcoders* as an integrated, surgical intervention layer that allows precise, in-place suppression of targeted concepts without retraining the generative model. Because our approach modifies the model’s backbone and not just external modules, it is robust against circumvention and preserves overall generation quality. Empirically, our method achieves new state-of-the-art results for both diffusion and autoregressive image generative models, remaining robust even against adversarial prompts and throughout sequential, diverse concept removal requests. Thereby, our approach sets the practical foundation for trustworthy image generation in real-world scenarios.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 1581
Loading