LatentCompass: T2I Diffusion Steering via Orthogonal Attribute Spaces for Debiasing, Concept Erasure, and Red Teaming
Keywords: Feature Geometry, Methods (probing, steering, causal interventions), Interpretability for AI Safety
TL;DR: We construct a nonlinear, closed-form, mutually orthogonal attribute space for steering text-to-image diffusion models.
Abstract: Text-to-image (T2I) diffusion models suffer from biased results stemming from entangled generative priors and a lack of accurate control over outputs. Current mitigation attempts rely on imprecise, adversarially-vulnerable prompt and text embedding interventions, or they require prohibitive and invasive fine-tuning. Further, text-based methods can only control descriptive attributes, i.e., what an image depicts, but not evaluative attributes, i.e., how it is perceived by an external judge. We propose LatentCompass, an exemplar-based approach that enables disentangled and controllable generation for both descriptive and evaluative concepts in a training-free manner. LatentCompass steers the generative trajectory by (a) constructing a nonlinear, low-dimensional, and orthogonal attribute space via a closed-form solution that explicitly isolates desired concepts, (b) computing an optimal shift in the constructed space, and (c) reflecting the corresponding shift in the T2I latent space. Extensive evaluations demonstrate that LatentCompass effectively (i) mitigates generative stereotypes by 100%, (ii) reduces unsafe concept generation by 58%, (iii) enhances aesthetic quality by 27% on average, (iv) boosts red-teaming success rates against Deepfake detectors by up to 47%, and (v) enables high-fidelity style and face attribute editing without attribute leakage.
Submission Number: 491
Loading