Keywords: Bias identification, Bias mitigation, Fairness, Diffusion models
TL;DR: A framework to identify minority or underrepresented attributes in the intermediate representations of diffusion models.
Abstract: Text-to-image diffusion models achieve impressive generation quality but also inherit and amplify biases from training data, resulting in biased coverage of semantic attributes. Prior work addresses this in two ways. Closed-set approaches mitigate biases in predefined fairness categories (e.g., gender, race), assuming socially salient minority attributes are known a priori. Open-set approaches frame the task as bias identification, highlighting majority attributes that dominate outputs. Both overlook a complementary task: uncovering minority features underrepresented in the data distribution (social, cultural, or stylistic) yet still encoded in model representations. We introduce MADGen, the first framework, to our knowledge, for discovering minority attributes in diffusion models. Our method leverages Matryoshka Sparse Autoencoders and introduces a minority metric that integrates neuron activation frequency with semantic distinctiveness, enabling the unsupervised identification of rare attributes. Specifically, MADGen identifies a set of neurons whose behavior can be directly interpreted through their top-activating images, which correspond to underrepresented semantic attributes in the model. Quantitative and qualitative experiments demonstrate that MADGen uncovers attributes beyond fairness categories, supports systematic auditing of architectures such as Stable Diffusion 1.5, 2, and XL, and enables amplification of minority attributes during generation.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 5074
Loading