Towards Test-Time Refusals via Concept Negation

Peiran Dong; Song Guo; Junxiao Wang; Bingjie WANG; Jiewei Zhang; Ziming Liu

Towards Test-Time Refusals via Concept Negation

Peiran Dong, Song Guo, Junxiao Wang, Bingjie WANG, Jiewei Zhang, Ziming Liu

Published: 21 Sept 2023, Last Modified: 14 Jan 2024NeurIPS 2023 posterEveryoneRevisionsBibTeX

Keywords: Diffusion models, test-time refusal, concept negation, safety in generative models

TL;DR: Existing concept negation methods are insufficient when it comes to handling interdependent concepts, which are essential for confining the output space of generative diffusion models that produce unbounded outputs.

Abstract: Generative models produce unbounded outputs, necessitating the use of refusal techniques to confine their output space. Employing generative refusals is crucial in upholding the ethical and copyright integrity of synthesized content, particularly when working with widely adopted diffusion models. "Concept negation'' presents a promising paradigm to achieve generative refusals, as it effectively defines and governs the model's output space based on concepts, utilizing natural language interfaces that are readily comprehensible to humans. However, despite the valuable contributions of prior research to the field of concept negation, it still suffers from significant limitations. The existing concept negation methods, which operate based on the composition of score or noise predictions from the diffusion process, are limited to independent concepts (e.g., ``a blonde girl`` without ``glasses``) and fail to consider the interconnected nature of concepts in reality (e.g., ``Mickey mouse eats ice cream`` without ``Disney characters``). Keeping the limitations in mind, we propose a novel framework, called $ProtoRe$, to improve the flexibility of concept negation via test-time negative concept identification along with purification in the feature space. $ProtoRe$ works by incorporating CLIP's language-contrastive knowledge to identify the prototype of negative concepts, extract the negative features from outputs using the prototype as a prompt, and further refine the attention maps by retrieving negative features. Our evaluation on multiple benchmarks shows that $ProtoRe$ outperforms state-of-the-art methods under various settings, in terms of the effectiveness of purification and the fidelity of generative images.

Supplementary Material: pdf

Submission Number: 12076

Loading