ABCDE: Agentic-Based Controlled Dynamic Erasure for Intent-Aware Safety Reasoning

TMLR Paper7123 Authors

23 Jan 2026 (modified: 08 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Concept erasure has emerged as a central mechanism for safety alignment in text-conditioned generative models, yet most existing approaches implicitly adopt an unconditional suppression paradigm in which target concepts are removed whenever they appear, regardless of contextual intent. This formulation conflates benign and harmful concept usage, leading to systematic over-suppression that unnecessarily censors policy-compliant content and degrades model utility. We argue that safety intervention should instead be framed as a decision problem grounded in contextual language understanding, rather than as a purely mechanistic removal operation. Based on this perspective, we introduce {Intent-Aware Concept Erasure} (ICE), a decision-centric formulation that explicitly separates the question of {whether} a concept should be suppressed from {how} suppression is realized, enabling context-sensitive intervention policies that preserve benign usage while maintaining safety guarantees. To operationalize this formulation, we present ABCDE, an agentic framework that infers stable intervention decisions from semantic context and realizes them through minimal prompt rewriting with closed-loop output feedback. Experiments on a paired benchmark designed to isolate contextual intent demonstrate that ABCDE substantially reduces unnecessary interventions while preserving strong safety effectiveness, outperforming unconditional concept erasure baselines.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~bo_han2
Submission Number: 7123
Loading