SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation

Jing-Jing Li; Valentina Pyatkin; Max Kleiman-Weiner; Liwei Jiang; Nouha Dziri; Anne Collins; Jana Schaich Borg; Maarten Sap; Yejin Choi; Sydney Levine

SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation

Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI safety, AI ethics, LLM content moderation, interpretability, pluralistic alignment

Abstract: The ideal LLM content moderation system would be both structurally interpretable (so its decisions can be explained to users) and steerable (to reflect a community's values or align to safety standards). However, current systems fall short on both of these dimensions. To address this gap, we present SafetyAnalyst, a novel LLM safety moderation framework. Given a prompt, SafetyAnalyst creates a structured "harm-benefit tree," which identifies 1) the actions that could be taken if a compliant response were provided, 2) the harmful and beneficial effects of those actions (along with their likelihood, severity, and immediacy), and 3) the stakeholders that would be impacted by those effects. It then aggregates this structured representation into a harmfulness score based on a parameterized set of safety preferences, which can be transparently aligned to particular values. To demonstrate the power of this framework, we develop, test, and release a prototype system, SafetyReporter, including a pair of LMs specializing in generating harm-benefit trees through symbolic knowledge distillation and an interpretable algorithm that aggregates the harm-benefit trees into safety labels. SafetyReporter is trained on 18.5 million harm-benefit features generated by SOTA LLMs on 19k prompts. On a comprehensive set of prompt safety benchmarks, we show that our system (average F1=0.75) outperforms existing LLM safety moderation systems (average F1$<$0.72) on prompt safety classification, while offering the additional advantages of interpretability and steerability.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10662

Loading