Track: Technical
Keywords: AI safety, LLM content moderation, interpretability, pluralistic alignment
TL;DR: We introduce SafetyAnalyst, a novel content moderation framework and system that generates interpretable features to make explainable decisions for prompt classification, which can be steered to align with pluralistic preferences.
Abstract: The ideal LLM content moderation system would be both structurally interpretable (so its decisions can be explained to users) and steerable (to reflect a community's values or align to safety preferences). However, current systems fall short on both of these dimensions. To address this gap, we present SafetyAnalyst, a novel LLM safety moderation framework. Given a prompt, SafetyAnalyst creates a structured ``harm-benefit tree,'' which identifies 1) the actions that could be taken if a compliant response were provided, 2) the harmful and beneficial effects of those actions (along with their likelihood, severity, and immediacy), and 3) the stakeholders that would be impacted by those effects. It then aggregates this structured representation into a harmfulness score based on a parameterized set of safety preferences, which can be transparently aligned to particular values. To demonstrate the power of this framework, we develop, test, and release a prototype system for prompt safety classification, SafetyReporter, including two specialized LMs in generating harm-benefit trees and an interpretable algorithm that aggregates the harm-benefit trees into safety labels. SafetyReporter is trained on 18.5 million harm-benefit features generated by SOTA LLMs on 19k prompts. On a comprehensive set of benchmarks, we show that SafetyReporter (average F1=0.75) outperforms existing LLM safety moderation systems (average F1$<$0.72) on prompt safety classification, while offering the additional benefits of interpretability and steerability.
Submission Number: 57
Loading