TL;DR: We propose a novel framework for interpretable safety moderation of AI behavior.
Abstract: The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community's values), which current systems fall short on. To address this gap, we present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences by creating a structured "harm-benefit tree," which enumerates harmful and beneficial *actions* and *effects* the AI behavior may lead to, along with *likelihood*, *severity*, and *immediacy* labels that describe potential impacts on *stakeholders*. SafetyAnalyst then aggregates all effects into a harmfulness score using 28 fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this framework to develop an open-source LLM prompt safety classification system, distilled from 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts. On comprehensive benchmarks, we show that SafetyAnalyst (average F1=0.81) outperforms existing moderation systems (average F1$<$0.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability.
Lay Summary: **Making AI Safety More Human-Understandable and Flexible**
**The Problem:** Current AI safety systems are often like "black boxes"—it's hard to understand their decisions, and they aren't easily adjusted for the different safety needs of different applications and user populations.
**Our Solution:** We created SafetyAnalyst, a system that transparently evaluates potential AI actions. It builds a "harm-benefit tree" detailing who might be affected by some given AI action, any harmful and beneficial consequences to them, and how severe the impacts could be. SafetyAnalyst then uses adjustable weights to calculate a "harmfulness score."
**Why It Matters:** This makes AI safety decisions human-understandable and allows them to be tailored to specific rules or community values in a transparent way. Our tests show that SafetyAnalyst is more effective at identifying unsafe AI prompts than existing systems, making it an outstanding tool for enabling safer, more trustworthy AI that better aligns with human values.
Link To Code: https://jl3676.github.io/SafetyAnalyst/
Primary Area: Social Aspects->Safety
Keywords: AI safety, large language model, interpretability, content moderation
Flagged For Ethics Review: true
Submission Number: 13822
Loading