Keywords: Safety, Moderation, Compliance
TL;DR: We demonstrate the ability to perform content moderation with small models through user-defined policies in a model's context window.
Abstract: Large language models often exhibit safety and reliability issues in critical user-facing scenarios. While current approaches use static models to detect specific harmful categories, we propose dynamic guardian models: specialized classifiers that evaluate text based on predefined trustworthiness objectives. These models assess compliance with user-defined rules across diverse AI-mediated communication contexts through a participatory pipeline that produces synthetic datasets for training and evaluation. Our methodology incorporates diverse perspectives to define appropriate AI behavior in specific contexts. We use group relative policy optimization to improve the model's ability to reason through rule violations and articulate justifications. Experiments show our dynamic guardian models match static models in harm detection while identifying violations nearly as well as frontier reasoning models in a fraction of the time. This approach ensures alignment with stakeholder expectations and regulatory standards while providing adaptability across various communication contexts.
Submission Number: 15
Loading