Keywords: Moderation Tools, Guardrails, LLMs, Safety
TL;DR: We allow efficient moderation by leveraging latent representations from multiple layers of existing LLMs.
Abstract: With the widespread adoption of large language models (LLMs), ensuring their safety and alignment has become a critical challenge.
Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at deployment time.
Existing approaches, such as guard models, activation steering, and prompt engineering, each involve significant trade-offs: guard models are costly to train and deploy, and their users are typically limited to a few model checkpoints, while activation steering and prompt engineering can often degrade the quality of responses.
In this work, we introduce Latent Prototype Moderator (LPM), a lightweight moderation tool that assesses input safety by sparsely aggregating Mahalanobis distances to safe and harmful prototypes across multiple layers.
By leveraging multi-level prototypes, LPM improves both moderation robustness and performance.
By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model.
LPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of varying sizes.
Moreover, we show that it integrates smoothly into end-to-end moderation pipelines and further improves response safety when combined with output moderation techniques.
Overall, our work provides a practical and adaptable solution for robust, efficient safety moderation for real-world LLM deployment.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 1583
Loading