Efficient Large Language Models Moderation with Multi-Layer Latent Prototypes

Maciej Chrabaszcz; Filip Szatkowski; Bartosz Wójcik; Jan Dubiński; Tomasz Trzcinski; Sebastian Cygert

Efficient Large Language Models Moderation with Multi-Layer Latent Prototypes

Maciej Chrabaszcz, Filip Szatkowski, Bartosz Wójcik, Jan Dubiński, Tomasz Trzcinski, Sebastian Cygert

03 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Moderation Tools, Guardrails, LLMs, Safety

TL;DR: We allow efficient moderation by leveraging latent representations from multiple layers of existing LLMs.

Abstract: With the widespread adoption of large language models (LLMs), ensuring their safety and alignment has become a critical challenge. Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at deployment time. Existing approaches, such as guard models, activation steering, and prompt engineering, each involve significant trade-offs: guard models are costly to train and deploy, and their users are typically limited to a few model checkpoints, while activation steering and prompt engineering can often degrade the quality of responses. In this work, we introduce Latent Prototype Moderator (LPM), a lightweight moderation tool that assesses input safety by sparsely aggregating Mahalanobis distances to safe and harmful prototypes across multiple layers. By leveraging multi-level prototypes, LPM improves both moderation robustness and performance. By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model. LPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of varying sizes. Moreover, we show that it integrates smoothly into end-to-end moderation pipelines and further improves response safety when combined with output moderation techniques. Overall, our work provides a practical and adaptable solution for robust, efficient safety moderation for real-world LLM deployment.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 1583

Loading