SafeMoE: Leveraging Unsafe Data to Train Safer, More Informative LLMs

19 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mixture-of-loras, llm safety, expert routing
TL;DR: We use unsafe data to train domain-specific adapters integrated within in a Mixture-of-LoRAs paradigm with a router trained using safe response data.
Abstract: The increasing ease at which large language models can be accessed has spurred debate about ensuring their responsible usage and safety. While such models can act as boundless sources of knowledge, not all information is of equal value, especially to those who can potentially exploit it as a means of inducing harm, either to themselves or on others. Ensuring user satisfaction while avoiding exposure of problematic information therefore remains an outstanding concern regarding their application to more sensitive settings, such as public health and education. In this work, we highlight the concern of blanket _refusal_, where models actively reject producing a detailed responses that risk exposing harmful information. Thus, safe informative responses can be difficult to attain, given the various barriers that need to be overcome. Yet unsafe data is readily available, in various unique domains, while also being rich in details that render them informative. Leveraging this fact, we introduce `SafeMoE`, a Mixture-of-LoRA based routing approach that utilizes fine-tuned domain-specific adapters, trained only on unsafe data, and a router, tuned to select among these experts using minimal safe response data, to ensure that models are both safe __*and*__ informative. Comparisons with safety-aligned models on multiple domains shows that `SafeMoE` not only trains models to be more helpful than existing baselines, with over 20\% relative improvements in safe response rate (15\%+ raw improvement) compared to the nearest competitor, but also provides more informative responses in settings where safety and harmfulness are of utmost concern, all the while being effective using only 100 total safe responses and generalizing to even domains without such responses available for training.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14636
Loading