Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning
Abstract: Safety alignment is crucial for Large Language Models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. To this end, we introduce **ACTOR** (Activation-Based Training for Over-Refusal Reduction), a robust and compute- and-data efficient training framework that mini- mizes over-refusals by utilizing internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model’s ability to handle harmful queries and preserving overall utility.
Lay Summary: Problem — Today’s AI chatbots often panic: they refuse innocent questions just because the wording sounds dangerous, blocking help with topics like first-aid or chemistry homework.
Solution — We discovered a tell-tale “refusal signal” hidden inside the model’s internal calculations. By gently adjusting that single signal—rather than overhauling the whole network—we teach the AI to pause only when a request is truly harmful. The training needs just minutes and a small set of examples.
Impact — In tests, our fix let the chatbot answer up to one-third more harmless questions while keeping its existing safety guardrails almost untouched. Because the method is quick, cheap, and leaves the rest of the model unchanged, it can be slotted into real-world systems right away, making AI assistants more helpful without making them more risky.
Primary Area: Social Aspects->Safety
Keywords: LLM, Alignment, Over-refusal
Submission Number: 14799
Loading