Shortcut Learning in Safety: The Impact of Keyword Bias in Safeguards

ACL ARR 2025 February Submission2181 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

Safeguarding LLMs requires separating harmful prompts from safe ones. We frame this reliance as a shortcut learning problem and conduct experiments revealing how existing models depend on specific keywords for classification rather than semantic understanding. Performance evaluations across six safety benchmarks show that models perform well when keyword distributions align but degrade on out-of-distribution prompts. Results from our counterfactual analysis demonstrate that current safeguard models are vulnerable to keyword distribution shifts due to shortcut learning. These findings highlight the importance of addressing shortcut learning to enhance the robustness of safeguard models.

Paper Type: Short
Research Area: NLP Applications
Research Area Keywords: NLP Applications, Special Theme Track (Generalization of NLP Models)
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 2181
Loading