everyone
since 09 May 2025">EveryoneRevisionsBibTeXCC BY 4.0
Safeguarding LLMs requires separating harmful prompts from safe ones. We frame this reliance as a shortcut learning problem and conduct experiments revealing how existing models depend on specific keywords for classification rather than semantic understanding. Performance evaluations across six safety benchmarks show that models perform well when keyword distributions align but degrade on out-of-distribution prompts. Results from our counterfactual analysis demonstrate that current safeguard models are vulnerable to keyword distribution shifts due to shortcut learning. These findings highlight the importance of addressing shortcut learning to enhance the robustness of safeguard models.