Keywords: Foundation Models, AI Safety, Spurious Correlations, Over-cautiousness
TL;DR: SCOPE is a scalable pipeline that generates test data to evaluate the spurious correlated safety refusal of foundation models through a systematic approach.
Abstract: The rapid progress of foundation models has amplified AI safety risks, prompting the development and deployment of alignment techniques and safety measures such as reinforcement learning with human feedback and supervised safety fine-tuning. However, these safety mechanisms can inadvertently cause models to reject benign requests that contain keywords or syntax linked to unsafe content in training data, leading to misguided safety refusals (or over-cautiousness). Existing benchmarks for assessing these refusals are limited by their static nature and reliance on manual efforts. To address this, we introduce SCOPE, an automated pipeline that dynamically generates false refusal benchmarks from any given red-teaming dataset. This facilitates continuous adaptation to the evolving landscape of refusal behaviors introduced by growing red-teaming efforts.
Our evaluation across 29 models demonstrates the widespread issue of misguided refusals in existing LLMs and identifies spurious features that trigger these behaviors. Furthermore, we demonstrate that the generated benchmarks facilitate the development of more effective countermeasures to mitigate these misguided refusals.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8874
Loading