Keywords: LLM, Safety, Refusal, Noncompliance, Evaluation, Reliable AI
TL;DR: While most existing work primarily focuses on refusal of ``unsafe'' queries, in this work, we posit that noncompliance should extend beyond safety aspects to cover a diverse range of scenarios.
Abstract: Chat-based language models are designed to be helpful, yet they should not comply with every user request.
While most existing work primarily focuses on refusal of ``unsafe'' queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should *not* comply with user requests. Our taxonomy spans a wide range of categories including *incomplete*, *unsupported*, *indeterminate*, and *humanizing* requests (in addition to *unsafe* requests). To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts. We find that most existing models show significantly high compliance rates in certain previously understudied categories with models like GPT-4 incorrectly complying with as many as 30\% of requests.
To address these gaps, we explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses.
Our experiments demonstrate that while direct finetuning of instruction-tuned models can lead to both over-refusal and a decline in general capabilities, using parameter efficient methods like low rank adapters helps to strike a good balance between appropriate noncompliance and other capabilities.
Supplementary Material: zip
Submission Number: 2136
Loading