Programmatic Evaluation of Rule-Following Behavior

15 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: large language models, rule-following, instruction following, alignment, safety
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We introduce a new, interactive framework for evaluating rule-following behavior of large language models in a programmatic manner (no human labeling required), and analyze the performance current state-of-the-art models.
Abstract: As Large Language Models (LLMs) are being deployed today with increasing real-world responsibilities, it is important for us to reliably specify and constrain the behavior of these systems. However, evaluating a model's adherence to even very simple rules (e.g., "do not generate abusive text" or "do not aid the user in the commission of crime") requires costly human judgment, which slows down the monitoring and improvement of rule-following behavior in LLMs. Motivated by that, we propose the Benchmark for Identifying Non-compliant Decisions (BIND), a framework to programmatically evaluate rule-following in LLMs. BIND consists of 15 text scenarios in which the model is instructed to obey a set of rules while interacting with the human user. The scenarios are inspired by various security properties of computer systems and simple children's games. Each scenario has a concise evaluation program to decide whether the model has broken any rules in the conversation, which also means all our scenarios can be trivially solved by corresponding programs. Through extensive qualitative exploration of model behavior in our scenarios, we identify 6 different categories of attack strategies. We then systematically hand-write a suite of 862 test cases implementing specific strategies from these 6 categories and evaluate currently popular proprietary and open-source models. All evaluated models struggle on the test suite: GPT-4, the best overall model, passes only 73.9% of test cases, while Llama2-7B only passes 26.1% of test cases. Additionally, we evaluate the performance of Llama2 and Vicuna under adversarially optimized inputs, which reliably drive pass rates down to $0\%$ across multiple scenarios. We propose BIND as a challenging new setting for open research into defending against both human-written and automatic attacks on LLMs, as well as exploring rule-following model behavior.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 382
Loading