BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards

Published: 28 Jun 2024, Last Modified: 25 Jul 2024NextGenAISafety 2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmark, safeguard, LLM, AI safety, input-output safeguard
Abstract: Currently, there is no widely recognised methodology for the evaluation of input-output safeguards for Large Language Models (LLMs), such as offline evaluation of traces, automated assessments, content moderation, and periodic or real-time monitoring. In this document, we introduce the Benchmarks for the Evaluation of LLM Safeguards (BELLS), a structured collection of tests, organised in three categories, for three main goals: **(1) established failure tests**, based on well-known benchmarks for well-defined failure modes, aiming to compare the performance of current input-output safeguards; **(2) emerging failure tests**, organised to measure generalisation to never-seen-before failure modes and encourage the development of more general safeguards; **(3) next-gen architecture tests**, for more complex scaffolding (such as LLM-agents and multi-agent systems), aiming to foster the development of safeguards for future applications for which no safeguard currently exists. Furthermore, we implement and share the first next-gen architecture test, using the MACHIAVELLI environment, along with an interactive visualisation of the dataset.
Submission Number: 26
Loading