Model Manipulation Attacks Enable More Rigorous Evaluations of LLM Capabilities

Zora Che; Stephen Casper; Anirudh Satheesh; Rohit Gandikota; Domenic Rosati; Stewart Slocum; Lev E McKinney; Zichu Wu; Zikui Cai; Bilal Chughtai; Daniel Filan; Furong Huang; Dylan Hadfield-Menell

Model Manipulation Attacks Enable More Rigorous Evaluations of LLM Capabilities

Zora Che, Stephen Casper, Anirudh Satheesh, Rohit Gandikota, Domenic Rosati, Stewart Slocum, Lev E McKinney, Zichu Wu, Zikui Cai, Bilal Chughtai, Daniel Filan, Furong Huang, Dylan Hadfield-Menell

Published: 12 Oct 2024, Last Modified: 22 Nov 2024SafeGenAi PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Unlearning, Adversarial Robustness, Evaluations

TL;DR: We evaluate 8 state-of-the-art LLM unlearning methods under a suite of diverse attacks, and propose model-manipulation attacks as a complementary approach to better assess and address persistent harmful capabilities in AI systems.

Abstract: Evaluations of large language model (LLM) capabilities are increasingly being incorporated into AI safety and governance frameworks. Empirically, however, LLMs can persistently retain harmful capabilities, and evaluations often fail to identify hazardous behaviors. Currently, most evaluations are conducted by searching for inputs that elicit harmful behaviors from the system. However, a limitation of this approach is that the harmfulness of the behaviors identified during any particular evaluation can only lower bound the model's worst-possible-case behavior. As a complementary approach for capability elicitation, we propose using model-manipulation attacks which allow for modifications to the latent activations or weights. In this paper, we test 8 state-of-the-art techniques for removing harmful capabilities from LLMs against a suite of 5 input-space and 5 model-manipulation attacks. In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the attack success rates (ASRs) of model-manipulation attacks can help to predict and develop conservative high-side estimates of the ASRs for held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 50 steps of LoRA fine-tuning. Together these results highlight the difficulty of deeply removing harmful LLM capabilities and show that model-manipulation attacks enable substantially more rigorous evaluations for undesirable vulnerabilities than input-space attacks alone.

Submission Number: 205

Loading