An Auditing Test to Detect Behavioral Shift in Language Models

Published: 03 Jul 2024, Last Modified: 03 Jul 2024ICML 2024 FM-Wild Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: model certification, large language models, safety alignment, large language model auditing, large language model evaluations
TL;DR: This work proposes a regulatory framework with a continuous online auditing test to detect behavioral change in language models, preventing vendors or attackers from deploying unaligned models for profit or malicious intent.
Abstract: Ensuring language models (LMs) align with societal values has become paramount as LMs continue to achieve near-human performance across various tasks. In this work, we address the problem of a vendor deploying an unaligned model to consumers. For instance, unscrupulous vendors may wish to deploy unaligned models if they increase overall profit. Alternatively, an attacker may compromise a vendor and modify their model to produce unintended behavior. In these cases, an external auditing process can fail: if a vendor/attacker knows the model is being audited, they can swap in an aligned model during this evaluation and swap it out once the evaluation is complete. To address this, we propose a regulatory framework involving a continuous, online auditing process to ensure that deployed models remain aligned throughout their life cycle. We give theoretical guarantees that, with access to an aligned model, one can detect an unaligned model via this process solely from model generations, given enough samples. This allows a regulator to impersonate a consumer, preventing the vendor/attacker from surreptitiously swapping in an aligned model during evaluation. We hope that this work extends the discourse on AI alignment via regulatory practices and encourages additional solutions for consumer rights protection for LMs.
Submission Number: 92
Loading