Measuring and Controlling Instruction (In)Stability in Language Model Dialogs

Kenneth Li; Tianle Liu; Naomi Bashkansky; David Bau; Fernanda Viégas; Hanspeter Pfister; Martin Wattenberg

Measuring and Controlling Instruction (In)Stability in Language Model Dialogs

Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Alignment, Evaluation, Safety

Keywords: Dialog system; System prompt

TL;DR: When a conversation goes long, a chatbot quickly ceases to follow its system prompt (within 8 rounds).

Abstract: System-prompting is a standard tool for customizing language-model chatbots, enabling them to follow a specific instruction. An implicit assumption in the use of system prompts is that they will be _stable_, so the chatbot will continue to generate text according to the stipulated instructions for the duration of a conversation. We propose a quantitative benchmark to test this assumption, evaluating instruction stability via self-chats between two instructed chatbots. Testing popular models like LLaMA2-chat-70B and GPT-3.5, we reveal a significant _instruction drift_ within eight rounds of conversations. An empirical and theoretical analysis of this phenomenon suggests the transformer attention mechanism plays a role, due to _attention decay_ over long exchanges. To combat attention decay and instruction drift, we propose a lightweight method called split-softmax, which compares favorably against two strong baselines. Code: [https://github.com/likenneth/persona_drift](https://github.com/likenneth/persona_drift).

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 543

Loading