Language Models Can Articulate Their Implicit Goals

Jan Betley; Xuchan Bao; Martín Soto; Anna Sztyber-Betley; James Chua; Owain Evans

Language Models Can Articulate Their Implicit Goals

Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans

Published: 12 Oct 2024, Last Modified: 14 Nov 2024SafeGenAi PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: NLP, LLM, GPT, generalization, out-of-context reasoning, capabilities, fine-tuning, self-awareness, self-knowledge

TL;DR: LLMs finetuned to follow an implicit policy can later explicitly describe that policy.

Abstract: We investigate LLMs' awareness of newly acquired goals or policies. We find that a model finetuned on examples that exhibit a particular policy (e.g. preferring risky options) can describe this policy (e.g. "I take risky options"). This holds even when the model does not have any examples in-context, and without any descriptions of the policy appearing in the finetuning data. This capability extends to *many-persona scenarios*, where models internalize and report different learned policies for different simulated individuals (*personas*), as well as *trigger* scenarios, where models report policies that are triggered by particular token sequences in the prompt. This awareness enables models to acquire information about themselves that was only implicit in their training data. It could potentially help practitioners discover when a model's training data contains undesirable biases or backdoors.

Submission Number: 141

Loading