Keywords: NLP, LLM, GPT, generalization, out-of-context reasoning, capabilities, fine-tuning, self-awareness, self-knowledge
TL;DR: LLMs finetuned to follow an implicit policy can later explicitly describe that policy.
Abstract: We investigate LLMs' awareness of newly acquired goals or policies. We find that a model finetuned on examples that exhibit a particular policy (e.g. preferring risky options) can describe this policy (e.g. "I take risky options"). This holds even when the model does not have any examples in-context, and without any descriptions of the policy appearing in the finetuning data. This capability extends to *many-persona scenarios*, where models internalize and report different learned policies for different simulated individuals (*personas*), as well as *trigger* scenarios, where models report policies that are triggered by particular token sequences in the prompt.
This awareness enables models to acquire information about themselves that was only implicit in their training data. It could potentially help practitioners discover when a model's training data contains undesirable biases or backdoors.
Submission Number: 141
Loading