Teaching language models with canonical examples

Published: 01 Nov 2023, Last Modified: 12 Dec 2023R0-FoMo OralEveryoneRevisionsBibTeX
Keywords: Language Models, Finetuning, Out-of-distribution, Model editing
TL;DR: Finding examples of good or bad language model behavior is easy; generalizing is hard. We formalize the task, provide 6 datasets, and present a strong new method.
Abstract: It is easy to write a desirable or undesirable language model behavior (e.g., knowledge---The capital of Mauritius is Port Louis---or undesirable stereotypes---Researchers are always coldhearted) but it is difficult to make the model robustly generalize from these canonical examples. We formalize this task: a learning method takes a model and simple canonical examples and must produce a model that (1) generalizes to naturalistic examples, (2) stays within a bound of the original model's loss, and (3) performs well on a ``hard negative'' distribution to test overgeneralization. We build on the Backpack language model; its predictions take the form of a sparse weighted sum over a very large sense vector bank. We select and finetune a few Backpack senses per canonical example and find that this substantially outperforms other training methods. The Backpack we work with is only 170m parameters; yet, we find that it can improve much larger models: a product-of-experts ensemble between the 35x larger GPT-J-6B and the ratio of finetuned to pretrained Backpack outperforms finetuning GPT-J itself.
Submission Number: 105
Loading