Keywords: interpretability, alignment, steering
TL;DR: Learning new words in a language model allows for concept control and model self-description
Abstract: Humans invent new words when there is a rising demand for a new useful concept
(e.g., doomscrolling). We explore and validate a similar idea in our communication
with LLMs: introducing new words to better understand and control the models,
expanding on the recently introduced neologism learning. This method introduces a
new word by adding a new word embedding and training with examples that exhibit
the concept with no other changes in model parameters. We show that adding a new
word allows for control of concepts such as flattery, incorrect answers, text length,
as well as more complex concepts in AxBench. We discover that neologisms can
also further our understanding of the model via self-verbalization: models can
describe what each new word means to them in natural language, like explaining
that a word that represents a concept of incorrect answers means “a lack of complete,
coherent, or meaningful answers. . . ” To validate self-verbalizations, we introduce
plug-in evaluation: we insert the verbalization into the context of a model and
measure whether it controls the target concept. In some self-verbalizations, we find
machine-only synonyms: words that seem unrelated to humans but cause similar
behavior in machines. Finally, we show how neologism learning can jointly learn
multiple concepts in multiple words.
Primary Area: interpretability and explainable AI
Submission Number: 20753
Loading