Keywords: World Representation, Fine-tuning, Generalization, Transformers
TL;DR: We introduce a synthetic setup to study how emergent world representations affect generalization.
Abstract: While neural representations have been extensively studied in large practical models, the controlled conditions that govern their emergence and their downstream role in model adaptation remain poorly understood. In this work, we develop a framework separating the underlying world, the data generation process, and the resulting model representations to answer these questions in a controlled setup. This framework further allows clearly defining expected behavioral and representational changes resulting from a world update. Specifically, we define the world as a set of city coordinates and define 7 geometric tasks which generate data to train an autoregressive language model. First, we show that different data generation processes give rise to different world representations in the model. Next, we show that multi-task training drives representational alignment between models that do not share any common tasks, providing controlled evidence for the Multitask Scaling Hypothesis, a potential explanation of the Platonic Representation Hypothesis. Finally, we study whether multi-task models can integrate new entities consistently via fine-tuning. Surprisingly, we find that some fine-tuning tasks are “divergent” and actively harm the representational integration of new entities. Overall, our framework establishes a model system to study the emergence of world representations in neural networks and their adaptability in a controlled manner.
Primary Area: interpretability and explainable AI
Submission Number: 23914
Loading