Origins and roles of world representations in neural networks

Core Francisco Park

Origins and roles of world representations in neural networks

Core Francisco Park

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: World Representation, Fine-tuning, Generalization, Transformers

TL;DR: We introduce a synthetic setup to study how emergent world representations affect generalization.

Abstract: While neural representations have been extensively studied in large practical models, the controlled conditions that govern their emergence and their downstream role in model adaptation remain poorly understood. In this work, we develop a framework separating the underlying world, the data generation process, and the resulting model representations to answer these questions in a controlled setup. This framework further allows clearly defining expected behavioral and representational changes resulting from a world update. Specifically, we define the world as a set of city coordinates and define 7 geometric tasks which generate data to train an autoregressive language model. First, we show that different data generation processes give rise to different world representations in the model. Next, we show that multi-task training drives representational alignment between models that do not share any common tasks, providing controlled evidence for the Multitask Scaling Hypothesis, a potential explanation of the Platonic Representation Hypothesis. Finally, we study whether multi-task models can integrate new entities consistently via fine-tuning. Surprisingly, we find that some fine-tuning tasks are “divergent” and actively harm the representational integration of new entities. Overall, our framework establishes a model system to study the emergence of world representations in neural networks and their adaptability in a controlled manner.

Primary Area: interpretability and explainable AI

Submission Number: 23914

Loading