Multi-Task Pretraining Drives Representational Convergence

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Representation, Multi-Task Learning, Platonic Representation Hypothesis
TL;DR: Multi-task training makes different models converge to similar representations, even on disjoint tasks.
Abstract: What determines the geometry of a neural network’s internal representations, and when do different training objectives lead to the same representational solution? We study these questions using a controlled framework in which small transformers are trained on geometric tasks defined over real-world city coordinates. We find that single-task training produces diverse, task-specific representational geometries, from thread-like structures to 2D manifolds to fragmented clusters. However, multi-task training drives rapid representational convergence: models trained on different task combinations develop increasingly similar internal representations, as measured by CKA. A 7-task model spontaneously recovers world-map-like structure in raw PCA; while linear world representations exist in all models, multi-task training amplifies their magnitude until they dominate the principal components. These findings provide controlled evidence for the Multitask Scaling Hypothesis, one proposed mechanism underlying the Platonic Representation Hypothesis.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 95
Loading