A Taxonomy of Transcendence

Natalie Abreu; Edwin Zhang; Eran Malach; Naomi Saphra

A Taxonomy of Transcendence

Natalie Abreu, Edwin Zhang, Eran Malach, Naomi Saphra

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: language models, data diversity, composition, knowledge graph

TL;DR: We propose a controlled setting in which to study how properties of the pretraining data influence the model's ability to transcend the performance of the sources that generated the data.

Abstract: Although language models are trained to mimic humans, the resulting systems display capabilities beyond the scope of any one person. To understand this phenomenon, we use a controlled setting to identify properties of the training data that lead a model to transcend the performance of its data sources. We build on previous work to outline three modes of transcendence, which we call \textit{skill denoising}, \textit{skill selection}, and \textit{skill generalization}. We then introduce a knowledge graph-based setting in which simulated experts generate data based on their individual expertise. We highlight several aspects of data diversity that help to enable the model's transcendent capabilities. Additionally, our data generation setting offers a controlled testbed that we hope is valuable for future research in the area.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Award Nomination: true

Submission Number: 1677

Loading