Beyond Co-occurence: A Study of Early-stage Semantic Geometry in Next-Token Prediction

Yize Zhao; Isabel Papadimitriou; Christos Thrampoulidis

Beyond Co-occurence: A Study of Early-stage Semantic Geometry in Next-Token Prediction

Yize Zhao, Isabel Papadimitriou, Christos Thrampoulidis

Published: 02 Mar 2026, Last Modified: 16 Mar 2026ICLR 2026 Workshop GRaM PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: tiny paper (up to 4 pages)

Keywords: Neural Collapse, Representation Geometry, Implicit Bias, Feature Learning, Semantic Emergence, Training Dynamics, One-Hot Supervision

TL;DR: We discover a transient phase where models learn rich semantic structure before eventually discarding it to converge to the rigid Neural Collapse geometry.

Abstract: Neural Collapse predicts that balanced one-hot classification pushes model representations to be equally far from each other; a symmetric configuration that ignores any semantic similarity in the inputs. This creates a puzzle: next-token prediction language models are trained predominantly (as context length increases) with one-hot labels, yet they clearly learn that "red" and "blue" are more similar than "red" and "circle." How does gradient descent find such semantic structure when co-occurrence statistics collapse to one-hot sparsity, eliminating any shared next-tokens among different contexts? To investigate this tension we identify a controlled setting where inputs have latent semantic factors but are mapped to distinct one-hot labels. We find that semantic geometry emerges early in training: representations cluster by shared attributes despite receiving no explicit supervision to do so. This structure is transient: with sufficient capacity and time, the model eventually reaches the predicted symmetric state where all representations are equally separated. We study this phase transition through Gram matrix analysis and propose a preliminary modification to the commonly used unconstrained features model to capture the emerged semantic geometry.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 133

Loading