Probability Signature: Bridging Data Semantics and Embedding Structure in Language Models

Junjie Yao; Zhi-Qin John Xu

Probability Signature: Bridging Data Semantics and Embedding Structure in Language Models

Junjie Yao, Zhi-Qin John Xu

02 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: embedding space, data distribution, large language models

Abstract: The embedding space of language models is widely believed to capture the semantic relationships; for instance, embeddings of digits often exhibit an ordered structure that corresponds to their natural sequence. However, the mechanisms driving the formation of such structures remain poorly understood. In this work, we interpret the embedding structures via the token relationships. We propose a set of probability signatures that reflect the semantic relationships among tokens. Through experiments on the composite addition tasks using the linear model and feedforward network, combined with theoretical analysis of gradient flow dynamics, we reveal that these probability signatures significantly influence the embedding structures. We further generalize our analysis to large language models (LLMs). Our results show that the probability signatures are faithfully aligned with the embedding structures, particularly in capturing strong pairwise similarities among embeddings. Our work offers a universal analytical framework that investigates how token relationships direct embedding geometries, empowering researchers to trace how gradient flow propagates token relationships onto embedding structures of their models.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 796

Loading