Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Andrey Bochkov

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Andrey Bochkov

Published: 14 Oct 2025, Last Modified: 14 Oct 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational "meaning vectors." This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to "representational interference" in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer's compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

Submission Length: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=UgtgTICJ5Y&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)

Changes Since Last Submission: A new paragraph at the end of Section 5.2 to address the concern regarding our baseline's performance. Explicitly contextualizes the performance gap by highlighting the ~150x data difference between our controlled 4B token experiment and the 600B+ token pre-training of SOTA models. Adds a performance comparison to GPT-2 to better situate our proposed model's results.

Supplementary Material: zip

Assigned Action Editor: ~Francisco_J._R._Ruiz1

Submission Number: 5526

Loading