TriEmbed: Bridge the Gap between Text and Token Indices with Embedding Reparameterization

TriEmbed: Bridge the Gap between Text and Token Indices with Embedding Reparameterization

ACL ARR 2025 February Submission1636 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The current paradigm of language modeling is a two-stage pipeline that first transforms raw text to token indices, where the distribution is then estimated. It inherently discards linguistic relations between tokens during tokenization, creating a fundamental gap. To address this, we propose \textbf{TriEmbed}, a reparameterization method for embeddings that incorporates the morphological relationships inherent in subword tokenizer algorithms. Specifically, by organizing the vocabulary into a Trie structure, we can encode these relations and reparametrize the embeddings, facilitating the recovery of other linguistic relationships during training. Empirical results across various settings demonstrate that TriEmbed outperforms conventional embeddings from the perspective of scaling, while offering more linguistically informative token embeddings.

Paper Type: Short

Research Area: Language Modeling

Research Area Keywords: embeddings,scaling,pre-training

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 1636

Loading