Dispersing Embeddings in Transformer Layers Improves Generalization of Language Models

Published: 13 Nov 2025, Last Modified: 23 Nov 2025TAG-DS 2025 FlashTalkEveryoneRevisionsBibTeXCC BY 4.0
Track: Extended Abstract (non-archival, 4 pages)
Keywords: large language model, small language model, transformer, geometric learning, embedding
TL;DR: We observe a model-size-specific embedding condensation phenomenon and designed training objectives to narrow performance gap between small and large language models.
Abstract: Large language models achieve remarkable performance through ever-increasing parameter counts, but scaling incurs steep computational costs. In this work, we observe a geometric phenomenon which we call embedding condensation, where token representations collapse into narrow cones as they propagate through transformer layers in some language models. Through systematic measurements across multiple transformer families, we show that small models such as GPT2 and ALBERT-base exhibit severe condensation, whereas the larger models within the same families such as GPT2-xl and ALBERT-xxlarge are more resistant to this phenomenon. This suggests that superior performance might arise from sustained representational diversity. To test this hypothesis, we formulate four losses that explicitly encourage embedding dispersion during training. Experiments demonstrate that these losses mitigate condensation, recover dispersion patterns seen in larger models, and yield consistent performance gains across 10 benchmarks. We believe this work offers a principled path toward improving smaller transformers without additional parameters.
Supplementary Material: zip
Submission Number: 43
Loading