Track: Extended Abstract (non-archival, 4 pages)
Keywords: large language model, small language model, transformer, geometric learning, embedding
TL;DR: We observe a model-size-specific embedding condensation phenomenon and designed training objectives to narrow performance gap between small and large language models.
Abstract: Large language models achieve remarkable performance through ever-increasing parameter counts, yet scaling imposes steep computational costs. We observed a geometric phenomenon called embedding condensation, where token representations collapse into narrow cones as they propagate through smaller models. Through systematic measurements across multiple transformer families, we show that small models such as ALBERT-base and GPT-2 exhibit severe condensation, whereas large models maintain broad embedding dispersion. This suggests that superior performance partly arises from sustained representational diversity. We formulate four losses that explicitly encourage embedding dispersion during training. Experiments demonstrate that these losses mitigate condensation, recover dispersion patterns seen in larger models, and yield consistent performance gains across 13 benchmarks, offering a principled path toward improving smaller transformers without additional parameters.
Submission Number: 43
Loading