Abstract: Transformer models encounter inefficiency when scaling hidden dimensions due to the uniform expansion of parameters. When delving into the sparsity of hidden dimensions, we observe that only a small subset of dimensions are highly activated, where some dimensions are commonly activated across tokens, and some others uniquely activated for individual tokens. To leverage this, we propose MoHD (Mixture of Hidden Dimensions), a sparse architecture that combines shared sub-dimensions for common features and dynamically routes specialized sub-dimensions per token. To address the potential information loss from sparsity, we introduce activation scaling and group fusion mechanisms. MoHD efficiently expands hidden dimensions with minimal computational increases, outperforming vanilla Transformers in both parameter efficiency and task performance across 10 NLP tasks. MoHD achieves 1.7% higher performance with 50% fewer activatied parameters and 3.7% higher performance with 3× total parameters expansion at constant activated parameters cost. MoHD offers a new perspective for scaling the model, showcasing the potential of hidden dimension sparsity.
Lay Summary: Transformer models are powerful tools used widely in tasks like translation, text generation, and understanding language. However, making these models bigger often means inefficiently adding many unnecessary parameters because they uniformly enlarge their internal structure. We discovered that Transformers actually don’t need all of their internal "dimensions" (think of these as pathways for information). Instead, only some dimensions are regularly active: some are used by many words, while others are specifically important for certain words.
To utilize this observation, we developed a method called Mixture of Hidden Dimensions (MoHD). MoHD smartly selects which dimensions to activate: it shares some dimensions across many words and dynamically assigns specialized dimensions for specific words. To avoid losing important information from activating fewer dimensions, MoHD uses techniques to boost signals and merge dimensions effectively. Our experiments show that MoHD significantly improves model performance and efficiency, achieving better results while using far fewer active parameters. This makes Transformer models faster and more powerful, demonstrating a smarter way to scale up language technology.
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Models, Transformer, Efficient Model, Conditional Activation, Hidden-dimension Sparsity
Submission Number: 4884
Loading