Keywords: early exit, weight initialization, efficient, class aware, class means, pre-training, LLMs
TL;DR: We propose a novel class-aware weight initialization technique for early exit large language models with the purpose of accelerating pre-training.
Abstract: We propose a novel class-aware weight initialization technique for early exit large language models with the purpose of accelerating pre-training. Our design utilizes the neural collapse phenomenon combined with a Gaussian mixture model for the distribution of feature vectors at a given layer. Specifically, we calculate the average of token representations at the early exit point and use the resulting vectors together with class probabilities for initializing the early exit vectors. The next token prediction accuracy of our class-aware initialization technique is up to five times higher than other baselines at epoch zero and matches or surpasses them in later epochs throughout the pre-training process.
Submission Number: 36
Loading