Keywords: Pre-training, Large Language Models, Foundation Models, Parameter efficient training
TL;DR: A novel parameter-efficient pre-training method that achieves similar performance to a fully parameterized model with significantly fewer trainable parameters.
Abstract: Pre-trained foundation models have achieved remarkable generalization across a wide spectrum of downstream tasks. However, as models scale in size, the cost to pre-train models becomes prohibitively expensive. In this work, we introduce Bi-Phase Training (BPT), a novel parameter-efficient pre-training method designed to capture the expressiveness of fully parameterized models while drastically reducing the number of trainable parameters. BPT achieves this by combining constrained high-rank transformations using diagonal matrices with exploration of lower-dimensional subspaces through low-rank matrices, facilitating effective optimization within a reduced parameter space. We empirically demonstrate the effectiveness of BPT across various model scales, showing that it successfully matches the performance of standard pre-training on language models while achieving significant reductions in trainable parameters, such as a 66\% reduction of trainable parameters for a 1.5B model. Furthermore, we conducted a comprehensive evaluation of 17 diverse downstream tasks, confirming that models trained with BPT maintained performance comparable to those trained with a fully parameterized standard method.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13634
Loading