Bi-Phase Training: Learning Efficiently in High Dimensions

Bi-Phase Training: Learning Efficiently in High Dimensions

ICLR 2026 Conference Submission13634 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Pre-training, Large Language Models, Foundation Models, Parameter efficient training

TL;DR: A novel parameter-efficient pre-training method that achieves similar performance to a fully parameterized model with significantly fewer trainable parameters.

Abstract: Pre-trained foundation models have achieved remarkable generalization across a wide spectrum of downstream tasks. However, as models scale in size, the cost to pre-train models becomes prohibitively expensive. In this work, we introduce Bi-Phase Training (BPT), a novel parameter-efficient pre-training method designed to capture the expressiveness of fully parameterized models while drastically reducing the number of trainable parameters. BPT achieves this by combining constrained high-rank transformations using diagonal matrices with exploration of lower-dimensional subspaces through low-rank matrices, facilitating effective optimization within a reduced parameter space. We empirically demonstrate the effectiveness of BPT across various model scales, showing that it successfully matches the performance of standard pre-training on language models while achieving significant reductions in trainable parameters, such as a 66\% reduction of trainable parameters for a 1.5B model. Furthermore, we conducted a comprehensive evaluation of 17 diverse downstream tasks, confirming that models trained with BPT maintained performance comparable to those trained with a fully parameterized standard method.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 13634

Loading