Efficient GPT Model Pre-training using Tensor Train Matrix Representation

Viktoriia A. Chekalina, Georgiy Novikov, Julia Gusak, Alexander Panchenko, Ivan Oseledets

20 Oct 2023OpenReview Archive Direct UploadReaders: Everyone

Abstract: Large-scale transformer models have shown remarkable performance in language modelling tasks. However, such models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch. To reduce the number of parameters in the GPT-2 (Radford et al., 2019) architecture, we replace the matrices of fully-connected layers with the corresponding Tensor Train Matrix (TTM) (Oseledets, 2010) structure. Finally, we customize forward and backward operations through the TTM-based layer for simplicity and the stability of further training. The resulting GPT-2-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model. On the downstream tasks, including language understanding and text summarization, the model performs similarly to the original GPT-2 model. The proposed tensorized layers can be used to efficiently pretrain other Transformer models.

0 Replies