Abstract: Large-scale transformer models have shown remarkable
performance in language modelling
tasks. However, such models feature billions
of parameters, leading to difficulties in their
deployment and prohibitive training costs from
scratch. To reduce the number of parameters in
the GPT-2 (Radford et al., 2019) architecture,
we replace the matrices of fully-connected layers
with the corresponding Tensor Train Matrix
(TTM) (Oseledets, 2010) structure. Finally,
we customize forward and backward operations
through the TTM-based layer for simplicity and
the stability of further training. The resulting
GPT-2-based model stores up to 40% fewer parameters,
showing the perplexity comparable to
the original model. On the downstream tasks,
including language understanding and text summarization,
the model performs similarly to
the original GPT-2 model. The proposed tensorized
layers can be used to efficiently pretrain
other Transformer models.
0 Replies
Loading