Keywords: large language model, structured matrices, efficient architecture
Abstract: State-of-the-art results in large language models (LLMs) often rely on scale, which
becomes computationally expensive. This has sparked a research agenda to reduce
these models’ parameter counts and computational costs without significantly
impacting their performance. Our study focuses on transformer-based LLMs,
specifically targeting the computationally intensive feedforward networks (FFNs),
which are less studied than attention blocks. We consider three structured linear
parameterizations of the FFN using efficient low-rank and block-diagonal matrices.
In contrast to many previous works that examined these approximations, our study
i) explores these structures from a training-from-scratch perspective, ii) scales up
to 1.3B parameters, and iii) is conducted within recent Transformer-based LLMs
rather than convolutional architectures. We demonstrate that these structures can
lead to actual computational gains in various scenarios, including online decoding
when using a pre-merge technique. Additionally, we propose a novel training
regime, called self-guided training, aimed at improving the poor training dynamics
that these approximations exhibit when used from initialization. Interestingly,
the scaling performance of structured matrices is explored, revealing steeper
curves in scaling training FLOPs, along with a favorable scaling trend in the
overtraining regime. Specifically, we show that wide and structured networks
can utilize training FLOPs more efficiently, with fewer parameters and lower
loss than dense models at their optimal trade-off. Our code is available at
https://github.com/CLAIRE-Labo/StructuredFFN/tree/main.
Supplementary Material: zip
Primary Area: Deep learning architectures
Submission Number: 19102
Loading