Towards smaller language models via layer looping

Published: 21 Jun 2024, Last Modified: 26 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, Compression, Looped Models
TL;DR: We show that looping the layers of a language model can enable more parameter-efficient models.
Abstract: Language models store a huge amount of knowledge in their parameters. This dominant architecture bears little resemblance to the implementations of optimized data stores (e.g. a database management system like PostgreSQL), which begs the question: are there other architectures that can store and query the same information more efficiently? In this work, we explore two simple modifications to the standard architecture: looping --- sharing parameters across layers --- and mixture-of-experts (MoE). We compare the space complexity of standard and looped-moe models on a simple task where the model must memorize a knowledge graph (KG) and answer multi-hop queries over it. We prove that the looped-moe can store a KG of size $T$ and answer $q$-hop queries with $\mathcal{O}(T)$ parameters. In contrast, the best known upper bound for the standard model is $\mathcal{O}(qT)$ parameters. We confirm this scaling with experiments on synthetic KGs, finding that looped-conditional models can reliably answer four-hop queries over KGs that are $9\times$ larger than parameter-matched standard models can.
Submission Number: 82
Loading