An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

Anuj K. Nayak; Lav R. Varshney

An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

Anuj K. Nayak, Lav R. Varshney

Published: 09 Oct 2024, Last Modified: 19 Nov 2024Compression Workshop @ NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language models, scaling law, emergence, plateauing, Low-Density Parity Check codes, sequential learning, composition of skills.

TL;DR: We present a simplified unified graph framework to explain compute-optimal size scaling, emergent capabilities, and performance plateauing using tools from iterative decoding in information theory and random network theory.

Abstract: Recent empirical studies show three phenomena with increasing size of language models: compute-optimal size scaling, emergent capabilities, and performance plateauing. We present a simple unified mathematical framework to explain all of these language model scaling phenomena, building on recent skill-text bipartite graph frameworks for semantic learning. Modeling the learning of concepts from texts as an iterative process yields an analogy to iterative decoding of low-density parity check (LDPC) codes in information theory. Thence, drawing on finite-size scaling characterizations of LDPC decoding, we derive the compute-optimal size scaling (Chinchilla rule) for language models. Further, using tools from random network theory, we provide a simple explanation for both emergence of complex skills and plateauing of performance as the size of language models scale. We see multiple plateaus.

Submission Number: 79

Loading