Transformers are Universal Predictors

Published: 11 Jul 2023, Last Modified: 15 Jul 2023NCW ICML 2023EveryoneRevisionsBibTeX
Keywords: language models, transformer architecture, universal prediction
TL;DR: Transformers are universal predictors in the information-theoretic sense.
Abstract: We find limits to the Transformer architecture for language modeling and show it has a universal prediction property in an information-theoretic sense. We further analyze their performance in non-asymptotic data regimes to understand the role of various components of the Transformer architecture, especially in the context of data-efficient training. We validate our theoretical analysis with experiments on both synthetic and real datasets.
Submission Number: 34
Loading