MLPs for NLP: Towards Discovering Inductive Bias From Scratch

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Language Models, Architecture Design, MLP, Inductive Bias, Scaling Laws
Abstract: The recent rise of large language models has been fueled by scale. More data, more compute, and bigger models have consistently led to better performance. This scaling paradigm has been applied most notably to the transformer architecture, which is especially conducive to training parallelization and sequence modeling. In this work, we ask what happens if we apply the power of scale to the simplest possible class of neural network: the multi-layer perceptron (MLP). Specifically, we train MLPs to perform next-token prediction on billions of tokens of text. Indeed, their performance consistently improves with scale, though vanilla MLPs are still clearly inferior to transformers for this task, especially because their parameter count grows with the length of the input sequences. We then perform a mechanistic analysis of the trained models, and identify a consistent emergent structure: most neurons in the first hidden layer either perform arbitrary linear functions over a small look-back window, or low-frequency functions over the entire context. These neuron types recall $n$-gram and bag-of-words techniques from classical statistical language modeling. Using the discrete cosine transform, we define a unified way of reparameterizing these neuron types such that the number of parameters per neuron does not depend on the sequence length.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11653
Loading