The Impact of Depth and Width on Transformer Language Model Generalization

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: transformers, generalization, compositional, depth, width, layers
TL;DR: When controlling for total parameter count, deeper transformer language models generalize better than wider ones, up to a point.
Abstract: Transformer language models tend to perform better the more parameters they have. Previous theoretical and empirical work suggests that the total number of parameters is not the only relevant factor, however; rather, expressivity and out-of-distribution generalization may benefit more from increasing depth than increasing width. To test this hypothesis we disentangle depth from the number of parameters, constructing families of models which trade off depth for width while keeping the total number of parameters constant. We pretrain those models and evaluate them on both language modeling and compositional generalization tasks. We report three main conclusions: (1) within each family, deeper models show better language modeling performance, but the relative benefit of additional layers diminish rapidly; (2) when fine-tuned on compositional generalization tasks, deeper models generalize better out-of-distribution than shallower models do, but returns are similarly diminishing; (3) the benefits of depth for generalization cannot be attributed solely to better performance on language modeling or in-distribution data. These results replicate in three different model families (41M, 134M and 374M parameters), suggesting that depth improves performance across model sizes.
Supplementary Material: pdf
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6715
Loading