How does Architecture Influence the Base Capabilities of Pre-trained Language Models? A Case Study Based on Wider Transformer Models

24 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Pre-trained Language Models, Base Capabilities
Abstract: Pre-trained language models have demonstrated robust base capabilities. They not only perform well in in-distribution language modeling but also demonstrate powerful abilities in out-of-distribution language modeling, transfer learning, and few-shot learning. Based on the fundamental principles of machine learning induction bias, the architecture of a model is a significant factor affecting its capabilities. However, the specifics of how architecture influences the base capabilities of pre-trained language models remain under-explored. This research initiates from the observation that the base capabilities of the FFN-wider Transformers are diminished relative to the vanilla Transformers, and we aim to elucidate how this particular architectural modification impacts base capacities. Our findings indicate that these architectural changes reduce the contribution of the combinatorial function, specifically the multi-head attention layer, to pre-trained language modeling. This alteration may impact the model architecture's expression of linguistic compositionality prior. Consequently, we postulate that this may be the central cause behind the observed base capability discrepancies. To substantiate our hypothesis, we modified the architecture, allowing a certain proportion of the parameters from the wider FFN to be specifically allocated for enhancing the combinatorial function. With incremental adjustments to the ratio, the base capabilities of the wider Transformers showed consistent improvement, ultimately nearing that of the vanilla Transformers, providing substantial evidence for our hypothesis. Moreover, applying our insights to the MoE architectural models, which also manifest base capabilities declines, resulted in notable base capabilities enhancements.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8986
Loading