An Intrinsic Dimension Perspective of Transformers for Sequential Modeling

Zeping Min; Qian Ge; Zhong Li

An Intrinsic Dimension Perspective of Transformers for Sequential Modeling

Zeping Min, Qian Ge, Zhong Li

20 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: Transformers, Intrinsic Dimension, Hyperparameter Optimization, Natural Language Processing

TL;DR: The study unveils a correlation between the intrinsic dimension of Transformer embeddings and task performance, providing insights to optimize hyperparameters and data reduction in NLP tasks.

Abstract: Transformers have become immensely popular for sequential modeling, particularly in domains like natural language processing (NLP). Recent innovations have introduced various architectures based on the Transformer framework, resulting in significant advancements in applications. However, the underlying mechanics of these architectures are still somewhat enigmatic. In this study, we explore the geometrical characteristics of data representations learned by Transformers using a mathematical metric known as intrinsic dimension (ID). This can be conceptualized as the minimum parameter count needed for effective modeling. A sequence of experiments, predominantly centered on text classification, support the ensuing empirical observations regarding the correlation between embedding dimension, layer depth, individual layer ID, and task performance. Interestingly, we note that a higher terminal feature ID, when obtained from Transformers, generally correlates with a lower classification error rate. This stands in contrast to the behavior observed in CNNs (and other models) during image classification tasks. Furthermore, our data suggests that the ID for each layer tends to diminish as layer depth increases, with this decline being notably steeper in more intricate architectures. We also present numerical evidence highlighting the geometrical constructs of data representations as interpreted by Transformers, indicating that only nonlinear dimension reduction is achievable. Lastly, we delve into how varying sequence lengths impact both ID and task performance, confirming the efficacy of data reduction during training. Our ambition is for these insights to offer direction in the choice of hyper-parameters and the application of dimension/data reduction when using Transformers for text classification and other prevalent NLP tasks.

Primary Area: visualization or interpretation of learned representations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2349

Loading