Keywords: intrinsic dimension, transformer, text Classification, NLP
TL;DR: The analysis of transformers applied to sequential modeling from an perspective of intrinsic dimension.
Abstract: Transformers have gained great popularity for sequential modeling, especially in fields such as natural language processing (NLP). Recently, numerous architectures based on the Transformer framework are proposed, leading to great achievements in applications. However, the working principles behind still remain mysterious. In this work, we numerically investigate the geometrical properties of data representation learned by Transformers, via a mathematical concept called intrinsic dimension (ID), which can be viewed as the minimal number of parameters required for modeling. A series of experiments, mainly focusing on text classification tasks, backs up the following empirical claims on relationships among embedding dimension, depth, respective ID per layer and tasks performance. First, we surprisingly observe that a higher ID (of terminal features extracted by Transformers) typically implies a lower classification error rate. This is contrary to that of CNNs (or other models) performed on image classification tasks. In addition, it is shown that the ID per layer tends to decrease as the depth increases, and this reduction usually appears more significant for deeper architectures. Moreover, we give numerical evidence on geometrical structures of data representation learned by Transformers, where only the nonlinear dimension reduction can be achieved.
Finally, we explore the effect of sequential lengths on the ID and tasks performance, which guarantees the validity of data reduction in training. We hope that these findings can play a guiding role in hyper-parameters selection and dimension/data reduction for Transformers on text classification and other mainstream NLP tasks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
5 Replies
Loading