Keywords: Interpretability, Geometric Space, Isotropy, Intrinsic Dimensionality, Linguistic Features
TL;DR: A Linguistic Profiling Analysis of the embedding space of Transformers
Abstract: Transformer language models embed tokens in high-dimensional spaces, but whether geometry reflects linguistic structure remains unclear. We analyse token representations in BERT and $GPT\mbox{-}2$, selected as canonical encoder-only and decoder-only Transformer architectures, through a linguistically grounded geometric lens. We partition tokens from the UD English-EWT treebank by surface and syntactic features (position, length, POS, head distance and arity) and examine how their representational geometry evolves across layers. We employ complementary diagnostic metrics, including isotropy, linear and nonlinear intrinsic dimensionality, to capture distinct aspects of embedding structure. Our findings reveal that BERT maintains more isotropic and higher-dimensional subspaces, whereas $GPT\mbox{-}2$ exhibits stronger anisotropy driven by a compact cluster of sentence-initial tokens. Across models, open-class words, longer tokens, and high-arity predicates occupy more isotropic, higher-dimensional manifolds than short function words and pre-head modifiers, indicating that semantic richness and syntactic centrality play a key role in structuring embedding space. Our analysis provides a reusable framework for profiling how linguistic abstractions organize the geometry of Transformer embeddings.
Scope Confirmation: To the best of my judgment, this submission falls within the scope of CoNLL.
Primary Area Selection: Theoretical Analysis and Interpretation of ML Models for NLP
Secondary Area Selection: Syntax and Morphology
Use Of Generative Artificial Intelligence Tools: Yes, for editing/proofreading the manuscript
Data Collection From Human Subjects: No
Submission Type: Archival: I certify that the submission has not been previously published, nor is the material in it under review by another journal or conference. Further, no material in it will be submitted for review at another conference or journal while under review by CoNLL 2026.
Submission Number: 47
Loading