The geometry of sentence embedding spaces is not indicative of their performance: A study of three variations of sentence representation
Abstract: Transformer models learn to encode and decode an input text, and produce contextual token embeddings as a side-effect. The mapping from language into the embedding space maps words expressing similar concepts onto points that are close in the space. In practice, the reverse implication is also assumed: words corresponding to points that are close in this space are similar or related.
Does this closeness in the embedding space extend to shared properties for sentence embeddings? We compute sentence embeddings in three ways: as the averaged token embeddings, as the embedding of the special [CLS] token, and as the embedding of a random token from the sentence. We explore whether sentence embedding variations that are close in this space also have similar performance on morphology, syntax, semantic, discourse, and reasoning tasks, or whether their relative position does not offer useful clues about their relative performance and the type of linguistic information they encode.
The results show that each of the four transformer models tested -- BERT, RoBERTa, DeBERTa, Electra -- have their own embeddings profile, but shallow differences or commonalities between the three types of embeddings are not predictive of their performance on specific tasks. In an extreme case, Electra's [CLS] sentence embeddings and averaged token embeddings are superficially almost orthogonal, but both of them encode information about sentence chunk structure in the same way. RoBERTa's very similar sentence embeddings have very different performance on linguistic tasks. The embedding of a random token in a sentence works surprisingly well as a proxy for the sentence embedding.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: probing; robustness; feature attribution;
Contribution Types: Model analysis & interpretability
Languages Studied: English, French, Italian, Romanian
Submission Number: 2171
Loading