Abstract: Existing representation models often utilize single-token embedding for downstream tasks, employing approaches such as first token pooling, last token pooling, average pooling, and max pooling for representation. However, these token pooling methods inevitably lead to information loss, as they either ignore or dilute important features from the rest of the sentence. So, would multiple tokens improve sentence embeddings? In this paper, we select the sentence classification task as the research foundation, as it best reflects the quality of sentence embeddings. Randomly selecting multiple tokens is unlikely to effectively improve sentence embeddings; understanding which tokens to use and how to utilize multiple tokens are critical questions that must be explored. Therefore, we propose BTMR, which stands for \textbf{B}oosted \textbf{T}oken-Level \textbf{M}atryoshka \textbf{R}epresentation, to investigate the impact of using multiple tokens on sentence embeddings. BTMR operates through two key stages: Fine-to-Coarse Token Matryoshka Learning, which generates token group representation vectors by capturing both local and global contextual information, and Token Fusion Boosting, which aggregates the correct predictions derived from these vectors to produce the final prediction. Experimental results demonstrate that leveraging multiple tokens can indeed improve sentence embeddings.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: Analysis; Multiple Tokens; Sentence Classification; Token-Level Matryoshka Representation; Token Fusion Boosting
Languages Studied: English
Submission Number: 1288
Loading