Can Multiple Tokens Improve Sentence Embeddings? A Classification-Based Analysis

Can Multiple Tokens Improve Sentence Embeddings? A Classification-Based Analysis

ACL ARR 2024 December Submission1288 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Existing representation models often utilize single-token embedding for downstream tasks, employing approaches such as first token pooling, last token pooling, average pooling, and max pooling for representation. However, these token pooling methods inevitably lead to information loss, as they either ignore or dilute important features from the rest of the sentence. So, would multiple tokens improve sentence embeddings? In this paper, we select the sentence classification task as the research foundation, as it best reflects the quality of sentence embeddings. Randomly selecting multiple tokens is unlikely to effectively improve sentence embeddings; understanding which tokens to use and how to utilize multiple tokens are critical questions that must be explored. Therefore, we propose BTMR, which stands for \textbf{B}oosted \textbf{T}oken-Level \textbf{M}atryoshka \textbf{R}epresentation, to investigate the impact of using multiple tokens on sentence embeddings. BTMR operates through two key stages: Fine-to-Coarse Token Matryoshka Learning, which generates token group representation vectors by capturing both local and global contextual information, and Token Fusion Boosting, which aggregates the correct predictions derived from these vectors to produce the final prediction. Experimental results demonstrate that leveraging multiple tokens can indeed improve sentence embeddings.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: Analysis; Multiple Tokens; Sentence Classification; Token-Level Matryoshka Representation; Token Fusion Boosting

Languages Studied: English

Submission Number: 1288

Loading