Cropping outperforms dropout as an augmentation strategy for self-supervised training of text embeddings
Authors that are also TMLR Expert Reviewers: ~Dmitry_Kobak2
Abstract: Text embeddings, i.e. vector representations of entire texts, play an important role in many NLP applications, such as retrieval-augmented generation, clustering, or visualizing collections of texts for data exploration. Currently, top-performing embedding models are derived from pre-trained language models via supervised contrastive fine-tuning. This fine-tuning strategy relies on an external notion of similarity and annotated data for generation of positive pairs. Here we study self-supervised fine-tuning and systematically compare the two most well-known augmentation strategies used for fine-tuning text embeddings models. We assess embedding quality on MTEB and additional in-domain evaluations and show that cropping augmentation strongly outperforms the dropout-based approach. We find that on out-of-domain data, the quality of resulting embeddings is substantially below the supervised state-of-the-art models, but for in-domain data, self-supervised fine-tuning can produce high-quality text embeddings after very short fine-tuning. Finally, we show that representation quality increases towards the last transformer layers, which undergo the largest change during fine-tuning; and that fine-tuning only those last layers is sufficient to reach similar embedding quality.
Certifications: Expert Certification
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Dear Action Editor,
We have incorporated all the requested revisions and included some additional analyses to further support our results.
Specifically, we:
- added a dedicated Limitations section addressing the mentioned caveats (comparison of only two augmentations, use of a single training set for the MTEB evaluation, and the absence of hard negatives);
- expanded the Related Work section to reflect recent developments in the field;
- extended Figure 4 by adding panel (c) (layer-wise MTEB evaluation), and revised Section 4 accordingly to discuss these new results.
We hope that these revisions satisfactorily address the requested changes, and we sincerely appreciate your and the reviewers feedback throughout the review process.
Code: https://github.com/berenslab/text-embed-augm
Assigned Action Editor: ~Sarath_Chandar1
Submission Number: 5555
Loading