Keywords: unified embedding for text and code, unsupervised learning, multi-stage training
Abstract: Creating versatile embedding models that excel across both text and code domains is essential, as modern applications often involve diverse, heterogeneous data. While data mixing is a typical starting point, we take a significant step forward by addressing the limitations of naive data mixing. In this work, we introduce SageLite, a unified embedding model capable of handling both text and code within a single framework. Our approach begins with pretraining on a blended dataset of text and code, fostering shared representations that are crucial for strong cross-domain performance. We then enhance domain-specific capabilities by independently applying large-scale contrastive learning to text and code from various web sources. Our key finding is that, despite the inherent differences between text and code, starting from a model pretrained on mixed data enables the domain-specific contrastive learning stages to produce models that remain closely aligned. This alignment allows us to effectively integrate domain-specific improvements at the constrastive learning stage into a final model through model weights interpolation. Through comprehensive ablation studies, we explore the mechanisms behind our approach, offering insights to guide future research in this area.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10623
Loading