SageLite: Harmonizing Text and Code Through Multi-Stage Training

Dejiao Zhang; Sam Mayers; Jun Wang; Sanjay Krishna Gouda; Nihal Jain; Jiyang Zhang; Xiaofei Ma; Anoop Deoras

SageLite: Harmonizing Text and Code Through Multi-Stage Training

Dejiao Zhang, Sam Mayers, Jun Wang, Sanjay Krishna Gouda, Nihal Jain, Jiyang Zhang, Xiaofei Ma, Anoop Deoras

27 Sept 2024 (modified: 12 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: unified embedding for text and code, unsupervised learning, multi-stage training

Abstract: Creating versatile embedding models that excel across both text and code domains is essential, as modern applications often involve diverse, heterogeneous data. While data mixing is a typical starting point, we take a significant step forward by addressing the limitations of naive data mixing. In this work, we introduce SageLite, a unified embedding model capable of handling both text and code within a single framework. Our approach begins with pretraining on a blended dataset of text and code, fostering shared representations that are crucial for strong cross-domain performance. We then enhance domain-specific capabilities by independently applying large-scale contrastive learning to text and code from various web sources. Our key finding is that, despite the inherent differences between text and code, starting from a model pretrained on mixed data enables the domain-specific contrastive learning stages to produce models that remain closely aligned. This alignment allows us to effectively integrate domain-specific improvements at the constrastive learning stage into a final model through model weights interpolation. Through comprehensive ablation studies, we explore the mechanisms behind our approach, offering insights to guide future research in this area.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10623

Loading