CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity
Keywords: Text Representation, Information Retrieval, Semantic Textual Similarity, Pre-trained Language Models, Natural Language Processing, Multi-task Optimization
TL;DR: CoDiEmb, a single-stage framework, resolves IR/STS training trade-offs via distinct, task-specific optimizations, achieving strong results across 15 benchmarks while mitigating core geometric issues like anisotropy without extra parameters.
Abstract: Obtaining text embeddings that excel across diverse downstream scenarios is a long-standing pursuit in representation learning, yet negative transfer remains a persistent obstacle. This challenge is particularly pronounced when jointly optimizing two core tasks: Information Retrieval (IR) and Semantic Textual Similarity (STS). Owing to discrepancies in data organization, text-length distributions, and evaluation metrics, naive co-training typically yields steep performance trade-offs. In this paper, we contend that systematically decoupling these tasks at both the design and training levels is essential for comprehensive model convergence. To this end, we propose CoDiEmb, a unified framework that processes IR and STS collaboratively yet distinctly. Unlike previous methods, CoDiEmb achieves superior performance under joint optimization without requiring complex multi-stage training pipelines or additional learnable components. CoDiEmb introduces three key innovations: (1) a unified data format compatible with inputs of any granularity. (2) task-specific objective functions aligned with evaluation metrics; and (3) a dynamic single-source data sampling strategy. Extensive experiments on 15 standard IR and STS benchmarks across three base encoders thoroughly validate the effectiveness of CoDiEmb. Our results and analysis demonstrate that the framework not only mitigates inter-task conflicts but also substantially alleviates the issues of anisotropy and over-smoothing in the semantic space. Our code is publicly available at https://anonymous.4open.science/r/CoDiEmb.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 16598
Loading