General-Purpose Instruction-Aware Text Embeddings via Dual-Level Contrastive Learning

General-Purpose Instruction-Aware Text Embeddings via Dual-Level Contrastive Learning

ACL ARR 2026 January Submission5200 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Instruction Embedding, Contrastive Learning, Semantic Alignment, Asymmetric Fusion, Model Distillation, General Instruction Understanding, Catastrophic Forgetting, Embedding Models, Instruction Comprehension, Fine-Tuning

Abstract: Text embedding models are typically trained in an instruction-agnostic manner, limiting their ability to follow explicit user instructions in retrieval tasks. We propose a unified contrastive learning framework for instruction-aware text embeddings based on Dual-Level Instructional Contrastive Learning (DICL), with an Asymmetric Fusion Distillation strategy to preserve general semantic capability. We introduce GEDI, a large-scale instruction--text dataset, and ICE, a benchmark for evaluating instruction-following behavior in embedding models. Experiments demonstrate that our method achieves state-of-the-art performance on ICE and improves instruction-following retrieval by up to double-digit p-MRR gains across multiple benchmarks and embedding backbones, while retaining 99.6\% of the base model’s performance on standard embedding benchmarks.

Paper Type: Long

Research Area: Retrieval-Augmented Language Models

Research Area Keywords: Semantics: Lexical and Sentence-Level

Languages Studied: english

Submission Number: 5200

Loading