Phased Training for LLM-powered Text Retrieval Models Beyond Data Scaling

Xin Zhang; Yanzhao Zhang; Wen Xie; Dingkun Long; Mingxin Li; Pengjun Xie; Meishan Zhang; Wenjie Li; Min Zhang

Phased Training for LLM-powered Text Retrieval Models Beyond Data Scaling

Xin Zhang, Yanzhao Zhang, Wen Xie, Dingkun Long, Mingxin Li, Pengjun Xie, Meishan Zhang, Wenjie Li, Min Zhang

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text Retrieval, Text Embedding, Reranking, LLM-based Embedding

TL;DR: Training powerful general-purpose text embedding and reranking models by a multi-stage training framework and efficient data synthesis.

Abstract: Current efforts in building large language models (LLMs) based general-purpose text retrieval models primarily focus on architectural design and training data scaling. However, significant challenges remain in effectively modeling diverse retrieval tasks and domains, including multi-task conflict, data imbalance, and training efficiency. To address these challenges, we propose a novel phased training framework for text retrieval, featuring: (1) robust foundation modeling with core relevance data, (2) progressive specialization through modular task adaptation, and (3) knowledge fusion via weight interpolation based model merging. This framework simultaneously optimizes both embedding and reranking models through a unified architecture. We also present an efficient yet scalable data synthesis pipeline to expand training data, based on open-source LLMs. These synthetic data can be efficiently incorporated into the phased training framework, enhancing model performance. We identify five distinct types of retrieval tasks, \ie basic relevance retrieval, code retrieval, tool retrieval, complex instruction-based retrieval, as well as reasoning-intensive retrieval, conducting extensive experiments. Our method achieves the best performance across MTEB and various retrieval benchmarks of the five task types. Further analysis demonstrates the effectiveness and efficiency of our proposed training framework and data synthesis pipeline.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 340

Loading