ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers

ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers

ACL ARR 2024 December Submission754 Authors

15 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have demonstrated remarkable effectiveness in text reranking through works like RankGPT, leveraging their human-like reasoning about relevance. However, supervised fine-tuning for ranking often diminishes these models' general-purpose capabilities, including the crucial reasoning abilities that make them valuable for ranking. We introduce a novel approach integrating Chain-of-Thought prompting with an SFT-DPO (Supervised Fine-Tuning followed by Direct Preference Optimization) pipeline to preserve these capabilities while improving ranking performance. Our experiments on TREC 2019 and 2020 Deep Learning datasets show that our approach outperforms the state-of-the-art RankZephyr while maintaining strong performance on the Massive Multitask Language Understanding (MMLU) benchmark, demonstrating effective preservation of general-purpose capabilities through thoughtful fine-tuning strategies.

Paper Type: Short

Research Area: Information Retrieval and Text Mining

Research Area Keywords: Information Retrieval and Text Mining, Language Modeling, Machine Learning for NLP, NLP applications.

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 754

Loading