IFIR-EVAL: Evaluating Information Retrieval Models for Instruction Following in Specialized Domains

IFIR-EVAL: Evaluating Information Retrieval Models for Instruction Following in Specialized Domains

ACL ARR 2024 August Submission234 Authors

15 Aug 2024 (modified: 08 Sept 2024)ACL ARR 2024 August SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Despite the recent success of aligning large language models (LLMs) with human instructions, the ability of information retrievers to follow instructions has not been fully explored. To address this gap, we propose IFIR-EVAL, a comprehensive information retrieval benchmark that spans eight subsets across four expert domains: finance, legal, healthcare, and science-literature. Each subset tackles one or more domain-specific retrieval task in real-world scenarios where user-customized instructions are essential. To enable a comprehensive assessment of retrievers’ instruction-following abilities, we also construct instructions with different complexity levels. Realizing the limitations of traditional IR metrics for evaluating instruction-following capability, we propose a new LLM-based evaluation method, INSTFOL. We conduct a comprehensive experiments including a wide range of information retrievers. Our experimental results demonstrate that LLM-based retrievers have good potential to follow instructions. However, current information retrieval systems are still far from achieving optimal performance in handling complex instructions.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking; evaluation methodologies; NLP datasets

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 234

Loading