EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models

EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models

ACL ARR 2025 February Submission5841 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In the advancement of large language models (LLMs), while there have been notable improvements in their ability to generalize across various natural language processing tasks, existing datasets often lack the complexity required to fully reflect real-world scenarios. These datasets predominantly focus on single-task environments with limited constraints, thereby failing to capture the multifaceted and constraint-rich requirements inherent in practical applications. To bridge this gap, we present the extremely complex instruction following benchmark (EIFBENCH), meticulously crafted to facilitate a more realistic and robust evaluation of LLMs. EIFBENCH offers several distinctive advantages: Firstly, it includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently. Secondly, it is sourced from a wide array of diverse origins to ensure both the diversity and representativeness of its data. Lastly, it integrates a variety of constraints, replicating complex operational environments and providing critical insights into the models' capabilities under resource, time, and environmental limitations. Evaluations on EIFBENCH have unveiled considerable performance discrepancies in existing LLMs when challenged with these extremely complex instructions. This finding underscores the necessity for ongoing optimization and the development of more versatile and deeply understanding models, equipped to navigate the intricate challenges posed by real-world applications.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Resources and Evaluation; Language Modeling

Contribution Types: Data resources

Languages Studied: English

Submission Number: 5841

Loading