Keywords: LLM, evaluation, instruction-following
Abstract: Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval — a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have a large room for improvement on such tasks. We analyze the effects of the number and type of constraints on performance, revealing interesting patterns of model constraint-following behavior.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/gililior/wild-if-eval
Code URL: https://github.com/gililior/wild-if-eval-code
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 1820
Loading