Keywords: instruction following, large language models
TL;DR: We introduce ManyIFEval, a benchmark dataset comprising task prompts with up to ten objectively verifiable instructions and find that their capacity to adhere to all given directives diminishes as the number of instructions increases.
Abstract: Large language models (LLMs) have demonstrated impressive performance across various natural language processing (NLP) tasks owing to the strong capability of following instructions. To further accelerate the integration of LLMs into our society, it is essential to have LLMs follow many instructions as accurately as humans do. This study reveals that LLMs unexpectedly struggle to follow all instructions simultaneously as the number of instructions increases. First, to validate our claim, we introduce ManyIFEval, a large-scale benchmark dataset comprising task prompts with up to ten objectively verifiable instructions. Second, we conduct experiments based on ManyIFEval with GPT-4o, Claude-3.5, Gemini-1.5, Gemma2, and Llama3.1, demonstrating that as the instruction count rises, the models' ability to follow individual instruction deteriorates gradually but constantly. As a result, the models' ability to follow all the instructions significantly drops: the success rate of all the instructions is precisely explained by the success rate of individual instructions to the power of total number of instructions. We refer to it as the ``curse of instructions''. Third, to remove the curse without retraining models, we propose an inference-time strategy that enhances performance through iterative self-refinement. We demonstrate that instruction-level chain-of-thought reasoning significantly improves their capability to detect and correct instruction-following errors. Notably, our method has improved the success rate of following ten instructions by GPT-4o from 15% to 31% and Claude 3.5 Sonnet from 44% to 58%. We also show that precision is more important than recall in feedback: just telling LLMs that they are not following all the instructions also improves self-refinement success. Our findings highlight a fundamental limitation of instruction-following ability and suggest a future direction for building trustworthy LLMs that can coexist with human society.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9331
Loading