Complex Logical Instruction Generation

Mian Zhang; Shujian Liu; Sixun Dong; Ming Yin; Yebowen Hu; Xun Wang; Simin Ma; Song Wang; Sathish Reddy Indurthi; Haoyun Deng; Zhiyu Chen; Kaiqiang Song

Complex Logical Instruction Generation

Mian Zhang, Shujian Liu, Sixun Dong, Ming Yin, Yebowen Hu, Xun Wang, Simin Ma, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Zhiyu Chen, Kaiqiang Song

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model; Instruction Following; Evaluation

TL;DR: LLMs struggle with complex, logic-rich instructions, so we introduce an automated framework to generate verifiable logic-intensive tasksa benchmark of 426 such tasks, showing that even state-of-the-art models correctly follow fewer than 60% of them.

Abstract: Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in their capacity to handle instructions that involve complex logical structures.

Primary Area: datasets and benchmarks

Submission Number: 1807

Loading