Abstract: Despite of significant achievements in improving instruction-following capabilities of large language models (LLMs), the ability to process multiple potentially entangled or conflict instructions remains a considerable challenge. Real-world scenarios often require the consistency across multiple instructions over time, such as secret privacy, presonal preferences, and prioritization, so we demand sophisticated abilities to integrate multiple turns and carefully balance competing objectives when instructions intersect or conflict. This work presents a systematic investigation of LLMs' capabilities in handling multiple turns of instructions, covering three levels of difficulty: (1) retrieving information from instructions, (2) tracking and reasoning across turns, and (3) resolving conflicts among instructions. We construct MultiTurnInstruct with
1.1K high-quality multi-turn conversations through the human-in-the-loop approach and result in a total of nine capability categories, including statics and dynamics, reasoning and multitasking. Our finding reveals an intriguing trade-off between different capabilities. While GPT models demonstrate superior memorization, they show reduced effectiveness in privacy-protection tasks requiring selective information withholding. Larger models exhibit stronger reasoning capabilities but still struggle with resolving conflicting instructions. Importantly, these performance gaps cannot be attributed solely to information loss, as models demonstrate strong BLEU scores on memorization tasks but their attention mechanisms fail to effectively integrate multiple related instructions. These findings highlight critical areas for improvement in the complex real-world tasks involving multi-turn instructions.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: LLM Capability, Multiturn, Evaluation
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 7660
Loading