Submission Type: Long
Keywords: natural language processing, large language models, prompt optimization, prompt engineering
TL;DR: Despite recent advances, modern LLMs (GPT-4o, GPT-4o mini, DeepSeek) suffer from significant order sensitivity with performance dropping up to 13.95% when input order is shuffled, posing serious reliability risks for real-world applications.
Abstract: As large language models (LLMs) become integral to diverse applications, ensuring their reliability under varying input conditions is crucial. One key issue affecting this reliability is order sensitivity, wherein slight variations in the input arrangement can lead to inconsistent or biased outputs. Although recent advances have reduced this sensitivity, the problem remains unresolved. This paper investigates the extent of order sensitivity in LLMs whose internal components are hidden from users (such as closed-source models or those accessed via API calls). We conducted experiments across multiple tasks, including paraphrasing, relevance judgment, and multiple-choice question answering. Our results show that input order significantly affects performance across tasks, with shuffled inputs leading to measurable declines in output accuracy. Few-shot prompting demonstrates mixed effectiveness and offers partial mitigation; however, it fails to fully resolve the problem. These findings highlight persistent risks, particularly in high-stakes applications, and point to the need for more robust LLMs or improved input-handling techniques in future development.
Submission Number: 8
Loading