Characterizing and Assessing Robust Conversational Interaction: Amendments in Order Specification Tasks
Keywords: Task-Oriented Dialogue, Pragmatic Reasoning, Conversational AI, Large Language Models, Conversational Robustness, Dialogue Evaluation, Pragmatics in NLP
TL;DR: Even in a deliberately simple task—specifying fast-food orders and processing user-initiated amendments—SOTA LLMs struggle to consistently make the required pragmatic inferences, revealing a robustness gap in conversational interaction.
Abstract: Robustness in the face of complex but natural human conversational behavior is a key requirement for practical task-oriented conversational systems. We study the robustness of large language models (LLMs) in a deliberately simplified task setting—specifying fast-food orders—and involving a single common but somewhat complex conversational behavior—user-initiated amendments to previously provided information. Although the task itself is straightforward, correctly interpreting amendments requires pragmatic inference about speaker intent. To study this problem, we characterize the pragmatic inferences required to interpret amendments in this simple task context and construct a synthetic evaluation dataset requiring specific combinations of these inferences. Users may convey, with varying explicitness, whether they intend to change an order, the type of change (addition, deletion, substitution, modification), the affected item, the affected attribute, and the new value to apply.
The dataset systematically varies amendment difficulty along two dimensions. Breadth captures how many aspects of an amendment utterance require inference, while depth captures how implicit or demanding those inferences are. This framework provides a method for programmatically exploring the space of utterances requiring different degrees and kinds of pragmatic inference to interpret amendments, enabling systematic evaluation of robustness with respect to this aspect of natural conversational interaction.
We evaluate several state-of-the-art LLMs in correctly processing utterances requiring inference in this way. While the best models achieve strong overall performance, with average error rates of roughly 5–7\%, performance degrades substantially when interpreting amendments. Error rates increase systematically along both dimensions of amendment difficulty: they rise as more aspects of an amendment require inference and as the required inferences become more implicit or pragmatically demanding. For the strongest models, pass@5 error rates are roughly 3–6\%; but pass$^5$ error rates remain substantially higher (7–11\%), indicating inconsistent reasoning across runs.
Using these results, we identify the pragmatic inferences that cause failures and construct a set of challenging examples focusing on these. We use this to test several strategies intended to improve robustness, including two-step queries, function calling for constrained edits, and schema validation with reprompting.
The dataset does not reflect real-world distributions; instead it isolates pragmatic inferences that are simple and natural for humans but nevertheless difficult for current models. Failures therefore indicate a robustness gap in pragmatic reasoning. Importantly, these failures arise even in a simplified task—standard orders for a single person with only one item changed—suggesting that more extensive real-world conversational settings will pose even greater challenges for LLM-based conversational systems.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 138
Loading