Gricean Maxims in LLM Development

Published: 22 Sept 2025, Last Modified: 22 Sept 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, Gricean Maxim, LLM Development, Pragmatic Inference, User experience
Abstract: We evaluate whether the capacity for pragmatic inference, an outcome of adherence to Gricean conversational norms, has changed across recent LLM model iterations. Understanding their ability to navigate conversational implicatures that humans follow helps ascertain effective human-AI interaction. This preliminary test shows us that modern LLM development is affecting their pragmatic ability and warrants community discussion, as it suggests unexpected patterns in model development. Method: Prompted models with curated prompts to test pragmatic competence across key dimensions from different versions of aligned models (GPT-3.5, GPT-4, GPT-4.1, and Claude-Opus-4) based on established pragmatic inference datasets and metrics[1]. Key Findings: Experiment 1: GPT-3.5 demonstrated pragmatic flexibility, showing a ~3% drop in "false" rates when switching from pragmatic to literal interpretation (closely matching the ~7% human benchmark). More recent models showed decreased flexibility: GPT-4o (1.5% drop), Claude 4-Opus (1% drop), and GPT-4.1 (0% drop). Experiment 2: GPT-4.1 excelled, showing improvement of 24.6% "false" rates when scalar terms became conversationally relevant. GPT-3.5 Turbo showed minimal context sensitivity ( 3%), while GPT-4o and Claude 4-Opus demonstrated no significant contextual awareness. Experiment 3: All models failed to demonstrate social awareness in face-threatening versus boosting contexts (0% drop across conditions). Discussion Empirical evidence suggests that recent models exhibit a decline in pragmatic competence. GPT-3.5 Turbo emerges as the most pragmatically competent model, demonstrating flexibility in interpretation while maintaining basic conversational reasoning. Conversely, GPT-4o's performance profile suggests a critical flaw. Its success in Experiment 2 is likely an artifact of superficial pattern matching rather than genuine comprehension. It is logically inconsistent for a model to be sensitive to a context that modulates an implicature (Exp 2) if it cannot make the basic inference (Exp 1). Its success is therefore brittle and does not generalize, as shown by its failure in Exp 3. [1] Doran, R., Ward, G., Larson, M., McNabb, Y., & Baker, R. E. (2012). A novel experimental paradigm for distinguishing between what is said and what is implicated. Language, 88(1), 124-154
Submission Number: 341
Loading