"Do You Truly Love Me?'' Benchmarking LLM Capability on Hierarchical Pragmatic Tactic Conversations

ACL ARR 2026 January Submission2217 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: pragmatic reasoning; benchmark; LLM; Pragmatic Tactic
Abstract: Pragmatic reasoning, including inferring intent, manipulation, and hidden meaning in dialogue, is essential for trustworthy language understanding yet remains underexplored in current large language models (LLMs). We introduce $\textbf{HIPO}$, a new benchmark designed to assess $\textbf{Hi}$erarchical $\textbf{P}$ragmatic Tactic in multi-turn C$\textbf{o}$nversations. Grounded in linguistic theories, HIPO decomposes utterances based on a pragmatic reasoning framework and annotates each utterance along one dialogue-level goal and three utterance-level pragmatic tactics dimensions: communicative intention ($\{why}$), veracity strategy ($\{how}$), and illocutionary act ($\{what}$). To ensure reliable supervision, we design a structured generation pipeline to generate high-quality synthetic dialogues. The benchmark comprises over 4,088 benchmarking utterances with 6,350 contextual utterances across 1,131 dialogues drawn from 31 real-world-inspired scenarios. Benchmarking 22 state-of-the-art LLMs reveals a striking gap: while models excel at recognizing surface speech acts(83.3% accuracy), they perform poorly on detecting veracity strategies(32.2%) and speaker intentions(45.1%). These results highlight the limitations of current LLMs in pragmatic understanding. Furthermore, we demonstrate that high-quality synthetic data generated by HIPO can substantially improve model performance through supervised finetuning, suggesting a promising direction to close the gap.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation, Discourse and Pragmatics
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 2217
Loading