"Do You Truly Love Me?'' Benchmarking LLM Capability on Hierarchical Pragmatic Tactic Conversations

"Do You Truly Love Me?'' Benchmarking LLM Capability on Hierarchical Pragmatic Tactic Conversations

ACL ARR 2026 January Submission2217 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: pragmatic reasoning; benchmark; LLM; Pragmatic Tactic

Abstract: Pragmatic reasoning, including inferring intent, manipulation, and hidden meaning in dialogue, is essential for trustworthy language understanding yet remains underexplored in current large language models (LLMs). We introduce $\textbf{HIPO}$, a new benchmark designed to assess $\textbf{Hi}$erarchical $\textbf{P}$ragmatic Tactic in multi-turn C$\textbf{o}$nversations. Grounded in linguistic theories, HIPO decomposes utterances based on a pragmatic reasoning framework and annotates each utterance along one dialogue-level goal and three utterance-level pragmatic tactics dimensions: communicative intention ($\{why}$), veracity strategy ($\{how}$), and illocutionary act ($\{what}$). To ensure reliable supervision, we design a structured generation pipeline to generate high-quality synthetic dialogues. The benchmark comprises over 4,088 benchmarking utterances with 6,350 contextual utterances across 1,131 dialogues drawn from 31 real-world-inspired scenarios. Benchmarking 22 state-of-the-art LLMs reveals a striking gap: while models excel at recognizing surface speech acts(83.3% accuracy), they perform poorly on detecting veracity strategies(32.2%) and speaker intentions(45.1%). These results highlight the limitations of current LLMs in pragmatic understanding. Furthermore, we demonstrate that high-quality synthetic data generated by HIPO can substantially improve model performance through supervised finetuning, suggesting a promising direction to close the gap.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Resources and Evaluation, Discourse and Pragmatics

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 2217

Loading