Benchmarking LLM Overshooting: Automatic Evaluation of Refutation Quality and Emotional Alignment under Pressure

TMLR Paper6319 Authors

27 Oct 2025 (modified: 15 Nov 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite the substantial convenience and productivity gains provided by large language models (LLMs), a growing concern is users’ uncritical or blind trust in their generated responses. Such emotional alignment has recently attracted attention, with a specific focus on overshooting—a phenomenon in which users attribute emotional value to artificial intelligence beyond its inherent capabilities. Despite recent advancements, “refutation quality” and “emotional alignment” remain largely unquantified in situations where LLMs encounter false premises. To address this gap, we introduce a new benchmark that enables automatic quantification of LLM overshooting. Specifically, it defines an Overshoot Index (OI) that integrates six metrics: Refutation Strength (RS), Directness Index (DI), Hedging Load (HL), Affective Overshoot Proxy (AOP), Normative Jump (NJ), and Evidence-Backed Correction (EBC). In our experiments, three models—OpenAI’s gpt-4o-mini, Anthropic’s claude-3-5-sonnet-20241022, and Google’s gemini-1.5-flash—were evaluated using prompts generated from the TruthfulQA, CREPE, and FalseQA datasets. Additionally, three pressure levels (pressure ∈ {0, 1, 2}) were introduced to examine behavioral changes under stress. Rather than ranking models, OI serves as a diagnostic benchmark that reveals how refutation strength and emotional accommodation interact under false-premise conditions. Across the three commercial LLMs, OI highlights distinct behavioral tendencies, illustrating its value as a complementary tool for alignment and safety evaluation rather than a performance leaderboard. Statistical validation was performed using Kruskal–Wallis and Wilcoxon signed-rank tests. Overall, this study provides a novel perspective for evaluating LLM safety and robustness.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=obMuB8KlEf&referrer=%5Bthe%20profile%20of%20Takafumi%20Nakanishi%5D(%2Fprofile%3Fid%3D~Takafumi_Nakanishi3)
Changes Since Last Submission: \section*{Changes Since Last Submission} In this revised manuscript, we have made substantial revisions to address the concerns raised by the editorial decision. \begin{itemize} \item \textbf{Expansion of the ``Recent Advances'' section:} The literature review has been significantly broadened beyond TMLR publications to include recent works from major venues such as \textit{NeurIPS}, \textit{ICLR}, \textit{ACL}, \textit{EMNLP}, and \textit{AAAI}, as well as peer-reviewed journals including \textit{Expert Systems with Applications} and \textit{Topoi}. This modification ensures that the section presents a comprehensive and balanced survey of research progress in alignment evaluation, refutation behavior, and emotional modeling across diverse publication outlets. \item \textbf{Correction of citation anonymity:} All citations previously marked as ``Anonymous'' have been replaced with their proper author names and publication information (e.g., Meng \textit{et al.}, 2022; Bai \textit{et al.}, 2022; Schwitzgebel \& Sebo, 2025). This correction eliminates the anonymity issue and aligns all references with TMLR’s citation format requirements. \item \textbf{Improvements in structure and clarity:} Logical transitions between sections were refined to improve readability and argument coherence. Minor typographical and stylistic inconsistencies were corrected throughout the manuscript. Figure and table captions were also revised for interpretability and consistency. \item \textbf{Mathematical and notational clarification:} The formal definition of the Overshoot Index (OI) was rewritten for improved readability and precision: \begin{equation} OI_{\mathrm{sep}} = (\tau - RS)^{+} + \lambda_{DI}(1 - DI) - \lambda_{EBC}EBC + \lambda_{HL}HL + \lambda_{AOP}AOP^{*} + \lambda_{NJ}NJ^{*}, \end{equation} where each $\lambda$ denotes a positive scaling coefficient. This equation has been reformatted and explained more clearly to facilitate reproducibility. \item \textbf{Minor editorial refinements:} We revised the abstract and conclusion for conciseness and accuracy, ensuring terminological consistency (e.g., ``refutation quality,'' ``emotional overshooting,'' and ``alignment robustness'') and unified notation across sections. \end{itemize} Overall, these revisions comprehensively resolve the issues noted in the desk-reject decision and improve the scholarly rigor, balance, and readability of the manuscript, making it suitable for full peer review.
Assigned Action Editor: ~Candace_Ross1
Submission Number: 6319
Loading