Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY-ND 4.0
Keywords: Large Language Models, Multi-Turn, Consistency
TL;DR: A framework to improve LLM response consistency through a new scoring system, benchmark dataset, and confidence-aware generation method.
Abstract: Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent and coherent behavior across multiple rounds of user interaction. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions. First, we introduce \textbf{Position-Weighted Consistency (PWC)}, a metric designed to capture both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present \textbf{MT-Consistency}, a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce \textbf{Confidence-Aware Response Generation (CARG)}, a framework that significantly improves response stability by explicitly integrating internal model confidence scores during the generation process. Experimental results demonstrate that CARG significantly improves response stability without sacrificing accuracy, offering a practical path toward more dependable LLM behavior in critical, real-world deployments.
Submission Number: 120
Loading