Are Large Language Models General-Purpose Solvers for Dialogue Breakdown Detection? An Empirical Investigation
Abstract: This study addresses the challenge of dialogue breakdown—characterized as incoherence, irrelevance, or any disruption that significantly hampers the flow of the conversation. The impact of dialogue breakdowns has become critical with the adoption of large language models in various industries for companies with dialogue-based systems, such as Salesforce, Amazon, and Microsoft. Leveraging the Dialogue Breakdown Detection Challenge Dataset, we investigate the performance of generalist large language models, including ChatGPT, GPT-4, and Mistral-Medium, in identifying instances of dialogue breakdown without domain-specific fine-tuning. Through a series of experiments employing zero-shot and few-shot prompting techniques combined with chain-of-thought reasoning, this research sets a new benchmark in the field. Our findings reveal that GPT-4 outperforms both specialized models previously considered state-of-the-art and other generalist models in detecting dialogue breakdowns by over a 3% margin, achieving an accuracy of 82.0%. To our knowledge, this study is the first to demonstrate the enhanced capability of generalist large language models in this domain. Our experiments found that when detecting dialogue breakdowns, larger models like GPT-4 are less sensitive to how they are prompted. In contrast, smaller models like ChatGPT and Mistral-Medium can improve their performance by using prompting techniques that combine few-shot learning with the chain-of-thought method. This work proposes future research directions, including enhanced error analysis and developing an Ensemble RAG approach for improved generalization in dialogue breakdown detection.
Loading