Hallucination in the Wild: A Field Guide for LLM Users

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: conversational AI, AI, LLMs, hallucination
TL;DR: Hallucination is not a bug but a structural feature of LLMs, and careful, claim-level, dialogue-aware evaluation is essential to manage it.
Abstract: Large language models produce text that is fluent, coherent, and often persuasive. They also produce false statements with the same fluency and confidence. These so-called “hallucinations” are not rare failures but predictable consequences of next-token prediction trained over noisy, conflicting, internet-scale data. In other words, plausibility is optimized; truth is incidental. This talk examines hallucination in conversational, retrieval-augmented dialogue systems. I first outline why confabulation emerges, focusing on four pressures: optimization for fluency, frequency effects where popularity diverges from truth, the statistical bias toward plausibility over factuality, and the epistemic messiness of training data. I then survey current detection approaches, including human evaluation, LLM-as-judge methods, uncertainty estimation via semantic entropy, and natural language inference–based fact scoring. Each approach has strengths, but all struggle with dialogue-specific challenges such as subjectivity, abstentions, and cross-turn consistency. I introduce VISTA (Verification in Sequential Turn-based Assessment), a claim-level evaluation framework grounded in the linguistic notion of common ground. VISTA decomposes each conversational turn into atomic claims and classifies them as verified, contradicted, lacking evidence, abstention, or out of scope. Across multiple conversational RAG datasets and eight models, VISTA improves hallucination detection accuracy relative to LLM-as-judge and FActScore, with particularly large gains for smaller open-weight models. I conclude by discussing mitigation strategies at the model, instruction, context, system, and interface levels, and by arguing that as models become more conversational and more human-like, fine-grained, dialogue-aware evaluation grounded in linguistic theory will become increasingly necessary.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 5
Loading