MIRAGE: A Metrics lIbrary for Rating hAllucinations in Generated tExt

Benjamin Vendeville, Liana Ermakova, Pierre De Loor, Jaap Kamps

Published: 10 Nov 2025, Last Modified: 05 Jan 2026CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: Errors in natural language generation, so-called hallucinations, remain a critical challenge, particularly in high-stakes domains such as healthcare or science communication. While several automatic metrics have been proposed to detect and quantify hallucinations, such as FactCC, QAGS, FEQA, and FactAcc, these metrics are often unavailable, difficult to reproduce, or incompatible with modern development workflows. We introduce MIRAGE, an open-source Python library designed to address these limitations. MIRAGE re-implements key hallucination evaluation metrics in a unified library built on the Hugging Face framework, offering modularity, reproducibility, and standardized inputs and outputs. By adhering to FAIR principles, MIRAGE promotes reproducibility, accelerates experimentation, and supports the development of future hallucination metrics. We validate MIRAGE by re-evaluating existing metrics on benchmark datasets, demonstrating comparable performance while significantly improving usability and transparency.
Loading