Abstract: Despite the rapid advancement of large language models, they remain highly susceptible to generating hallucinations, which significantly hinders their widespread application.
Hallucination research requires dynamic and fine-grained evaluation.
However, most existing hallucination benchmarks (especially in Chinese language) rely on human annotations, making automatical and cost-effective hallucination evaluation challenging.
To address this, we introduce HaluAgent, an agentic framework that automatically constructs fine-grained QA dataset based on some knowledge documents.
Our experiments demonstrate that the manually designed rules and prompt optimization can improve the quality of generated data.
Using HaluAgent, we construct C-FAITH, a Chinese QA hallucination benchmark created from 1,399 knowledge documents obtained from web scraping, totaling 60,702 entries.
We comprehensively evaluate 16 mainstream LLMs with our proposed C-FAITH, providing detailed experimental results and analysis.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: hallucination, benchmark, agent
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: Chinese
Submission Number: 5758
Loading