ZiCiEval: Challenging Large Language Models with Seemingly Basic Chinese Character-Word Questions

ACL ARR 2025 May Submission6251 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) have made significant progress in handling complex tasks, while some seemingly basic questions remain unexpectedly unsolved. In practice, LLMs are prone to hallucinate on free-form questions about Chinese characters and words, which causes inconvenience for ordinary users or language learners who use LLMs to acquire Chinese knowledge. To quantitatively investigate the issue, we introduce ZiCiEval, a dataset covering five types of real-world Chinese character-word questions. For reliable automatic evaluation, we develop an LLM-as-judge framework enhanced with adaptive tool use. Empirical results demonstrate substantial performance gaps among advanced language models. In some tasks, the top-performing models only reach ~70% accuracy. The resources will be publicly available to facilitate further research.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation methodologies, automatic evaluation of datasets
Contribution Types: Data resources
Languages Studied: Chinese
Submission Number: 6251
Loading