ZiCiEval: Challenging Large Language Models with Seemingly Basic Chinese Character-Word Questions

ZiCiEval: Challenging Large Language Models with Seemingly Basic Chinese Character-Word Questions

ACL ARR 2025 May Submission6251 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have made significant progress in handling complex tasks, while some seemingly basic questions remain unexpectedly unsolved. In practice, LLMs are prone to hallucinate on free-form questions about Chinese characters and words, which causes inconvenience for ordinary users or language learners who use LLMs to acquire Chinese knowledge. To quantitatively investigate the issue, we introduce ZiCiEval, a dataset covering five types of real-world Chinese character-word questions. For reliable automatic evaluation, we develop an LLM-as-judge framework enhanced with adaptive tool use. Empirical results demonstrate substantial performance gaps among advanced language models. In some tasks, the top-performing models only reach ~70% accuracy. The resources will be publicly available to facilitate further research.

Paper Type: Short

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, evaluation methodologies, automatic evaluation of datasets

Contribution Types: Data resources

Languages Studied: Chinese

Submission Number: 6251

Loading