Abstract: Large language models (LLMs) have made significant progress in handling complex tasks, while some seemingly basic questions remain unexpectedly unsolved. In practice, LLMs are prone to hallucinate on free-form questions about Chinese characters and words, which causes inconvenience for ordinary users or language learners who use LLMs to acquire Chinese knowledge. To quantitatively investigate the issue, we introduce ZiCiEval, a dataset covering five types of real-world Chinese character-word questions. For reliable automatic evaluation, we develop an LLM-as-judge framework enhanced with adaptive tool use. Empirical results demonstrate substantial performance gaps among advanced language models. In some tasks, the top-performing models only reach ~70% accuracy. The resources will be publicly available to facilitate further research.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation methodologies, automatic evaluation of datasets
Contribution Types: Data resources
Languages Studied: Chinese
Submission Number: 6251
Loading