Reasoning or Memorization? Investigating LLMs’ Capability in Restoring Chinese Internet Homophones

Published: 07 Jul 2025, Last Modified: 10 Jul 2025KnowFM @ ACL 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chinese homophone, memorization
Abstract: Chinese homophones, prevalent in Internet culture, bring rich linguistic twists that are challenging for language models. While native speakers disambiguate them through phonological reasoning and contextual understanding, it remains untested how well LLMs perform on this task and whether LLMs also achieve this via similar reasoning processes or merely through memorization of homophone-original word pairs during training. In this paper, we present HomoP-CN, the first Chinese Internet homophones dataset with systematic perturbations for evaluating LLMs' homophone restoration capabilities. Using this benchmark, we investigated the influence of semantic, phonological, and graphemic features on LLMs' restoration accuracy, measured the reliance levels of each model on memorization during restoration through consistency ratios under controlled perturbations, and assessed the effectiveness of various prompting strategies, including contextual cues, pinyin augmentation, few-shot learning, and thought-chain approaches.
Archival Status: Archival (included in proceedings)
Submission Number: 59
Loading