Enhancing LLM Character-Level Manipulation via Divide and Conquer

ACL ARR 2025 May Submission7097 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) have shown impressive generalization across diverse natural language processing tasks. However, they consistently struggle with character-level string manipulation such as deletion, insertion, and substitution, despite the foundational role of these operations in data preprocessing and code generation. This gap raises a critical question: *Why do LLMs, despite their strong token-level capabilities, fail at basic character-level manipulations?* To address this question, we conduct a systematic analysis and uncover two key findings: (1) LLMs have limited ability to leverage intrinsic token knowledge for fine-grained character reasoning, and (2) decomposing words into atomized structures can significantly enhance their sensitivity to token-level structure. Building on these insights, we propose Character-Level Manipulation via Divide and Conquer, a novel framework that bridges the gap between token-level processing and character-level manipulation. Our approach decomposes complex tasks into explicit character-level subtasks followed by controlled token reconstruction phases. This method leads to significant accuracy improvements without requiring additional model training. Empirical results show that our method achieves notable gains on the deletion, insertion, and substitution benchmarks. We release our implementation and evaluation suite to support future research in character-aware language modeling.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: large language model, interpretability
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 7097
Loading