Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting

Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting

ICLR 2026 Conference Submission16884 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM code generation, Memorization

Abstract: Large language models (LLMs) have recently demonstrated exceptional code generation capabilities. However, there is a growing and active debate whether LLMs are mostly doing memorization (i.e., replicating or reusing large parts of their training data) versus generalization (i.e., beyond training data). Existing evaluations largely proxy memorization with surface/structural similarity, thereby conflating benign reuse with harmful recall and neglecting task correctness under semantic variation. We define memorization behaviorally as failure at high similarity and introduce a semantic perturbation code rewriting, which rewrites a semantically different answer at a similar difficulty level for a given coding question, then reverse-engineers a novel coding question. We further propose Memorization Risk Index (MRI), a normalized score that combines two signals: (i) how similar the model’s answer for the rewritten task is to the original ground-truth solution, and (ii) how much performance drops from the original task to its rewritten counterpart. MRI is high only when both conditions hold—when the model outputs similar code but fails the perturbed task—thereby capturing harmful memorization rather than benign reuse. Empirical evaluations on code generation benchmarks MBPP+ and BigCodeBench reveal remarkable findings that (1) the memorization risk alleviates as LLMs scale up, and (2) supervised fine-tuning (SFT) improves accuracy while worsening memorization, (3) reinforcement learning with PPO achieves a more balanced trade-off between memorization and generalization.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 16884

Loading