Keywords: LLM code generation, Memorization
Abstract: Large language models (LLMs) have recently demonstrated exceptional code generation capabilities. However, there is a growing and active debate whether LLMs are mostly doing memorization (i.e., replicating or reusing large parts of their training data)
versus generalization (i.e., beyond training data). Existing evaluations largely proxy memorization with surface/structural similarity, thereby conflating benign reuse with harmful recall and neglecting task correctness under semantic variation. We define memorization behaviorally as failure at high similarity and introduce a semantic perturbation code rewriting, which rewrites a semantically different answer at a similar difficulty level for a given coding question, then reverse-engineers a novel coding question. We further propose Memorization Risk Index (MRI), a normalized score that combines two signals: (i) how similar the model’s answer for the rewritten task is to the original ground-truth solution, and (ii) how much performance drops from the original task to its rewritten counterpart. MRI is high only when both conditions hold—when the model outputs similar code but fails the perturbed task—thereby capturing harmful memorization rather than benign reuse. Empirical evaluations on code generation benchmarks MBPP+ and BigCodeBench reveal remarkable findings that (1) the memorization risk alleviates as LLMs scale up, and (2) supervised fine-tuning (SFT) improves accuracy while worsening memorization, (3) reinforcement learning with PPO achieves a more balanced trade-off between memorization and generalization.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16884
Loading