Beyond Benchmark Scores: Evaluating Robustness and Memorization in LLM Code Generation with Evolved Questions

Beyond Benchmark Scores: Evaluating Robustness and Memorization in LLM Code Generation with Evolved Questions

ACL ARR 2025 May Submission6338 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: With the rapid advancement in large language models (LLMs), their ability in code generation has received significant attention. Evaluating this capability involves assessing both robustness and memorization behaviors. Robustness expects that syntactic modifications of a prompt—without changing its semantics—should produce functionally equivalent generated code. Conversely, memorization occurs when an LLM produces code very similar to solutions seen during training, even when the prompt’s meaning has semantically changed. In this paper, we systematically investigate these phenomena by introducing three prompt-variation strategies: mutation (minor textual noise), paraphrasing (different wording, same meaning), and code-rewriting (similar wording, different meaning). Based on these strategies, we propose two metrics to quantify these behaviors: Robustness Ratio (RR), measuring how consistently models solve tasks despite textual perturbations, and Memorization Risk Index (MRI), capturing how often models reproduce known solutions despite semantic prompt changes. Our experiment illustrates that as task complexity increases and model size decreases, robustness generally declines. Additionally, supervised fine-tuning (SFT) significantly improves origin accuracy but often at the expense of increased memorization, while proximal policy optimization (PPO) provides a more balanced trade-off.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: LLM code generation, Memorization, Robustness

Languages Studied: English

Submission Number: 6338

Loading